Tuesday, March 29, 2016

Context for prooffreader.com post about White House petitions

Since this is so long, I stuck it here instead of in the main prooffreader.com post. This is a list of five randomly selected petition titles containing the top 25 most characteristic words in successful petitions, or the top 10 most characteristic words in unsuccessful petitions.


### SUCCESSFUL PETITIONS ###


gun
- Create ‘National Gun Safety Day’ to Make Gun Responsibility a Part of Mainstream American Culture
- Keep guns in America! No weapons ban!
- Honor our service members who used their personal firearms to fight back against the terrorist attacker in Chattanooga.
- To award the Medal of Freedom to the 4 Firefighters who were ambushed in West Webster New York on Christmas Eve 2012
- RESIGN!

tragedy
- Immediately sign Executive Order banning sale of assault weapons and high-capacity magazines until Congress acts on this
- Enforce ceasefire and send humanitarian/international aid to  Syria to stop violence and slaughter of innocent people.
- To Honor the Granite Mountain Hot Shot Team with the Public Safety Officer Medal
- Spend the $25 million already appropriated on a supercomputer for increased weather-prediction capabilities.
- Disclose the Truth About Benghazi 1) Who Issued the Stand Down Order? 2) Who Concocted the YouTube Lie? WE DEMAND TRUTH!

access
- Enact the Restroom Access Act, or Ally's Law
- Promote dystonia awareness by recognizing Dystonia Awareness Month in September
- MODERNIZE THE RAIL NETWORK INTO A HIGH-CAPACITY, GRADE-SEPARATED, ELECTRIFIED SYSTEM TO SERVE FREIGHT AND PASSENGERS
- To force cellular carriers to allow us to install Google Wallet and use this feature on all Google NFC devices.
- Admonish OCR for its Arbitrary Action and Request that it Withdraw its Directive for Illinois School District 211

imperative
- Enforce the tax code, and strip violating Religious institutions of their tax exempt 501(c) status.
- protect coal ash recycling by promptly enacting disposal regulations that do NOT designate coal ash a “hazardous waste.'
- Require that Insurance Companies offer PPO plans to Individuals and small groups in all states.
- Establish Lunar New Year as a National Holiday. Give it the same importance and weight as the other cultural holidays.
- Mandate all freight trains have two-person crews.

regulatory
- Appoint Susan Crawford as FCC Chairman
- reverse Secretary Sebelius's decision to restrict access to emergency contraception.
- Call on Congress to repeal the job-killing, USPS-strangling Postal Accountability and Enhancement Act of 2006.
- Reform ECPA: Tell the Government to Get a Warrant
- impose relatively quick and effective sanctions on Russia, the sponsor of terrorismus

halbach
- Pardon Steven Avery and Brendan Dassey for their alleged involvement in the murder of Teresa Halbach.
- Retrie Steven Avery in the murder case of Theresa halbach
- Federally Investigate the Manitowoc County Sheriffs Department and the County Court System for Criminal Behavior.
- Encourage the Judiciary to re-examine the trials of Steven Avery and Brendan Dassey, & the search for Halbach's killer.
- Open discussion with Wisconsin Gov. Scott Walker allowing a retrial by way of due process in the case of Steven Avery

westboro
- Define the Westboro Baptist Church as a hate group due to promoting animosity against differing cultural demographics.
- Revoke the tax exempt status of the Westboro Baptist Church & re-classify Westboro Baptist Church as a hate group.
- Ban protests for any cause within three hundred feet of any funeral service, both during and two hours before and after.
- Make Petitioning funerals illegal.
- Investigate the IRS Tax-Exempt Status of the Westboro Baptist Church

clinical
- Include Gender Reassignment Surgery within the ACA, Medicare/aid, and the Veterans Administration.
- Change the Direction of EHR Technology, in Order to Improve Healthcare Outcomes and Control Costs
- Urge the FDA to grant Accelerated Approval to critical new treatments for cystic fibrosis.
- Recognize pharmacists as health care providers!
- authorize the FDA to grant a compassionate use exemption to Refael Elisha Cohen for Antineoplaston therapy.

superior
- Initiate a civil rights investigation against Texas Attorney General Ken Paxton for civil rights violations.
- Secure resources and funding, and begin construction of a Death Star by 2016.
- Immediately halt the cruel and unnecessary use of monkeys in Army chemical casualty management training courses.
- Stop Apache Land Grab
- Demand Brooklyn District Attorney Kenneth P. Thompson to withdraw indictment against Asian minority Officer Peter Liang!

abiding
- Pass the DREAM act or Development, Relief, and Education for Alien Minors
- Dissolve any petitions on an Assault Weapons Ban as unconstitutional under amendment II  of the Constitution
- Expedite the process of obtaining a Green Card
- Address our petition for redress of grievance against any proposed legislation violating our 2nd amendment rights.
- Lower the American flag to half staff nationally to honor the death of Fr. Theodore Hesburgh.

baptist
- Ban protests for any cause within three hundred feet of any funeral service, both during and two hours before and after.
- BAN THE WESTBORO BAPTIST CHURCH FROM ENTERING BOSTON AND PICKETING THE FUNERALS OF THOSE WHO DIED DURING THE BOMBING
- Revoke the tax exempt status of the Westboro Baptist Church & re-classify Westboro Baptist Church as a hate group.
- Define the Westboro Baptist Church as a hate group due to promoting animosity against differing cultural demographics.
- Investigate the IRS Tax-Exempt Status of the Westboro Baptist Church

avery
- Retrie Steven Avery in the murder case of Theresa halbach
- Pardon Steven Avery and Brendan Dassey for their alleged involvement in the murder of Teresa Halbach.
- Defund our state religion: The Church of Political Correctness whose dogma is WHITE GENOCIDE
- Open DOJ investigation into Manitowoc County Sheriff Depart for convictions of Steven Avery and Brendan Dassey
- Work with other world governments to stop human trafficking and torture by Bedouins in the Sinai.

definitely
- take steps with our allies and the UN to put pressure on the Iranian government to release Pastor Behnam Irani.
- ban the use of gas chambers for killing shelter companion animals. They are inhumane, expensive and dangerous to humans.
- Repeal the unconstitutional NDAA and FISA which allow intelligence agencies to secretly spy on US Citizens!
- Advance EB2 and EB3 priority dates for India and China
- We petition Obama Administration to consider revoking the citizenship of Ms. Eun Mi Shin who is praising N. Korea now.

connecticut
- Address The Civil Rights Violations By The State Of Connecticut Against Gerald O'Donnell and George Gould
- Extradite Warren Anderson, Bhopal fugitive and bail-jumper, to India to face charges related to thousands of deaths
- Use executive authority to reinstate the Federal Assault Weapons Ban of 1994 (expired 2004).
- Call on Congress to pass sensible gun legislation from the pediatricians of America
- Not punish the tens of millions of law-abiding gun owners with ineffective and unconstitutional 'assault weapons' bans.

sensible
- Preserve 6 Day Mail Delivery
- Replace Gil Kerlikowske
- begin a national conversation on sensible gun control.
- Establish federal gun control laws
- reinstate the public tours of the White House. This is the people's house and they deserve the opportunity to tour it!

brendan
- Encourage the Judiciary to re-examine the trials of Steven Avery and Brendan Dassey, & the search for Halbach's killer.
- Initiate a Federal Investigation of the Sheriff's Offices of Manitowoc County and Calumet County, Wisconsin
- Pardon Steven Avery and Brendan Dassey for their alleged involvement in the murder of Teresa Halbach.
- Open DOJ investigation into Manitowoc County Sheriff Depart for convictions of Steven Avery and Brendan Dassey
- Open discussion with Wisconsin Gov. Scott Walker allowing a retrial by way of due process in the case of Steven Avery

turnout
- Make Election Day a National Holiday.
- Make Election Day a National Holiday
- propose legislation that would make all federal election days national holidays to increase voter turnout.
- There are election rigging made by Progressive Program that have been used in the 18th Presidential Election of S. KOREA
- Designate Election Day as a national holiday.

recount
- Recount the election!
- There are election rigging made by Progressive Program that have been used in the 18th Presidential Election of S. KOREA
- Demand from El Salvador's Electoral Tribunal to hold a vote-by-vote recount of March 9th's presidential elections
- Cut off the $1.3B in aid annually to Egypt, unless they bring to justice the killers of the massacre of Coptic Christian
- Call upon the International Community to urge that a full recount of votes be done in Venezuela's presidential elections

dassey
- Initiate a Federal Investigation of the Sheriff's Offices of Manitowoc County and Calumet County, Wisconsin
- Open DOJ investigation into Manitowoc County Sheriff Depart for convictions of Steven Avery and Brendan Dassey
- Pardon Steven Avery and Brendan Dassey for their alleged involvement in the murder of Teresa Halbach.
- Encourage the Judiciary to re-examine the trials of Steven Avery and Brendan Dassey, & the search for Halbach's killer.
- Open discussion with Wisconsin Gov. Scott Walker allowing a retrial by way of due process in the case of Steven Avery

practically
- Peacefully grant the State of Texas to withdraw from the United States of America and create its own NEW government.
- Restore humane horse slaughter to improve horse welfare, stop needless & wasteful suffering & create jobs.
- Wage Justice, Debt Payment to Cops of Puerto Rico, and Restitution of their Setirement System
- Allow commercial flights to Cuba.
- Make the Metric system the standard in the United States, instead of the Imperial system.

conversation
- increase NASA funding to 1% of the federal budget
- Work with Congress to close the 13th Amendment slavery loophole
- Designate the sea between the Korean Peninsula and islands of Japan as 'East Sea/Sea of Japan' in all maps and texts.
- Stop Cyberlaw HR4681
- record all senate and congressional phone, email, and chat correspondence and make it public for anyone to see.

produce
- Release the full text of the Trans-Pacific Partnership to the American public ahead of a Congressional vote.
- Immediately halt the cruel and unnecessary use of monkeys in Army chemical casualty management training courses.
- Stop San Francisco from abusing David Gizzarelli and his puppy Charlie and return Charlie home where he belongs.
- Put Authentic America Made Products in our National Park Gift Shops! Using Presidential Executive Order power.
- Enact legislation guaranteeing women equal access & opportunity for employment in government funded arts organizations.

columbia
- Fully Legalize Same-Sex Marriage Across The Nation
- Award Yogi Berra The Presidential Medal of Freedom for his military service and civil rights and educational activism.
- Stop the hazing of sea lions in Washington & Oregon & have the fws publicly report the truth about decline salmon.
- Restore Net Neutrality By Directing the FCC to Classify Internet Providers as 'Common Carriers'.
- to Impeach all politicians who signed the Norquist oath in violation of their oath to the government of the USA.

ct
- call for the Chinese authority to stop blocking major Internet services, such as Gmail, via the Great Firewall
- give assistance to the homeless people and also to the poor
- Reform copyright law to allow libraries to keep digital copies of ebooks and other media.
- Provide More Transparency Around Government Surveillance of Internet Users
- JUSTICE FOR TURKISH VICTIMS OF 1915 CIVIL WAR WHICH WAS COMMENCED BY ARMENIAN BANDITS IN OTTOMAN SOIL.

regulation
- Continue to allow Americans to import life-sustaining prescription drugs from safe international online pharmacies.
- Legalize Concealed Carry for Soldiers on Military Installations
- make experimental drugs available for terminally ill ALS patients through existing FDA Accelerated Approval Program.
- Stop using Homeland Security funds to seize imported vehicles, and change the DOT/EPA exemption to 15 years.
- Rule the 'NY SAFE ACT' to be UNCONSTITUTIONAL!


### UNSUCCESSFUL PETITIONS ###


condition
- Help the families with open adoption cases in Russia bring their babies home
- Award the Presidential Medal of Freedom to each Woman in the video “STOP WHITE GENOCIDE”
- Require FAA to re-examine its 65 Decibel (dBA) noise safety level and consider 55 dBA as a new standard for human health
- (Jul #2 of 4) Allow White Americans to vote 'Yes' or 'No' on WHITE GENOCIDE!
- Support and sign into law H.R. 2858 the Wildland Firefighter Protection Act

2014
- Allow Staff Sergeant Carl Lee Wheless, Jr., U.S. Army, to continue on Federal active duty an additional 55 days.
- Look into the terrible and despicable officiating of the Lions and Cowboys NFC Wildcard game that took place on 1/4/2015
- Revoke The Tennessee Religious Viewpoints Anti-discrimination Act
- CANCEL White House Invitation To PM Modi-Organizer of 2002 Massacre of MUSLIMS. Ban BJP For 1984 Attack On Golden Temple
- help free Nadiya Savchenko

often
- SUPPORT more aggressive SARCOIDOSIS RESEARCH, add SARCOIDOSIS & it's Complications to SSA Compassionate Allowance List!
- Offer a federal tax credit of $1,000 per child to homeschooling families for educational expenses incurred.
- Pass legislation to ensure that military working dogs be classified as canine members of the armed services.
- Sign An Executive Order Barring Federal Agencies And Contractors From Discriminating On The Basis Of Prior Conviction
- Support and Implement Instant Runoff Voting.

duty
- exonerate Arnold Abbott!
- Military POV Shipping Contract Review (International Auto Logistics)
- retract its support of the current proposed increase of Tricare fees for active duty and retired military members.
- Allow United States Military service members to place their hands in their pockets.
- To keep its word to improve access to behavioral/mental health care for active duty, veterans, and all Americans.

physical
- Speak out against anti-whites who use the word “hate” to promote White Genocide
- (April #1 of 6) STOP WHITE GENOCIDE! Halt MASSIVE third world immigration and FORCED assimilation in White countries!
- Appoint board of inquiry: Was the Charleston shooting a tragic result of anti-Whites' program of White Genocide?
- (Aug #1 of 6) STOP WHITE GENOCIDE! Halt MASSIVE third world immigration and FORCED assimilation in White countries!
- Proclaim Oct. 29, 2011, as World Psoriasis Day

down
- remove the South Carolina Confederate flag
- Declare an executive order for all executive departments and agencies to be closed 12/26/2014, for a four-day weekend.
- Call for the resignation of Texas Lt. Gov. David Dewhurst based on misconduct during Wendy Davis' filibuster of sb5.
- Help fight for Parrot. Parrot was a innocent dog who got killed by a Police officer in Washington DC.
- arrest and prosecute the House GOP for treason.

genocide
- LIST ERDOGAN’S TURKEY AS STATE SPONSOR OF TERRORISM; VOID U.S. ALLIANCE WITH TURKEY
- Demand The Release of Supreme Religious Leader of Sikhism, Jathedar Jagtar Singh Hawara, Head  of  Sri Akal Takhat
- Award the Presidential Medal of Freedom to each Woman in the video “STOP WHITE GENOCIDE”
- Tell White Americans: 'White folk, look at your own family, or families you know, and see it happening.'
- Join presidential candidate Bob Whitaker in approving Donald Trump to head Department of Deportation NOW!

i
- Use the executive powers to fix the broken immigration system for legal immigrants
- Before Obama leaves his office, He’d better apologize his black people in Ferguson for calling them “criminals and thugs
- Recognize the Confederate flag as equal to the official national flag of the United States of America
- Stop States From Legalizing Discrimination
- Allow non-violent marijuana drug offenders to serve 65% of federal sentences as oppose to the current 85%.

code
- Explain why Asia is for the Asians, Africa is for the Africans, but White countries are for EVERYBODY
- Reverse our nation’s “No Child Left White” policy.
- Candidate Bob Whitaker invites the President and Mr. Trump to his seminar, 'Diversity is a code word for WHITE GENOCIDE'
- Censure the anti-white media for trivializing 'Polar Bear Hunting' as a 'game'
- Designate National White GeNOcide Day

say
- (#3 of 6 for March) Teach public school children the truth: ANTI-RACIST IS A CODE WORD FOR ANTI-WHITE!
- make 'Trophy Hunting' of endangered species outside of the U.S. illegal for all U.S. citizens.
- Publically ask Postmaster General & Congress to create a new  fundraising stamp called Stamp Out PTSD to Honor Veterans
- Petition urging the Congress and President Obama to intervene to stop violations of children's rights in Vietnam.
- Stop ALL construction of the Keystone tar sands pipeline once and forever !!

• • •

Tuesday, October 27, 2015

They Might Be Random Fingertips


I have a bit of an obsession with Fingertips, a sequence of 21 unrelated songs mostly between 4 and 12 seconds long from They Might Be Giants' 1992 album Apollo 18. The liner notes say to put the CD on random shuffle so that the brief snippets play in a random order.

There are 51,090,942,171,709,440,000 ways to permute 21 items -- that's a little excessive, so I made YouTube videos of 100 such permutations. I selected two video clips for each track, mostly by searching for the track name in YouTube, and assigned them randomly, so there should be some visual surprises upon multiple viewings.

Here's a link to the YouTube channel I made to hold all 100 videos.

And here is Random Fingertips #1 of 100:




• • •

Monday, September 28, 2015

Methodology for comparison of Pope Francis address with presidential inaugural addresses

This post goes deeper into the methodology of this post on Prooffreader.com.

The code is in this gist.

Here is the graphic for that post:




DATASET: Links are in the gist. Note that Pope Francis's address is the one published and given to the media, not the one actually delivered. This may well be the case for inaugural addresses as well.

TFIDF: A standard technique in NLP (natural language processing): Wikipedia entry. I removed the default English stopwords in the TextBlob module (although, of course, TFIDF will give low scores to stopwords anyway.)

CALCULATION OF SIMILARITY BY PARTY: The average cosine similarities for presidents by party was:

republican 0.825
democratic 0.832
other 0.891
range: 0.743-0.968

In order to create a metric that would be useful when hacking the t-SNE, below, I calculated the % of presidents from each party who numbered among the top half of similarity scores:

republican 0.417
democratic 0.500
other 1.000

TSNE: My favorite manifold learning algorithm, but it's not without its problems, mostly in terms of reproducibility. This is less of a problem when it's used to show similarities in a group in general, because it spreads the error inherent in such dimensionality reduction around to all points (a different way each time it's run), so in general it minimizes error at each point. But this is not what I wanted here. The thrust of this analysis is to portray the similarity between one point (the pope) and every other point (the presidents); I therefore needed to privilege this point (and similarity vector), minimize its error the most, and spread whatever error I saved among the other points. In other words, the president-pope distances would be more accurate than the president-president distances.

There are several possible solutions to this problem; I went with brute-force hack! I simply re-ran the t-SNE over and over until (a) the pope point was more or less in the center, and (b) the percentages of republicans, democrats and others closest to the pope was reasonably close to that in the similarity matrix.

Oh, and there was a third criterion: the t-SNE had to look aesthetically pleasing, in a more or less globular shape without too many points overlapping so the mouseovers weren't annoying.

Then it just became a matter of tuning. If my criteria were too strict, I'd never find a solution. Less strict, and I'd find a solution every few minutes, but they might not look nice and I'd have to start again. Finally, I went with a solution (% democrats in top half - % republicans in top half > 4%, %other in top half - % democrats in top half > 8%) that took about 50 iterations and less than a minute to find, and ran it a few times till the first time I got something that 'looked nice'.

If anyone has ideas for a less hacky way to solve this problem (and yes, I tried using graphs, but they ended up just too darn symmetrical-looking), I'd love to hear it!

TOP THREE CHARACTERISTIC SIMILAR WORDS MOUSEOVER: The top three most characteristic words shared between the pope and each president was another hack; my first thought was to use a simple Dunning log-likelihood test, with one corpus the pope's speech concatenated with the president in question's speech, and the other corpus the speeches of all of the other presidents. Dunning usually does a good job of determining words that are overrepresented in one corpus, and penalizes words that are both too rare and to common to be statistically significant. But of course, the pope used a bunch of words that no president used, like 'Merton', whom he mentioned five times. So I tried a few variations on the theme, and ended up using, for corpus 1, the maximum frequency (I used a bag of words, not a TFIDF) of each word that both the pope and the president used. Again, it's a little hacky. I also added +1 to every word count in both corpuses so there were no divide-by-zero errors and the algorithm didn't ignore words that the pope and president used but no other president used -- these are definitely of interest in this analysis.

Again, anyone who has a better solution, I'm all ears!

TOP POPE WORDS & PRESIDENTS WHO USED THEM: I took the top 100 words used by the pope in terms of tf-idf, so it penalized words he used rarely, and words few or no presidents used. Then I determined the tf-idf for each word for each president, listed the top three presidents in terms of tf-idf, and indicated how many presidents in total used the word.

INTERPRETATION OF PARTY RESULTS: Here is a plot of the rank of cosine similarities to Jaccard similarities (which is simply the ratio of the intersection of terms used to the union of terms used) between the pope and each president. The rank is perfectly preserved, indicating that the preponderance of 'other' presidents among the most similar might be due to some combination of a large intercept of words in common and a small union of total words. In other words, maybe early presidents had low lexical diversity, short speeches, or both.


Plotting length of speeches and lexical diversity (number of unique words divided by total number of words) vs. cosine distance, however, shows that the green 'other' dots tend to be towards the left, but they're still spread out somewhat along the top third of the graph, indicating that perhaps the pope did indeed genuinely share a vocabulary with them more, just possible not to the extent that the cosine similarities along might indicate.

Again, anyone with better tools to analyze this is welcome to contribute!




• • •

Monday, August 17, 2015

A Python script to make choropleth grid maps

In May 2015, there was a sudden fad in the Dataviz community (on Twitter, anyway) for hexagonal grid-type choropleth maps. A choropleth (not "chloropleth") is a map in which areas are filled in with a color whose intensity and/or hue is proportional to a quantity; we've all seen them. The problem with traditional choropleths is they are dependent on area so that, for example, a choropleth of the United States will emphasize relatively deserted Wyoming over highly populated Massachusetts.

One recent partial solution to this has been to create choropleths with equal representations of subunits with squares or hexagons. They're still not proportional to population (attempts to solve this last hurdle generally result in ugly and/or unrecognizable maps), but they're pretty cool nonetheless.

As soon as the hub-bub started, I thought it would be relatively painless and an interesting exercise to code up a Python script that would make these choropleths. Since they're geometric, it's a cinch to output SVG vector markup. I got about 90% of the way through this project in three weeks, and then a new job and some health issues interfered. but I finally found a weekend to finish it off, in at least a beta, v.0.1 way.

The script is hosted on my GitHub, and here's the IPython notebook/Jupyter tutorial that goes along with it. (There's also a script called Colorbin that's supposed to remove the hassle from associating colors with quantities for the choropleths).

Feedback is welcome. Here are some examples of what the script has produced:























• • •

Saturday, August 8, 2015

Dataset: Single word frequencies per decade from Google Books

I have crunched a public English language dataset in order to remove information that is least likely to be of interest to users, and I offer it for anyone to download:

[1 GB] Google 1-grams English v.20120701 by decade, lowercase, no parts of speech (zipped csv)

The original dataset is Creative Commons Attribution 3.0 Unported License.

***

Google Ngram Viewer is an online tool to track the uses of English words from the 16th century to 2008, with the following caveats, among others:

  • It only contains words that were used at least 40 times in any given year, in order to preserve copyright (so you can't tell exactly what book a given word appeared in). This means, for example, if a word occurs 40 times in 1970, 39 times in 1971 and 41 times in 1972, in the database the word will occur 40 times in 1970, 0 times in 1971 and 41 times in 1972.
  • The database is mostly based on library books, so it is heavily biased towards the types of books found in libraries; this includes, for example, directories of names. It is also biased towards the availability of books in a given year, so, for example, 1994 will be much more representatively represented (so to speak) than 1731.
  • A lot of the older books have many, many typos, and many books have the wrong date. I wrote an amusing (I hope) blog post about this, 
  • Culturomics, the hyperbolically named organization that prepared the data and made the search tool, warns that data before 2000 should not be compared to data from 2000-2008; they don't explain why, but a reasonable hypothesis would be that the availability of electronic documents drastically changed the nature of the underlying documents and thus their word frequencies.
I have downloaded the over 6 GB of data, converted the words to lowercase, removed part-of-speech tags, converted dates to the decade year and aggregated the results. That means, for example, all of the following 72 entries were collapsed into only one word, "after":

AFteR  AFTEr      aFTER_ADP  AFTER_DET   AFter_NOUN
aFTER  AFTER      AFter_ADP  afTer_DET   AFTeR_NOUN
aFTer  AftEr      aftEr_ADP  AfTER_DET   AFteR_NOUN
afteR  AfTEr      AftEr_ADP  aFter_DET   AFTER_NUM
After  AfTER      AfTER_ADP  aFTER_DET   AFTER_PRON
AfTer  after      aFter_ADP  AfTer_NOUN  after_PRT
AFtER  AFter      after_ADP  AfTER_NOUN  AFTER_PRT
aftEr  AFTER_ADJ  AfteR_ADP  after_NOUN  aFTER_VERB
AFTer  AfTER_ADJ  AFTER_ADP  AFTer_NOUN  AFter_VERB
afTer  after_ADJ  afTer_ADP  afTer_NOUN  AfTER_VERB
AftER  After_ADJ  after_ADV  aFter_NOUN  AFTER_VERB
AFTeR  afteR_ADP  After_ADV  AFTER_NOUN  AFTER_X
aFter  AfTer_ADP  AFTER_ADV  AFTEr_NOUN  
aftER  AftER_ADP  AFter_ADV  AFtER_NOUN  
AfteR  After_ADP  AFter_DET  aFTER_NOUN  

And then, for example, the 10 entries for "after" for 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998 and 1999 are aggregated to only one entry, 1990.

This makes the dataset much easier to use, small enough to hold in memory for most computers, and it smooths out some of the weirdness like all of the different capitalizations.

If you end up using this data, I'd love it if you dropped me a line. Enjoy.
• • •

Wednesday, May 13, 2015

Let's include Europe in The Great Grid Map Debate

I've been enjoying what Nathan Yau calls 'The Great Grid Map Debate', as different chloropleths of the United States with equal-area states have been proposed. I thought we should make this a little less American-centric, so I tried my hand at Europe:
Really tiny countries like Andorra, Liechtenstein, Monaco and Vatican City were left out; I extended Europe as far to the east as the most genererous definitions of Europe I could find allowed, they can always be left out depending on the dataset. I hope I got all the ISO three-letter abbreviations right, I had memorized the three-letter IOC abbreviations when I was a kid and they kept jumping into my head.

Here's the NPR graphic that started it all:



And here's my current favorite, a four-hex grid, posted by Jason Emory Parker of The Post and Courier this morning:


• • •

Thursday, May 7, 2015

Methodology for Most characteristic words in pro- and anti-feminist tweets

Here I discuss the methodology used in the prooffreader.com post Most characteristic words in pro- and anti-feminist tweets. I both describe what we did (my team and I) and explain the reasoning behind it. I try not to make it too technical.

The team consisted of myself, Zafarali Ahmed, Jerome Boisvert-Chouinard, Dave Gurnsey, Nancy Lin, and Reda Lotfi. We won the Data Science and the Natural Language Processing prizes at the Montreal Big Data Week Hackathon on April 19, 2015.

The code for this post is available in my Github. The code we created during the hackathon is in teammate Zaf's Github, and includes some features we'd like to eventually include, such tweet frequency tracking and topic modelling. The script downloads the dataset from my website; if you want, you can download it yourself for perusal here (it's 248 MB). If you just want to see the log-likelihood results, they're here. And if you'd like to see if you agree with our manual curation, the set is here. Finally, the actual code that searches Twitter is here, I didn't put it in the Github because it's embarassingly poorly hacked together, but it works (which gives me not enough incentive to clean it up).

1. Data collection: I wrote a simple Python script that uses the free Twitter Search API and runs every 15 minutes. I have a list of about 20 search terms I'm interested in; I randomly collect the last 100 tweets of each until the limit maxes out. Additionally, at random intervals about once per day, I collect the results for only one search term, as many results as possible. Between January and April 2015, I collected 988,000 tweets containing "feminism", "feminist" and/or "feminists".

2. Curation: We used a simple script to manually curate 1,000 randomly chosen tweets into three categories: pro-feminist, anti-feminist, and other (e.g. neither, neutral, a news report, we couldn't tell, not in English, etc.)

You may have heard of sentiment analysis; this is what we called 'attitude analysis' and it's more difficult. A typical pipeline of sentiment analysis uses curated movie reviews and makes inferences from texts containing words like 'hate', 'love', 'like', 'terrific', 'awful', i.e. positive, negative, subjective and/or non-subjective language.

For example, look at these two hypothetical tweets:
    1. Man, do I ever hate feminists.
    2. I hate that my mom does not like the word 'feminism'.

Sentiment analysis will typically consider these tweets similar due to their use of 'hate', but they're diametrically opposed in attitude. By using machine learning we try to see if other words ('man', 'ever', 'feminists' vs 'feminism', 'mom', etc.) are good predictors of the underlying attitude. (We verified that sentiment analysis was not up to the task, you can see it in the Github repo).

The manual curation was time-consuming, especially since we really wanted to get it right. If we were in any doubt (e.g. if we thought the tweet might be sarcastic) we classified it as 'other'. The final count was about 50% neither, 25% pro and 25% anti.

3. Removal of 'low-effort' tweets. We wanted tweets that people spent time writing, and chose their words deliberately (if not always carefully). In other words, we wanted to eliminate anything that could be tweeted with just a click of a button, like retweets or those 'Share me' links on websites. From our original 988,000 we were left with 390,000 after this adjustment. (People sure do retweet/share a lot). (BTW, to minimize the shared tweets, we just eliminated duplicates, leaving only one copy in the database).

4. Tokenizing. Since tweets are limited to 140 characters, we did not eliminate stopwords (articles, pronouns, linking verbs, common prepositions, etc.) in the theory that they were carefully chosen and significant. We eliminated very common punctuation, but considered most non-alphabetic characters (quotation marks, question marks, special characters, emoji, etc.) as separate tokens. (This let us see if one group quoted, questioned, exclaimed, etc. more than another). We also considered 'curly quotes' different from 'straight quotes', since they are used most often in copied-and-pasted text, but we did not see a significant difference between groups once duplicate tweets were removed. (We saw a big difference in favor of pro-feminism before the removal, indicating they are the ones who are 'sharing' from traditional news sites.)

5. Classification. We used machine learning to classify the 390,000 tweets as pro- or anti-feminist (or other) based on our 1,000 tweets. We used a bag-of-words approach, i.e. taking the set of all words used in all tweets, and seeing which words are used in individual tweets and whether any of those individual words are good predictors of the class (pro-, anti-, other) of the tweet. We used a Naive Bayes classifier because it's commonly used in NLP (Natural Language Processing) for bag-of-words tasks. We sequestered 25% of our curated tweets to test our classifier; it classified about 90% of the non-sequestered tweets correctly, but only about 40% of the sequestered tweets correctly. This is called 'overfitting', and the obvious solution is to curate more tweets. (Maybe Mechanical Turk could help?). Still, one important measure of false-positive rates was around 60%, which is pretty good for twitter data, indicating that we weren't getting a lot of misclassifications inside two classes of interest, pro- and anti-; most of our errors were false negatives, i.e. real pro- and anti- tweets being classified as 'other'. We can live with that; false positives in this scenario are far worse than false negatives.

6. We calculated the log-likelihood of every word/token that appeared in both pro- and anti- tweets at least 10 times, as a measure of how characteristic they were to one class as opposed to the other. It's a measure of significance, i.e. a form of p-value, which can be misused, but we're using it properly.

My best explanation in layman's terms is, we choose a word and see how common it is in each dataset. The log-likelihood is a measure of how 'surprised' we would be if we mixed up the two datasets and divided them randomly, and then came up with a similarly unequal distribution. In other words, it's a measure of the odds that our observation is due to chance rather than the true nature of our datasets.

Log-likelihood is a handy measure because it takes into account both the ratio and the absolute values of frequency differences between sets. In other words, if "goldilocks" (to pick a word totally at random) appeared 10 times in the anti-feminist tweets and 20 times in the pro-feminist tweets, it would have a lower log-likelihood than if it appeared 100 times and 200 times, respectively, even though the ratio between them is the same. How much lower depends on the total size of the dataset; we're more 'surprised' to find differences in large datasets (which are more predictable) than in small ones.

It's been my experience that log-likelihood is pretty robust with imperfectly classified data, like this. In other words, the most characteristic words might not be exactly the same words in the same order if the classification were better, but there would be very little substantive change, i.e. words changing their relative values spectacularly.

I'm interested in comments and question, feel free! I have a thick skin and I acknowledge I'm capable of mistakes!



• • •