Monday, November 14, 2016

I scraped all the 2016 U.S. election data

I'm a weird guy. I enjoy building webscrapers; I find it relaxing. I have no real plans to use the 2016 U.S. election data, no particular horses to grind (I'm not to thrilled at the outcome, but hey), but I've been  hanging around /r/datasets on Reddit, and lots of people were asking for the data, and wondering if someone was going to scrape it. So, I did.

All of the data is in this Github repo. Obviously, I did not put a license of any sort on it, so feel free to use it. If you end up doing any interesting analyses with it (even boring ones, I'm not picky) I'd love to hear about it!

If I understand correctly, official verified (as opposed to reported) results will start to be released by the states in 2-4 weeks, but it seemed a shame the data was not, as far as I could tell, easily available right now.

(Of course, given my experience as to how the universe works, probably there will be a better data dump of all this info somewhere soon, or someone will point out it already exists somewhere my Google Fu wasn't strong enough to find, making my 10-12 hours of effort redundant. But it wouldn't have happened if I hadn't forced fate's hand!)

Is scraping legal? Yes. Is it ethical? Pretty much. Feel free to disagree; my position is that this is all public data, not in any copyright, presented publicly, and I scraped it by automating an actual web browser, so I did not use up any more of the websites' resources than a regular visitor. If I have breached any terms of service, I'll just have to live with the consequences, of which there are likely to be none. Here's a good Quora post about the subject.





• • •

Tuesday, October 11, 2016

Hugo Larochelle's neural network & deep learning tutorial videos, subtitled & screengrabbed

Like a lot of data scientists (I consider myself more of a data spelunker, but I aspire to data science), I try my best to keep up with the latest discoveries in a very fast-changing field; and probably nothing has been as game-changing as the advent of deep learning.

Deep Learning, explained to a five-year-old (okay, maybe fifteen-year-old): Data science been really good for a while now at data that can be explained in Excel spreadsheets, i.e. columns and rows: one row per observation, one column per variable. This is called structured data. Deep Learning allows us to create rows of column variables that describe a representation of unstructured data, like images or text. It's as if you had an automatic algorithm that could look through all your images, and create one column based on the likelihood the image contains a cat, another the likelihood it contains a shovel -- without having to tell the algorithm what a cat or shovel is, or what they look like, or determine that there are cats and shovels at all before running the algorithm.
Deep Learning is rather math-intensive, and involves neural networks, a family of algorithms that's been around for a long time but has now come into its own. Unlike some skills, you can't learn it as a black box and then slowly come to understand it as you use it. There are foundations you need to acquire; tutorials you need to absorb.

I live in Montreal, which recently hosted its annual Deep Learning Summer School; I couldn`t attend, but I heard great things about the lecture by Universit√© de Sherbrooke's Hugo Larochelle.

There's just one thing; I hate listening to videos. It's why I don't take Coursera classes now that I only have a short commute to work every day. I need to learn at my own pace. And I prefer to read.

So when I realized Larochelle's lecture was based on a series of 92 videos on his YouTube channel, I wrote a Python script to add a black bar beneath them, burn subtitles into it, and take screenshots of every subtitle slide and make a pdf out of it so I can read them. I like to read.

Here's an example screenshot:


I'm sharing the fruits of my labor with you here: Videos with subtitles, pdfs of subtitled screenshots, and Python code I used to make them.



Hugo Larochelle neural network lecture videos
& pdfs with subtitles


These are zip files of subtitled videos and pdfs of screenshots made from Hugo Larochelle's (University of Sherbrooke) YouTube playlist of 92 videos in 10 parts on neural networks.


Videos with subtitles:

  1. Subtitled MP4s for Part 01, Feedforward neural networks [zip, 108.5 MB]
  2. Subtitled MP4s for Part 02, Training neural networks [zip, 238.6 MB]
  3. Subtitled MP4s for Part 03, Conditional random fields [zip, 250.2 MB]
  4. Subtitled MP4s for Part 04, Training CRFs [zip, 106.6 MB]
  5. Subtitled MP4s for Part 05, Restricted Boltzmann Machine [zip, 169.1 MB]
  6. Subtitled MP4s for Part 06, Autoencoder [zip, 136.8 MB]
  7. Subtitled MP4s for Part 07, Deep Learning [zip, 226.8 MB]
  8. Subtitled MP4s for Part 08, Sparse coding [zip, 152.8 MB]
  9. Subtitled MP4s for Part 09, Computer vision [zip, 191.4 MB]
  10. Subtitled MP4s for Part 10, Natural Language Processing [zip, 289.5 MB]


PDFs of screenshots:

  1. PDFs for Part 01, Feedforward neural networks [zip, 120.9 MB]
  2. PDFs for Part 02, Training neural networks [zip, 287.0 MB]
  3. PDFs for Part 03, Conditional random fields [zip, 284.2 MB]
  4. PDFs for Part 04, Training CRFs [zip, 132.7 MB]
  5. PDFs for Part 05, Restricted Boltzmann Machine [zip, 189.8 MB]
  6. PDFs for Part 06, Autoencoder [zip, 161.1 MB]
  7. PDFs for Part 07, Deep Learning [zip, 307.9 MB]
  8. PDFs for Part 08, Sparse coding [zip, 200.9 MB]
  9. PDFs for Part 09, Computer vision [zip, 239.8 MB]
  10. PDFs for Part 10, Natural Language Processing [zip, 371.6 MB]



In a Python script (which you can see here: Part 1, Part 2), I:
  • used requests and BeautifulSoup to parse the YouTube playlist;
  • used youtube-dl to download the videos and WEBVTT subtitles;
  • used pycaption to convert subtitles to SRT format;
  • used ffmpeg (from a subprocess call) to add a black letterbox below each video, burn the subtitles into that box and then save png screenshots wherever there was a new subtitle line;
  • used imagemagick to bundle pngs into pdfs;
  • used zipfile to zip similar files together and deleted the originals.



I'm David Taylor, aka prooffreader. About me


• • •

Tuesday, March 29, 2016

Context for prooffreader.com post about White House petitions

Since this is so long, I stuck it here instead of in the main prooffreader.com post. This is a list of five randomly selected petition titles containing the top 25 most characteristic words in successful petitions, or the top 10 most characteristic words in unsuccessful petitions.


### SUCCESSFUL PETITIONS ###


gun
- Create ‘National Gun Safety Day’ to Make Gun Responsibility a Part of Mainstream American Culture
- Keep guns in America! No weapons ban!
- Honor our service members who used their personal firearms to fight back against the terrorist attacker in Chattanooga.
- To award the Medal of Freedom to the 4 Firefighters who were ambushed in West Webster New York on Christmas Eve 2012
- RESIGN!

tragedy
- Immediately sign Executive Order banning sale of assault weapons and high-capacity magazines until Congress acts on this
- Enforce ceasefire and send humanitarian/international aid to  Syria to stop violence and slaughter of innocent people.
- To Honor the Granite Mountain Hot Shot Team with the Public Safety Officer Medal
- Spend the $25 million already appropriated on a supercomputer for increased weather-prediction capabilities.
- Disclose the Truth About Benghazi 1) Who Issued the Stand Down Order? 2) Who Concocted the YouTube Lie? WE DEMAND TRUTH!

access
- Enact the Restroom Access Act, or Ally's Law
- Promote dystonia awareness by recognizing Dystonia Awareness Month in September
- MODERNIZE THE RAIL NETWORK INTO A HIGH-CAPACITY, GRADE-SEPARATED, ELECTRIFIED SYSTEM TO SERVE FREIGHT AND PASSENGERS
- To force cellular carriers to allow us to install Google Wallet and use this feature on all Google NFC devices.
- Admonish OCR for its Arbitrary Action and Request that it Withdraw its Directive for Illinois School District 211

imperative
- Enforce the tax code, and strip violating Religious institutions of their tax exempt 501(c) status.
- protect coal ash recycling by promptly enacting disposal regulations that do NOT designate coal ash a “hazardous waste.'
- Require that Insurance Companies offer PPO plans to Individuals and small groups in all states.
- Establish Lunar New Year as a National Holiday. Give it the same importance and weight as the other cultural holidays.
- Mandate all freight trains have two-person crews.

regulatory
- Appoint Susan Crawford as FCC Chairman
- reverse Secretary Sebelius's decision to restrict access to emergency contraception.
- Call on Congress to repeal the job-killing, USPS-strangling Postal Accountability and Enhancement Act of 2006.
- Reform ECPA: Tell the Government to Get a Warrant
- impose relatively quick and effective sanctions on Russia, the sponsor of terrorismus

halbach
- Pardon Steven Avery and Brendan Dassey for their alleged involvement in the murder of Teresa Halbach.
- Retrie Steven Avery in the murder case of Theresa halbach
- Federally Investigate the Manitowoc County Sheriffs Department and the County Court System for Criminal Behavior.
- Encourage the Judiciary to re-examine the trials of Steven Avery and Brendan Dassey, & the search for Halbach's killer.
- Open discussion with Wisconsin Gov. Scott Walker allowing a retrial by way of due process in the case of Steven Avery

westboro
- Define the Westboro Baptist Church as a hate group due to promoting animosity against differing cultural demographics.
- Revoke the tax exempt status of the Westboro Baptist Church & re-classify Westboro Baptist Church as a hate group.
- Ban protests for any cause within three hundred feet of any funeral service, both during and two hours before and after.
- Make Petitioning funerals illegal.
- Investigate the IRS Tax-Exempt Status of the Westboro Baptist Church

clinical
- Include Gender Reassignment Surgery within the ACA, Medicare/aid, and the Veterans Administration.
- Change the Direction of EHR Technology, in Order to Improve Healthcare Outcomes and Control Costs
- Urge the FDA to grant Accelerated Approval to critical new treatments for cystic fibrosis.
- Recognize pharmacists as health care providers!
- authorize the FDA to grant a compassionate use exemption to Refael Elisha Cohen for Antineoplaston therapy.

superior
- Initiate a civil rights investigation against Texas Attorney General Ken Paxton for civil rights violations.
- Secure resources and funding, and begin construction of a Death Star by 2016.
- Immediately halt the cruel and unnecessary use of monkeys in Army chemical casualty management training courses.
- Stop Apache Land Grab
- Demand Brooklyn District Attorney Kenneth P. Thompson to withdraw indictment against Asian minority Officer Peter Liang!

abiding
- Pass the DREAM act or Development, Relief, and Education for Alien Minors
- Dissolve any petitions on an Assault Weapons Ban as unconstitutional under amendment II  of the Constitution
- Expedite the process of obtaining a Green Card
- Address our petition for redress of grievance against any proposed legislation violating our 2nd amendment rights.
- Lower the American flag to half staff nationally to honor the death of Fr. Theodore Hesburgh.

baptist
- Ban protests for any cause within three hundred feet of any funeral service, both during and two hours before and after.
- BAN THE WESTBORO BAPTIST CHURCH FROM ENTERING BOSTON AND PICKETING THE FUNERALS OF THOSE WHO DIED DURING THE BOMBING
- Revoke the tax exempt status of the Westboro Baptist Church & re-classify Westboro Baptist Church as a hate group.
- Define the Westboro Baptist Church as a hate group due to promoting animosity against differing cultural demographics.
- Investigate the IRS Tax-Exempt Status of the Westboro Baptist Church

avery
- Retrie Steven Avery in the murder case of Theresa halbach
- Pardon Steven Avery and Brendan Dassey for their alleged involvement in the murder of Teresa Halbach.
- Defund our state religion: The Church of Political Correctness whose dogma is WHITE GENOCIDE
- Open DOJ investigation into Manitowoc County Sheriff Depart for convictions of Steven Avery and Brendan Dassey
- Work with other world governments to stop human trafficking and torture by Bedouins in the Sinai.

definitely
- take steps with our allies and the UN to put pressure on the Iranian government to release Pastor Behnam Irani.
- ban the use of gas chambers for killing shelter companion animals. They are inhumane, expensive and dangerous to humans.
- Repeal the unconstitutional NDAA and FISA which allow intelligence agencies to secretly spy on US Citizens!
- Advance EB2 and EB3 priority dates for India and China
- We petition Obama Administration to consider revoking the citizenship of Ms. Eun Mi Shin who is praising N. Korea now.

connecticut
- Address The Civil Rights Violations By The State Of Connecticut Against Gerald O'Donnell and George Gould
- Extradite Warren Anderson, Bhopal fugitive and bail-jumper, to India to face charges related to thousands of deaths
- Use executive authority to reinstate the Federal Assault Weapons Ban of 1994 (expired 2004).
- Call on Congress to pass sensible gun legislation from the pediatricians of America
- Not punish the tens of millions of law-abiding gun owners with ineffective and unconstitutional 'assault weapons' bans.

sensible
- Preserve 6 Day Mail Delivery
- Replace Gil Kerlikowske
- begin a national conversation on sensible gun control.
- Establish federal gun control laws
- reinstate the public tours of the White House. This is the people's house and they deserve the opportunity to tour it!

brendan
- Encourage the Judiciary to re-examine the trials of Steven Avery and Brendan Dassey, & the search for Halbach's killer.
- Initiate a Federal Investigation of the Sheriff's Offices of Manitowoc County and Calumet County, Wisconsin
- Pardon Steven Avery and Brendan Dassey for their alleged involvement in the murder of Teresa Halbach.
- Open DOJ investigation into Manitowoc County Sheriff Depart for convictions of Steven Avery and Brendan Dassey
- Open discussion with Wisconsin Gov. Scott Walker allowing a retrial by way of due process in the case of Steven Avery

turnout
- Make Election Day a National Holiday.
- Make Election Day a National Holiday
- propose legislation that would make all federal election days national holidays to increase voter turnout.
- There are election rigging made by Progressive Program that have been used in the 18th Presidential Election of S. KOREA
- Designate Election Day as a national holiday.

recount
- Recount the election!
- There are election rigging made by Progressive Program that have been used in the 18th Presidential Election of S. KOREA
- Demand from El Salvador's Electoral Tribunal to hold a vote-by-vote recount of March 9th's presidential elections
- Cut off the $1.3B in aid annually to Egypt, unless they bring to justice the killers of the massacre of Coptic Christian
- Call upon the International Community to urge that a full recount of votes be done in Venezuela's presidential elections

dassey
- Initiate a Federal Investigation of the Sheriff's Offices of Manitowoc County and Calumet County, Wisconsin
- Open DOJ investigation into Manitowoc County Sheriff Depart for convictions of Steven Avery and Brendan Dassey
- Pardon Steven Avery and Brendan Dassey for their alleged involvement in the murder of Teresa Halbach.
- Encourage the Judiciary to re-examine the trials of Steven Avery and Brendan Dassey, & the search for Halbach's killer.
- Open discussion with Wisconsin Gov. Scott Walker allowing a retrial by way of due process in the case of Steven Avery

practically
- Peacefully grant the State of Texas to withdraw from the United States of America and create its own NEW government.
- Restore humane horse slaughter to improve horse welfare, stop needless & wasteful suffering & create jobs.
- Wage Justice, Debt Payment to Cops of Puerto Rico, and Restitution of their Setirement System
- Allow commercial flights to Cuba.
- Make the Metric system the standard in the United States, instead of the Imperial system.

conversation
- increase NASA funding to 1% of the federal budget
- Work with Congress to close the 13th Amendment slavery loophole
- Designate the sea between the Korean Peninsula and islands of Japan as 'East Sea/Sea of Japan' in all maps and texts.
- Stop Cyberlaw HR4681
- record all senate and congressional phone, email, and chat correspondence and make it public for anyone to see.

produce
- Release the full text of the Trans-Pacific Partnership to the American public ahead of a Congressional vote.
- Immediately halt the cruel and unnecessary use of monkeys in Army chemical casualty management training courses.
- Stop San Francisco from abusing David Gizzarelli and his puppy Charlie and return Charlie home where he belongs.
- Put Authentic America Made Products in our National Park Gift Shops! Using Presidential Executive Order power.
- Enact legislation guaranteeing women equal access & opportunity for employment in government funded arts organizations.

columbia
- Fully Legalize Same-Sex Marriage Across The Nation
- Award Yogi Berra The Presidential Medal of Freedom for his military service and civil rights and educational activism.
- Stop the hazing of sea lions in Washington & Oregon & have the fws publicly report the truth about decline salmon.
- Restore Net Neutrality By Directing the FCC to Classify Internet Providers as 'Common Carriers'.
- to Impeach all politicians who signed the Norquist oath in violation of their oath to the government of the USA.

ct
- call for the Chinese authority to stop blocking major Internet services, such as Gmail, via the Great Firewall
- give assistance to the homeless people and also to the poor
- Reform copyright law to allow libraries to keep digital copies of ebooks and other media.
- Provide More Transparency Around Government Surveillance of Internet Users
- JUSTICE FOR TURKISH VICTIMS OF 1915 CIVIL WAR WHICH WAS COMMENCED BY ARMENIAN BANDITS IN OTTOMAN SOIL.

regulation
- Continue to allow Americans to import life-sustaining prescription drugs from safe international online pharmacies.
- Legalize Concealed Carry for Soldiers on Military Installations
- make experimental drugs available for terminally ill ALS patients through existing FDA Accelerated Approval Program.
- Stop using Homeland Security funds to seize imported vehicles, and change the DOT/EPA exemption to 15 years.
- Rule the 'NY SAFE ACT' to be UNCONSTITUTIONAL!


### UNSUCCESSFUL PETITIONS ###


condition
- Help the families with open adoption cases in Russia bring their babies home
- Award the Presidential Medal of Freedom to each Woman in the video “STOP WHITE GENOCIDE”
- Require FAA to re-examine its 65 Decibel (dBA) noise safety level and consider 55 dBA as a new standard for human health
- (Jul #2 of 4) Allow White Americans to vote 'Yes' or 'No' on WHITE GENOCIDE!
- Support and sign into law H.R. 2858 the Wildland Firefighter Protection Act

2014
- Allow Staff Sergeant Carl Lee Wheless, Jr., U.S. Army, to continue on Federal active duty an additional 55 days.
- Look into the terrible and despicable officiating of the Lions and Cowboys NFC Wildcard game that took place on 1/4/2015
- Revoke The Tennessee Religious Viewpoints Anti-discrimination Act
- CANCEL White House Invitation To PM Modi-Organizer of 2002 Massacre of MUSLIMS. Ban BJP For 1984 Attack On Golden Temple
- help free Nadiya Savchenko

often
- SUPPORT more aggressive SARCOIDOSIS RESEARCH, add SARCOIDOSIS & it's Complications to SSA Compassionate Allowance List!
- Offer a federal tax credit of $1,000 per child to homeschooling families for educational expenses incurred.
- Pass legislation to ensure that military working dogs be classified as canine members of the armed services.
- Sign An Executive Order Barring Federal Agencies And Contractors From Discriminating On The Basis Of Prior Conviction
- Support and Implement Instant Runoff Voting.

duty
- exonerate Arnold Abbott!
- Military POV Shipping Contract Review (International Auto Logistics)
- retract its support of the current proposed increase of Tricare fees for active duty and retired military members.
- Allow United States Military service members to place their hands in their pockets.
- To keep its word to improve access to behavioral/mental health care for active duty, veterans, and all Americans.

physical
- Speak out against anti-whites who use the word “hate” to promote White Genocide
- (April #1 of 6) STOP WHITE GENOCIDE! Halt MASSIVE third world immigration and FORCED assimilation in White countries!
- Appoint board of inquiry: Was the Charleston shooting a tragic result of anti-Whites' program of White Genocide?
- (Aug #1 of 6) STOP WHITE GENOCIDE! Halt MASSIVE third world immigration and FORCED assimilation in White countries!
- Proclaim Oct. 29, 2011, as World Psoriasis Day

down
- remove the South Carolina Confederate flag
- Declare an executive order for all executive departments and agencies to be closed 12/26/2014, for a four-day weekend.
- Call for the resignation of Texas Lt. Gov. David Dewhurst based on misconduct during Wendy Davis' filibuster of sb5.
- Help fight for Parrot. Parrot was a innocent dog who got killed by a Police officer in Washington DC.
- arrest and prosecute the House GOP for treason.

genocide
- LIST ERDOGAN’S TURKEY AS STATE SPONSOR OF TERRORISM; VOID U.S. ALLIANCE WITH TURKEY
- Demand The Release of Supreme Religious Leader of Sikhism, Jathedar Jagtar Singh Hawara, Head  of  Sri Akal Takhat
- Award the Presidential Medal of Freedom to each Woman in the video “STOP WHITE GENOCIDE”
- Tell White Americans: 'White folk, look at your own family, or families you know, and see it happening.'
- Join presidential candidate Bob Whitaker in approving Donald Trump to head Department of Deportation NOW!

i
- Use the executive powers to fix the broken immigration system for legal immigrants
- Before Obama leaves his office, He’d better apologize his black people in Ferguson for calling them “criminals and thugs
- Recognize the Confederate flag as equal to the official national flag of the United States of America
- Stop States From Legalizing Discrimination
- Allow non-violent marijuana drug offenders to serve 65% of federal sentences as oppose to the current 85%.

code
- Explain why Asia is for the Asians, Africa is for the Africans, but White countries are for EVERYBODY
- Reverse our nation’s “No Child Left White” policy.
- Candidate Bob Whitaker invites the President and Mr. Trump to his seminar, 'Diversity is a code word for WHITE GENOCIDE'
- Censure the anti-white media for trivializing 'Polar Bear Hunting' as a 'game'
- Designate National White GeNOcide Day

say
- (#3 of 6 for March) Teach public school children the truth: ANTI-RACIST IS A CODE WORD FOR ANTI-WHITE!
- make 'Trophy Hunting' of endangered species outside of the U.S. illegal for all U.S. citizens.
- Publically ask Postmaster General & Congress to create a new  fundraising stamp called Stamp Out PTSD to Honor Veterans
- Petition urging the Congress and President Obama to intervene to stop violations of children's rights in Vietnam.
- Stop ALL construction of the Keystone tar sands pipeline once and forever !!

• • •

Tuesday, October 27, 2015

They Might Be Random Fingertips


I have a bit of an obsession with Fingertips, a sequence of 21 unrelated songs mostly between 4 and 12 seconds long from They Might Be Giants' 1992 album Apollo 18. The liner notes say to put the CD on random shuffle so that the brief snippets play in a random order.

There are 51,090,942,171,709,440,000 ways to permute 21 items -- that's a little excessive, so I made YouTube videos of 100 such permutations. I selected two video clips for each track, mostly by searching for the track name in YouTube, and assigned them randomly, so there should be some visual surprises upon multiple viewings.

Here's a link to the YouTube channel I made to hold all 100 videos.

And here is Random Fingertips #1 of 100:




• • •

Monday, September 28, 2015

Methodology for comparison of Pope Francis address with presidential inaugural addresses

This post goes deeper into the methodology of this post on Prooffreader.com.

The code is in this gist.

Here is the graphic for that post:




DATASET: Links are in the gist. Note that Pope Francis's address is the one published and given to the media, not the one actually delivered. This may well be the case for inaugural addresses as well.

TFIDF: A standard technique in NLP (natural language processing): Wikipedia entry. I removed the default English stopwords in the TextBlob module (although, of course, TFIDF will give low scores to stopwords anyway.)

CALCULATION OF SIMILARITY BY PARTY: The average cosine similarities for presidents by party was:

republican 0.825
democratic 0.832
other 0.891
range: 0.743-0.968

In order to create a metric that would be useful when hacking the t-SNE, below, I calculated the % of presidents from each party who numbered among the top half of similarity scores:

republican 0.417
democratic 0.500
other 1.000

TSNE: My favorite manifold learning algorithm, but it's not without its problems, mostly in terms of reproducibility. This is less of a problem when it's used to show similarities in a group in general, because it spreads the error inherent in such dimensionality reduction around to all points (a different way each time it's run), so in general it minimizes error at each point. But this is not what I wanted here. The thrust of this analysis is to portray the similarity between one point (the pope) and every other point (the presidents); I therefore needed to privilege this point (and similarity vector), minimize its error the most, and spread whatever error I saved among the other points. In other words, the president-pope distances would be more accurate than the president-president distances.

There are several possible solutions to this problem; I went with brute-force hack! I simply re-ran the t-SNE over and over until (a) the pope point was more or less in the center, and (b) the percentages of republicans, democrats and others closest to the pope was reasonably close to that in the similarity matrix.

Oh, and there was a third criterion: the t-SNE had to look aesthetically pleasing, in a more or less globular shape without too many points overlapping so the mouseovers weren't annoying.

Then it just became a matter of tuning. If my criteria were too strict, I'd never find a solution. Less strict, and I'd find a solution every few minutes, but they might not look nice and I'd have to start again. Finally, I went with a solution (% democrats in top half - % republicans in top half > 4%, %other in top half - % democrats in top half > 8%) that took about 50 iterations and less than a minute to find, and ran it a few times till the first time I got something that 'looked nice'.

If anyone has ideas for a less hacky way to solve this problem (and yes, I tried using graphs, but they ended up just too darn symmetrical-looking), I'd love to hear it!

TOP THREE CHARACTERISTIC SIMILAR WORDS MOUSEOVER: The top three most characteristic words shared between the pope and each president was another hack; my first thought was to use a simple Dunning log-likelihood test, with one corpus the pope's speech concatenated with the president in question's speech, and the other corpus the speeches of all of the other presidents. Dunning usually does a good job of determining words that are overrepresented in one corpus, and penalizes words that are both too rare and to common to be statistically significant. But of course, the pope used a bunch of words that no president used, like 'Merton', whom he mentioned five times. So I tried a few variations on the theme, and ended up using, for corpus 1, the maximum frequency (I used a bag of words, not a TFIDF) of each word that both the pope and the president used. Again, it's a little hacky. I also added +1 to every word count in both corpuses so there were no divide-by-zero errors and the algorithm didn't ignore words that the pope and president used but no other president used -- these are definitely of interest in this analysis.

Again, anyone who has a better solution, I'm all ears!

TOP POPE WORDS & PRESIDENTS WHO USED THEM: I took the top 100 words used by the pope in terms of tf-idf, so it penalized words he used rarely, and words few or no presidents used. Then I determined the tf-idf for each word for each president, listed the top three presidents in terms of tf-idf, and indicated how many presidents in total used the word.

INTERPRETATION OF PARTY RESULTS: Here is a plot of the rank of cosine similarities to Jaccard similarities (which is simply the ratio of the intersection of terms used to the union of terms used) between the pope and each president. The rank is perfectly preserved, indicating that the preponderance of 'other' presidents among the most similar might be due to some combination of a large intercept of words in common and a small union of total words. In other words, maybe early presidents had low lexical diversity, short speeches, or both.


Plotting length of speeches and lexical diversity (number of unique words divided by total number of words) vs. cosine distance, however, shows that the green 'other' dots tend to be towards the left, but they're still spread out somewhat along the top third of the graph, indicating that perhaps the pope did indeed genuinely share a vocabulary with them more, just possible not to the extent that the cosine similarities along might indicate.

Again, anyone with better tools to analyze this is welcome to contribute!




• • •

Monday, August 17, 2015

A Python script to make choropleth grid maps

In May 2015, there was a sudden fad in the Dataviz community (on Twitter, anyway) for hexagonal grid-type choropleth maps. A choropleth (not "chloropleth") is a map in which areas are filled in with a color whose intensity and/or hue is proportional to a quantity; we've all seen them. The problem with traditional choropleths is they are dependent on area so that, for example, a choropleth of the United States will emphasize relatively deserted Wyoming over highly populated Massachusetts.

One recent partial solution to this has been to create choropleths with equal representations of subunits with squares or hexagons. They're still not proportional to population (attempts to solve this last hurdle generally result in ugly and/or unrecognizable maps), but they're pretty cool nonetheless.

As soon as the hub-bub started, I thought it would be relatively painless and an interesting exercise to code up a Python script that would make these choropleths. Since they're geometric, it's a cinch to output SVG vector markup. I got about 90% of the way through this project in three weeks, and then a new job and some health issues interfered. but I finally found a weekend to finish it off, in at least a beta, v.0.1 way.

The script is hosted on my GitHub, and here's the IPython notebook/Jupyter tutorial that goes along with it. (There's also a script called Colorbin that's supposed to remove the hassle from associating colors with quantities for the choropleths).

Feedback is welcome. Here are some examples of what the script has produced:























• • •

Saturday, August 8, 2015

Dataset: Single word frequencies per decade from Google Books

I have crunched a public English language dataset in order to remove information that is least likely to be of interest to users, and I offer it for anyone to download:

[1 GB] Google 1-grams English v.20120701 by decade, lowercase, no parts of speech (zipped csv)

The original dataset is Creative Commons Attribution 3.0 Unported License.

***

Google Ngram Viewer is an online tool to track the uses of English words from the 16th century to 2008, with the following caveats, among others:

  • It only contains words that were used at least 40 times in any given year, in order to preserve copyright (so you can't tell exactly what book a given word appeared in). This means, for example, if a word occurs 40 times in 1970, 39 times in 1971 and 41 times in 1972, in the database the word will occur 40 times in 1970, 0 times in 1971 and 41 times in 1972.
  • The database is mostly based on library books, so it is heavily biased towards the types of books found in libraries; this includes, for example, directories of names. It is also biased towards the availability of books in a given year, so, for example, 1994 will be much more representatively represented (so to speak) than 1731.
  • A lot of the older books have many, many typos, and many books have the wrong date. I wrote an amusing (I hope) blog post about this, 
  • Culturomics, the hyperbolically named organization that prepared the data and made the search tool, warns that data before 2000 should not be compared to data from 2000-2008; they don't explain why, but a reasonable hypothesis would be that the availability of electronic documents drastically changed the nature of the underlying documents and thus their word frequencies.
I have downloaded the over 6 GB of data, converted the words to lowercase, removed part-of-speech tags, converted dates to the decade year and aggregated the results. That means, for example, all of the following 72 entries were collapsed into only one word, "after":

AFteR  AFTEr      aFTER_ADP  AFTER_DET   AFter_NOUN
aFTER  AFTER      AFter_ADP  afTer_DET   AFTeR_NOUN
aFTer  AftEr      aftEr_ADP  AfTER_DET   AFteR_NOUN
afteR  AfTEr      AftEr_ADP  aFter_DET   AFTER_NUM
After  AfTER      AfTER_ADP  aFTER_DET   AFTER_PRON
AfTer  after      aFter_ADP  AfTer_NOUN  after_PRT
AFtER  AFter      after_ADP  AfTER_NOUN  AFTER_PRT
aftEr  AFTER_ADJ  AfteR_ADP  after_NOUN  aFTER_VERB
AFTer  AfTER_ADJ  AFTER_ADP  AFTer_NOUN  AFter_VERB
afTer  after_ADJ  afTer_ADP  afTer_NOUN  AfTER_VERB
AftER  After_ADJ  after_ADV  aFter_NOUN  AFTER_VERB
AFTeR  afteR_ADP  After_ADV  AFTER_NOUN  AFTER_X
aFter  AfTer_ADP  AFTER_ADV  AFTEr_NOUN  
aftER  AftER_ADP  AFter_ADV  AFtER_NOUN  
AfteR  After_ADP  AFter_DET  aFTER_NOUN  

And then, for example, the 10 entries for "after" for 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998 and 1999 are aggregated to only one entry, 1990.

This makes the dataset much easier to use, small enough to hold in memory for most computers, and it smooths out some of the weirdness like all of the different capitalizations.

If you end up using this data, I'd love it if you dropped me a line. Enjoy.
• • •