Saturday, August 8, 2015

Dataset: Single word frequencies per decade from Google Books

I have crunched a public English language dataset in order to remove information that is least likely to be of interest to users, and I offer it for anyone to download:

[1 GB] Google 1-grams English v.20120701 by decade, lowercase, no parts of speech (zipped csv)

The original dataset is Creative Commons Attribution 3.0 Unported License.


Google Ngram Viewer is an online tool to track the uses of English words from the 16th century to 2008, with the following caveats, among others:

  • It only contains words that were used at least 40 times in any given year, in order to preserve copyright (so you can't tell exactly what book a given word appeared in). This means, for example, if a word occurs 40 times in 1970, 39 times in 1971 and 41 times in 1972, in the database the word will occur 40 times in 1970, 0 times in 1971 and 41 times in 1972.
  • The database is mostly based on library books, so it is heavily biased towards the types of books found in libraries; this includes, for example, directories of names. It is also biased towards the availability of books in a given year, so, for example, 1994 will be much more representatively represented (so to speak) than 1731.
  • A lot of the older books have many, many typos, and many books have the wrong date. I wrote an amusing (I hope) blog post about this, 
  • Culturomics, the hyperbolically named organization that prepared the data and made the search tool, warns that data before 2000 should not be compared to data from 2000-2008; they don't explain why, but a reasonable hypothesis would be that the availability of electronic documents drastically changed the nature of the underlying documents and thus their word frequencies.
I have downloaded the over 6 GB of data, converted the words to lowercase, removed part-of-speech tags, converted dates to the decade year and aggregated the results. That means, for example, all of the following 72 entries were collapsed into only one word, "after":

aFTer  AftEr      aftEr_ADP  AfTER_DET   AFteR_NOUN
afteR  AfTEr      AftEr_ADP  aFter_DET   AFTER_NUM
AfTer  after      aFter_ADP  AfTer_NOUN  after_PRT
AFtER  AFter      after_ADP  AfTER_NOUN  AFTER_PRT
afTer  after_ADJ  afTer_ADP  afTer_NOUN  AfTER_VERB
AftER  After_ADJ  after_ADV  aFter_NOUN  AFTER_VERB
AfteR  After_ADP  AFter_DET  aFTER_NOUN  

And then, for example, the 10 entries for "after" for 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998 and 1999 are aggregated to only one entry, 1990.

This makes the dataset much easier to use, small enough to hold in memory for most computers, and it smooths out some of the weirdness like all of the different capitalizations.

If you end up using this data, I'd love it if you dropped me a line. Enjoy.
• • •


Post a Comment