Monday, July 14, 2014

The 100 trendiest baby names (determined by an analytical chemistry technique)

With the technique described at prooffreader.com, here are the 100 trendiest names from the U.S. Social Security Administration database:

Name Sex Trendiness Peak height (%) Peak width (years) Peak range
Linda F 0.183 5.67 31 1938-1969
Dewey M 0.151 0.91 6 1897-1903
Brittany F 0.121 2.05 17 1983-2000
Debra F 0.118 2.59 22 1950-1972
Shirley F 0.112 4.04 36 1921-1957
Ashley F 0.105 3.16 30 1980-2010
Jennifer F 0.105 4.3 41 1961-2002
Deborah F 0.104 2.82 27 1947-1974
Lisa F 0.1 3.41 34 1955-1989
Jessica F 0.092 3.22 35 1971-2006
Jason M 0.085 3.48 41 1968-2009
Betty F 0.069 3.4 49 1910-1959
Barbara F 0.067 3.56 53 1918-1971
Judith F 0.063 1.96 31 1934-1965
Grover M 0.059 0.71 12 1883-1895
Amy F 0.058 2.21 38 1958-1996
Carol F 0.055 2.32 42 1928-1970
Patricia F 0.053 3.13 59 1922-1981
Heather F 0.052 1.67 32 1966-1998
Mark M 0.052 2.75 53 1946-1999
Melissa F 0.052 2.12 41 1960-2001
Dorothy F 0.051 3.24 63 1893-1956
Joan F 0.049 1.97 40 1924-1964
Tammy F 0.049 1.22 25 1957-1982
Debbie F 0.048 0.97 20 1952-1972
Jaime F 0.045 0.53 12 1975-1987
Lori F 0.043 1.24 29 1954-1983
Karen F 0.042 1.99 47 1937-1984
Sandra F 0.042 2.02 48 1934-1982
Woodrow M 0.041 0.46 11 1911-1922
Sharon F 0.041 1.77 43 1935-1978
Judy F 0.039 1.3 33 1935-1968
Michelle F 0.039 2.03 52 1954-2006
Gary M 0.039 2.03 52 1933-1985
Brian M 0.039 2.29 59 1950-2009
Cynthia F 0.038 1.92 50 1942-1992
Kathy F 0.038 1.19 31 1943-1974
Tracy F 0.038 1.06 28 1959-1987
Larry M 0.037 1.91 51 1931-1982
Scott M 0.037 1.75 47 1950-1997
Chelsea F 0.037 0.88 24 1982-2006
Kimberly F 0.036 2.01 55 1956-2011
Donald M 0.036 2.95 81 1902-1983
Donna F 0.035 1.8 51 1925-1976
Pamela F 0.033 1.41 43 1942-1985
Ronald M 0.033 2.05 63 1926-1989
Cindy F 0.032 0.99 31 1951-1982
Cheryl F 0.032 1.2 38 1943-1981
Megan F 0.031 1.16 37 1973-2010
Angela F 0.031 1.6 51 1953-2004
Dolores F 0.031 1.14 37 1914-1951
Jeffrey M 0.03 1.69 56 1944-2000
Cody M 0.029 1 34 1978-2012
Crystal F 0.029 1.13 39 1963-2002
Tiffany F 0.029 1.03 36 1968-2004
Carolyn F 0.028 1.48 52 1923-1975
Diane F 0.028 1.19 43 1932-1975
Tina F 0.027 0.9 33 1955-1988
Dawn F 0.027 0.9 33 1952-1985
Joyce F 0.027 1.3 48 1921-1969
Steven M 0.027 1.84 68 1941-2009
Amber F 0.027 0.99 37 1971-2008
Whitney F 0.026 0.56 21 1979-2000
Kelly F 0.026 1.19 46 1958-2004
Chad M 0.026 0.83 32 1966-1998
Courtney F 0.025 0.81 32 1974-2006
Kim F 0.025 0.62 25 1952-1977
Todd M 0.024 0.84 35 1956-1991
Doris F 0.024 1.48 62 1897-1959
Timothy M 0.024 1.64 69 1942-2011
Mildred F 0.024 1.59 67 1881-1948
Marilyn F 0.024 1.06 45 1922-1967
Shannon F 0.023 0.91 39 1963-2002
Danielle F 0.023 0.98 42 1967-2009
Dennis M 0.023 1.32 57 1932-1989
Stephanie F 0.022 1.37 61 1949-2010
Kelsey F 0.022 0.64 29 1983-2012
Brittney F 0.021 0.42 20 1982-2002
Brenda F 0.021 1.21 58 1939-1997
Erin F 0.021 0.89 43 1966-2009
Carole F 0.02 0.62 31 1931-1962
Robin F 0.02 0.78 39 1950-1989
Gladys F 0.019 1.12 58 1887-1945
Julie F 0.019 1.05 55 1942-1997
Rhonda F 0.019 0.62 33 1949-1982
Christina F 0.019 1.07 57 1949-2006
Jamie F 0.019 0.86 46 1959-2005
Kathleen F 0.018 1.54 84 1908-1992
Gregory M 0.018 1.06 58 1945-2003
Cathy F 0.018 0.54 31 1945-1976
Janice F 0.018 0.89 51 1924-1975
Beverly F 0.017 0.91 52 1921-1973
Jerry M 0.017 1.37 79 1904-1983
Misty F 0.017 0.41 24 1967-1991
Tonya F 0.017 0.49 29 1959-1988
Lindsay F 0.017 0.52 31 1976-2007
Kristin F 0.016 0.58 36 1963-1999
Lindsey F 0.016 0.54 33 1976-2009
Denise F 0.016 0.78 48 1946-1994
Brandy F 0.016 0.44 27 1972-1999

Here are links to my Baby Name GitHub Repo, and to an IPython notebook for this analysis.
• • •

Monday, July 7, 2014

Additional info for Comparison of letter position in words for eight languages

Link to this week's blog, Comparison of letter position in words for eight languages
Link to May 27 blog, Graphing the distribution of English letters toward the beginning, middle and end of words
Link to May 27 Methodology and analysis of letter distributions blog post
Link to Github repo containing code used to make these graphs

My kingdom for a corpus

There is an embarrassment of corpora in English, from the Brown to the British National, to the licensed COHA to, if you don't mind a lack of curation and a lot of number-crunching, Google Ngrams/Culturomics. Therefore, I anglocentrically assumed other languages would be equally endowed, but some are not -- if corpora exist at all, they might be proprietary and licensed, or based on Wikipedia or Project Gutenberg or spoken transcriptions (which is not a bad thing in and of itself, but it makes it more difficult to make comparisons between corpora). Thank goodness for Europarl; it has its own biases (as all corpora do), but they're easy to identify.

Checking for bias

Europarl has a lot of words that are more common in parliamentary proceedings than in other corpora (like "parliament"), and it has a lot of specific proper names of people who happened to be important and mentioned in the proceedings during the time covered in the corpus (member of parliament Georg Jarzembowski contributed to languages in which the letters j, w, z or k are rare, for example).

First, a comparison of May 27's Brown Corpus graphs to those of English Europarl (click to enlarge):


As you can see, the comparison is quite robust; the only letter that is significantly different is "z", and even there the change is not dramatic. In terms of letter frequency, Europarl has a few more c's and l's than Brown; I hypothesize that it's mostly due to proper names (I'm looking at you, Czesław.)

Another check for bias would be to remove overrepresented words; for example, the German "p" should be the most sensitive (because the letter is rarer than in other languages) to removal of versions of cognates of "president", "parliament", "politics" and "europe". Here's what happens:


It's actually the word "Europe" that makes the German graph shape the most unlike the other languages, so not only does the removal of overrepresented words not change the front-loaded nature of the graph, it makes the graphs more uniform.




• • •