Monday, July 7, 2014

Additional info for Comparison of letter position in words for eight languages

Link to this week's blog, Comparison of letter position in words for eight languages
Link to May 27 blog, Graphing the distribution of English letters toward the beginning, middle and end of words
Link to May 27 Methodology and analysis of letter distributions blog post
Link to Github repo containing code used to make these graphs

My kingdom for a corpus

There is an embarrassment of corpora in English, from the Brown to the British National, to the licensed COHA to, if you don't mind a lack of curation and a lot of number-crunching, Google Ngrams/Culturomics. Therefore, I anglocentrically assumed other languages would be equally endowed, but some are not -- if corpora exist at all, they might be proprietary and licensed, or based on Wikipedia or Project Gutenberg or spoken transcriptions (which is not a bad thing in and of itself, but it makes it more difficult to make comparisons between corpora). Thank goodness for Europarl; it has its own biases (as all corpora do), but they're easy to identify.

Checking for bias

Europarl has a lot of words that are more common in parliamentary proceedings than in other corpora (like "parliament"), and it has a lot of specific proper names of people who happened to be important and mentioned in the proceedings during the time covered in the corpus (member of parliament Georg Jarzembowski contributed to languages in which the letters j, w, z or k are rare, for example).

First, a comparison of May 27's Brown Corpus graphs to those of English Europarl (click to enlarge):

As you can see, the comparison is quite robust; the only letter that is significantly different is "z", and even there the change is not dramatic. In terms of letter frequency, Europarl has a few more c's and l's than Brown; I hypothesize that it's mostly due to proper names (I'm looking at you, Czesław.)

Another check for bias would be to remove overrepresented words; for example, the German "p" should be the most sensitive (because the letter is rarer than in other languages) to removal of versions of cognates of "president", "parliament", "politics" and "europe". Here's what happens:

It's actually the word "Europe" that makes the German graph shape the most unlike the other languages, so not only does the removal of overrepresented words not change the front-loaded nature of the graph, it makes the graphs more uniform.

• • •

1 comment:

  1. I have the letter of motivation scholarship which is best for the students and for those who have interest in writing. I love this post!