Tuesday, May 27, 2014

Methodology and analysis of letter distributions blog post

Link to prooffreader.com blog post: Graphing the distribution of English letters towards the beginning, middle or end of words

Here's the graphic (click to enlarge):

The code for this is in this Github repo, or you can view the IPython notebook at nbviewer.

1. Dataset

As I explained on prooffreader.com, I used the Brown corpus, which may not be everyone's first choice for some applications, but when you get down to the level of letter-by-letter analysis it seems fine. Here is a comparison of Brown and the same analysis done on the much larger and painstakingly curated the COHA corpus from Brigham Young University (for which I have an academic license):


They're pretty identical; the letter z shows the most difference, probably because its rarity and increased frequency of use in foreign words makes it more sensitive to differences between corpora (for example, as a quick check, COHA has many more words with the 'sz' bigram, but the few that are in Brown are at much greater relative frequency because of the smaller size of the corpus).

2. Quantization

Graphing discrete data is nothing new, but this is the first time I've tried to graph ordinal data (i.e. 1st, 2nd, ..., last) with such widely different scales (the most common word, "the", has three letters, literally a beginning, middle and end, and then there are some words with more than 10 letters). I approached it as sort of a variation on the Spearman correlation; it's not the actual numbers that matter, but the trends and comparisons among and between them.

Here's where binning is an effective and intuitive approach; to make words of different size comparable I just used a proportionate scale; here's an example for a four-letter word with five bins:
Letter f, bin 2 would have 25% of the word's frequency added to it, and letter o, bin 2, 75%. (We want to weight the analysis for word frequency, so "four" has a larger contribution than "fourscore", otherwise we're de facto weighting the analysis towards rare words).

Two-letter words were split between two bins, but one-letter words were omitted entirely; they don't fit the beginning-end paradigm. Here is the effect the word "a" would have had:
The effect isn't dramatic, but it's misleading towards the right side of the graph. Should the letter "a" contribute to the distribution at both the beginning and the end of a word? I really don't think so. "Aa", though (which does appear in both corpora) would.

3. Number of bins

Since the exact center of a word is an important, logical and instinctual reference, it made sense to have an odd number of bins so that this point was straddled. With any histogram, too few bins means you smear out the data and hide its patterns, and too many bins means you're showing spurious noise instead of signal. My gut instinct was to have fewer bins, somewhere around the average word length (which was 4.45 letters), to reflect the somewhat fudged, semi-quantitative and ordinal nature of the data and not lead the viewer to think the data is more finely grained than it is.  So I started out with five bins, and it's a good thing I've learned to verify my gut instincts, because I think it works better with more; I ended up using 15 bins. Here's a comparison of them all:

You could certainly make a case for choosing fewer than 15, and the "optimal" number of bins depends on (a) the individual letter, and (b) what you consider optimal. I think 3 or 5 bins hides some interesting details like the fact that "x" is much more often the second letter of a word than the first; I chose 15 instead of 7 because it looks like many letters are trending towards the shape they have in 15, and noise only starts to disrupt signal there at letters like "n" (there's a spurious peak before the middle of "n" that first appears at 15 bins; it disappears if you leave the #2 word overall, "and," out of the analysis. But perhaps that's too early to be worrying about spurious peaks; there are more at 21 bins, e.g. the "f" in "before", the #89 word, and they're rampant at 51.)

4. Ordinal (y) axes

There's a fundamental problem with comparing histograms: if the y axes aren't identical, then the area the bars take up make wider distributions look more significant because the space under the histogram is greater. In this example, all of the data in one bar on the left is distributed in nine bars; the total amount of blue ink is the same, so the graphs are comparable (pretty much; the human brain has proportion-perception issues, which we'll get to later.) If you autoscale the nine bars, however, you seem to be communicating that there's more "stuff" in that graph than in the first, even though they're showing the same total quantity.


Keeping the y scale immutable is a good approach when there isn't too much variation in that scale, but in this case the frequency range of the most common letter, "e", is over 100 times that of the least common, "z". Here's a grid of all the approaches I considered (and if you have any suggestions I didn't consider, I'd be very interested to hear them). If you look at the top left, "a", and bottom left, "z", you'll see the "z" has been scaled down so far you can't see it at all.

In these cases, the common solution is to use a log scale; you can see it in the second column. The results are fine, there's nothing wrong with it at all. The only problem is: it's a data transformation that does not serve the story of the data (in my opinion). Allow me to explain; I'll just take two letters from the above, "j" and "l"; the former is very skewed toward the beginning of words, the latter is more evenly distributed.


Logarithms are counterintuitive, and data scientists who are used to them tend to forget how difficult it was at first to internalize how they work, even if they picked up the algorithms quickly. (Look at the spacing on a slide rule). I've presented log-scale graphs to scientists with Ph.D.s, and spent more time explaining the ramifications of the logarithms than the actual data.

Also, is the difference in scale the "story" of the graph? (Graphs are inherently narrative; the ease in which one can make them perfectly accurate yet entirely misleading shows this; I always say a graph has the same scope, strengths and weaknesses as a paragraph.) We can make out the "j", but its scale is compressed; is the purpose of this graph to show the relative letter frequencies, or to show their distribution? It's the latter. The two properties are interrelated, but I chose to divorce them as much as possible, while keeping the frequency information as a color scale.

The color scale accomplishes two things: it keeps the frequency information yet abstracts it to a semi-quantitative form (color differences are inherently semiquantitative, the human brain cannot determine whether something is "twice as red", just "more red".) So the information is there, but it's not tightly married to the actual individual graphs. I think color-scaled graphing works best when it's additional relevant data that's not absolutely crucial to understanding the visualization.

And yet, autoscaling (normalizing the y axis so it is always from 0 to 100% of the maximum frequency of each letter) did not accomplish this goal; we're back to the histogram scaling problem. In the middle column, it's obvious "l" is more frequent than "j", because it has more ink (so to speak).

Well, we could always integrate the data and normalize the area under the curve, so that every plot has exactly the same amount of ink, as in the second-last column. Now we're up against a perception problem that I alluded to earlier; there's the same amount of ink, but because the 'j' has it so clustered together, it looks like it's a lot more.

So I went with a compromise: the average of equal height and equal area. Often compromises in data visualization obscure data and its meaning, but I think it works in this case. The last column's values are not comparable between plots -- but the data itself is semi-quantitiative binning of ordinals, so it's in herently non-comparable on a fine scale to begin with, and I think this approach accomplishes the goal of the graph.

Feel free to disagree!

5. Ordinal (y) and abscissa (x) labeling

I chose not to label the axes; anything I put would be potentially misleading, I think it's best to leave the viewer's interpretation an intuitive one, yet make the information easily available if it's needed. A few arrow labels were enough.

As always, comments are welcome. I don't claim to have the optimal solution (if there even is one, which I highly doubt), but I put a lot of thought into finding an optimal solution, and I'd like to know if I could have done it better.


• • •

2 comments:

  1. Great post. The notebook link is 404, though.

    ReplyDelete
  2. Whoops! I reorganized that repo and forgot to fix the link! Thanks for letting me know, I fixed it.

    ReplyDelete