Thursday, May 7, 2015

Methodology for Most characteristic words in pro- and anti-feminist tweets

Here I discuss the methodology used in the prooffreader.com post Most characteristic words in pro- and anti-feminist tweets. I both describe what we did (my team and I) and explain the reasoning behind it. I try not to make it too technical.

The team consisted of myself, Zafarali Ahmed, Jerome Boisvert-Chouinard, Dave Gurnsey, Nancy Lin, and Reda Lotfi. We won the Data Science and the Natural Language Processing prizes at the Montreal Big Data Week Hackathon on April 19, 2015.

The code for this post is available in my Github. The code we created during the hackathon is in teammate Zaf's Github, and includes some features we'd like to eventually include, such tweet frequency tracking and topic modelling. The script downloads the dataset from my website; if you want, you can download it yourself for perusal here (it's 248 MB). If you just want to see the log-likelihood results, they're here. And if you'd like to see if you agree with our manual curation, the set is here. Finally, the actual code that searches Twitter is here, I didn't put it in the Github because it's embarassingly poorly hacked together, but it works (which gives me not enough incentive to clean it up).

1. Data collection: I wrote a simple Python script that uses the free Twitter Search API and runs every 15 minutes. I have a list of about 20 search terms I'm interested in; I randomly collect the last 100 tweets of each until the limit maxes out. Additionally, at random intervals about once per day, I collect the results for only one search term, as many results as possible. Between January and April 2015, I collected 988,000 tweets containing "feminism", "feminist" and/or "feminists".

2. Curation: We used a simple script to manually curate 1,000 randomly chosen tweets into three categories: pro-feminist, anti-feminist, and other (e.g. neither, neutral, a news report, we couldn't tell, not in English, etc.)

You may have heard of sentiment analysis; this is what we called 'attitude analysis' and it's more difficult. A typical pipeline of sentiment analysis uses curated movie reviews and makes inferences from texts containing words like 'hate', 'love', 'like', 'terrific', 'awful', i.e. positive, negative, subjective and/or non-subjective language.

For example, look at these two hypothetical tweets:
    1. Man, do I ever hate feminists.
    2. I hate that my mom does not like the word 'feminism'.

Sentiment analysis will typically consider these tweets similar due to their use of 'hate', but they're diametrically opposed in attitude. By using machine learning we try to see if other words ('man', 'ever', 'feminists' vs 'feminism', 'mom', etc.) are good predictors of the underlying attitude. (We verified that sentiment analysis was not up to the task, you can see it in the Github repo).

The manual curation was time-consuming, especially since we really wanted to get it right. If we were in any doubt (e.g. if we thought the tweet might be sarcastic) we classified it as 'other'. The final count was about 50% neither, 25% pro and 25% anti.

3. Removal of 'low-effort' tweets. We wanted tweets that people spent time writing, and chose their words deliberately (if not always carefully). In other words, we wanted to eliminate anything that could be tweeted with just a click of a button, like retweets or those 'Share me' links on websites. From our original 988,000 we were left with 390,000 after this adjustment. (People sure do retweet/share a lot). (BTW, to minimize the shared tweets, we just eliminated duplicates, leaving only one copy in the database).

4. Tokenizing. Since tweets are limited to 140 characters, we did not eliminate stopwords (articles, pronouns, linking verbs, common prepositions, etc.) in the theory that they were carefully chosen and significant. We eliminated very common punctuation, but considered most non-alphabetic characters (quotation marks, question marks, special characters, emoji, etc.) as separate tokens. (This let us see if one group quoted, questioned, exclaimed, etc. more than another). We also considered 'curly quotes' different from 'straight quotes', since they are used most often in copied-and-pasted text, but we did not see a significant difference between groups once duplicate tweets were removed. (We saw a big difference in favor of pro-feminism before the removal, indicating they are the ones who are 'sharing' from traditional news sites.)

5. Classification. We used machine learning to classify the 390,000 tweets as pro- or anti-feminist (or other) based on our 1,000 tweets. We used a bag-of-words approach, i.e. taking the set of all words used in all tweets, and seeing which words are used in individual tweets and whether any of those individual words are good predictors of the class (pro-, anti-, other) of the tweet. We used a Naive Bayes classifier because it's commonly used in NLP (Natural Language Processing) for bag-of-words tasks. We sequestered 25% of our curated tweets to test our classifier; it classified about 90% of the non-sequestered tweets correctly, but only about 40% of the sequestered tweets correctly. This is called 'overfitting', and the obvious solution is to curate more tweets. (Maybe Mechanical Turk could help?). Still, one important measure of false-positive rates was around 60%, which is pretty good for twitter data, indicating that we weren't getting a lot of misclassifications inside two classes of interest, pro- and anti-; most of our errors were false negatives, i.e. real pro- and anti- tweets being classified as 'other'. We can live with that; false positives in this scenario are far worse than false negatives.

6. We calculated the log-likelihood of every word/token that appeared in both pro- and anti- tweets at least 10 times, as a measure of how characteristic they were to one class as opposed to the other. It's a measure of significance, i.e. a form of p-value, which can be misused, but we're using it properly.

My best explanation in layman's terms is, we choose a word and see how common it is in each dataset. The log-likelihood is a measure of how 'surprised' we would be if we mixed up the two datasets and divided them randomly, and then came up with a similarly unequal distribution. In other words, it's a measure of the odds that our observation is due to chance rather than the true nature of our datasets.

Log-likelihood is a handy measure because it takes into account both the ratio and the absolute values of frequency differences between sets. In other words, if "goldilocks" (to pick a word totally at random) appeared 10 times in the anti-feminist tweets and 20 times in the pro-feminist tweets, it would have a lower log-likelihood than if it appeared 100 times and 200 times, respectively, even though the ratio between them is the same. How much lower depends on the total size of the dataset; we're more 'surprised' to find differences in large datasets (which are more predictable) than in small ones.

It's been my experience that log-likelihood is pretty robust with imperfectly classified data, like this. In other words, the most characteristic words might not be exactly the same words in the same order if the classification were better, but there would be very little substantive change, i.e. words changing their relative values spectacularly.

I'm interested in comments and question, feel free! I have a thick skin and I acknowledge I'm capable of mistakes!



• • •

0 comments:

Post a Comment