Data Mining "Lost" Tweets
Note: this article uses the Twitter V1 API which has been shut down. The concepts still apply but you'll need to map them to the new V2 API.
As some of you might know, Twitter provides a streaming API that pumps all of the tweets for a given search to you as they happen. There are other stream variants, including a sample feed (a small percentage of all tweets), "Gardenhose", which is a stastically sound sample, and "Firehose", which is every single tweet. All of them. Not actually all that useful, since you have to have some pretty beefy hardware and a really nice connection to keep up. The filtered stream is much more interesting if you have a target in mind. Since there was such a hubbub about "Lost" a few weeks ago I figured I would gather relevant tweets and see what there was to see. In this first part I'll cover capturing tweets and doing a little basic analysis, and in the second part I'll go over some deeper analysis, including some pretty graphs!
Capturing
Let me preface: I have never watched a single episode of "Lost". When it started I had way too much stuff going on to pay attention to television and since then I've sort of conciously stayed away. I pass no judgements on anyone who is a fan or not, or who is evil or not.
The streaming API is pretty easy to work with. You basically give it a comma separated list of search terms and it will give you any and all tweets that match those terms. For example, if you were to run this command:
$ curl -q http://stream.twitter.com/1/statuses/filter.json\?track=bpcares \
-uYourTwitterName:YourTwitterPass
you would get a stream of semi-humorous tweets about the oil spill.
I wrote a little perl wrapper around curl which will automatically stop capturing after a given number of hours or until it has captured a given number of megabytes. It will also reconnect when the stream dies for any reason. To capture a workable number of tweets, I launched this script on May 23rd at 4:14pm PDT like this:
$ capture-tweet-stream.pl 24 10000 ~/data/lost-finale-tweets.txt \
'lost,locke,jack,sawyer,smokemonster,theisland,jacob,shepard'
This means, capture any tweets matching those terms for 24 hours or 10 gb, whichever comes first.
A little analysis
For a while as I was running the capture I was tailing the output file and would pause the output whenever a gem of a tweet scrolled past, just so I could retweet it. Here's my favorite:
I hope Dexter shows up on Lost and kills them all. #FuckLost
— Ed Battes (@EdBattes) May 24, 2010
I happen to be a fan of Dexter, and would have gladly paid money for a crossover. Anyway.
If you want to play along the data is on my dropbox and the code is all on github. First, let's get an idea of how much raw data we're working with. Twitter sends carriage-return separated JSON blobs. Awk to the rescue!
$ gzcat lost-finale-tweets.txt.gz | awk 'BEGIN{RS="\r"}{n+=1}END{print n}'
779750
$
Almost 780,000 tweets. Tweeps were busy! Ok, so what were they saying? A normal approach would be to run through all of the tweets and count up occurances of each word, but because there's so much output I can't do it on my laptop or I'd run out of memory. Instead, here's a map and two stage reduce process. The map is a fairly small perl script that everyone and their mother can pretty much write from memory, the word count mapreduce example:
#!/usr/bin/env perl
use strict;
use warnings;
use JSON::XS qw/ decode_json /;
use Try::Tiny;
$/ = "\r";
binmode(STDIN, ':utf8');
binmode(STDOUT, ':utf8');
while(<>) {
my $obj;
try {
$obj = decode_json($_);
} catch { };
next unless $obj;
my $text = $obj->{text};
next unless $text;
$text =~ s/[^\w\d#\s]//g;
my @w = split(/\s+/, lc $text);
for my $i ( 0 .. $#w ) {
print_if_all(1, $w[$i]);
print_if_all(2, @w[$i..$i+1]);
print_if_all(3, @w[$i..$i+2]);
}
}
sub print_if_all
{
my $n = shift;
@_ = grep { $_ } @_;
print join(' ', @_) . "\t1\n" if @_ == $n;
}
This one has a few modifications, though. First, it removes all punctuation except '#' and lowercases everything. Second, it will count each individual word as well as each two and three word phrase in the tweet. We can run it like this:
$ gzcat lost-finale-tweets.txt.gz | ./stem.pl | split -l 1000000 - output/out.txt
The reduce happens in two phases, both using this even smaller perl script that just sums the output from the first one:
#!/usr/bin/env perl
use strict;
use warnings;
my %sum;
binmode(STDIN, ':utf8');
binmode(STDOUT, ':utf8');
while(<>) {
chomp;
my ($key, $num) = split(/\t/, $_);
$sum{$key} += $num;
}
print join("\t", $_, $sum{$_}) . "\n" for keys %sum;
Which we run like this:
$ find output -exec ./sum.pl {} \; | ./sum.pl | sort -t $'\t' -k 2,2nr > stems.txt
Sort of like a poor man's Hadoop, no? No, you're right. Not really. But it gets the job done, and that's what counts.
Ok, so now we have our word counts. Here's the top 26 words and phrases that people mentioned in these tweets after removing really common english words:
lost 104181
#lost 53188
finale 25322
watching 11204
de lost 10475
tonight 9588
final 9107
lost finale 9101
series 9000
watch 8105
series finale 7487
the lost 7444
jack 5747
episode 5507
lost series 5062
end 4806
lost series finale 4696
watching lost 4179
the lost finale 3631
final de lost 3519
the end 2981
to watch 2964
watching the 2920
spoiler 2804
love 2768
the finale 2765
#lost finale 2579
In amongst all the tiny junk words, we have some really nice indicators that we can use in the next phase to filter to just the tweets that are actually talking about lost the tv show vs their lost kitten named Mittens. Interestingly, the phrase "you all everybody" only showed up 67 times. Sad.