Twitter Text Analytics in Node.js, Part 3

Twitter Dashboard

This part 3 of the twitter analytics tutorial will introduce you to some basic text extraction techniques to discover what people are tweeting about. We’ll develop two utilities that can help you track terms that are trending and investigate the cause behind spikes in tweet traffic about your or your competitors’ brand/products/services. Previously in part 1, we got connected to the twitter feed and created a dashboard with graphs that display count metrics for our topic, and then extended that in part 2 to track the aggregate sentiment for the topic over time.

In this tutorial, we will be processing the tweets from the files created by the tweeter client developed in part 1 using the same LX NLP Server; so you’ll want to go through that one first.

Getting Started

In part 1 of this tutorial series, we developed a Twitter client that subscribes to the Twitter API and receives a continuous feed of tweets about some topic by keywords. The client, in addition to collecting metrics, writes the tweets to a log file that is rotated daily.

We now want to write a couple command-line tools that process the tweets in these log files. If you have not already done part 1, you’ll want to run through that and collect some data to work with. When you run the client you should get log files named tweets.YYYY-MM-DD.

Reading the Data

Let’s first look out how we can read the data in the tweet log files so we can process them. Unfortunately, there is no real elegant way to process a file line by line that is built-in to node.js. For this tutorial, I chose to use the n-readlines module which makes it easy to read lines from a file without any asynchronous programming. Install this module via npm as follows:

npm install n-readlines

Reading a file line by line is very simple using n-readlines with a loop like the following:

var readlines = require('n-readlines');

var line;
var reader = new readlines("tweets-2015-12-09");
while (line = reader.next()) {
...
}

The new readlines(...) line creates a reader for the file, and the next method returns one line at a time from the file or undefined when there are no more lines.

Named Entity/Noun Phrase Extraction

This first utility (topics.js) is going to use Named Entity Extraction (NER) and noun phrase extraction to identify what things people are tweeting about. It reads all the tweets and then reports the top entities/noun phrases by their frequency of occurrence.

Setup

The code developed in part 2 used the NLP module via the REST API. This time we are using the NLP library directly via the Javascript API since we are processing data in batch mode. The Javascript API is exported by the lxnlp module that is globally installed on the LX NLP Server. See the full documentation for details about any of the NLP methods used here.

We will run the topics.js script with the filename of the tweet log to process. The script starts with some setup code to load required modules and check command-line arguments:

var nlp = require('lxnlp');
var readlines = require('n-readlines');

// The number of entities/noun phrases reported
var n_report = 10;

if (process.argv.length < 3) {
    console.log('Usage: node topics.js <filename>')
    process.exit(1)
}

NLP pipelines

Many NLP toolkits, like the lxnlp module, follow a common design pattern: the pipeline. In a pipeline, there are one or more pieces of data that go through several stages of processing in sequence. For NLP, the data is a document and each stage either transforms or annotates the document in some way and passes the modified document to the next stage. The toolkit offers a library of functions that perform a single transform or annotation. The application simply applies selected functions in sequence to form the NLP “pipeline” for that application.

Loading

In addition to a document, the lxnlp module has the concept of a collection of documents. The NER and noun-phrase extraction functions are available at both the document and the collection level. In this tool, we want to process the whole collection of tweets, so we create a collection:

var collection = nlp.collection();

Now, we can read the tweets from the file and add them to the collection as shown here:

var line;
var count = 0;
var reader = new readlines(process.argv[2]);
while (line = reader.next()) {
    var entry = JSON.parse(line);
    var tweet = entry.message;

    // Exclude retweets and tweets with URLs (potential spam)
    if (tweet.search(/^RT @/) < 0 &&
        tweet.search(/http:/) < 0 &&
        tweet.search(/https:/) < 0) {
        var text = tweet.replace(/[ \n]+/g,' ');
        var doc = nlp.tokenize(text);
        doc.postag();
        collection.add(doc);
    }
    count += 1;
}

The log file messages are in JSON format, so in the body of the loop, we use the standard JSON module to parse the line and then get the tweet text which is in the message field of the JSON object. Then there is a little filtering to exclude retweets and potential spam.

The code inside the if statement is a very simple example of an NLP pipeline. In this case, we have the following steps in our pipeline:

preprocessing - standard Javascript methods are used to clean-up the text at the string level. In this case, we just combine multi-line texts into a single line.
tokenization - Tokenization parses the text into an array of tokens (words and punctuation). This is always the first step in the pipeline. The tokenize method creates a new NlpDocument, imports the raw text, performs tokenization and returns the NlpDocument.
POS tagging - Both the NER and noun-phrase extraction procedures are based on POS tagging. POS tagging employs a machine-learned model that predicts the part of speech (noun, verb, etc.) of each token in a document. The postag method annotates the tokens in the document with their part of speech and returns the document.

All the NlpDocument methods that transform or annotate the document return the document so that they can be chained together. For example, the above could have been written:

var doc = nlp.tokenize(text)
             .postag();

The extraction step

Having loaded the tweets as a collection of documents, we are now ready to process the collection. Performing NER is just a single call:

var entities = collection.posNER();

This method apples NER to each document in the collection and then aggregates all of the recognized names into an array of pairs. The first element of each pair is the number of documents containing the named entity, and the second element the text of the named entity. The array is sorted by count in ascending order. Use the Javascript pop function to easily report the most frequently occurring entities:

var n;
console.log('\tENTITIES');
n = (entities.length > n_report) ? n_report : entities.length;
for (var i = 0; i < n; i++) {
    var e = entities.pop();
    console.log((100*e[0]/m).toFixed(1) + '\t' + e[0] + '\t' + e[1]);
}

Noun-phrase extraction works exactly the same way using the posNP method:

var phrases = collection.posNP(true);

console.log('\tNOUN PHRASES');
n = (phrases.length > n_report) ? n_report : phrases.length;
for (var i = 0; i < n; i++) {
    var np = phrases.pop();
    console.log((100*np[0]/m).toFixed(1) + '\t' + np[0] + '\t' + np[1]);
}

The topics.js script is now ready to run. Run this on the LX NLP server as before:

/opt/node32/bin/node topics.js tweets.2015-12-10

Usage

One way to use this utility is to investigate what’s being talked about on a given day. For example, using the dashboard we built earlier in this series, we saw a spike in twitter traffic on Aug 20th for our keywords Final Fantasy. Using the topics script, we saw iOS and Android as top named entities, and ipad, iphone and app as top noun phrases. This was due to the release of the mobile app on that day. Of course, you already know what is happening in your own business, but this is useful for tracking competitors.

Another way to use this data would be to create a sort of track board that displays the top entities/noun phrases daily. This could include stats like how long they’ve been on the list or how much they’ve moved up/down etc. This is yet another way to gauge the effectiveness of PR and marketing campaigns.

Retweet Extraction

Our second utility is retweet.js. Like the first, it will process a tweet log file, but this time the goal is to report the most retweeted tweets. Let’s start with a look at some sample retweets:

RT @smitharyy: » http://t.co/gI4s10Apsl #FinalFantasy7Remake Final Fantasy 7 Remake Everything we know about the Final Fantasy VII remake: …
RT @smitharyy: » http://t.co/gI4s10Apsl Final Fantasy 7 Remake #FinalFantasy7Remake Everything we know about the Final Fantasy VII … http:/…
RT @smitharyy: → http://t.co/gI4s10Apsl Final Fantasy 7 Remake #FinalFantasy7Remake Everything we know about the Final Fantasy VII remake: …

These are easy to recognize as retweets because they all start with RT @username. However, even though these are all retweets of the same original tweet, the text of the retweets vary. Some variation is due to the tool the user is using which may prefix the retweet differently or truncate the end differently. Also, users can add their own text before or after the original text.

We’d like to identify and count all these as the same retweet. The process is akin to clustering (creating groups of similar tweets), although the variation is minimal making it considerable easier. Nevertheless, we’ll make use of the bag-of-words model which is commonly used for clustering.

Bag of words

There are many different ways to represent a text passage in a program. Most often they are just strings: arrays of characters. Strings are convenient for reading, writing, printing and basic cut and paste, but not very useful for NLP. For NLP, a text passage is at least tokenized, as discussed above, resulting in an array of tokens. The next most common representation used is the bag of words (BOW) which is simply the set of unique tokens in the passage. That is, an unordered list of tokens without duplicates. The BOW model is useful for clustering because it is easy to compare the similarity of two documents ignoring the original word order.

The retweet pipeline

Let’s first look at the pipeline for creating the retweet BOW and then afterward incorporate that into a script to find the most frequent retweets. To create the BOW model, we chain a number of functions together to form the pipeline:

var doc = nlp.tokenize(text);
doc.removePunctuation()
    .removeURL()
    .filter(function (x) {
        return x.charCodeAt(x.length-1) != 8230;
    })
    .bag();

As always, it begins with nlp.tokenize which creates the document and tokenizes the text. The next two steps remove punctuation and URLs because we don’t care about them.

The filter function allows to you filter the tokens in the document using a custom function which takes the token as a parameter and returns true if the token should be kept or false otherwise. In this case, we are filtering out the truncated words near the end of the retweet. The truncated words include a few letters of the word plus … (Unicode character 8320) which varies depending on where the app decided to truncate the text.

Finally, bag simply adds the BOW to the document based on the set of remaining tokens.

Matching tweets

Given the BOW model of two tweets, we would like to write a function to decide if the two texts are retweets of the same tweet. One possibility is the following:

// Return true if A subset B or B subset of A
function match(A, B) {
    var pieces = A.intersect(B); // returns [A-B, A & B, B-A]
    return pieces[0].length == 0 || pieces[2].length == 0;
}

This function uses the NlpDocument’s BOW intersect function, which compares the BOWs of the document to the BOWs of the argument. It returns three values in an array: the words in the first document that are not in the second (A-B), the words common to both documents (A & B), and the words in the second that are not in the first (B-A).

This match function simply returns true if either one is a subset of the other, which implies A-B or B-A will be empty. You could also relax this by making it a threshold instead like they differ by no more than N words.

The main loop

Putting all this together, we have the following main loop that loads the tweets and matches them by author and the content of the tweet.

var total = 0;
var by_author = {};

var line;
var reader = new readlines(process.argv[2]);
while (line = reader.next()) {
    var entry = JSON.parse(line);
    var tweet = entry.message;

    // Retweets have the form "RT @username: text"
    if (tweet.search(/^RT @/) == 0) {
        var i = tweet.indexOf(': ');
        var author = tweet.substring(4,i);
        var text = tweet.substring(i+2).replace(/[ \n]+/g,' ');
        var doc = nlp.tokenize(text);

        // Clean data and create bag-of-words
        doc.removePunctuation()
            .removeURL()
            .filter(function (x) {
                return x.charCodeAt(x.length-1) != 8230;
            })
            .bag();

        if (author in by_author) {
            // If we've already seen tweets by this author, try to match
            // this tweet, to ones already seen.
            var seen = by_author[author];
            for (i = 0; i < seen.length; i++) {
                if (match(seen[i].doc, doc)) {
                    // We've seen this tweet, increment the count
                    seen[i].n += 1;
                    // if this copy is longer replace the one in the seen list.
                    if (doc.bag.length > seen[i].doc.bag.length)
                        seen[i].doc = doc;
                    break;
                }
            }
            if (i == seen.length) {
                // Have not seen this one.
                seen.push({n: 1, doc: doc, text: text});
            }
        } else {
            // First tweet seen by author.
            by_author[author] = [{n: 1, doc: doc, text: text}];
        }
    }
    total += 1;
}

The by_author object is used as a dictionary mapping the author to an array of tweets by that author that have been retweeted. For each retweet, we apply the BOW pipeline and try to match it to any tweets already seen by that author. If no matches are found it’s added to the tweets by that author. If we do match one already seen, we bump up a counter to keep track of how many times that tweet has been retweeted.

Reporting the top retweets

At the end of the main loop, we have a list of tweets organized by author with counts of the number of times retweeted. We would like to now create a flat list of tweets and sort them by count to report the most frequently retweeted. The following fragment loops through each author adding their tweets to a flat list.

var tweets = [];
for (author in by_author) {
    var author_tweets = by_author[author];
    for (var i = 0; i < author_tweets.length; i++) {
        tweets.push(author_tweets[i]);
    }
}

Finally, we can sort the flat list of tweets and grab the most frequent retweets using pop as we did in the first app:

tweets.sort(function (a, b) {
    if (a.n < b.n) {
        return -1;
    }
    if (a.n > b.n) {
        return 1;
    }
    return 0;
});

var n = (tweets.length > n_report) ? n_report : tweets.length;
for (var i = 0; i < n; i++) {
    var tweet = tweets.pop()
    console.log(tweet.n + ' ' + tweet.text);
}

That’s the whole script! For simplicity, you can grab this script from my blog repository.

When we run this script on our Aug 20th data as we did in the topics.js section, we find the number one retweet is:

FINAL FANTASY VII IS AVAILABLE AT THE APP STORE

Of course, this explains why iOS, ipad and iphone appear in the top entities and noun phrase lists around that day.

Closing

That concludes part 3 of the tutorial and the initial set of tutorials planned for this series. There are still more topics I hope to cover in this series such as full-fledged tweet clustering and combining the information we’ve collected thus far with social-network analysis. If that sounds interesting, please follow me on Twitter or LinkedIn to hear about future posts. Also, please leave comments and suggestions below.

The full source code from this tutorial is also available from the GitHub blog repository.