This is a course project for practical data science about the classification problem in twitter data, including the preliminary text processing, feature construction, and model building. 

Twitter Data


The tweets are extracted from the six twitter accounts: realDonaldTrump, mike_pence, GOP, HillaryClinton, timkaine, TheDemocrats. For each tweet, there are two pieces of information:

  • screen_name: the Twitter handle of the user tweeting and

  • text: the content of the tweet.

  • screen_name: the Twitter handle of the user tweeting and

  • text: the content of the tweet.

The overarching goal of the problem is to "predict" the political inclination (Republican/Democratic) of the Twitter user from one of his/her tweets. The ground truth (i.e., true class labels) is determined from the screen_name of the tweet as follows

  • realDonaldTrump, mike_pence, GOP are Republicans

  • HillaryClinton, timkaine, TheDemocrats are Democrats

Text Processing

I clean up the raw tweet text using the various functions offered by the nltk package.

The generated list of tokens should meet the following specifications:

  1. The tokens must all be in lower case.

  2. The tokens should appear in the same order as in the raw text.

  3. The tokens must be in their lemmatized form. 

  4. The tokens must not contain any punctuations. Punctuations should be handled as follows: (a) Apostrophe of the form 's must be ignored. e.g., She's becomes she. (b) Other apostrophes should be omitted. e.g, don't becomes dont. (c) Words must be broken at the hyphen and other punctuations.

Part of the difficult work for me was to figure out the logic order of above operations. The example output is as follow.

Feature Construction

In this part, I construct bag-of-words feature vectors and training labels from the processed text of tweets and the screen_namecolumns respectively. The number of possible words is prohibitively large and not all of them may be useful for our classification task. The first sub-task is to determine which words to retain, and which to omit. The common heuristic is to construct a frequency distribution of words in the corpus and prune out the head and tail of the distribution. The intuition of the above operation is as follows. Very common words (i.e. stopwords) add almost no information regarding the similarity of two pieces of text. Conversely, very rare words tend to be typos.

The next step is to derive feature vectors from the tokenized tweets. In this section, I construct a bag-of-words TF-IDF feature vector.

Also for each tweet, assign a class label (0 or 1) using its screen_name. Use 0 for realDonaldTrump, mike_pence, GOP and 1 for the rest.


I use sklearn package to learn a model which classifies the tweets as desired.

The classifier I used is Support Vector Machine. At the heart of SVMs is the concept of kernel functions, which determines how the similarity/distance between two data points in computed. I calculate the accuracy of a classifier, which measures the fraction of all data points that are correctly classified by it