Dec 28, 2016  mode_comment

Sentiment analysis using tweets


This year's machine learning project for CIS 520 at Penn was to predict whether or not a twitter user was happy sentiment_very_satisfied or sad sentiment_very_dissatisfied using the words and images associated with the user's tweet. A training dataset of 4,500 labeled tweets (0 - sad, 1 - happy) and 4,500 unlabeled tweets was given along with 4,500 hold-out tweets that were to be used as test samples.

1. Data Wrangling

Two types of language data were given: one was the raw user tweets and the other was a processed version of the raw tweets (matrix X), specifically, the raw tweets mapped onto a word count of the 10,000 most popular words used in the entire dataset. The first step was to inspect the raw tweets since the word count data was in a straightforward format. Here are some examples of raw user tweets:

'little princess #girls #smile #beautiful /8 cchil5l'

':( but true now / rcmfy4kl'

'l''unica #lamborghini che avrò mai per le mani . / npea2teq'

'#3ed #ushaigr #saudi تحلوي / ufigrx2h'

We immediately noticed that not all tweets were in english and that there were a lot of symbols used in sentences (emoji included). It was clear that the data needed to be cleaned for meaningful feature extraction. The following procedures were carried out for data wrangling and feature extraction.

  • Filter out and count the number of external links
  • Remove punctuation and meaningless symbols (i.e. non-emoji)
  • Remove stop words such as the, la, at, to, a
  • Trim whitespace
  • Perform negation trick, replacing not happy as not ~happy to ensure that the word happy in a negative tweet doesn't get misclassified as a happy tweet
  • Count happy and sad words (i.e. happy words=good, happy, birth, sad words=sad,bad,die,cry,~happy)
  • Count happy and sad emojis (i.e. happy emojis=:), :], =], sad emojis=:(, :[, =[)

The final 'trimmed' dataset was converted to a custom dictionary (different from the top 10,000 words used) and we also had quantitative measurements of the sentiment level.

2. Training a Model

Given the sparsity of the word counts matrix, X, we decided to implement a Naive Bayes (NB) model as some discriminative models (i.e. Logistic Regression) may underestimate the probability of rare events. Before we went to train a NB model, we performed principal component analysis (PCA) via singular value decomposition (SVD) to reduce the dimensionality of the word count feature matrix.

Reconstruction accuracy as a function of number of dimensions after PCA

Since PCA is an unsupervised learning method (i.e. there are no labels), all labeled and unlabeled training data were used to learn a projection matrix W that will transform the word count matrix into a lower dimensional representation.

Here is a more detailed post on PCA via SVD written by my group mate

For tweets that had little or no words, we trained a support vector machine (SVM) model on image features associated with the tweets and used the SVM prediction along with NB predictions.

3. Testing the Model

The test accuracy of a NB model using the PCA'ed word counts data and SVM predictions of images gave a test accurracy of 79.4%. When the custom dictionary was added as extra features, the test accuracy improved to 81.0%. Finally, doing a separate logistic regression on the extracted features (counts of happy/sad words/emotes) and using a threshold of confidence of 10% and 85% for strongly sad and happy predictions, respectively, to replace the NB predictions improved the test accuracy to 81.58%.


I am a computational scientist finishing my PhD at U of Penn. I picked up programming coming into graduate school and after years of computational research, I'm amazed by what data can do. I love to use data analytics to find trends, which when exposed, empower people to make informed decisions about the world they live in. I'm also the co-founder of Penn Data Science Group.