1. Data Wrangling
Two types of language data were given: one was the raw user tweets and the other was a processed version of the raw tweets (matrix X), specifically, the raw tweets mapped onto a word count of the 10,000 most popular words used in the entire dataset. The first step was to inspect the raw tweets since the word count data was in a straightforward format. Here are some examples of raw user tweets:
'little princess #girls #smile #beautiful /8 cchil5l'
':( but true now / rcmfy4kl'
'l''unica #lamborghini che avrÃ² mai per le mani . / npea2teq'
'#3ed #ushaigr #saudi ØªØÙ„ÙˆÙŠ / ufigrx2h'
We immediately noticed that not all tweets were in english and that there were a lot of symbols used in sentences (emoji included). It was clear that the data needed to be cleaned for meaningful feature extraction. The following procedures were carried out for data wrangling and feature extraction.
- Filter out and count the number of external links
- Remove punctuation and meaningless symbols (i.e. non-emoji)
- Remove stop words such as the, la, at, to, a
- Trim whitespace
- Perform negation trick, replacing not happy as not ~happy to ensure that the word happy in a negative tweet doesn't get misclassified as a happy tweet
- Count happy and sad words (i.e. happy words=good, happy, birth, sad words=sad,bad,die,cry,~happy)
- Count happy and sad emojis (i.e. happy emojis=:), :], =], sad emojis=:(, :[, =[)
The final 'trimmed' dataset was converted to a custom dictionary (different from the top 10,000 words used) and we also had quantitative measurements of the sentiment level.
2. Training a Model
Given the sparsity of the word counts matrix, X, we decided to implement a Naive Bayes (NB) model as some discriminative models (i.e. Logistic Regression) may underestimate the probability of rare events. Before we went to train a NB model, we performed principal component analysis (PCA) via singular value decomposition (SVD) to reduce the dimensionality of the word count feature matrix.
Since PCA is an unsupervised learning method (i.e. there are no labels), all labeled and unlabeled training data were used to learn a projection matrix W that will transform the word count matrix into a lower dimensional representation.
Here is a more detailed post on PCA via SVD written by my group mate
For tweets that had little or no words, we trained a support vector machine (SVM) model on image features associated with the tweets and used the SVM prediction along with NB predictions.
3. Testing the Model
The test accuracy of a NB model using the PCA'ed word counts data and SVM predictions of images gave a test accurracy of 79.4%. When the custom dictionary was added as extra features, the test accuracy improved to 81.0%. Finally, doing a separate logistic regression on the extracted features (counts of happy/sad words/emotes) and using a threshold of confidence of 10% and 85% for strongly sad and happy predictions, respectively, to replace the NB predictions improved the test accuracy to 81.58%.