The sentiment on US Economy from Twitter

Is the economic crisis over? What is the sentiment of people regarding US Economy and the future? These are some of the questions that many people ask these days and the signs are somewhat mixed. Dow Jones is close to the 10000 mark and some US Economy Indices show that the worse is behind. But do people feel the same?

To answer these questions 10000 Tweets containing the word economy were collected with the purpose of finding out what people think and how they feel about the US Economy and the economic crisis. The following web chart shows some of the results :



PositiveSentiment is an annotation type that includes all words that suggest positivity such as good, better,advances while the opposite annotation (NegativeSentiment) exists for all keywords that suggest negativity.

The bolder the lines between words the heavier the association. To get an idea of how people feel, look at the line that connects NegativeSentiment and the word still which implies that the strongest sentiment is that US Economy is still under big problems.

Some other findings :

- US President tells that the economy gets better but people don't feel the same.

- Economy cannot be getting better while at the same time there are layoffs.

- People expressing very negative feelings after losing their jobs.


Notice also the association between NegativeSentiment and people, job, money, sales. Interesting insights can also be found if brand names and product categories are also taken into account : In this analysis a specific brand was found that was associated with word sales and a good overall sentiment. Buying behavior can also be found regarding consumer intentions.

You will also find that an association exists between finance_institution keywords (implying keyword Fed) and PositiveSentiment. This association exists because a number of Re-Tweets is about the Fed signaling the start of exit from recession and its impact on housing. Interesting also is the association between the words fool and annotation PositiveSentiment (...)

Specific Tweets were removed such as spam Tweets (that try to sell investing products). Re-Tweets were kept intact since we are making the assumption that if someone Re-Tweets -say- a positive sentiment Tweet then he/she also feels the same -positive- sentiment. Tweets that were jokes were identified, marked accordingly and removed.

As with many examples in the past, the software that was used consisted of GATE (for annotating unstructured text from Tweets) but also SPSS Clementine (now PASW Modeller). Here is the setup from GATE :




Specific rules (JAPE) were used that identify and annotate accordingly negative and positive sentiment. Consider the following sentences :

- The economy is most likely bad at the moment
- If the economy is great then why so many people can't find a job?

The first sentence has clearly a negative sentiment since the word bad exists. However the second phrase contains the word great so a specific matching rule should take into consideration the word If and annotate this phrase as one having negative sentiment despite the presence of word great.

After running GATE here is how the -now structured- data look like from a smaller sample of the original dataset (notice the highlighted record and the IfGood flag) :


With data in a structured form as the one depicted above we are then ready to identify which Tweets were found having a positive or negative sentiment, see erroneous annotations , take corrective actions and finally analyze the information and extract knowledge from it.

Mining the Tweets

I received through my Google Alerts a very interesting article : Twitter is in talks with Microsoft and Google regarding the use of Data Mining technology on user Tweets.

Despite the fact that Twitter execs do not appear so eager in making the deal as soon as possible, these news clearly show where things are going. If and when the deal is finalized it will be very interesting to see :


1) What kind of Data and Text Mining techniques will be mostly used? Which of them will prove useful?

Many examples of what can be done in terms of Data and Text Mining application on Twitter were given in this blog (starting from January 2009). In my opinion, types of analysis that will prove to be interesting -apart from Sentiment Mining for Products and Services which is already taking place- are Cluster Analysis (see post "Clustering the Thoughts of Twitter Users" here) and Prediction of Virality.

Although Twitter will be able to monetize through insights extracted from Cluster Analysis and Opinion - Sentiment Mining perhaps the most important analysis is finding patterns in user emotional states. Recall that everything needed for such an analysis exists in user Tweets : Life Events, thoughts and their associated emotional states. What emotions drive people in making several decisions such as which Product to buy or which Politician to support? What kind of feelings are generated during a bad economy? Perhaps by analyzing Tweets we could understand people (and thus consumers) in entirely new ways since this is the first time that this information is available to us.

2) How will Twitter users react when knowing their Tweets are being analyzed?

My first impression is that Twitter users do not care too much if companies extract the insights discussed above however this does not mean that people's opinion will stay like this. Again, user reaction on this matter is something that could be changed anytime and should be looked at closely.

3) Which other technologies will be mostly sought?

Although no one can give a definitive answer, i would likely expect Natural Language Processing (NLP) and Ontologies to be also heavily used and sought as expertise.

Surviving Cancer, Happiness and Twitter

Twitter is a great source of information on how people feel and how they behave. In previous posts we have discussed several examples of extracting from Twitter posts the feelings of Twitter users, their beliefs and values.

My latest analysis goal was to extract specific life events (such as the birth of a child) and the associated feelings and emotions of such an event.

First i wanted to identify life events associated with happiness. To do this i used text classification and a great piece of software called GATE. The data used originated from tweets of 60K Twitter Users and their biographies.

After completing the analysis, several "patterns of happiness" emerged but i believe that there is one that deserves a post on its own and should be disclosed : One of the most happiest groups of people on Twitter are cancer survivors. I was really amazed to find out that these people who faced -and possibly still facing- this life threatening disease were amongst the happiest people on Twitter and used very frequently words expressing happiness, satisfaction and blessedness.

I do believe that Twitter is a huge source of information and insights for Marketing, Branding and PR. It also appears that by analyzing Tweets we could also learn some important life lessons as well.

More to come soon.

A computer program predicts Viral Tweets

In the previous post we have seen that the author of a Tweet is the most important factor for making a viral Tweet. This time we will use Text Mining to score Tweets and see how much viral they could become. Each Tweet is fed to a computer program (an algorithm) and the algorithm responds with the probability each Tweet has to become viral (we assume that when a Tweet receives more than 30 Re-Tweets it is considered viral).




The information that is given to the algorithm is the Text of the Tweet and its author. Many other parameters can be taken into consideration such as the time that the Tweet has been posted, the type of the Tweet (ie. politics, technology, health, etc) or even whether this Tweet is part of a novel subject. Here is the output of the software that performs the predictions :




The number of Re-Tweets is shown in squares. Pay also close attention to the circled text shown above. For each Tweet the most probable outcome is given ('t'= Tweet will become viral, 'f'=otherwise) and a confidence for each prediction is given as a number from 0 to 1. As an example, the first Tweet shown above was posted from Paula Abdul saying that she will not return to American Idol. The algorithm predicts with a confidence of 63.38% that what Paula Abdul posted will be interesting (and it actually was).

The predictive model has an overall accuracy of 72.88% in predicting which Tweets will be viral in a total of 59 Tweets. An example of an incorrect prediction can be seen at the 4th circle from the top. The algorithm gave a 53.66% confidence that this Tweet will not become viral but actually this was a viral Tweet.

You can find the text file of the actual run from the algorithm here.

By looking the text file, results metrics such as TP (True positives) versus FP (False positives) can be calculated. It is also interesting to see how the algorithm switches to negative predictions when the number of Re-Tweets of each Tweet become less than 30.

Even though the example given here is very simplistic -and optimistic-, the application of a tool of this kind for PR, Marketing and Branding could prove very useful. Marketeers can try different messages and see what impact each message is likely to have. Consider the following run that shows that @mashable is more influential than @lifeanalytics :




The following run shows that specific keywords raise our chances in making a Viral Tweet :




In theory this information could provide the basis for performing A/B tests : One could simply use the 2 messages shown above and record what impact each one has using Google Analytics (a process which could prove whether this technology works or not).

Finding information that is interesting to masses is actually a much harder problem. Twitter is a data source that is biased for many reasons : Specific people can pass their messages with great ease and Twitter is used by specific population segments. Almost a week ago i came across reddit and i believe that this site (and also Digg) is able to capture the preference of masses in a more efficient way than Twitter. The truth is that the available information from forums, blogs and many other websites can capture different aspects of human behavior. All that is needed to extract useful knowledge is an efficient blending of these facts, emotions and beliefs of people from different web sources.

Predicting the next Viral Tweet

It is time to use Twitter data for another reason : Can Predictive Analytics be used to identify which tweets have an increased probability to become viral?





First we have to identify the problem and see what information we should consider. Every Tweet has an author, a content and is posted on a specific day and time. More specifically, for every tweet we can collect usage data such as

  • Day of Post
  • Time of post
  • Elapsed minutes since tweet has been posted
  • Author of tweet (Twitter username)
  • Number of followers of the author
and also information such as :

  • Subject of post
  • Whether the tweet involves a question being asked
  • Whether the tweet contains hashtags
  • Whether the tweet contains a "Please Re-Tweet" directive (or variants)
  • Whether a user is mentioned
  • The text of the tweet itself.

Our goal then is to combine the information mentioned above and come up with a predictive model that when given an author, day, time of post and text of the tweet it will be able to tell us whether this tweet has an increased probability to become viral.

For this Data & Text mining exercise (and keeping in mind that tweets have been sampled from one website and not Twitter itself) let's define what is a viral tweet : After collecting approx. 8000 tweets from dailyrt.com it was found that the median value of Re-tweets is 17. Here we make the assumption that if a tweet exceeds 30 Re-tweets it is considered viral (and actually this specific assumption makes the classification task much easier).

As discussed above, usage data do not tell us anything about the content of a tweet. Usage data tell us about the name of the author, his/her followers, when the tweet has been posted and how many minutes elapsed since its post. Can this information alone predict whether a tweet will become viral? A data mining model predicted (without using the elapsed time as input field) with an overall accuracy of 75.03% whether a tweet can be viral and -perhaps as expected- shown that the most important factor for making a viral tweet is its author. Running a process called Feature Selection tells us just that :



But what we have seen so far only tells us one -the Data Mining- side of the story. With Text Mining we can see the importance of words and authors. To do that, each author is appended at the end of each tweet (so essentially the author becomes a part of each tweet text). Here is what Feature Selection tells us :



A Tweet mentioning Michael Jackson has a great probability of becoming viral but perhaps it should be also posted by a popular author to make a greater impact. Pay attention also to the fact that @mashable and the @theonion are on top of our feature selection list shown above.

The difficult -but also interesting- task is to predict a viral tweet that has an impact not because of its author but because of its content and to do this the methodology of data collection and analysis differs significantly.

On the next post we will see a model predicting viral tweets in action : We will submit several tweets and their author and the model will tell us the probability that each submitted tweet has to become viral.

How Habitat UK *should* have used Twitter

Following the great post from Tiphereth Gloria i wanted to take the opportunity to show an example of how Habitat UK should be using Twitter.

My suggestion would be that instead of the "initiative" they took they should identify the values, beliefs and needs of their customers by capturing and analyzing relevant tweets instead. And here is how they could do it :

First they should capture all relevant Tweets every -say- month :



The second step would be to identify what people want when they talk about furniture. If they used Text Mining they would have found specific furniture products that customers want to buy and the values associated with these types. For an example look at the following table :



The table shows us (pay attention to dark red cells) that customers looking to buy baby furniture have Safety as their number one associated value. With this knowledge then perhaps Habitat UK would make sure that when they advertise Baby furniture they would use this word on their advertisements to capture the interest of their customers. Of course what was shown above is not some new information but is meant to be given as an example.

Some more things that Habitat UK could have done with Text Mining would be to see :

  • How important it is to suggest solutions to customers
  • Which rooms people want to re-furnish more often and -more importantly- why.
  • How problems (such as furniture received is damaged or difficult to assembly) affect their brand.
  • How people feel excited when they wait for their new furniture...and how bad they feel when furniture is not delivered on time.

There is much more that can be done : By running Cluster analysis many kinds of customer thoughts can be grouped together : One of them was how much "Feeling good" is closely related to new furniture and how it affects people's psyche.

By using Social Media Analytics, Habitat UK -and most other companies- would understand their customers better, see what is important for them and with this knowledge they would be able to take informed decisions that would -most likely- make a real difference.


How people use Twitter - 10 distinct usage groups

During this post we will be looking at another example of cluster analysis performed on Twitter. The analysis was performed on 17000 Twitter users with the goal of extracting distinct groups of usage which essentially shows us the different types of Usage behavior of Twitter users. The following parameters were taken under consideration :

  • Number of Followers
  • Number of Links posted per 20 Tweets (not during RT)
  • Number of Updates
  • Elapsed Days

The following table shows the results :


Note that each cluster has a specific number from 1 to 10. Clusters are listed according to their size which means that cluster "10" is the largest usage group, while cluster "5" being the smallest.

Let's see what the table tells us, starting with the first line : Cluster 10, is the largest (=more frequent) type of usage behavior. Users of that group have an average number of followers, have been using Twitter for relatively many days (elapsedDays=high) ,have a high number of updates while the number of links they provide per 20 tweets is average - say around 3 links-

Now consider -highlighted- cluster 8 which we will call The Information providers : Notice that even though this group of users have relatively few elapsed days and average number of updates, they achieve a High number of followers. The reason is that these users provide a large number of links per 20 Tweets ( Note that this confirms findings during a previous analysis).

See also cluster 3 : Even though this group of users has been on Twitter for many days but also has a high number of updates, it appears that it pays a price for not providing links.

Recall that the "#OfLinks" parameter counts only these links that are NOT part of a Retweet. This tells us that users that are able to find original content and provide it to the community tend to gain more followers.

This analysis was given with the aim of providing a simple example and should not be considered as a detailed analysis since few parameters have been taken into account. Cluster Analysis on Twitter data (which include things that people like doing, professions, interests, marital status, mention of products or opinions to name a few) can -potentially- give us excellent insights on different aspects of user behavior.