Clustering the thoughts of Twitter Users

During the last two posts i presented the reasons and some problems on analyzing the thoughts of users on the web and particularly Twitter. (For more see Part1 and Part2 ).

As an example, we are going to be looking at a specific kind of thought that Twitter users make : What they don't want. By using the Twitter API i managed to extract all tweets having the phrase "i don't want to". The following text file shows the results :




The next step is to remove all phrases that do not give us any information about what users do not want :



Finally we remove the phrase "i don't want to". However, consider the following example:

"I must go to Chicago. I don't want to do that"


The steps discussed above will discard the first sentence which is actually what the user does not want to do and leave only the phrase "i don't want to do that" which is not particularly informative. At this point we must quantify the problem -let's assume it involves the 8.5% of our records- and recall what the pareto principle is all about.


After some additional pre-processing steps which are not discussed here, i feed the data to K-Means to see the clusters the algorithm comes up with. For a better presentation of the results, here is a screen capture from IBM's UI Modeler :




We immediately see -in descending order- what Twitter users do not want :

1) They do not want to go to work
2) They do not want to go to school
3) They do not want to hear about various issues
4) They do not want to buy things


Notice also the top two categories named Miscellaneous and None. These categories contain thoughts that have a very small frequency to form a cluster. These two categories consist the 69.56% of our records and at this point we should think again about the pareto principle.

Please note that not all necessary work is discussed here and i had to omit several actions that have to take place. In trying to understand what people actually think i am using an approach which uses Ontologies, Information Extraction, Clustering and Classification analysis with the ultimate goal to minimize the percentage of thoughts (69.56% in this example) that cannot form a cluster and to increase the accuracy of the analysis.

It is also an interesting fact that we could move further down the sentence branch (see this post) for even better insight. Here i presented a cluster analysis about what users do not want. As an example we could apply clustering on user thoughts specifically for "I don't want to feel".



4 Responses to "Clustering the thoughts of Twitter Users"

Rafa Says :
March 9, 2009 at 11:52 PM

Hi Themos, this seems like a good way. I have tried to wrap opinions on Blogs from Twitter and process it in Weka but I failed due to language specificity and due to the fact that URLs are hardly correctly resolved using the search api. Do you have any experience on that?

Themos Kalafatis Says :
March 10, 2009 at 12:07 AM

Hi Rafa,

I am not sure as to what you mean by saying about URL resolving,so please send me an email with more details.

Krish Says :
April 6, 2009 at 10:10 PM

This is pretty interesting. I guess if you extend this further to reduce mundane issues (like work, school, etc.), good business insights can be found.

arash Says :
October 31, 2010 at 4:44 PM

My Master Thesis Project is about Clustering the Users in Twitter. Im going to cluster users based on the keyword. Its a just a part of my project. So when you type "Sony" in search , the software should find the users who have talked about that and group those who have interaction with each other and have the keyword in their twits. Any idea or help that how can I do that?
I will work and pay to any one who can help me in my project.
write me : arash2001h@yahoo.com
thanks

Post a Comment