All of us who use Twitter know the problem of spam Tweets. Spamming on Twitter can happen in several ways. For example spammers can use a trending topic to make their tweets visible (that also happen to have nothing to do with the current topic). Other tweets, although they do not contain erroneous hash tags they contain uninteresting information.
In a previous example, Tweets were used to analyze the sentiment of Twitter users on U.S Economy. The study used several thousands of Tweets to extract insights. However between all tweets that originally discussed about the economy there were several spam Tweets such as "make money online even if the economy is bad".
It is well known that the most time-consuming process in a Data / Text Mining project is pre-processing. Therefore when one wants to analyze tweets and extract knowledge from them, obviously one step is to remove spam and uninteresting Tweets to minimize the chances of GIGO.
Spam detection in Tweets -and Social Media unstructured data in general- is a difficult task. It requires "concept-aware" analysis of Text. One of the interesting facets of analytics is the ability to solve the same problem in several ways, or -perhaps even better- to combine all available tools to reach a better solution.
There is an ever growing number of companies that analyze Social Media Data and erroneous data may be seriously altering their insights - even if millions of records are available. Perhaps in the very near future, providing cleaned social media data to analytic companies or other information consumers could be a business in its own.
It is possible to perform spam detection in many ways : Using machine learning methods is one : In other words, training a classifier to sift through -say- hundreds of thousands of tweets that are marked accordingly as "spam" or "no-spam". We could use a more elaborate methodology to actually build and define rules by non-automatic methods that characterize spam Tweets. We could even consider other information such as who Tweeted, how many followers this user has or how often '@' is used to address other users. Once again, problem representation and how / which algorithms are used should be carefully selected.
Spam detection in Social Media Data is one of the problems that will become more important as more analytic companies are created. Detecting interesting information is another area to watch. People want real insights.
In the previous post, tweets were used to identify what people want / feel / don't like when they visit a shopping mall. While analyzing this information it was found that word 'Omaha' was associated with the word "Mall". Under close inspection i realized that "Omaha Mall" is a song by Justin Bieber. Of course i am not suggesting that these Tweets about Justin's song were spam but they had nothing to do with the purpose of the analysis. Could an automated technique identify this inconsistency and suggest to filter out this information? Being able to automatically select the right information will probably become more important as text information increases and a fast, correct and actionable intelligence becomes a necessity.