<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-9150291873749799355</id><updated>2012-01-25T13:37:06.179+02:00</updated><category term='ontologies'/><category term='clustering'/><category term='google wave'/><category term='computational linguistics'/><category term='predictive analytics'/><category term='data mining'/><category term='concept mining'/><category term='trading'/><category term='apple'/><category term='spam detection'/><category term='politics'/><category term='economy'/><category term='sequence detection'/><category term='text mining'/><category term='real estate'/><category term='financial markets'/><category term='text analytics'/><category term='reality mining'/><category term='life analytics'/><category term='sentiment analysis'/><category term='association rule learning'/><category term='banking'/><category term='kaggle'/><category term='telecoms'/><category term='unstructured information'/><category term='decision tree'/><category term='novelty detection'/><category term='scoutlabs'/><category term='correlation matrix'/><category term='information extraction'/><category term='personalization'/><category term='knowledge hub'/><category term='digg'/><category term='rss'/><category term='concept trending'/><category term='twitter'/><category term='debt crisis'/><category term='social media analytics'/><category term='Event Detection'/><category term='GATE'/><category term='model testing'/><category term='feature selection'/><category term='financial news'/><category term='credit risk'/><category term='stock exchange indices'/><category term='R'/><title type='text'>Life Analytics</title><subtitle type='html'>Practical Applications of Data Mining, Text Mining and Information Extraction</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>73</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-4359165995789943964</id><published>2012-01-25T13:37:00.000+02:00</published><updated>2012-01-25T13:37:06.189+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='telecoms'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Text Analytics for Telecommunications - Part 1</title><content type='html'>&lt;div style="text-align: justify;"&gt;As discussed in the &lt;a href="http://lifeanalytics.blogspot.com/2012/01/case-study-competitive-intelligence-for.html"&gt;previous post&lt;/a&gt;, performing Text Analytics for a language for which no tools exist is not an easy task. The Case Study which i will present in the &lt;a href="http://www.textanalyticsnews.com/text-mining-conference-europe/"&gt;9th European Text Analytics Summit&lt;/a&gt; is about analyzing and understanding thousands of Non-English FaceBook posts and Tweets for Telco Brands and their Topics, leading to what is known as Competitive Intelligence.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;The Telcos used for the Case Study &amp;nbsp;are &lt;a href="http://www.telenor.rs/"&gt;Telenor&lt;/a&gt;, &lt;a href="http://www.mts.telekom.rs/index.php/naslovnastandard.html"&gt;MT:S&lt;/a&gt; and &lt;a href="http://www.vipmobile.rs/"&gt;VIP Mobile&lt;/a&gt; which are located in Serbia. The analysis aims to identify &amp;nbsp;the perception of Customers for each of the &amp;nbsp;three Companies mentioned and understand the Positive and Negative elements of each Telco as this is captured from the Voice of the Customers - Subscribers.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;By analyzing several thousands of Tweets and FaceBook posts and comments we can have a first glimpse of Competitive Intelligence. For example when we wish to identify which words frequently occur with mentions about postpaid packages this is what we find&amp;nbsp;&lt;i&gt;&amp;nbsp;&lt;/i&gt;:&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-UYSe_RVEq-k/TxRKB2q1qlI/AAAAAAAAAg4/bWAqq34pTeo/s1600/Screen+shot+2012-01-16+at+4.48.28+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="90" src="http://4.bp.blogspot.com/-UYSe_RVEq-k/TxRKB2q1qlI/AAAAAAAAAg4/bWAqq34pTeo/s400/Screen+shot+2012-01-16+at+4.48.28+PM.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Red boxes show Telco Brands - notice "mts" and "mtsa" which point to the same Telco, namely mt:s. &amp;nbsp;Blue boxes indicate similar words that should be merged. &amp;nbsp;From a first look at the results above we see that :&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;a) mt:s is found more frequently when users mention PostPaid packages.&lt;br /&gt;&lt;br /&gt;b) Telenor and VIP Mobile are not found as frequently as MT:S in PostPaid package conversations.&lt;br /&gt;&lt;br /&gt;c) We see several &amp;nbsp;problems from insufficient pre-processing : &lt;i&gt;Kredit&lt;/i&gt; and &lt;i&gt;Kredita&lt;/i&gt;&amp;nbsp;(=credit) should merge into one word, the same applies for&amp;nbsp;&lt;i&gt;telefona -&lt;/i&gt;&amp;nbsp;&lt;i&gt;telefon,&lt;/i&gt;&amp;nbsp;&lt;i&gt;internet &lt;/i&gt;- &lt;i&gt;interneta &lt;/i&gt;and &lt;i&gt;mts - mtsa.&lt;/i&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Notice that we can perform the same High-level analysis for several Telco Topics such as Network, Billing, Customer Care, Promotions, Questions of subscribers and so on. The next task is to identify the reason(s) why MT:S was found to have more mentions about PostPaid packages. Note that at this point we do not know why this is so : It could be the fact that MT:S prices of prepaid packages are high, very cheap or something else is happening that needs to be identified.&lt;br /&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;The Serbian Language poses extra work because it is a highly inflected language : Even the ending &amp;nbsp;of &amp;nbsp;Brand names change according to the usage. &amp;nbsp;Consider the following :&lt;br /&gt;&lt;br /&gt;&lt;i&gt;U mts-u &lt;/i&gt;(at mts)&lt;br /&gt;&lt;i&gt;Sa mts-om &lt;/i&gt;(With mts)&lt;br /&gt;&lt;i&gt;Bez mts-a &lt;/i&gt;(Without mts)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;It is evident that a highly inflected language explodes our feature space and for this reason R can come to the rescue with some success. We can use R for changing several synonyms to one word, removing (Serbian) stop words, removing URLs and performing several other pre-processing steps that are necessary prior to an extensive analysis. More on the next post.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-4359165995789943964?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/4359165995789943964/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=4359165995789943964' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4359165995789943964'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4359165995789943964'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2012/01/text-analytics-for-telecommunications.html' title='Text Analytics for Telecommunications - Part 1'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-UYSe_RVEq-k/TxRKB2q1qlI/AAAAAAAAAg4/bWAqq34pTeo/s72-c/Screen+shot+2012-01-16+at+4.48.28+PM.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-8629790781529132140</id><published>2012-01-09T11:55:00.000+02:00</published><updated>2012-01-09T11:55:15.469+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='text analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='feature selection'/><category scheme='http://www.blogger.com/atom/ns#' term='GATE'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Case Study : Competitive Intelligence for Telecommunications</title><content type='html'>&lt;div style="text-align: justify;"&gt;Telcos are a good example of a fast moving business environment and a good candidate for using Competitive Intelligence analysis from Social Media sources. The Case Study involves three major Telcos located in an Eastern European Country and shows the results from the analysis of thousands of Tweets and FaceBook wall posts to understand the following :&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- How subscribers perceive each Telco Brand?&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Which information do subscribers tend to Re-Tweet and "Like" on FaceBook Wall Posts?&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Which words and Topics are commonly found with Intense feelings / thoughts?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Which topics are mostly discussed when subscribers compare two or more Telco operators?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- What do subscribers discuss about &amp;nbsp;Network Quality and Speed, Billing, Promotions, Marketing Events, Customer Care, TV Commercials etc.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- How do they prioritize these topics and which of them are interesting and why? &amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- What do subscribers talk about in general (i.e without any Telco Brand being mentioned) regarding Internet speed, Charges and what would they expect to see more?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;I will present the Case Study mentioned &amp;nbsp;above in the forthcoming &lt;a href="http://www.textanalyticsnews.com/text-mining-conference-europe/conference-agenda.php"&gt;9th Annual European Text Analytics Summit&lt;/a&gt; in April in London - UK. The Case Study is an example of application of Text Analytics to a language for which currently no tools exist and thus all difficulties and possible solutions will also be discussed. Examples will be also given on analyzing information to different conceptual levels and how this technique provides even more insights in consumer behavior.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The following tools were used for the analysis :&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- &lt;a href="http://gate.ac.uk/"&gt;GATE&lt;/a&gt;&amp;nbsp;to annotate all Topics that occur within Telco conversations (such as "sms", "internet", "dropped call", "network","promotion") and for setting up Conceptual Levels.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; for pre-processing Text and performing Text Classification, Topic Detection and Cluster Analysis.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- &lt;a href="http://www.cs.waikato.ac.nz/ml/weka/"&gt;WEKA&lt;/a&gt;&amp;nbsp; for Feature Selection and Text Classification.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Finally, &amp;nbsp;Java is used to manage the information that is generated from GATE such as &amp;nbsp;understanding how subscribers prioritize various Telco Concepts and Topics and also identify important phrases and/or words that frequently occur when these Topics are being discussed. &amp;nbsp;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-8629790781529132140?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/8629790781529132140/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=8629790781529132140' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8629790781529132140'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8629790781529132140'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2012/01/case-study-competitive-intelligence-for.html' title='Case Study : Competitive Intelligence for Telecommunications'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-8603213132618064715</id><published>2011-11-28T13:22:00.002+02:00</published><updated>2011-11-28T19:03:28.724+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='sequence detection'/><category scheme='http://www.blogger.com/atom/ns#' term='concept mining'/><title type='text'>New Insights from Text Analytics</title><content type='html'>&lt;div style="text-align: justify;"&gt;Text Analytics has gained the attention it deserves in the past few years. Sentiment Analysis is perhaps the most frequently discussed type of analysis but there will be always new ways to analyze and gain insights from text data. &amp;nbsp;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;Examples of new types of analysis -and they have a vast potential- are in my opinion two :&amp;nbsp; Sequence Detection and Concept Mining. I am not aware whether&amp;nbsp; these types of analysis are currently being implemented by any Text Mining practitioner at the moment and if there is one, feel free to add your comments below.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;So what is Sequence Detection and Concept Mining ? Some examples :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Suppose that you receive several similar e-mails sent from customers as the one seen below :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;"&lt;i&gt;I have been trying repeatedly to solve my billing problem through customer care. I first talked with someone called&amp;nbsp; Mrs Jane Doe. She said she should transfer my call to another representative from the sales department&lt;/i&gt;. &lt;i&gt;Yet another rep from the sales department informed me that i should be talking with the Billing department instead. Unfortunately my bad experience of being transferred through various representatives was not over because the Billing department informed me that i should speak to the......&lt;/i&gt;"&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Currently Text Analytics software will identify key elements of the above text but a very important piece of information goes unnoticed. It is the sequence of events which takes place :&lt;br /&gt;&lt;br /&gt;&amp;nbsp;(Jane Doe =&amp;gt; Sales Dept =&amp;gt;Billing&amp;nbsp;&amp;nbsp;Dept =&amp;gt;...)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Being able to detect the sequence of events is an important element in understanding customer interaction. In our example above, imagine the possibility of detecting similar sequences through thousands of e-mails or call center transcripts and running a sentiment analysis, a process which then could correlate sentiment with specific event sequences.&lt;br /&gt;&lt;br /&gt;Next, &amp;nbsp;is the usage of Concept Mining (this is just a phrase i coined for this post) : Being able to analyze information to different conceptual levels. A very powerful technique indeed and let's see why this is so.&lt;br /&gt;&lt;br /&gt;People that have attended the 7th annual&amp;nbsp;&lt;a href="http://www.textanalyticsnews.com/text-mining-conference/agenda.shtml"&gt;Text Analytics Summit in Boston&lt;/a&gt; had the opportunity to listen to several presentations regarding Semantics. The discussions between experts from the Semantics Panel and the attendees revealed that people could not find Semantics practical for several reasons. Yet, in Semantics lies the power of being able to find patterns on different conceptual levels.&lt;br /&gt;&lt;br /&gt;As a -very basic- example, if we use Information Extraction to annotate -say- the Tweets containing mentions of American Telcos we can tag each one as a more general category called TELCOS. We can also tag individual prepaid packages as a more general category called PREPAID_PACKAGES. By doing that we can then search for patterns in a more general conceptual level than searching for patterns only at a Telco Brand level or a specific Telco's prepaid package. As an example we can run &amp;nbsp;a sentiment analysis on &lt;i&gt;all &lt;/i&gt;prepaid packages&amp;nbsp;mentions, &amp;nbsp;identify patterns of negative or positive sentiment and see which Telco is the winner of positive sentiment at a conceptual level.&lt;br /&gt;&lt;br /&gt;The possibilities are endless.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-8603213132618064715?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/8603213132618064715/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=8603213132618064715' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8603213132618064715'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8603213132618064715'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2011/11/new-insights-from-text-analytics.html' title='New Insights from Text Analytics'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-5995588624124350772</id><published>2011-09-20T08:50:00.001+03:00</published><updated>2011-09-20T08:52:50.877+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='predictive analytics'/><title type='text'>Big Data : Case Studies, Best Practices and Why America should care</title><content type='html'>&lt;div style="text-align: justify;"&gt;We know that Knowledge is Power. Due to Data Explosion more Data Scientists will be needed and being a Data Scientist becomes increasingly a "cool" profession. Needless to say that America should be preparing for the increased need for Predictive Analytics professionals in Research and Businesses.&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;Being able to collect, analyze and extract knowledge from a huge amount of Data is not only about Businesses being able to make the right decisions but also critical for a Country as a whole. The more efficient and fast this cycle is, the better for the Country that puts Analytics to work.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="margin: 0px;"&gt;This Blog post is actually about the words and phrases being used for this post :&amp;nbsp;All words and phrases on the title of the post (and the introductory text) were carefully selected to produce specific thoughts which can be broken down in three parts :&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="margin: 0px;"&gt;&lt;ul&gt;&lt;li&gt;&amp;nbsp;Being a Data Scientist has high value.&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;"Case Studies" and "Best Practices" communicate to readers successful applications and knowledge worthwhile reading.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;"America should". This phrase obviously creates specific emotions and feelings to Americans.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="margin: 0px;"&gt;"Case Study" and "Best Practices" were phrases found to be commonly associated with posts of high visibility. You might also get many views if you create a post which proves that whatever concept you are writing about is the right thing to do (for example write a post that clearly demonstrates yet another reason to use Social Media and have this post shown to Social Media Professionals).&amp;nbsp;&amp;nbsp;Regarding our example : It is very probable (and logical) for Data Miners to look at and then re-tweet (or otherwise share) information which is a "proof" about Data Mining being useful &amp;nbsp;and also a "cool" profession. The higher concept / motive which works behind the scenes is that "I am doing the right job and this post proves it".&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="margin: 0px;"&gt;You might also get many views by submitting a post which&amp;nbsp;&lt;i&gt;disproves&lt;/i&gt;&amp;nbsp;well-accepted concepts or posts that demonstrate the difficulties that well-accepted concepts face : For example, if you were a Data Scientist or a BI Professional, you would be inclined to read a post titled "Big Data is a Big Hype". &amp;nbsp;Whether you will re-tweet or share the post is of course under your discretion. At this point it should be noted that there is a big difference between number of clicks of a post and the number of shares it got (by Retweeting it, Liking it, etc) because sharing a post means that this post is considered worthwhile to read.&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;All of the above (and much more) have been found by analyzing thousands of Blog posts along with their number of clicks and shares they got (either by RT's , FaceBook "Likes", etc) and this is what i will be presenting in&amp;nbsp;&lt;a href="http://www.textanalyticsworld.com/"&gt;Text Analytics World&lt;/a&gt;&amp;nbsp;in New York this October. It was also very interesting to see that some findings are in tandem with findings discussed by&amp;nbsp;&lt;a href="http://www.linkedin.com/in/josephcarrabis"&gt;Joseph Carrabis&lt;/a&gt;&amp;nbsp;during the Text Analytics Summit 2011 in Boston back in May.&amp;nbsp;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;Of course it is not suggested&amp;nbsp; that&amp;nbsp;&amp;nbsp; by using specific words and phrases you are guaranteed a successful post being re-tweeted from thousands of people and there are many reasons for this which i will not get into here. Additionally,&amp;nbsp;&lt;b&gt;Text Analytics cannot infer the higher meaning and concepts suggested within Text&lt;/b&gt;&amp;nbsp;and this problem deserves a post on its own. This analysis however identifies concepts and/or phrases that point Bloggers and Marketers to look at a specific direction and with this knowledge to have increased probabilities for a successful Web presence. Again, this is an example of true Social Media Intelligence. Not (just) Reports.&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;So, in case that this post title immediately got your attention from other posts, you've just had a little taste of Predictive Analytics in action.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-5995588624124350772?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/5995588624124350772/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=5995588624124350772' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/5995588624124350772'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/5995588624124350772'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2011/09/big-data-case-studies-best-practices.html' title='Big Data : Case Studies, Best Practices and Why America should care'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-5478960523760079120</id><published>2011-09-09T11:48:00.000+03:00</published><updated>2011-09-09T11:48:34.671+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='social media analytics'/><title type='text'>Do Social Media Monitoring tools provide True Intelligence?</title><content type='html'>&lt;div style="text-align: justify;"&gt;Having recently read a &lt;a href="http://www.webliquidgroup.com/social-media-monitoring-survey.html"&gt;report&lt;/a&gt; from WebLiquid one of the interesting facts to consider is that around 70% of Marketers replied finding the insights gleaned from Social Media Monitoring tools "Somewhat Valuable". &amp;nbsp;Slightly more than 20% of them found these insights "Extremely valuable". The report also shows that most Marketers plan to invest more in SMM tools with few of them retreating from any further investment.&lt;br /&gt;&lt;br /&gt;This is Big News. 70% of Marketers finding insights gleaned from SMM tools "Somewhat" valuable is not a good thing and perhaps there are reasons for this.&amp;nbsp;It would be very interesting to know what do Marketers consider Insights, how they prioritize those Insights and how easily they can act once they have those insights . The problem can be summarized in one sentence:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;- Marketers do not want (just) Reports.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;There is a lot of useful information provided by many Social Media Monitoring tools &amp;nbsp;:&amp;nbsp;The number of mentions of a Brand (or Product or Service) per channel, which users talk frequently about your Brand &amp;nbsp;(and which of them are considered influential). Sentiment Analysis provides Marketers with the perception of a Brand but also the perception about competitive Brands leading to what is known as Competitive Intelligence.&amp;nbsp;Perhaps Social Media Monitoring platforms have many types of metrics still to offer : For example, a potentially useful metric could be the ability to identify Consumer &amp;nbsp;Intentions&amp;nbsp;("I will definitely buy...")&amp;nbsp;and how these intentions differentiate - such as "I would buy 'ABC' if it was cheaper" or "I would buy 'ABC' if i hadn't &amp;nbsp;purchased 'XYZ' already".&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Notice that SMM tools provide metrics &amp;nbsp;: Number of mentions per channel, Top influential users, percentage of positive / negative / neutral sentiment and sentiment intensity, how mentions of a new product disperse through different social media channels, etc.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;But what is considered Intelligence in Social Media? Would someone identify as intelligence the fact that during the past 2 months there was an increase in specific Brand Mentions on Twitter but not on YouTube? Or is it Intelligence when we notice that there has been a decline in positive sentiment about a product? &amp;nbsp;All of this information is Reporting and Feedback. It is not meant that this is not useful information : &amp;nbsp;It is important to know what is happening and why.&lt;br /&gt;&lt;br /&gt;So what True Intelligence is all about?&lt;br /&gt;&lt;br /&gt;True Intelligence is about knowing how to successfully Promote and Market a Brand, Product or Service. To do that a Marketer wants to know the Best Practices : With Social Media Reports, Marketers know what is happening (a decline in positive mentions on our new smartphone) and why this is happening (a potential hardware problem). Social Media Analytics can identify the right strategies to make things happen. True Social Media Intelligence is about knowing which parameters (channels, number of mentions) are important in achieving a result. Is it important to have a product associated with intense (positive) sentiment? Or could it be more important to have a Product being highly associated with Rumors?&lt;br /&gt;&lt;br /&gt;There is still a long way to go in terms of Insights from Social Media Monitoring tools. There are many processes and parameters that will eventually used for deriving more Insights and better Strategies. The answer to true Social Media Intelligence is the use of Predictive Analytics (Data and Text Mining) &amp;nbsp;applied to Social Data : One area that is currently untouched by most Social Media Monitoring tools.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-5478960523760079120?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/5478960523760079120/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=5478960523760079120' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/5478960523760079120'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/5478960523760079120'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2011/09/do-social-media-monitoring-tools.html' title='Do Social Media Monitoring tools provide True Intelligence?'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-3215372434950054417</id><published>2011-07-15T13:37:00.001+03:00</published><updated>2011-09-20T08:53:09.852+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='debt crisis'/><title type='text'>More Trends of the Greek Debt Crisis</title><content type='html'>&lt;div style="text-align: justify;"&gt;Here are some more results on mentions of various Concepts being discussed in Greek Blogs about the Greek Debt Crisis. Using Text Analytics, thousands of Greek Blogs are being annotated on a daily basis with the purpose of identifying the frequency with which several aspects of the Greek Debt crisis are discussed.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;First let's have a look at the trend line of the &lt;a href="http://en.wikipedia.org/wiki/2010%E2%80%932011_Greek_protests#The_.22Indignant_Citizens_Movement.22_.28May.E2.80.93present.29"&gt;Indignant Citizens Movement&lt;/a&gt;&amp;nbsp;:&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-w_QadUEPsj8/TiAJ6GoKTaI/AAAAAAAAAeo/lQsfVUV4XM8/s1600/Screen+shot+2011-07-15+at+11.33.46+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="172" src="http://3.bp.blogspot.com/-w_QadUEPsj8/TiAJ6GoKTaI/AAAAAAAAAeo/lQsfVUV4XM8/s400/Screen+shot+2011-07-15+at+11.33.46+AM.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;We can see that there is a clear down-trend in the number of Blog Mentions. This is also supported by a very significant reduction of the total Tweets found for this subject.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Next let's see how the trend of the mentions of "Greek default" looks like in the past month :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-DNxSVKkoGqI/TiANoqdlD7I/AAAAAAAAAew/-kXcGxWPkUA/s1600/Screen+shot+2011-07-15+at+12.50.23+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="185" src="http://1.bp.blogspot.com/-DNxSVKkoGqI/TiANoqdlD7I/AAAAAAAAAew/-kXcGxWPkUA/s400/Screen+shot+2011-07-15+at+12.50.23+PM.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&amp;nbsp;We notice a severe spike beginning from July 12th because several Blogs and News sites were having mentions on a possible "Selective Default" which could happen to Greece. &amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Interestingly, the trend on mentions of a US Default is also rising in Greek Blogs but is found with a much smaller frequency :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-nN9-mKUUsZY/TiAX_oQn43I/AAAAAAAAAe0/wRaknlqxtwE/s1600/Screen+shot+2011-07-15+at+1.34.30+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="213" src="http://2.bp.blogspot.com/-nN9-mKUUsZY/TiAX_oQn43I/AAAAAAAAAe0/wRaknlqxtwE/s400/Screen+shot+2011-07-15+at+1.34.30+PM.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-3215372434950054417?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/3215372434950054417/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=3215372434950054417' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3215372434950054417'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3215372434950054417'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2011/07/more-trends-of-greek-debt-crisis.html' title='More Trends of the Greek Debt Crisis'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-w_QadUEPsj8/TiAJ6GoKTaI/AAAAAAAAAeo/lQsfVUV4XM8/s72-c/Screen+shot+2011-07-15+at+11.33.46+AM.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-8508418790210390864</id><published>2011-06-17T19:49:00.000+03:00</published><updated>2011-06-17T19:49:51.392+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='debt crisis'/><title type='text'>The Greek Debt crisis - Some Trends</title><content type='html'>&lt;div style="text-align: justify;"&gt;Several friends and blog readers ask me very frequently on what i think about Greece and the problems that Greece has on a Social and Economic level. Since this is not a blog about Politics or the Economy i will try to give my point of view with some analytics added.&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&amp;nbsp;It is always interesting to know how people feel and what do they think about the economy,their future, the politicians and how the general sentiment is. Also of great importance is the trend of all opinions and/or sentiment as this is recorded in Blog posts and other Social Media sources.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Here are some examples from data that i collect on a daily basis, several times a day from Greek blogs. Hundreds of Concepts are annotated within thousands of Blogs entries and collected for further analysis.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The results that i will show here are for :&lt;br /&gt;&lt;br /&gt;- the latest Government Reform&lt;br /&gt;&lt;br /&gt;- words that communicate Negative Sentiment.&lt;br /&gt;&lt;br /&gt;-The "Indignants Movement" : Citizens that do not agree with the practices of both 2 largest Greek political parties &amp;nbsp;during the past 30 years and spending cuts directed by the IMF.&lt;br /&gt;&lt;br /&gt;- Debt Crisis&lt;br /&gt;&lt;br /&gt;Let us begin with the trend of "Government Reform" which at the time of writing (17/06/11 - Note that date format is &amp;nbsp;DD/MM/YY) has just happened. Here is the trend of mentions :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-M1CI82SirHo/TftkWHV9BLI/AAAAAAAAAeI/DlUoAJbB0Hw/s1600/reform.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="180" src="http://2.bp.blogspot.com/-M1CI82SirHo/TftkWHV9BLI/AAAAAAAAAeI/DlUoAJbB0Hw/s400/reform.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Notice how during the previous days not many mentions were captured and how much the trend increases until June 17th were the reform took place.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Next, let's look at entries that communicate "economic default" and their trend :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-MrxA2LxjIzM/TftlEUdiNvI/AAAAAAAAAeM/Eg_s-AnqEdE/s1600/default.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="157" src="http://4.bp.blogspot.com/-MrxA2LxjIzM/TftlEUdiNvI/AAAAAAAAAeM/Eg_s-AnqEdE/s400/default.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Again notice how on previous days mentions of Greek default start to rise (starting from June 3rd) and gradually the trend appears to fade out (French and German leaders said they will back up Greek debt on June 17th). It was no surprise that on June 8th and 9th (yet more) Greeks rushed in Banks to withdraw their money.&lt;br /&gt;&lt;br /&gt;Here is the Trend of "The Indignants" movement :&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-2wtEPg37WSA/Tftn_AG9IPI/AAAAAAAAAeQ/DSL5W7xkosU/s1600/indignant.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="167" src="http://2.bp.blogspot.com/-2wtEPg37WSA/Tftn_AG9IPI/AAAAAAAAAeQ/DSL5W7xkosU/s400/indignant.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;Notice dates May 29th-30th, June 5th, Jun 12th-13th. All of these dates are Sundays (or close to Sundays) which is the day that most people gather in Syntagma square to express their anger for the IMF and Government practices. The trend however appears to be falling but &amp;nbsp;this may well be changing in the next days. Time will tell.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;How about the words that communicate Negative Sentiment? Here is the trend :&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-n2BVKaD8pDQ/Tftr-ppdPvI/AAAAAAAAAeU/ryPisEMZOR4/s1600/negativesentiment.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="185" src="http://1.bp.blogspot.com/-n2BVKaD8pDQ/Tftr-ppdPvI/AAAAAAAAAeU/ryPisEMZOR4/s400/negativesentiment.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;Negative sentiment words appear to be somewhat rising after 31/05 but are coming down to previous levels.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;FYI, &amp;nbsp;words that frequently occur with the concept "Politicians" are : "leaders", "cheats", "traitors".&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: justify;"&gt;More on the next post.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-8508418790210390864?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/8508418790210390864/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=8508418790210390864' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8508418790210390864'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8508418790210390864'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2011/06/greek-debt-crisis-some-trends.html' title='The Greek Debt crisis - Some Trends'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-M1CI82SirHo/TftkWHV9BLI/AAAAAAAAAeI/DlUoAJbB0Hw/s72-c/reform.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-1677366605394936401</id><published>2011-06-16T10:20:00.001+03:00</published><updated>2011-06-16T14:49:13.275+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><category scheme='http://www.blogger.com/atom/ns#' term='apple'/><title type='text'>Apple Products on Twitter - A Text Analytics example</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;table border="0" cellpadding="0" cellspacing="0" style="text-align: justify;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="font-family: arial, sans-serif; font: inherit; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;" valign="top"&gt;&lt;div&gt;&lt;div style="text-align: justify;"&gt;My presentation on the 7th annual text analytics summit was a tutorial in one of the methodologies one could use to analyze unstructured text.&amp;nbsp;The sample consisted of 365000 tweets that contained keywords of Apple products and concepts such as &lt;i&gt;iPad, iPhone, iPod, Apple Store, Mac, Steve Jobs&lt;/i&gt; and the goal was to get an understanding of what people where tweeting about each product or concept.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The first step is to use a text analysis toolkit (i used GATE) to annotate the tweets and identify which concepts and keywords occur within the tweets. But this is not always easy. Take the word &lt;i&gt;Mac&lt;/i&gt; for example. According to the context, &lt;i&gt;Mac&lt;/i&gt; could be a computer type, &amp;nbsp;a burger type, the MAC beauty products or Mac Arthur airport. So when a query sent to Twitter API that contains the word &amp;nbsp;&lt;i&gt;Mac&lt;/i&gt; we end up with lots of erroneous information.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;So one of the things that have to be done to ensure good results &amp;nbsp;is word sense disambiguation. We know for example that if a tweet contains a word such as &lt;i&gt;fries, lettuce&lt;/i&gt; and/or &lt;i&gt;salad&lt;/i&gt; then quite likely the word &lt;i&gt;Mac&lt;/i&gt; that was also found within this tweet was about the Big Mac (even though the word &lt;i&gt;Big&lt;/i&gt;&amp;nbsp;may not be present). If we find the word &lt;i&gt;Arthur&lt;/i&gt; next to the word &lt;i&gt;Mac&lt;/i&gt; then the tweet is about the Mac Arthur airport, etc. Here is GATE in action, identifying different keywords and concepts in Tweets :&lt;/div&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-5ZQ1ucu65nU/TfntVPhGhPI/AAAAAAAAAd8/b8sPgFMEngc/s1600/gateclip.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="473" src="http://2.bp.blogspot.com/-5ZQ1ucu65nU/TfntVPhGhPI/AAAAAAAAAd8/b8sPgFMEngc/s640/gateclip.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: arial, sans-serif;"&gt;Now we can see which concepts and keywords appear frequently in Re-Tweets ('USER' denotes that a '@' was present in the Tweet, 'URL' that a URL link was found in the Tweet,etc)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-McTpbJe5rYA/Tfnte2V2EfI/AAAAAAAAAeA/awHlWUvYF0A/s1600/RTs.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="460" src="http://2.bp.blogspot.com/-McTpbJe5rYA/Tfnte2V2EfI/AAAAAAAAAeA/awHlWUvYF0A/s640/RTs.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: arial, sans-serif;"&gt;&amp;nbsp;We can also see which words frequently occur with iPhone5 :&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: arial, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-IKUNPt9Xlzk/TfntokFB7AI/AAAAAAAAAeE/tATzgHknoYc/s1600/iPhone5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="444" src="http://4.bp.blogspot.com/-IKUNPt9Xlzk/TfntokFB7AI/AAAAAAAAAeE/tATzgHknoYc/s640/iPhone5.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-family: arial, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: arial, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: arial, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="webkit-fake-url://A0029629-A7DF-4A14-9C24-86E3ABF31453/image.tiff" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-1677366605394936401?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/1677366605394936401/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=1677366605394936401' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/1677366605394936401'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/1677366605394936401'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2011/06/apple-products-on-twitter-text.html' title='Apple Products on Twitter - A Text Analytics example'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-5ZQ1ucu65nU/TfntVPhGhPI/AAAAAAAAAd8/b8sPgFMEngc/s72-c/gateclip.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7126022282410483989</id><published>2011-04-26T12:53:00.000+03:00</published><updated>2011-04-26T12:53:44.174+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='Event Detection'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Event Detection:  Analytics becoming more personal</title><content type='html'>&lt;div style="text-align: justify;"&gt;Sentiment Analysis is a hot technology at the moment. Marketers are interested in the perception that consumers have about &amp;nbsp;a specific brand, product or service as this is found in unstructured text. Some people claim that Sentiment Analysis does not meet their expectations but also that it is not straightforward for a company to find the "right" solution. &amp;nbsp;Comparing different Sentiment Analysis solutions could prove a difficult task.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Marketers and Decision makers need insights with which they can make better decisions - They need both Reports &lt;i&gt;and&lt;/i&gt; Intelligence.&amp;nbsp;Therefore the question that always follows the finding that "Your product has a 35% negative sentiment in the past 10 days" is "Why".&amp;nbsp;&amp;nbsp;Social Media Monitoring tools must also &amp;nbsp;provide actionable Intelligence.&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;All this is important information as it shows why your Brand / Product / Service could be losing customers. You monitor what is being said, identify whether a negative or positive Sentiment Trend is declining or rising and take necessary actions accordingly.&lt;br /&gt;&lt;br /&gt;One of the questions i often get is what other applications can emerge from using Text Analytics and Data Mining.&amp;nbsp;With Text Analytics and Data Mining we can find behavior patterns on many levels and -assuming that information such as Tweets will keep coming- the understanding of consumers can &amp;nbsp;go to the next -and sometimes more personal- level.&lt;br /&gt;&lt;br /&gt;One of these applications is Event Detection.&amp;nbsp;I am not aware if Event Detection is provided by any tool at the moment but i believe that this type of analysis could become a next major source of consumer insights. But what exactly is "Event Detection"?&lt;br /&gt;&lt;br /&gt;Since we are able to have a computer automatically identify whether a phrase contains positive, negative or neutral sentiment, perhaps we could use Text Analytics and Machine Learning to detect that a specific event has occurred to an individual from the Tweets that someone posted such as "i've just returned from holidays". But that's not all. We can mine for patterns of consumer behavior given the fact that an event has occurred. And that potential knowledge from such an analysis could be very powerful. Because apart from the emotions that a product / service / person generates, the same applies for events happening in our lives. These events and the emotions they create can sometimes change our lives and also drive our decisions. A logical next step is to collect several behavioral Data and use Data Mining to analyze this information. &lt;br /&gt;&lt;br /&gt;I will discuss an example of using Event Detection towards the end of my presentation on the &lt;a href="http://www.textanalyticsnews.com/text-mining-conference/"&gt;7th Annual Text Analytics Summit&lt;/a&gt; in Boston this May along with the reasons for such an analysis being important and i am looking forward to the reactions. &lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The fact is that with more insights, privacy issues arise even more and I get an increasing number of people asking me about privacy. I was also interviewed by a major British newspaper last month on what companies can learn by applying "Super Crunching" on Tweets. I tried to show both worlds of "Super Crunching" but the truth is that consumer insights become more personal as companies understand the value of structured and (more recently) unstructured information.&amp;nbsp;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7126022282410483989?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7126022282410483989/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7126022282410483989' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7126022282410483989'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7126022282410483989'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2011/04/event-detection-analytics-becoming-more.html' title='Event Detection:  Analytics becoming more personal'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-9032613636787386783</id><published>2011-03-15T16:06:00.000+02:00</published><updated>2011-03-15T16:06:27.727+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Social Media Data and what analysts can do with it</title><content type='html'>&lt;div style="text-align: justify;"&gt;It is worth looking at what having our lives "digitalized" means since all of the information currently generated from usage of Social Media is available for analysis : "Collective Intelligence" and "Behavior Mining" are terms that are becoming increasingly known.&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;But what exactly is Social Media Data? Here are some examples :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The number of followers you have on Twitter and number of friends on FaceBook.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;The number of links you provide, groups you join, retweets you make and how often you talk with other friends / followers.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;The number of re-tweets, FaceBook "likes", comments and views that a blog post generates.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;The personal information you provide (such as Twitter Bio)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;The concepts being discussed in Tweets and FaceBook walls.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh4.googleusercontent.com/-qYnJLfqYlpo/TX8PlMCk0eI/AAAAAAAAAds/MdrUXHuPPAs/s1600/clust1.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="337" src="https://lh4.googleusercontent.com/-qYnJLfqYlpo/TX8PlMCk0eI/AAAAAAAAAds/MdrUXHuPPAs/s640/clust1.JPG" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;By applying Predictive Analytics to all of this information an impressive number of applications arises such as : &lt;br /&gt;&lt;br /&gt;- Analysis of your Twitter Bio and words that are contained in your Tweets. For example we can identify what do people stating in their Bio being "Computer Geeks" discuss more frequently (in terms of Electronic Brands, technology trends etc). (See more &lt;a href="hhttp://lifeanalytics.blogspot.com/2009/05/twitter-analytics-cluster-analysis.html"&gt;here&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;- Analyze thousands of Twitter accounts and find words that could make a difference in your follower count. (It appears that you should &amp;nbsp;keep things positive -at least most of the time-. See why&amp;nbsp;&lt;a href="http://lifeanalytics.blogspot.com/2009/05/twitter-analytics-these-words-may-be.html"&gt;here&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;- Identify best practices on how to use Social Media &amp;nbsp;: When to post your new blog post, which words and concepts to avoid writing about and ultimately what concepts (such as &lt;i&gt;Personal Branding&lt;/i&gt;) you should focus on. ( See more&amp;nbsp;&lt;a href="http://lifeanalytics.blogspot.com/2010/09/social-media-insights-from-predictive.html"&gt;here&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;- Understand consumer behavior : What people liked, how they feel and what they would like to see in upcoming products and/or experiences. See &lt;a href="http://lifeanalytics.blogspot.com/2010/11/mining-consumer-behavior-in-tweets.html"&gt;this&lt;/a&gt; example on how different aspects of consumer behavior in shopping malls is "mined".&lt;br /&gt;&lt;br /&gt;Note that these are just some examples. The list goes on.&lt;br /&gt;&lt;br /&gt;There is no doubt that new exciting Social Media apps will become available. This in turn will produce even more Social Media data (such as ones that contain location information). Being able to combine Data Mining and Text Mining techniques to extract insights from Social Media Data will become a very &amp;nbsp;important skill to have.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-9032613636787386783?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/9032613636787386783/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=9032613636787386783' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/9032613636787386783'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/9032613636787386783'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2011/03/social-media-data-and-what-analysts-can.html' title='Social Media Data and what analysts can do with it'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='https://lh4.googleusercontent.com/-qYnJLfqYlpo/TX8PlMCk0eI/AAAAAAAAAds/MdrUXHuPPAs/s72-c/clust1.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-3283805850845077835</id><published>2011-02-16T16:21:00.001+02:00</published><updated>2011-03-02T21:38:30.510+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><title type='text'>7th Annual Text Analytics Summit</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;I would like to say a few words about an upcoming major event for all of those interested in Text Analytics and its various uses in Social Media, Marketing and Business Intelligence. Starting on May 18th, the annual &lt;a href="http://www.textanalyticsnews.com/text-mining-conference/index.shtml"&gt;Text Analytics summit&lt;/a&gt;&amp;nbsp;(the only conference dedicated completely to Text Mining) will take place in Boston, MA with a total of &lt;a href="http://www.textanalyticsnews.com/text-mining-conference/speakers.shtml"&gt;28 speakers&lt;/a&gt; presenting material on applications of Text Analytics including :&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-uTiMVNJ69pw/TVt-cHtKU7I/AAAAAAAAAdU/uTMHtZEs4Xg/s1600/250-125-static.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-uTiMVNJ69pw/TVt-cHtKU7I/AAAAAAAAAdU/uTMHtZEs4Xg/s1600/250-125-static.gif" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;Social Media Analytics&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;Sentiment Analysis&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;Voice of the Customer&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;Marketing&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;Semantics&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;Well-known names in the industry will be there ( Seth Grimes,&amp;nbsp;Tom Anderson,&amp;nbsp;Gregory Piatetsky-Shapiro, Ronen Feldman) as well as experts from companies such as SAS, IBM, Forrester Research, Attensity, Adobe, J.D. Power&amp;amp;Associates, Clarabridge and others. &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;My presentation will be about &lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;Behavior Mining in Social Media&lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt; using Text Analytics and i will be giving a step-by-step tutorial&amp;nbsp; on the analysis of data originating from Twitter regarding a major Electronics Brand in USA. More specifically :&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;I will show how Tweets can be transformed and then analyzed using various statistical NLP techniques and software.&amp;nbsp;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;Discuss the various problems that are found when one wants to analyze Text data&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;Discuss and introduce new ways of seeking for valuable information and extracting insights when it comes to Mining Behavior.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;Although the Case Study will be using data from Twitter, the techniques shown can be applied to any other Text Data such as those found in FaceBook, Blog posts, User comments, etc.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: 110%;"&gt;I am looking forward to seeing the work done by others, learning about successful applications of Text Analytics and the knowledge gained and also seeing the issues that professionals come across and how these are faced by them.&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-3283805850845077835?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/3283805850845077835/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=3283805850845077835' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3283805850845077835'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3283805850845077835'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2011/02/7th-annual-text-analytics-summit.html' title='7th Annual Text Analytics Summit'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-uTiMVNJ69pw/TVt-cHtKU7I/AAAAAAAAAdU/uTMHtZEs4Xg/s72-c/250-125-static.gif' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-3008111430828754252</id><published>2011-02-11T09:37:00.002+02:00</published><updated>2011-02-11T17:30:38.350+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='trading'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Forex Trading with R : Part 2</title><content type='html'>&lt;div style="text-align: justify;"&gt;In the previous post the first steps were given for building the basis for trading forex. Now it is time to build the actual classifiers that &amp;nbsp;can give us future buy / hold / sell signals.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Assuming that everything is in working order and the instructions given in &amp;nbsp;the previous post were followed we can start building these classifiers.&lt;br /&gt;&lt;br /&gt;First let's train a Neural Network. The following command trains a Neural Network and then applies the trained model on our test data and outputs the predictions for buy/sell/hold signals :&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;set.seed(134)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;nn &amp;lt;- nnet(class~.,traindata, size = 3, rang = 0.1,decay = 0.001, maxit = 3000,trace="F")&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;table(actual=testdata$class,predicted=predict(nn,newdata=testdata,type="class"))&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Note that a seed number was used. &amp;nbsp;You should either try different seed numbers (so that network weights are re-initialized) or omit the set.seed() directive. You should also experiment with other Neural Net parameters such as the number of iterations (maxit), the learning decay (decay), etc.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The confusion matrix shows us the necessary information for calculating TP, FP, TN,FN rates for each class (ie for each signal type).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Similarly we can train and test a Random Forest :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;rf.model&amp;lt;-randomForest(class~.,data=traindata,nodesize=40,importance=FALSE,mtry=3,ntree=100)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;table(actual=testdata$class,predicted=predict(rf.model,newdata=testdata,type="class"))&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Now let's train an SVM for our data. We can issue the following command :&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;###train SVM&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;sv&amp;lt;-svm(class~.,traindata,gamma=0.01,cost=5,kernel="radial")&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;To see how the classifier did on the test set, we enter :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;table(actual=testdata$class,predicted=predict(sv,newdata=testdata,type="class"))&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Next we can try to optimize parameters of the SVM classifier as follows :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;#find optimal values of Gamma and Cost for an RBF- SVM classifier&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;tuned &amp;lt;- tune(svm, class~., data = traindata,ranges = list(gamma = c(0.0001,0.001,0.05,0.1,0.2,0.3), cost = c(1,5,10,20,50,100,120,130)),tunecontrol = tune.control(sampling = "cross"),cross=10)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;tuned&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The first command uses 10-fold cross validation to identify the best gamma and cost parameters among some predetermined values. We then issue the command &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;tuned&lt;/span&gt; to see which combination of parameters &amp;nbsp;gives us the lowest classification error. Knowing these parameters we can then use these parameters to train an SVM classifier and see how this model performs (as was shown previously).&lt;br /&gt;&lt;br /&gt;Be aware of the following key points :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Three sets of data should be used : Training, Test and Validation. The Validation set should not be a part of the optimization (=finding the best algorithm parameters) process.&lt;/li&gt;&lt;li&gt;Make sure that you create classifiers for several time periods. Test the performance of any classifier according to the percentage of available data you use for training / testing / validation and the number of periods you use for the sliding window.&lt;/li&gt;&lt;li&gt;Make also sure that once you have chosen your model, you use a correct way to test your system by simulating buy / hold / sell signals and taking under consideration all associated trading costs.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-3008111430828754252?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/3008111430828754252/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=3008111430828754252' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3008111430828754252'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3008111430828754252'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2011/02/forex-trading-with-r-part-2.html' title='Forex Trading with R : Part 2'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-709780501091137241</id><published>2011-01-10T19:09:00.001+02:00</published><updated>2011-01-10T19:10:44.552+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='trading'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Forex Trading with R : Part 1</title><content type='html'>&lt;div style="text-align: justify;"&gt;I recently started learning R - probably something i should have done a long time ago - and since learning by doing is the best way to learn something i decided to &amp;nbsp;use &amp;nbsp;R to generate buy/see/hold signals for the EUR/USD Pair. For those that wish to use R for making Trading decisions, this series of posts is a short introduction with which one can pursue the subject further. By no means it is implied that this post's methodology is the one that &amp;nbsp;you should use &amp;nbsp;to trade : &amp;nbsp; different response variables, signal thresholds, technical indicators and classifiers than the ones presented here should be tried. Then, elaborate testing methods should be put to use to assess the &amp;nbsp;performance of &amp;nbsp;each classifier and the worth of your trading strategy.&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;First the resources needed to be downloaded and installed :  &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;1) &amp;nbsp;Packages : quantmod, nnet, e1071, tseries, randomforest&lt;/div&gt;&lt;div style="text-align: justify;"&gt;2) A file that contains OHLC Data. For this example EUR/USD Data are used. You can download an example file &lt;a href="http://www.megaupload.com/?d=H6PDIA6S"&gt;here&lt;/a&gt;. I downloaded the EUR/USD historical data&amp;nbsp;&lt;a href="http://www.fxhistoricaldata.com/"&gt;here&lt;/a&gt;.&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;I would also highly recommend getting  &lt;a href="http://www.amazon.com/Data-Mining-Learning-Knowledge-Discovery/dp/1439810184"&gt;Data Mining with R&lt;/a&gt; : A concise book that goes through an introduction of using R and then presents various case studies one of which is about Using R to predict variations of the S&amp;amp;P Index. The author also provides package "DMwR" that includes all necessary functionality for generating signals, extracting precision / recall metrics of generated models, performing Monte Carlo Estimates and evaluating trading strategies.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;After downloading the file from step (2), place the file to a directory of your choice.  Now copy and paste the following commands in R :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;library(e1071)&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;library(nnet)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;library(randomforest)&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;library(quantmod)&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;library(tseries)&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Next we import the csv file from the directory that was originally saved &amp;nbsp;(change "YOUR_PATH" accordingly with the directory path you saved the csv file) : &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;#get data OHLC from csv file&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;raw&amp;lt;- read.delim2("/YOUR_PATH/EURUSD.csv",header=TRUE,sep=",")&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now, paste the following to R (change again YOUR_PATH accordingly) : &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;#convert date&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;stripday&amp;lt;-strptime(raw$DATE,format="%Y%m%d")&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;fxdata&amp;lt;-data.frame(stripday,raw)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;fxdata$TIME&amp;lt;-NULL&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;fxdata$TICKER&amp;lt;-NULL&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;fxdata$DATE&amp;lt;-NULL&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;colnames(fxdata)&amp;lt;-c("Date","Open","Low","High","Close")&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;#write data to .csv&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;write.table(fxdata,"/YOUR_PATH/eurusd.csv",quote=FALSE,sep=",",row.names=FALSE)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;##transform to an xts object&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;EURUSD&amp;lt;-as.xts(read.zoo("YOUR_PATH/eur-usd.csv",sep=",",format="%Y-%m-%d",header=T))&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Now we define some technical indicators, the model to work on and a function that generates our trading signals &amp;nbsp;&lt;/span&gt;:&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;#setup Technical Indicators&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myATR &amp;lt;- function(x) ATR(HLC(x))[,'atr']&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;mySMI &amp;lt;- function(x) SMI(HLC(x))[,'SMI']&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myADX &amp;lt;- function(x) ADX(HLC(x))[,'ADX']&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myAroon &amp;lt;- function(x) aroon(x[,c('High','Low')])$oscillator&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myBB &amp;lt;- function(x) BBands(HLC(x))[,'pctB']&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myChaikinVol&amp;lt;-function(x)Delt(chaikinVolatility(x[,c("High","Low")]))[,1]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myCLV &amp;lt;- function(x) EMA(CLV(HLC(x)))[,1]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myMACD &amp;lt;- function(x) MACD(Cl(x))[,2]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;mySAR &amp;lt;- function(x) SAR(x[,c('High','Close')]) [,1]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myVolat &amp;lt;- function(x) volatility(OHLC(x),calc="garman")[,1]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myEMA10 &amp;lt;- function(x) EMA(Cl(x),n=10)[,1]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myEMA20 &amp;lt;- function(x) EMA(Cl(x),n=20)[,1]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myEMA30 &amp;lt;- function(x) EMA(Cl(x),n=30)[,1]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myEMA50 &amp;lt;- function(x) EMA(Cl(x),n=50)[,1]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;myEMA60 &amp;lt;- function(x) EMA(Cl(x),n=60)[,1]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;data.model &amp;lt;- specifyModel(Delt(Cl(EURUSD)) ~ &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;       myATR(EURUSD) + mySMI(EURUSD) + myADX(EURUSD) + myAroon(EURUSD) + &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;       myBB(EURUSD)  + myChaikinVol(EURUSD) + myCLV(EURUSD)  +myEMA10(EURUSD) +myEMA20(EURUSD) +myEMA30(EURUSD) +myEMA50(EURUSD) + myEMA60(EURUSD) +&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;       CMO(Cl(EURUSD)) + EMA(Delt(Cl(EURUSD))) + &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;       myVolat(EURUSD)  + myMACD(EURUSD) + RSI(Cl(EURUSD)) +&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;       mySAR(EURUSD) + runMean(Cl(EURUSD)) + runSD(Cl(EURUSD)))&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;Tdata.train &amp;lt;- as.data.frame(modelData(data.model,&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;                       data.window=c('2008-01-01','2010-01-01')))&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;Tdata.eval &amp;lt;- na.omit(as.data.frame(modelData(data.model,&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;                       data.window=c('2010-01-02','2010-11-01'))))&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;                       &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;# a very simple signal function &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;signals&amp;lt;-function(x) {&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;if(x&amp;gt;=-0.005&amp;amp;&amp;amp;x&amp;lt;=0.005) {result&amp;lt;-"hold"} else &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;if(x&amp;gt;0.005) {result&amp;lt;-"buy"} else &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;if(x&amp;lt;-0.005) {result&amp;lt;-"sell"}&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;result&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;}&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;#create class vector that holds TRAINING buy,sell,hold signals&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;class&amp;lt;-sapply(Tdata.train$Delt.Cl.EURUSD,signals)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;#paste both to a new list that holds everything&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;traindata&amp;lt;-cbind(Tdata.train,class)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;#remove Delt.Cl.EURUSD - not needed anymore. &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;traindata$Delt.Cl.EURUSD&amp;lt;-NULL&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;#create class vector that &amp;nbsp;holds TESTING buy,sell,hold signals&lt;br /&gt;&lt;br /&gt;class&amp;lt;-sapply(Tdata.eval$Delt.Cl.EURUSD,signals)&lt;br /&gt;testdata$Delt.Cl.EURUSD&amp;lt;-NULL&lt;br /&gt;&lt;br /&gt;#paste to a new list that holds everything&lt;br /&gt;testdata&amp;lt;-cbind(Tdata.eval,class)&lt;br /&gt;&lt;br /&gt;#get a summary of our traindata&lt;br /&gt;summary(traindata)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The last command prints out some summary statistics of the &lt;i&gt;traindata&lt;/i&gt; sample. Notice the 'class' attribute and the distribution of buy, hold and sell signals .&lt;br /&gt;&lt;br /&gt;Now we are ready to apply some modeling techniques using &lt;i&gt;traindata&lt;/i&gt; and &lt;i&gt;testdata&lt;/i&gt;&amp;nbsp;as the datasets to work with. Although i would suggest using also a third sample for validation (remember the "elaborate testing" discussed at the beginning), for this example we will keep things &amp;nbsp;simple. &amp;nbsp;More on the next post&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-709780501091137241?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/709780501091137241/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=709780501091137241' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/709780501091137241'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/709780501091137241'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2011/01/forex-trading-with-r-part-1.html' title='Forex Trading with R : Part 1'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-514111956484802850</id><published>2010-12-09T09:32:00.032+02:00</published><updated>2010-12-31T15:09:09.910+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>What Women Want - As seen in Tweets</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Probably a post which will be interesting to many of us. 15000 Tweets were collected that contained the phrase "women want". What do women value most  when it comes to how they want to feel? What do women really love? How important is for women to feel special? And finally, can Tweets really tell us this information by applying Text Mining techniques to them?&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Normally at this point i would describe technical details such as how i pre-processed Tweets and the problems i ran to while trying to analyze this information. Thanks to &lt;a href="http://twitter.com/Nathalief"&gt;@nathalief&lt;/a&gt; i was advised to focus on giving information not only about what women want but also on their feelings. &lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;First let's see the results from Tweets that apart the phrase "women want"  they also contain words such as "feel, feeling, feels, felt" in them. The following chart shows what words where frequently found in these Tweets (and thus what feelings a woman wants to experience) :&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); "&gt;&lt;img src="http://3.bp.blogspot.com/_koDJi0ps7Mw/TQH1efrCDKI/AAAAAAAAAco/Yct9agcNyaw/s400/Screen%2Bshot%2B2010-12-10%2Bat%2B11.35.12%2BAM.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5548986120144030882" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 227px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;So it appears that one of the first priorities in terms of how women want to feel is security (shown as &lt;i&gt;safe&lt;/i&gt; and &lt;i&gt;secure&lt;/i&gt; in the chart). Notice how important for women is also to feel &lt;i&gt;special &lt;/i&gt;and to feel that someone loves them (words  &lt;i&gt;love, loved, like&lt;/i&gt;).&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;How about words that frequently occur with the word "Love" :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="color: rgb(0, 0, 238);  -webkit-text-decorations-in-effect: underline; font-size:16px;"&gt;&lt;img src="http://4.bp.blogspot.com/_koDJi0ps7Mw/TQH0-X7nqEI/AAAAAAAAAcg/sw-40wPXpsE/s400/Screen%2Bshot%2B2010-12-10%2Bat%2B11.34.57%2BAM.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5548985568310306882" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 261px; " /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;It appears that women want "to love and to be loved" with "&lt;i&gt;respect&lt;/i&gt;", "&lt;i&gt;affection&lt;/i&gt;" and "&lt;i&gt;sex&lt;/i&gt;" coming next. &lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Surely there must be men that also give their opinions on "what women want" within these Tweets. Quite possibly many guys would say that "women just love money". In order to capture those who believe that women want money, let's see which words occur frequently with the word&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt; "money" &lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;within these Tweets :&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span"  style="font-size:100%;"&gt;&lt;span class="Apple-style-span"  style="font-size:11px;"&gt;&lt;span class="Apple-style-span"   style="font-size:130%;color:#0000EE;"&gt;&lt;span class="Apple-style-span"  style="font-size:16px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-style-span"  style="font-family:Georgia, serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-style-span"  style="font-family:Georgia, serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style=" color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; font-size:16px;"&gt;&lt;img src="http://3.bp.blogspot.com/_koDJi0ps7Mw/TQH1s0jU9DI/AAAAAAAAAcw/z24aWni35ro/s400/Screen%2Bshot%2B2010-12-10%2Bat%2B11.35.29%2BAM.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5548986366267028530" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 251px; " /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0); "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0); "&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Notice how the landscape of keywords changes here : Apart from &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;"love", "secure" and "hurt"&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; we see words labeled as &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;"censored"&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; (for obvious reasons), &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;"shoes" &lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;and &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;"future"&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; : these words communicate a more materialistic and logical point of view on what women want. Unfortunately for this analysis there was no way to identify which Tweets were originated from women and which Tweets originated from men.  Also at the time these tweets were collected a specific Re-Tweet was about a more 'materialistic' profile of women (words &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;multiple,&lt;/span&gt;&lt;censored&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; and shoes&lt;/span&gt;&lt;/censored&gt;&lt;/i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;). I decided to keep this re-tweet in the data that was analyzed because i felt that since this tweet was heavily re-tweeted then it was also liked by a large audience.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0); "&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0); "&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Perhaps these results show once again that "Men are from Mars and Women are from Venus"&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-style-span"  style="font-family:Georgia, serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-514111956484802850?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/514111956484802850/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=514111956484802850' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/514111956484802850'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/514111956484802850'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/12/what-women-want-as-seen-in-tweets.html' title='What Women Want - As seen in Tweets'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_koDJi0ps7Mw/TQH1efrCDKI/AAAAAAAAAco/Yct9agcNyaw/s72-c/Screen%2Bshot%2B2010-12-10%2Bat%2B11.35.12%2BAM.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-4957897240685439463</id><published>2010-11-24T09:21:00.019+02:00</published><updated>2010-12-07T16:27:31.428+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='spam detection'/><title type='text'>Spam Detection in Social Data : A new business?</title><content type='html'>&lt;div style="text-align: justify;"&gt;All of us who use Twitter know the problem of spam Tweets. Spamming on Twitter can happen in several ways. For example spammers can use a trending topic to make their tweets visible (that also happen to have nothing to do with the current topic). Other tweets, although they do not contain erroneous hash tags they contain uninteresting information.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;In a previous &lt;a href="http://lifeanalytics.blogspot.com/2009/10/sentiment-on-us-economy-from-twitter.html"&gt;example&lt;/a&gt;, Tweets were used to analyze the sentiment of Twitter users on U.S Economy. The study used several thousands of Tweets to extract insights. However between all tweets that originally discussed about the economy there were several spam Tweets  such as "make money online even if the economy is bad". &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;It is well known that the most time-consuming process in a Data / Text Mining project is pre-processing. Therefore when one wants to analyze tweets and extract knowledge from them, obviously one step is to remove spam and uninteresting Tweets to minimize the chances of &lt;a href="http://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out"&gt;GIGO&lt;/a&gt;. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Spam detection in Tweets -and Social Media unstructured data in general- is a difficult task. It requires "concept-aware" analysis of Text. One of the interesting facets of analytics is the ability to solve the same problem in several ways, or -perhaps even better- to combine all available tools to reach a better solution. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;There is an ever growing number of companies that analyze Social Media Data and erroneous data may be seriously altering their insights - even if millions of records are available.  Perhaps in the very near future, providing cleaned social media data to analytic companies or other information consumers could be a business in its own. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;It is possible to perform spam detection in many ways : Using machine learning methods is one : In other words,  training a classifier to sift through -say- hundreds of thousands of  tweets that are marked accordingly as "spam" or "no-spam". We could use a more elaborate methodology  to actually build and define rules by non-automatic methods that characterize spam Tweets. We could even consider other information such as who Tweeted, how many followers this user has or how often  '@' is used to address other users. Once again, problem representation and how / which algorithms are used should be carefully selected.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;  &lt;/div&gt;&lt;div style="text-align: justify;"&gt;Spam detection in Social Media Data is one of the problems that will become more important as more analytic companies are created. Detecting interesting information is another area to watch. People want real insights. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;In the previous &lt;a href="http://lifeanalytics.blogspot.com/2010/11/mining-consumer-behavior-in-tweets.html"&gt;post&lt;/a&gt;, tweets were used to identify what people want / feel / don't like when they visit a shopping mall. While analyzing this information it was found that word 'Omaha' was associated with the word "Mall". Under close inspection i realized that "Omaha Mall" is a song by Justin Bieber.  Of course i am not suggesting that these Tweets about Justin's song were spam but they had nothing to do with the purpose of the analysis.  Could an automated technique identify this inconsistency and suggest to filter out this information? Being able to automatically select the right information will probably become more important as text information increases and a fast, correct and actionable intelligence becomes a necessity.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-4957897240685439463?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/4957897240685439463/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=4957897240685439463' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4957897240685439463'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4957897240685439463'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/11/spam-detection-in-social-data-new.html' title='Spam Detection in Social Data : A new business?'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-8608020052354908151</id><published>2010-11-03T02:43:00.024+02:00</published><updated>2010-11-17T15:11:39.101+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Mining consumer behavior in Tweets</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_koDJi0ps7Mw/TNQFeJLf1nI/AAAAAAAAAcY/B4Ua_TYhp0E/s1600/Screen+shot+2010-11-04+at+7.01.45+PM.png"&gt;&lt;br /&gt;&lt;/a&gt;In the &lt;a href="http://smartdatacollective.com/themoskalafatis/28283/inside-consumers-mind-text-analytics"&gt;previous&lt;/a&gt; post we discussed the first steps necessary to understand what consumers write in their Tweets regarding their recent visit to a shopping Mall. In this post we will see how from this information Marketeers are able to understand spending patterns, know what consumers liked about their visit to a shopping Mall and know what is important for consumers. According to &lt;a href="http://www.city.academic.gr/staff/profile.asp?Id=30"&gt;Dr Dimitriadis&lt;/a&gt; with whom i teamed up for  this analysis, important things to look at include (list not exhaustive) :&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- spending patterns and situations (when, what and with whom people spend money)&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- tenant mix preference (which products people like to buy and what else / new they want)&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- experience evaluation (safety, availability of stores / products, cleanliness etc)&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- perception of shopping mall communications (what people think about mall ads / messages)&lt;/div&gt;&lt;div style="text-align: justify;"&gt; &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"   style="border-collapse: collapse;font-family:arial,sans-serif;font-size:13px;"&gt; &lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Although there are many behaviors and opinions we could look for, let's identify first what makes a consumer happy. To find this out we can analyze all Tweets containing a :-) (smilie) and find which keywords co-exist in these Tweets. Here are some of the results:&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_koDJi0ps7Mw/TNQE7dGQE0I/AAAAAAAAAcQ/pMTSLSX63hk/s1600/Screen+shot+2010-11-05+at+3.19.59+PM.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 473px; height: 317px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/TNQE7dGQE0I/AAAAAAAAAcQ/pMTSLSX63hk/s400/Screen+shot+2010-11-05+at+3.19.59+PM.png" alt="" id="BLOGGER_PHOTO_ID_5536055261414822722" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt; &lt;/div&gt;&lt;div style="text-align: justify;"&gt;Apart from some typical words that suggest positive feelings, we also identify that 'friend' and 'birthday' are commonly found with smilies. It was found that consumers that shop for a birthday present or outfit use smilies often. Let's see what happens with tweets  that contain negative feelings :-( (frownie):&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_koDJi0ps7Mw/TNQDl6DbIVI/AAAAAAAAAcI/7-mwxH3qaXQ/s1600/Screen+shot+2010-11-05+at+3.09.09+PM.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 466px; height: 316px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/TNQDl6DbIVI/AAAAAAAAAcI/7-mwxH3qaXQ/s400/Screen+shot+2010-11-05+at+3.09.09+PM.png" alt="" id="BLOGGER_PHOTO_ID_5536053791718842706" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt; A frown appears more often when consumers do not find what they were looking for and also when they are at the mall alone.  But what about what people &lt;i&gt;hate&lt;/i&gt; when they visit a mall? A similar statistical test is performed to identify words which co-occur with the phrase "I hate it when." These words are:&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;-Park&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;-People&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;-Walk&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;By looking at the actual tweets we can identify that many people hate it when:&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;1. a mall is very busy&lt;/div&gt;&lt;div style="text-align: justify;"&gt;2. it is difficult to park at the mall &lt;/div&gt;&lt;div style="text-align: justify;"&gt;3. people in front walk at a much slower pace (particularly older people)&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Next we can perform a cluster analysis for these tweets to identify common "thought clusters" of the consumers and their behavior. As an example i have used &lt;a href="http://rapid-i.com/"&gt;Rapid-I&lt;/a&gt; to generate these clusters using the following setup:&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_koDJi0ps7Mw/TNQFeJLf1nI/AAAAAAAAAcY/B4Ua_TYhp0E/s1600/Screen+shot+2010-11-04+at+7.01.45+PM.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 487px; height: 324px;" src="http://2.bp.blogspot.com/_koDJi0ps7Mw/TNQFeJLf1nI/AAAAAAAAAcY/B4Ua_TYhp0E/s400/Screen+shot+2010-11-04+at+7.01.45+PM.png" alt="" id="BLOGGER_PHOTO_ID_5536055857363539570" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Without getting into technical details (such as usage of tokenization, stop word removal and optimization of the process) by executing the stream  shown above, a cluster analysis is run that identifies common consumer thoughts on their visit to the shopping Mall. Some of the clusters found are :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;-  People that state their intent to buy something&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Consumers which eat a meal and then go to the movies&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- "saw a cute guy / girl looking at me"&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- "I had a good time at the mall"&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;As discussed in previous posts, cluster analysis not only allows us to find common groups of behavior and thoughts but also to identify the frequency with which these behaviors and thoughts appear in consumer Tweets.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;This behavior mining seems endless :  In the same manner we can look for mentions of food, (for example see how often 'Chinese', 'Indian' or 'Pizza' appear in Tweets) or buying patterns (which items are discussed more frequently in "i want to buy" Tweets) or whether users feel more happy when they buy gifts for themselves or for others. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-8608020052354908151?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/8608020052354908151/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=8608020052354908151' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8608020052354908151'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8608020052354908151'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/11/mining-consumer-behavior-in-tweets.html' title='Mining consumer behavior in Tweets'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_koDJi0ps7Mw/TNQE7dGQE0I/AAAAAAAAAcQ/pMTSLSX63hk/s72-c/Screen+shot+2010-11-05+at+3.19.59+PM.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-3689635653524732678</id><published>2010-09-28T08:17:00.051+03:00</published><updated>2010-11-10T19:37:05.075+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Inside a consumer's mind with Text Analytics</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/TKxCupDIgRI/AAAAAAAAAaQ/J_uhOM9mQJA/s1600/Screen+shot+2010-10-06+at+12.32.47+PM.png"&gt;&lt;br /&gt;&lt;/a&gt;&lt;div style="text-align: justify;"&gt;So far we have seen &lt;a href="http://lifeanalytics.blogspot.com/search/label/twitter"&gt;several examples&lt;/a&gt; on how Predictive Analytics applied in Social Media and Blog posts  can help us suggest better strategies in Marketing, Branding, Sales and PR . This post is a walk-through example on how we can choose a concept, extract what users write about this concept on Twitter, get insights on how consumers think / behave about it and finally group similar consumer thoughts and experiences using Cluster Analysis.  A "concept" could be :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Any activity&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- A Brand (e.g Apple Inc.)&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- A Product / Service &lt;/div&gt;&lt;div style="text-align: justify;"&gt;- A Politician&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;and -almost- anything discussed in user Tweets .&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;What we will look at is work that was made specifically for understanding what consumers think, liked or disliked while visiting a shopping Mall. What do people feel when visiting a Mall?  Which words are associated with a positive experience or when a smiley is present in Tweets about Malls? Using the Twitter API approximately 36000 distinct Tweets where collected on consumer experiences from visiting a shopping Mall (sample below shows an example of a consumer's negative sentiment ) :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;a href="http://2.bp.blogspot.com/_koDJi0ps7Mw/TKGCQmuYYXI/AAAAAAAAAZo/4UsvwAwIFco/s1600/Screen+shot+2010-09-28+at+8.35.01+AM.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 521px; height: 250px;" src="http://2.bp.blogspot.com/_koDJi0ps7Mw/TKGCQmuYYXI/AAAAAAAAAZo/4UsvwAwIFco/s400/Screen+shot+2010-09-28+at+8.35.01+AM.png" alt="" id="BLOGGER_PHOTO_ID_5521837839917539698" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;So how can an analyst get into a consumer's mind by analyzing Tweets and how would this information be useful? To find some answers I teamed up with Marketing Strategist &lt;a href="http://www.city.academic.gr/staff/profile.asp?Id=30"&gt;Dr Nikos Dimitriadis&lt;/a&gt; to assist me in the actionability and interestingness of each extracted insight. Note that we capture thoughts from a biased sample  which means that we cannot make inferences about the general population. However this work can be a great additional tool for finding new ideas and insights for Marketing initiatives -on top of more traditional methods such as focus groups-  and also enables us to form several hypotheses as to what could likely work.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;After a number of pre-processing steps to clean captured Tweets from irrelevant information (such as links), replace words with their synonyms and remove frequently occurring words such as 'and', 'to', 'at', 'in' and 'mall' and also filter all Tweets with small length i started performing frequency counts of the words contained in Tweets about Malls :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;a href="http://4.bp.blogspot.com/_koDJi0ps7Mw/TKmymp9OE0I/AAAAAAAAAZw/f-EDy7nbWP4/s1600/Screen+shot+2010-10-04+at+1.46.23+PM.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 500px; height: 253px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/TKmymp9OE0I/AAAAAAAAAZw/f-EDy7nbWP4/s400/Screen+shot+2010-10-04+at+1.46.23+PM.png" alt="" id="BLOGGER_PHOTO_ID_5524142795114025794" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;We immediately notice how often LOL and :-) (smiley) appear in Tweets about being, going or returning from the Mall which also gives us examples of consumers being in a specific mood . Here is what happens when we look at the most frequently occurring 2-word phrases :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;a href="http://1.bp.blogspot.com/_koDJi0ps7Mw/TKxCupDIgRI/AAAAAAAAAaQ/J_uhOM9mQJA/s1600/Screen+shot+2010-10-06+at+12.32.47+PM.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 500px; height: 262px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/TKxCupDIgRI/AAAAAAAAAaQ/J_uhOM9mQJA/s400/Screen+shot+2010-10-06+at+12.32.47+PM.png" alt="" id="BLOGGER_PHOTO_ID_5524864211937165586" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;and 3-word phrases (Note : ive = i've) :&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_koDJi0ps7Mw/TKmzif8rdbI/AAAAAAAAAaA/REsCfc9KxW8/s1600/Screen+shot+2010-10-04+at+1.47.13+PM.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 500px; height: 266px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/TKmzif8rdbI/AAAAAAAAAaA/REsCfc9KxW8/s400/Screen+shot+2010-10-04+at+1.47.13+PM.png" alt="" id="BLOGGER_PHOTO_ID_5524143823219553714" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Looking at the two charts we also notice that we frequently find the phrases :&lt;br /&gt;&lt;br /&gt;-   &lt;span style="font-style: italic;"&gt;My best friend&lt;/span&gt; : since consumers Tweet the fact that are visiting a Mall with their best friend.&lt;br /&gt;&lt;br /&gt;- &lt;span style="font-style: italic;"&gt;My nails done&lt;/span&gt; : appears to be one of  women's frequently discussed activity.&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;We then could look at Words and Phrases that seem interesting in understanding consumer experiences and values when visiting a Mall, such as :&lt;br /&gt;&lt;br /&gt;- Shop&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Shoes&lt;br /&gt;- Parking Lot&lt;br /&gt;- Food Court&lt;br /&gt;- Need / Want&lt;br /&gt;- Walk around&lt;br /&gt;- Made my Day&lt;br /&gt;- Post Picture FaceBook&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;and mine through all these words / phrases to understand what consumers think : What exactly made the day of consumers who used the phrase "Made my Day" in their tweets? How do consumers feel when they visit the Mall with their best friend? when they are alone? Which activities trigger positive feelings? But more importantly : How could one use this information to better understand consumers and Market a concept? More on the next post.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-3689635653524732678?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/3689635653524732678/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=3689635653524732678' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3689635653524732678'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3689635653524732678'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/09/inside-consumers-mind-with-text.html' title='Inside a consumer&apos;s mind with Text Analytics'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_koDJi0ps7Mw/TKGCQmuYYXI/AAAAAAAAAZo/4UsvwAwIFco/s72-c/Screen+shot+2010-09-28+at+8.35.01+AM.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-3831503551453977698</id><published>2010-09-22T09:17:00.019+03:00</published><updated>2010-09-24T10:51:02.279+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='social media analytics'/><title type='text'>Social Media Insights from Predictive Analytics</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/TJxPbydnFcI/AAAAAAAAAZY/IIIvVm6DQPw/s1600/Screen+shot+2010-09-24+at+10.08.33+AM.png"&gt;&lt;br /&gt;&lt;/a&gt;&lt;div style="text-align: justify;"&gt;Here is one more example on how Predictive Analytics may help  professionals to make better decisions. For this post  a total of 3000 Social Media title posts where analyzed to gain -hopefully- important insights for Social Media professionals. To achieve this, Text Mining was used to analyze the text of titles, identify the most important subjects (do posts about &lt;i&gt;Personal Branding &lt;/i&gt;tend to be re-tweeted more than&lt;i&gt; &lt;/i&gt; &lt;i&gt;Social Media Monitoring&lt;/i&gt;?) and also try to prioritize the various areas of Social Media. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;We start with the basics. Many of Social Media pros read (and write) about various subjects : How-to's, things to avoid, Adoption of Social Media etc). The first goal was to identify the most frequently occurring subject areas in Social Media posts using simple keyword frequencies. The following chart shows this information :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;a href="http://2.bp.blogspot.com/_koDJi0ps7Mw/TJw_6e2m_8I/AAAAAAAAAZI/RZEQsJaGWb0/s1600/Screen+shot+2010-09-24+at+8.55.46+AM.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 280px;" src="http://2.bp.blogspot.com/_koDJi0ps7Mw/TJw_6e2m_8I/AAAAAAAAAZI/RZEQsJaGWb0/s400/Screen+shot+2010-09-24+at+8.55.46+AM.png" alt="" id="BLOGGER_PHOTO_ID_5520357517196459970" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;p style="margin: 0px; font: 11px Monaco;"&gt;&lt;br /&gt;&lt;/p&gt;&lt;div style="text-align: justify;"&gt;Although the fact that &lt;span style="font-style: italic;"&gt;Social&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;Media&lt;/span&gt; is on top of the list is not much of an insight or that &lt;span style="font-style: italic;"&gt;Twitter&lt;/span&gt; appeared in posts more frequently than &lt;span style="font-style: italic;"&gt;FaceBook&lt;/span&gt;, we see that &lt;span style="font-style: italic;"&gt;Brand&lt;/span&gt; is found more frequently than &lt;span style="font-style: italic;"&gt;Marketing&lt;/span&gt; or &lt;span style="font-style: italic;"&gt;Strategy&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;However, there is a slight problem : The chart shown above is about single words and perhaps measuring how often 2 adjacent words occur in Social Media posts could  be more useful  with &lt;span style="font-style: italic;"&gt;Social Media&lt;/span&gt; being omitted (click to enlarge):&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_koDJi0ps7Mw/TJxBI-mSRaI/AAAAAAAAAZQ/mDgL284CFaM/s1600/Screen+shot+2010-09-24+at+8.57.55+AM.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 458px; height: 234px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/TJxBI-mSRaI/AAAAAAAAAZQ/mDgL284CFaM/s400/Screen+shot+2010-09-24+at+8.57.55+AM.png" alt="" id="BLOGGER_PHOTO_ID_5520358865747723682" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This leads us to the fact that most of Social Media posts where found to be  about How-to's (note that phrases  &lt;span style="font-style: italic;"&gt;How to&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;ways to&lt;/span&gt; have similar meaning).  One could dig more to identify the concepts for which How-to's apply (How to monetize, How to be successful, How to avoid mistakes etc)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The next goal was to find words and phrases that are commonly found in posts with a high number of retweets (&gt;40). To get this insight various Text Mining techniques where used. The following features have been taken into consideration :&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Author of Post&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Title of Post&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Number of Retweets&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;and here are some of the results :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_koDJi0ps7Mw/TJxPbydnFcI/AAAAAAAAAZY/IIIvVm6DQPw/s1600/Screen+shot+2010-09-24+at+10.08.33+AM.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 443px; height: 206px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/TJxPbydnFcI/AAAAAAAAAZY/IIIvVm6DQPw/s400/Screen+shot+2010-09-24+at+10.08.33+AM.png" alt="" id="BLOGGER_PHOTO_ID_5520374582070416834" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;Words that have a negative weight tend to be found in SM posts with a low number of re-tweets (&lt;span style="font-style: italic;"&gt;write, talk, trust, sentiment)&lt;/span&gt; while &lt;span style="font-style: italic;"&gt;launch &lt;/span&gt;and&lt;span style="font-style: italic;"&gt; America&lt;/span&gt; where commonly found in popular posts. Please notice (the reason will be explained later) that &lt;span style="font-style: italic;"&gt;personal &lt;/span&gt;is one of the hot words but also &lt;span style="font-style: italic;"&gt;link&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;increase.&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;With this information, an analyst may then identify why such words tend to commonly exist in popular Social Media posts. Here are some insights :&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Personal Branding&lt;/span&gt; appears to be a hot area. People are primarily interested  on the various ways they can increase their "personal worth" in the Social Media arena.  &lt;/li&gt;&lt;/ul&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;IWOM&lt;/span&gt; : Internet Word Of Mouth is also a concept that frequently occurs in SM posts with many re-tweets.&lt;/li&gt;&lt;/ul&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Positive &amp;amp; Possible&lt;/span&gt; : It appears that posts that discuss  various possibilities in a positive way (use of the word &lt;span style="font-style: italic;"&gt;could&lt;/span&gt;) where found to be re-tweeted more (recall &lt;span style="font-style: italic;"&gt;link&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;increase &lt;/span&gt;keywords discussed previously).&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-3831503551453977698?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/3831503551453977698/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=3831503551453977698' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3831503551453977698'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3831503551453977698'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/09/social-media-insights-from-predictive.html' title='Social Media Insights from Predictive Analytics'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_koDJi0ps7Mw/TJw_6e2m_8I/AAAAAAAAAZI/RZEQsJaGWb0/s72-c/Screen+shot+2010-09-24+at+8.55.46+AM.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-1636430381651109127</id><published>2010-09-05T15:40:00.015+03:00</published><updated>2010-09-09T15:39:01.483+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>"Ways to stop Social Media and Sentiment Mining"</title><content type='html'>&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/TIjIoMW6zKI/AAAAAAAAAY4/r9vQPJeaczo/s1600/clust1.jpg"&gt;&lt;/a&gt;&lt;div style="text-align: justify;"&gt;While looking at my Google Analytics account i came across a keyword search originated from Australia which was different from keywords that usually drive traffic to my blog. The keywords were the following :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;"Ways to stop Social Media and Sentiment Mining"&lt;br /&gt;&lt;br /&gt;&lt;/b&gt;I decided to write this post assuming that the person who submitted this search does not like the fact that machines are mining his points of view about people or products or "understand" to some point whether  he/she feels happy or not.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Among the many interesting aspects of being a Data Miner is to explain to other people what a Data Miner does (this was also discussed by &lt;a href="http://www.kdnuggets.com/gps.html"&gt;G Piatetsky - Shapiro&lt;/a&gt;  if my memory serves me well). When asked, i sometimes say that i also "analyze emotions as these are expressed on the Web". At first people are very interested but after a short amount of time almost always the next responses go along these lines :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Are you allowed to do this?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Is this legal?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Have you ever heard about Big Brother?&lt;/div&gt;&lt;div style="text-align: justify;"&gt; &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;It's no big secret that emotions play a major role in our lives and drive our decisions. Many people start to realize that companies are already using Information Extraction and Data - Text Mining techniques to extract the things that we discuss about various products or people and better understand our behavior. I believe that the most important thing in this area is not just Sentiment Mining or in other words whether we feel positive or negative about a Person, Product or Brand but the ability of Analytics to extract our core values and analyze our emotions.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;img src="http://1.bp.blogspot.com/_koDJi0ps7Mw/TIjIoMW6zKI/AAAAAAAAAY4/r9vQPJeaczo/s400/clust1.jpg" alt="" id="BLOGGER_PHOTO_ID_5514878336552848546" style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 488px; height: 258px;" border="0" /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt; &lt;/div&gt;&lt;div style="text-align: justify;"&gt;When applying Text Mining or a mixture of Data and Text Mining methods on -for example- Twitter, we are not only able to see the sentiment for a product. We can identify a user that is alone, feeling bored and watching television.  We can form several hypotheses on whether users that survived from Cancer express more positive thoughts than other user groups (see &lt;a href="http://lifeanalytics.blogspot.com/2009/08/surviving-cancer-happiness-and-twitter.html"&gt;Surviving Cancer, Happiness and Twitter&lt;/a&gt;), find what sort of lifestyle makes a CEO happy or whether a specific profession increases your chances of being single (see &lt;a href="http://lifeanalytics.blogspot.com/2009/05/twitter-analytics-cluster-analysis.html"&gt;Twitter Analytics : Cluster Analysis reveals similar users&lt;/a&gt;). Cluster Analysis can also &lt;a href="http://lifeanalytics.blogspot.com/2009/01/clustering-thoughts-of-twitter-users.html"&gt;identify core values&lt;/a&gt; of people and what they want or what trying to avoid.&lt;br /&gt;&lt;br /&gt;Some of the examples discussed above have a clear business value while others don't. The important fact however is that analysts now have data to analyze emotions and our responses on facts happening in our lives on a much deeper level. This information has not been available on this scale before.&lt;br /&gt;&lt;br /&gt;Should we stop extracting these insights and how dangerous can these insights become? &lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-1636430381651109127?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/1636430381651109127/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=1636430381651109127' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/1636430381651109127'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/1636430381651109127'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/09/ways-to-stop-social-media-and-sentiment.html' title='&quot;Ways to stop Social Media and Sentiment Mining&quot;'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_koDJi0ps7Mw/TIjIoMW6zKI/AAAAAAAAAY4/r9vQPJeaczo/s72-c/clust1.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-8830628950960419884</id><published>2010-08-31T11:52:00.004+03:00</published><updated>2010-09-01T13:21:23.304+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='banking'/><category scheme='http://www.blogger.com/atom/ns#' term='credit risk'/><title type='text'>Banks, Risk Disclosure and Text Analytics</title><content type='html'>&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_koDJi0ps7Mw/THzD8Yo-iXI/AAAAAAAAAYw/Xh_KzHNdVSA/s1600/Screen+shot+2010-08-28+at+3.47.08+PM.png"&gt;&lt;/a&gt;&lt;div style="text-align: justify;"&gt;A UK-based MSc student of Kingston Business School - Christos Gkemitzis had an idea for his MSc project which immediately caught my attention : Use Text Analytics methods to annual reports given by Banks and extract metrics on how these Banks handle their Credit and Interest rate risk as explained in these reports and then test several hypotheses ( do Banks of a higher risk profile disclose bigger amount of risk-related information compared to those having lower risk profile?) and also identify any correlations :&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;ul&gt;&lt;li&gt;between the size of the Bank and volume of risk disclosures&lt;/li&gt;&lt;li&gt;between the risk of the Bank's profile and volume of risk disclosures&lt;/li&gt;&lt;li&gt;between the profitability of the Bank and volume of risk disclosures&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;div style="text-align: justify; "&gt;Essentially the problem is to -automatically- identify mentions of credit risk but in a specific way :&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;1) Identify sentences mentioning risk refer to the present, past or future&lt;/div&gt;&lt;div style="text-align: justify; "&gt;2) Identify positive, negative or neutral sentiment mentions about Credit Risk&lt;/div&gt;&lt;div style="text-align: justify; "&gt;3) Identify qualitative versus quantitative information regarding the Bank's Credit Risk&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;For example consider the following text which is part of an actual Bank report :&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;div style="text-align: justify; "&gt;&lt;i&gt;"A substantial increase of credit risk and provisions is also expected, as from 2009 on, the&lt;/i&gt;&lt;i&gt;economy will be entering a period of low growth."&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;The sentence above contains qualitative information ("substantial increase of credit risk and provisions") and negative Sentiment referring to the future ("also expected" and "will be entering a period of...").&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;while the following sentence :&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;div style="text-align: justify; "&gt;&lt;i&gt;"The Group’s ongoing efforts to manage efficiently credit risk led the level of loan losses to 3.3% in December 2008"&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;contains quantitative information ("level of loan losses to 3.3%") with a positive sentiment about Credit Risk handling in the past.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;After receiving some PDF samples of Bank reports from Christos, i began feeding these reports to the &lt;a href="http://gate.ac.uk/"&gt;GATE&lt;/a&gt; Text Analysis toolkit in order to assess the feasibility of such analysis. After some tutorials through Skype, Christos -who had no prior knowledge of programming- started using the toolkit on his own in a very short amount of time. Here is a snapshot of GATE in action for the analysis :&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;span class="Apple-style-span" style="-webkit-text-decorations-in-effect: underline; "&gt;&lt;img src="http://2.bp.blogspot.com/_koDJi0ps7Mw/THzD8Yo-iXI/AAAAAAAAAYw/Xh_KzHNdVSA/s400/Screen+shot+2010-08-28+at+3.47.08+PM.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5511495486168533362" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 224px; " /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;span class="Apple-style-span" style="-webkit-text-decorations-in-effect: underline; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center; "&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;span class="Apple-style-span" style="-webkit-text-decorations-in-effect: underline; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;span class="Apple-style-span" style="-webkit-text-decorations-in-effect: underline; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;The snapshot shows how GATE correctly identified a part of text that communicates a negative sentiment for Credit Risk in a qualitative manner for the future (notice that "QualitativeBadNewsFuture" is checked).&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0); "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;After running GATE in many documents, Christos had the necessary metrics (=how many mentions of different Risk types exist in a document) to test his hypotheses using a 2-tailed Wilcoxon test. To identify correlations, Spearman coefficient was also used.&lt;/div&gt;&lt;div style="text-align: justify; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify; "&gt;Since this is work which has not been submitted yet, it is not permitted to post the findings of this research. The post shows however another application of Text Analytics and the many sources of unstructured information that could be mined for knowledge.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-8830628950960419884?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/8830628950960419884/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=8830628950960419884' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8830628950960419884'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8830628950960419884'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/08/banks-risk-disclosure-and-text_31.html' title='Banks, Risk Disclosure and Text Analytics'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_koDJi0ps7Mw/THzD8Yo-iXI/AAAAAAAAAYw/Xh_KzHNdVSA/s72-c/Screen+shot+2010-08-28+at+3.47.08+PM.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-4211992821085702278</id><published>2010-08-09T08:17:00.015+03:00</published><updated>2010-08-10T08:59:29.776+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='kaggle'/><title type='text'>Interview with Kaggle CEO Anthony GoldBloom</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;For those that haven't heard of &lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.kaggle.com/"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Kaggle&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; before, Kaggle is a &lt;/span&gt;&lt;/span&gt;&lt;a href="http://kaggle.com/About-Us/ourteam"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;team&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; of people that provide the functionality and support to host Data Mining contests. Here is how it works : Suppose that you are working  for a Telco and wish to implement a new Churn prediction model. Rather than running this project in-house, you submit your data to Kaggle. What happens next is that -hopefully- many statisticians globally will each analyze your dataset, produce a model and then submit their prediction model(s) to Kaggle. The best model (and hence its creator) gets  the prize which is given by the Telco company. Here is the interview with Kaggle CEO, Anthony GoldBloom :&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;img src="http://4.bp.blogspot.com/_koDJi0ps7Mw/TF-WCU89XaI/AAAAAAAAAYM/5xqwzogXb5Q/s400/Screen+shot+2010-08-09+at+8.43.15+AM.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5503282236398329250" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 212px; height: 154px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;- What is Kaggle and what new ideas brings to the predictive analytics arena?&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;Kaggle aims to help companies and researchers make predictions more precise by providing a platform for data prediction competitions. Competitions turn out to be a great way to get the most out of a dataset. This is because there are infinitely many approaches to any data modeling problem. By opening up a data prediction problem to a wide audience, a competition makes it possible to get to the frontier of what is possible given a dataset's inherent noise and richness. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;i&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;- Can you tell us more about "real-time science" and how it could help Research globally?&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;   &lt;/span&gt; &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"   style="  border-collapse: collapse; font-family:arial, sans-serif;font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;Data modeling competitions can facilitate real-time science. Consider the recent announcement about the discovery of genetic markers that correlate with extreme longevity.  Work on the study began in 1995, with results published in 2010.  Had the study been run as a data modeling competition, the results would have been generated in real time and insights available much sooner (and with a higher level of precision).&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-size:medium;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;Data modeling competitions also benchmark, in real time, new techniques against old.  A technique that performs well in competitions can prove its mettle long before any paper can be published, helping the science to progress more quickly.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-size:medium;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;Competitions also help to avoid situations where valuable techniques are overlooked by the scientific establishment.  This aspect of the case for competitions  is neatly illustrated by Ruslan Salakhutdinov, now a postdoctoral fellow at the Massachusetts Institute of Technology, who had a new algorithm rejected by the NIPS conference.  According to Ruslan, the reviewer ‘basically said “it’s junk and I am very confident it’s junk”’.  It later turned out that his algorithm was good enough to make him an early leader in the Netflix Prize (he called his Netflix Prize team NIPS_reject).&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-size:medium;"&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:-webkit-xxx-large;"&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;- How can companies benefit through Kaggle?&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"   style="  border-collapse: collapse; font-family:arial, sans-serif;font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-size:-webkit-xxx-large;"&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;/span&gt;&lt;/i&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;Companies can use Kaggle to gain an advantage over their competitors. Consider a bank that wants to improve the algorithms that vet loan applicants. If a bank can develop a more effective algorithm they will have fewer defaults and can charge lower interest rates than their competitors. Kaggle has proven to be an effective way to improve existing models very quickly.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"   style="  border-collapse: collapse; font-family:arial, sans-serif;font-size:13px;"&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Competitions are also really useful to companies that want to develop new products and capabilities. Consider a hedge fund that wants to be able to generate long-range weather forecasts in key agriculture regions. They can attempt to hire a weather forecasting expert or they can use Kaggle to throw the problem open to a wide audience. Using Kaggle they can be sure they'll get great results very quickly.&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;span class="Apple-style-span"  style="border-collapse: separate;   font-size:medium;"&gt;&lt;i&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;- How is the best model selected?&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span class="Apple-style-span" style="border-collapse: separate;  "&gt;&lt;span class="Apple-style-span" style=" border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;The competition host will typically split their dataset into two parts - a training dataset and a test dataset. The training dataset includes all explanatory variables as well as the dependent variable (or the answer). The test dataset also includes all the explanatory variables but the dependent variable (or answer) is withheld. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span class="Apple-style-span" style="border-collapse: separate;  "&gt;&lt;span class="Apple-style-span" style=" border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;Participants train their models on the training dataset. They then apply their models to generate predictions on the test dataset. Those predictions are then scored on-the-fly against the actual answers (using one of several evaluation methods). Once the competition deadline passes, the team that generates the most accurate predictions gives the winning methodology to the competition host in exchange for the prize money.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-4211992821085702278?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/4211992821085702278/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=4211992821085702278' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4211992821085702278'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4211992821085702278'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/08/interview-with-kaggle-ceo-anthony.html' title='Interview with Kaggle CEO Anthony GoldBloom'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_koDJi0ps7Mw/TF-WCU89XaI/AAAAAAAAAYM/5xqwzogXb5Q/s72-c/Screen+shot+2010-08-09+at+8.43.15+AM.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-3401358794283689678</id><published>2010-07-27T11:00:00.015+03:00</published><updated>2010-07-30T16:59:09.899+03:00</updated><title type='text'>Summarization of Blog posts with "Web Pulse" Reports</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;In the past couple of months i was looking for a way to best capture and understand what happens  on the Web -and more specifically what people write in blogs- in terms of sentiment and emerging trends. The first thing that i came up with was the the idea of creating a "Web Pulse" Report : A way to summarize what people are discussing on the web. Although the implementation was not as complex as i expected, i was pleased to find that the knowledge that can be extracted is -to say the least- very useful and interesting. Before looking at  an actual Report examples here are the elements that comprise it :&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;1) Concept Frequencies :  Identifies the concepts that bloggers most frequently write about&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;2) Global co-occurence Matrix : Identifies most frequent word &lt;/span&gt;&lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Bigram"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;bigrams&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;3) Keyword Associations for Concepts : Which keywords tend to co-exist with a specific concept?&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;4) Most frequent &lt;/span&gt;&lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/N-gram"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;n-grams&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt; associated with a given Concept (where n=2,3,4,5) &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;As an example we will identify what bloggers were discussing in Greek blogs on July 27th, 2010 and specifically the Blog titles in more than 300 Greek blogs. &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;  &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Here are the concept frequencies found (in descending order) on that date : &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;[&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;Turkey]=178&lt;br /&gt;[Politics]=128&lt;br /&gt;[Economy]=101&lt;br /&gt;[International Monetary Fund - IMF]=62&lt;br /&gt;[Banking]=61&lt;br /&gt;[Public Sector]=50&lt;br /&gt;[Negative Characterizations]=30&lt;br /&gt;[Political Parties]=29&lt;br /&gt;[George Papandreou]=29 (=Prime Mininster of Greece)&lt;br /&gt;[Loans]=22&lt;br /&gt;[Society]=20&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"   style="  border-collapse: collapse; font-family:arial, sans-serif;font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style=" border-collapse: collapse; font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;T&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;he first interesting fact was that "Turkey" appears to be in the top of the list of Greek blog articles, even though Greek mass media did not place so much weight in the latest Turkish behavior in the Aegean sea on that day. The second concept is Politics with the Economy following next.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;Here is the top part of the Global Co-occurence Matrix found (in Greek) :&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;ΣΚAΦΟΣ,ΤΟΥΡΚΙΚΟ : 25&lt;br /&gt;ΡΕΙΣ,ΤΟΥΡΚΙΚΟ : 25&lt;br /&gt;ΠΙΡΙ,ΤΟΥΡΚΙΚΟ : 25&lt;br /&gt;ΡΕΙΣ,ΣΚAΦΟΣ : 24&lt;br /&gt;ΕΛΛAΔΑ,ΧΩΡΑ : 23&lt;br /&gt;ΥΠΟΥΡΓΕΙΟΥ,ΟΙΚΟΝΟΜΙΚΩΝ : 22&lt;br /&gt;ΡΕΙΣ,ΕΡΕΥΝΗΤΙΚΟ : 22&lt;br /&gt;ΠΙΡΙ,ΕΡΕΥΝΗΤΙΚΟ : 22&lt;br /&gt;ΡΕΙΣ,ΠΙΡΙ : 21&lt;br /&gt;ΑΝΑΜΕΝΕΤΑΙ,ΣΥΜΦΩΝΑ : 21&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;ΠΟΛΙΤΙΚΗ,ΧΩΡΑΣ : 19&lt;br /&gt;ΟΙΚΟΝΟΜΙΑ,ΕΛΛΗΝΙΚΗ : 19&lt;br /&gt;ΜΟΝAΔΩΝ,ΔΕΗ : 19&lt;br /&gt;ΚΥΒΕΡΝΗΣΗ,ΠΑΠΑΝΔΡΕΟΥ : 19&lt;br /&gt;ΗΓΕΣΙΑ,ΠΟΛΙΤΙΚΗ : 19&lt;br /&gt;ΕΡΕΥΝΗΤΙΚΟ,ΤΟΥΡΚΙΚΟ : 19&lt;br /&gt;ΥΠΟΧΡΕΩΣΕΙΣ,ΜΝΗΜΟΝΙΟ : 16&lt;br /&gt;ΧΩΡΑ,ΜΝΗΜΟΝΙΟ : 15&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="  border-collapse: collapse; font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;The top 4 frequent keyword associations is -again- about the latest problems of Greece with Turkey and more specifically with the fact that a Turkish boat named "Piri Reis" (in Greek : &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"   style="  border-collapse: collapse; font-family:georgia;font-size:medium;"&gt;ΠΙΡΙ &lt;/span&gt;&lt;span class="Apple-style-span"   style="  border-collapse: collapse; font-family:georgia;font-size:medium;"&gt;ΡΕΙΣ&lt;/span&gt;&lt;span class="Apple-style-span"   style="  border-collapse: collapse; font-family:georgia;font-size:medium;"&gt;)&lt;/span&gt;&lt;span class="Apple-style-span"   style="  border-collapse: collapse; font-family:georgia;font-size:medium;"&gt;  has been repeatedly entering without a permission a Greek part of Aegean Sea.&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;Let's look at the Associations frequencies found between specific Concepts  : The following is an example of concepts associated with "Giorgos Papandreou" (Greek Prime Minster)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;span class="Apple-style-span"  style=" border-collapse: collapse; font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;International Monetary Fund - IMF=32&lt;br /&gt;Politics=28&lt;br /&gt;Political Reform=6&lt;br /&gt;Nea Dimokratia=3 (=Oppositional Political Party)&lt;br /&gt;Politics, International Monetary Fund, Loans,Political Parties=2&lt;br /&gt;Negative Sentiment=2&lt;br /&gt;Public Sector=2&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style=" ;font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Uncertainty=2&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style=" ;font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style=" ;font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;It appears that George Papandreou is frequently mentioned where the IMF is involved and also a political reform might be on its way.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style=" ;font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style=" ;font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;The fourth element of the report shows phrases that are commonly found in Blog posts. Since many blogs tend to use the same titles, with this functionality one is able to look at the distribution of the information from one blog to another.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style=" ;font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;The report can be enhanced in various ways : For example by tokenizing Blog posts in sentences i have added the option of performing chi-square tests to identify co-occurences in a more concise way, rather than using strictly absolute term frequencies. Through different types of analysis and knowledge representation we are able to look to our subject(s) of interest in different ways, which -hopefully- leads us to better insights. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="  border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style=" ;font-family:arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;From my experience so far, this type of report is a simple but efficient way to summarize  the content of Blogs and also show what is 'hot' at the moment and why. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-3401358794283689678?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/3401358794283689678/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=3401358794283689678' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3401358794283689678'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3401358794283689678'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/07/summarization-of-blog-posts-with-web.html' title='Summarization of Blog posts with &quot;Web Pulse&quot; Reports'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-6348708079234137408</id><published>2010-05-26T21:45:00.025+03:00</published><updated>2010-06-29T10:53:26.653+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='concept trending'/><title type='text'>Concept Trending : A Glimpse into the future?</title><content type='html'>&lt;div style="text-align: justify;"&gt;In the previous post some ideas were presented on the trends of Text Analytics. Analyzing and extracting knowledge from text is a hard thing, whether this involves Sentiment Analysis, Text Classification, Cluster Analysis or Information Extraction.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;A particularly interesting application of Text Analytics is the identification of trends for specific concepts. In contrast with simple keyword trending, this type of trending attempts to disambiguate keywords according to their context and use co-reference resolution to identify the subjects for which the sentiment relates to. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;To better understand concept trending let's look at an example : Suppose that one wishes to identify the trend of negative characterizations -and even swear words- that exist on the Greek web. The first step would be to collect the information from various blogs and forums whenever a negative keyword is found. A Text analysis toolkit could then provide the means of identifying the subject(s) of negative characterizations on the Greek web such as Politicians, the Economy or the International Monetary Fund which recently came in to the rescue.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;From a &lt;a href="http://lifeanalytics.blogspot.com/2009/12/building-knowledge-hub.html"&gt;post&lt;/a&gt; dated December 28th, 2009  :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="  color: rgb(51, 51, 51); "&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Over the past month there has been a considerable amount of increase in negative economy sentiment, crime-related incidents and/or terms that communicate future social instability and uneasiness&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;."&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Although  not stated on purpose, the country which the article addressed was Greece and the trend increase on negative sentiment was found to be starting in the beginning of December 2009. This is a photo of a Greek newspaper taken on February 4, 2010 &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;img src="http://1.bp.blogspot.com/_koDJi0ps7Mw/TCj_lYu1l9I/AAAAAAAAAX0/vKf_-JhBJj8/s400/IMG_0376.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5487917163710093266" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 300px; height: 400px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The title shown writes about the "Fear of Social Explosion". On May 6th 2010 after clashes in the center of Athens, mentions about "Social Explosion" in Greece started appearing on the Web. The following Google search uses a timeline for "Social Unrest". The increase of mentions appears to be starting on February 2010.  &lt;/div&gt;&lt;div style="text-align: justify;"&gt; &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;span class="Apple-style-span"  style="color:#000000;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;img src="http://3.bp.blogspot.com/_koDJi0ps7Mw/TCmBYEBzSBI/AAAAAAAAAYE/EmfBfW55ZjY/s400/Screen+shot+2010-06-29+at+8.14.00+AM.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5488059871325800466" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 233px; " /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;span class="Apple-style-span"  style="color:#000000;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Although concept trending has significant challenges it is a process which in my experience has proven itself many times. A recent &lt;a href="http://www.newscientist.com/article/mg20627655.800-blogs-and-tweets-could-predict-the-future.html"&gt;article&lt;/a&gt; at NewScientist suggests that by capturing the sentiment of the crowds we are able to predict the moves of S&amp;amp;P 500 or by looking at keyword searches such as "job search engine" we can predict coming changes of the US unemployment rate.  &lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span"   style="  color: rgb(51, 51, 51); font-family:Verdana;font-size:12px;"&gt;&lt;span style="font-style: italic; "&gt;&lt;/span&gt;&lt;/span&gt;   &lt;/div&gt;&lt;div style="text-align: justify;"&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-6348708079234137408?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/6348708079234137408/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=6348708079234137408' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6348708079234137408'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6348708079234137408'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/05/concept-trending-glimpse-into-future.html' title='Concept Trending : A Glimpse into the future?'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_koDJi0ps7Mw/TCj_lYu1l9I/AAAAAAAAAX0/vKf_-JhBJj8/s72-c/IMG_0376.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-4675268633679860273</id><published>2010-05-17T17:58:00.019+03:00</published><updated>2010-05-19T13:51:50.087+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><title type='text'>The future and trends of Text Analytics</title><content type='html'>&lt;div style="text-align: justify;"&gt;I recently attended a &lt;a href="http://gate.ac.uk/family/"&gt;GATE&lt;/a&gt; seminar on the University of Sheffield. Having used &lt;a href="http://gate.ac.uk/family/"&gt;GATE&lt;/a&gt; for quite some time now, i was happy to see that the &lt;a href="http://gate.ac.uk/people/"&gt;GATE team&lt;/a&gt; is well committed to developing the GATE Text Analysis Workbench  by constantly adding more functionality.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Although many of the participants were PhD students i was also happy to see people from companies that now wish to leverage the hidden knowledge that exists in unstructured text.  Whether it was analysis on text of Patents information, intelligent search on Text of Photo Captions for a large News Agency or understanding what a customer wants, Text Analytics are becoming an important tool for making better decisions.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;I also had the opportunity to speak with several people about the future of Text Analytics. What are we likely to see happening in the next years on  Information Extraction and Text Analytics?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/S_O5KGTlvWI/AAAAAAAAAXs/JYDk8ru2en4/s1600/econ_gate.jpg"&gt;&lt;img src="http://3.bp.blogspot.com/_koDJi0ps7Mw/S_O5KGTlvWI/AAAAAAAAAXs/JYDk8ru2en4/s400/econ_gate.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5472921555327892834" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 216px; " /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;First we have to understand how Text Analytics deliver results. In order for a computer to 'understand' unstructured text, it should be 'taught' that the word 'Dollar' is a currency of a country that is called 'US' and also that US, United States, USA and U.S.A is the same concept. This means that hundreds of thousands of concepts and synonyms have to be specified so that a computer identifies them in unstructured text. This process is called &lt;i&gt;Text Annotation.&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The Golden Standard of Text Annotation is annotations done by humans : A computer sifts through the text of a web page, annotates it with concepts and then these annotations are checked against annotations made by humans on the same text  to assess the accuracy with which a computer 'understands' this text and the concepts and entities that exist in it.   &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;So what does the future hold? First of all, since unstructured text becomes more available there will be a greater need for 'annotation farms' : Groups of people who will be manually annotating free text, identifying an ever-growing number of Companies, Managers, Politician names, or anything else that has to be 'taught' to a computer. Note that Annotation Farms already exist but the need for this service will become greater. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The second trend on Text Analytics could be something equivalent to what we have seen happening with NetFlix. Suppose that you own a company that produces Brand 'X' and you wish to track the reputation of your product online. You would then submit a sample of your product's mentions  to various companies that analyze text and have them compete against each other in terms of -for example- &lt;a href="http://en.wikipedia.org/wiki/Precision_and_recall"&gt;Precision and Recall&lt;/a&gt;.  The one that produces &lt;i&gt;consistently&lt;/i&gt; the best metrics (whether Precision - Recall, Kappa statistic or F-Measure) will also get the job.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;A third trend could be the development of text analytics for specific concepts : Sentiment Analysis and Named Entity recognition is hard work if one wants to produce sound and accurate results. So it could be probable that Text Analytics experts will choose a specific concept -For example reputation of Banks- and then work in the analysis of this -very specific- concept so that they achieve better metrics.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-4675268633679860273?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/4675268633679860273/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=4675268633679860273' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4675268633679860273'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4675268633679860273'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/05/future-and-trends-of-text-analytics.html' title='The future and trends of Text Analytics'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_koDJi0ps7Mw/S_O5KGTlvWI/AAAAAAAAAXs/JYDk8ru2en4/s72-c/econ_gate.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-8681624950140262153</id><published>2010-03-23T23:33:00.019+02:00</published><updated>2010-04-06T16:02:10.671+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='politics'/><title type='text'>Predictive Analytics and Politics - Part 2</title><content type='html'>&lt;div style="text-align: justify;"&gt;In the &lt;a href="http://lifeanalytics.blogspot.com/2010/03/predictive-analytics-and-politics-part.html"&gt;previous post&lt;/a&gt; we have seen an example of analyzing messages sent from citizens regarding a new taxation plan. We identified some correlations between keywords and concepts but there are more ways to gain knowledge from such unstructured information.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;By using Cluster Analysis we can extract groups of similar concepts among thousands of comments written by citizens but also presenting an order within them. Let's assume that Cluster Analysis reveals the following clusters (or similar concepts) within submitted messages :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;-  battling tax fraud&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- requests for a fair tax plan&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- requests for less taxation for large families&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- various incentives for citizens&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Our problem is finding the order of importance that people place on the various concept categories shown above : Is battling tax fraud considered more important (=discussed more frequently by citizens) than requesting a fair tax plan? How about taxation for larger families?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;A cluster analysis can reveal to us the size of each cluster and -as a consequence- how important each cluster is :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/S6lNf9ESS7I/AAAAAAAAAXc/gt3jCdQjLWc/s1600-h/Screen+shot+2010-03-24+at+1.19.27+AM.png"&gt;&lt;img src="http://1.bp.blogspot.com/_koDJi0ps7Mw/S6lNf9ESS7I/AAAAAAAAAXc/gt3jCdQjLWc/s400/Screen+shot+2010-03-24+at+1.19.27+AM.png" alt="" id="BLOGGER_PHOTO_ID_5451974035272518578" style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 236px;" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;We make the assumption that in the text representation shown above Cluster 5 (which contains 329 citizen messages) is about requests for a fair tax plan and Cluster 10 contains messages with requests that tax fraud should be minimized.  It appears that significantly less people are concerned with a battle against fraudulent activity but they request -more immediate- benefits through a fair tax plan.  &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Collecting and analyzing information found in blogs and forum entries is another area of analysis that could prove very interesting. Let's see an example with the Political / Social / Economic situation in Greece : The goal is to identify and extract trends and co-occurences of key concepts from blog titles and forum posts such as :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt; - Names of major Political parties&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Names of Politicians &lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Economy (words/phrases such as "austerity plan") &lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Negative characterizations &lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Company Names&lt;/div&gt;&lt;div style="text-align: justify;"&gt;...etc&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;For this kind of data several applications can emerge. We could track specific concepts through time and see their trends. We can also identify which concepts are discussed together. As an example we could identify the reasons on why Giorgos Papandreou (PM of Greece) is characterized in a bad way in blog posts. (= what other concepts are found in Blog posts containing keywords 'Giorgos Papandreou'  AND Bad Characterizations?)  :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;(Note : PASOK = Governmental Political Party )&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Politics = 120&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Economy=72&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Economy, Politics=40&lt;/div&gt;&lt;div style="text-align: justify;"&gt;PASOK=24&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Politics, PASOK, Referendum=8&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Economy, Politics,PASOK,Referendum, Immigrants=8&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Economy, Politics, Society=8&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Society, PASOK=4&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;In other words : Giorgos Papandreou is criticized mainly for his Political decisions and the Economy followed by criticism on PASOK. Negative sentiment also exists because of the fact that a percentage of Greek citizens require that a referendum should take place concerning the latest decision of the Greek government to give to a large proportion of Immigrants the Greek citizenship.&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-8681624950140262153?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/8681624950140262153/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=8681624950140262153' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8681624950140262153'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8681624950140262153'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/03/predictive-analytics-and-politics-part_23.html' title='Predictive Analytics and Politics - Part 2'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_koDJi0ps7Mw/S6lNf9ESS7I/AAAAAAAAAXc/gt3jCdQjLWc/s72-c/Screen+shot+2010-03-24+at+1.19.27+AM.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7905889618323092315</id><published>2010-03-12T16:16:00.032+02:00</published><updated>2010-03-17T18:46:26.271+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='association rule learning'/><category scheme='http://www.blogger.com/atom/ns#' term='politics'/><title type='text'>Predictive Analytics and Politics - Part 1</title><content type='html'>&lt;div style="text-align: justify;"&gt;One of the most interesting applications of Data/Text Mining and Information Extraction is Politics. I started collecting information from various blogs, websites and forums and applying Information Extraction and Data/Text Mining techniques to extract potentially useful knowledge in this area. By combining different pieces of information one could come up with trends that may tell us what lies ahead of us.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The latest developments in Greece are more or less known to most of people that read International News. The situation is difficult and the voice of citizens in various blogs and forums could give us the sentiment of Greek Web Users. For example :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Which are the most frequently occurring words?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- Which are the most frequently occurring thoughts?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- What are the things that have to be changed by Greek politicians?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;To answer these questions i have started collecting information found on the top 120 Greek blogs, the OpenGov website (a state-run website where Greek citizens express their opinions) and a couple more Greek sites of economic content. For blogs and forums a Java program scans every 20 minutes for new information :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;img src="http://3.bp.blogspot.com/_koDJi0ps7Mw/S5wP2_4pPJI/AAAAAAAAAW0/8wZzcHYWXuU/s400/Screen+shot+2010-03-14+at+12.15.43+AM.png" alt="" id="BLOGGER_PHOTO_ID_5448247086747827346" style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 196px;" border="0" /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;This information is then sent to an annotation engine which analyzes the textual content. Once the text is analyzed we can -for example- produce a keyword vector that we can later use to understand what citizens are saying on the Web. We can then find out answers to many interesting questions such as :&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- With which words is Mr George Papandreou (PM of Greece) associated with? &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;-When there are some very negative words (such as swearing) what other words are found in the same text?&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;- What does keyword trending tell us? (For example, we identify an increasingly number of swear words in citizen posts)&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;First let's see some examples regarding the OpenGov website where thousands of citizens have expressed their opinions on the tax policy of the Greek state. The following chart shows us a number of pairwise correlations between written words in these comments  :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238);"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0);"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/S5yoJWr8WaI/AAAAAAAAAXE/Zma8ocMcN3A/s1600-h/Screen+shot+2010-03-14+at+11.07.10+AM.png"&gt;&lt;img src="http://4.bp.blogspot.com/_koDJi0ps7Mw/S5yoJWr8WaI/AAAAAAAAAXE/Zma8ocMcN3A/s400/Screen+shot+2010-03-14+at+11.07.10+AM.png" alt="" id="BLOGGER_PHOTO_ID_5448414527873636770" style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 352px; height: 290px;" border="0" /&gt;&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Under the red rectangle appear two words (dikigoros,iatros) which in Greek mean "Lawyer" and "Medical Doctor" respectively. This essentially tells us that these two professions are  used together frequently in citizen discussions. By looking closely at these messages one can reveal that professionals in these two sectors are said to avoid taxes by not issuing receipts.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Next we could use association rule learning to look for some more -potentially interesting - rules :&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/S50SXkOiK3I/AAAAAAAAAXM/8aLCoKt9SY8/s1600-h/Screen+shot+2010-03-14+at+6.40.04+PM.png"&gt;&lt;img src="http://4.bp.blogspot.com/_koDJi0ps7Mw/S50SXkOiK3I/AAAAAAAAAXM/8aLCoKt9SY8/s400/Screen+shot+2010-03-14+at+6.40.04+PM.png" alt="" id="BLOGGER_PHOTO_ID_5448531320259095410" style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 576px; height: 171px;" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The highlighted rule although one of low support it could prove interesting : A subset of  citizens are requesting that freelancers and the self-employed should be more closely monitored for tax fraud.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Apart from rule learning, it is interesting to identify the proportion of the total dataset for which each rule holds. That also gives us a sense of order with which different ideas and thoughts exist on the mind of citizens.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;In the next post : What is the Voice of the Citizen tells us in Blogs and forums? &lt;/div&gt;&lt;div style="text-align: justify;"&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7905889618323092315?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7905889618323092315/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7905889618323092315' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7905889618323092315'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7905889618323092315'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/03/predictive-analytics-and-politics-part.html' title='Predictive Analytics and Politics - Part 1'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_koDJi0ps7Mw/S5wP2_4pPJI/AAAAAAAAAW0/8wZzcHYWXuU/s72-c/Screen+shot+2010-03-14+at+12.15.43+AM.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-3475949956575348425</id><published>2010-01-04T19:55:00.026+02:00</published><updated>2010-01-05T13:04:15.652+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><category scheme='http://www.blogger.com/atom/ns#' term='novelty detection'/><title type='text'>Detecting Novelty in Twitter posts</title><content type='html'>&lt;div style="text-align: justify;"&gt;A question one could come up with is the following :  How can we easily identify and extract novel  information from the web? Although we could apply this "novelty detection" into many areas  i would like to discuss for now the idea of semi-automatically identifying novelty among posts on Twitter.&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Let's take for example the &lt;a href="http://www.apple.com/iphone/"&gt;IPhone&lt;/a&gt;. Thousands of Tweets are generated every day regarding the Apple IPhone. These tweets mainly discuss about :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Which new apps are available / used / liked.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;New accessories (cases, chargers, etc)&lt;/li&gt;&lt;li&gt;User Experiences and sentiment (such as blaming IPhone's short battery life)&lt;/li&gt;&lt;li&gt;Pros and cons of the IPhone vs other similar devices&lt;/li&gt;&lt;li&gt;Upgrading / hacking etc.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;So the  problem is : How can we identify novel information among thousands of  tweets? Some would argue that we should first define what is "novelty" such as finding a new application or a new accessory for the famous mobile device. Others might argue that novelty is a customer idea  that not many people about the IPhone  thought about and for which &lt;a href="http://www.apple.com/"&gt;Apple&lt;/a&gt; would be interested in identifying among thousands of Tweets. As an example consider the following Tweets :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/S0L9XPNnATI/AAAAAAAAAWM/C9DjweBb8hg/s1600-h/novelty2.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 326px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/S0L9XPNnATI/AAAAAAAAAWM/C9DjweBb8hg/s400/novelty2.JPG" alt="" id="BLOGGER_PHOTO_ID_5423175476969931058" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;A subset of users experiences problems with the automatic orientation of the IPhone : This  subset of IPhone users is perhaps very small but identifying these tweets could give Apple  &lt;span style="text-decoration: underline;"&gt;&lt;/span&gt;some ideas to work on.&lt;br /&gt;&lt;br /&gt;Here is another subset of Tweets that talk about the charger's cable length :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/S0L-74hhpII/AAAAAAAAAWc/77Kx9dlekfE/s1600-h/novelty1.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 271px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/S0L-74hhpII/AAAAAAAAAWc/77Kx9dlekfE/s400/novelty1.JPG" alt="" id="BLOGGER_PHOTO_ID_5423177206046237826" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In the example shown above notice that using just "iphone cable" as search terms would return a large number of Tweets, making it hard to identify novelty among all these Tweets.&lt;br /&gt;&lt;br /&gt;Searching for novelty and identifying new ideas among Tweets is not an easy task. The problem is that we do not know what we are looking for in the first place : We can define the general context -such as wanting to identify novelty in user experience- but then we come to a halt in terms of what techniques to use  (with an exception perhaps being cluster analysis).&lt;br /&gt;&lt;br /&gt;The potential of using semi-automatic novelty detection on Twitter and other websites -such as &lt;a href="http://delicious.com/"&gt;delicious&lt;/a&gt; links- is very big.  Although this is work still in progress, the general methodology of novelty detection in Twitter could be to :&lt;br /&gt;&lt;br /&gt;1) Collect a large subset of Tweets mentioning IPhone and a keyword that identifies context (such as the word &lt;span style="font-style: italic;"&gt;charger)&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;2)  Identify keyword frequencies&lt;br /&gt;&lt;br /&gt;3) Generate search queries using a subset of keywords  chosen in an "intelligent" way, otherwise the number of search queries would be practically impossible to be evaluated.&lt;br /&gt;&lt;br /&gt;4) Test these combination of keywords by submitting them to Twitter search and evaluating the results.&lt;br /&gt;&lt;br /&gt;Steps (3) and (4) shown above are the key to success of course. In our example about the IPhone cable being too short we  had results returned because the combination of keywords submitted could make sense. Trying out &lt;span style="font-style: italic;"&gt;IPhone, cable, snow &lt;/span&gt;tells us that such keyword combination is not a valid one and -hence- not an "intellligent" keywords subset :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/S0MOfl_xqjI/AAAAAAAAAWk/3SbxdNTnOw8/s1600-h/novelty3.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 271px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/S0MOfl_xqjI/AAAAAAAAAWk/3SbxdNTnOw8/s400/novelty3.JPG" alt="" id="BLOGGER_PHOTO_ID_5423194312222550578" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-3475949956575348425?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/3475949956575348425/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=3475949956575348425' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3475949956575348425'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3475949956575348425'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2010/01/detecting-novelty-in-twitter-posts.html' title='Detecting Novelty in Twitter posts'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_koDJi0ps7Mw/S0L9XPNnATI/AAAAAAAAAWM/C9DjweBb8hg/s72-c/novelty2.JPG' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-1970353091333510496</id><published>2009-12-28T18:13:00.027+02:00</published><updated>2009-12-30T17:51:38.686+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='knowledge hub'/><title type='text'>Building a Knowledge Hub</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;br /&gt;The web is a huge source of information. It stores facts, thoughts, feelings and intentions of people. It also records what people like and what they don't in an indirect way - something that we are going to be looking at shortly .  Some of the examples on harnessing this information were shown previously in this blog, such as  :&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Extraction of user opinions, beliefs and values from Twitter&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Prediction of popular stories on Digg&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Prediction of popular Tweets&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Consider the following snapshot from a BBC &lt;a href="http://news.bbc.co.uk/2/shared/bsp/hi/live_stats/html/bysection.stm"&gt;webpage&lt;/a&gt; :&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/SznUTeH2xfI/AAAAAAAAAVc/nBApY2m7txU/s1600-h/bbc.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 260px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/SznUTeH2xfI/AAAAAAAAAVc/nBApY2m7txU/s400/bbc.JPG" alt="" id="BLOGGER_PHOTO_ID_5420597057485719026" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;The table above shows a representation of the most popular business stories on BBC on the 22nd December 2009. Even though we do not have specific metrics, we intuitively understand that the order with which the stories are listed also tell us the popularity of each post. Notice that the first post on the most read stories talks about the British economy while the last one is a title regarding football.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;This is knowledge that we can harness. No doubt it is a very specific kind of knowledge because it tells us only what -mostly- British readers of BBC have found interesting. In other words this is knowledge for a specific population : Most likely in another country -say France- the title about UK being still in recession would not be so interesting but a title about France being in the same situation would. Subject, Time and Location are all important parameters that need to be captured and taken into account.&lt;br /&gt;&lt;br /&gt;Let's consider the idea of creating a Knowledge Hub : This could be done by collecting massive amounts of information from Social Media,  blogs, comments from forums and news titles (and their popularity).  Techniques such as Information Extraction with concept annotation, Data and Text Mining could be used to extract knowledge by combining incidents, opinions, intentions and emotions found from different sources.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;I have been monitoring and collecting for the past 3 months  news and forum posts generated from/for a specific country. The information collected is then annotated in such a way to extract concepts. This text annotation is matched with keywords of concepts,  incidents and intentions.  Over the past month there has been a considerable amount of increase in  negative economy sentiment, crime-related incidents and/or terms that communicate future social instability and uneasiness&lt;span style="font-style: italic;"&gt;. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It is a very interesting fact that our behavior is recorded -up to a point- by the web.  Again, the key is the way that we are able to organize this information into logical chunks and then use this representation to find possible insights.&lt;br /&gt;&lt;br /&gt;2009 has been a year of big changes. Best wishes for a Happy and Prosperous New Year for everyone.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-1970353091333510496?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/1970353091333510496/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=1970353091333510496' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/1970353091333510496'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/1970353091333510496'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/12/building-knowledge-hub.html' title='Building a Knowledge Hub'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_koDJi0ps7Mw/SznUTeH2xfI/AAAAAAAAAVc/nBApY2m7txU/s72-c/bbc.JPG' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-167382529668312466</id><published>2009-10-31T16:22:00.059+02:00</published><updated>2009-11-02T11:32:33.184+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><category scheme='http://www.blogger.com/atom/ns#' term='information extraction'/><category scheme='http://www.blogger.com/atom/ns#' term='economy'/><title type='text'>The sentiment  on US Economy from Twitter</title><content type='html'>&lt;div style="text-align: justify;"&gt;Is the economic crisis over? What is the sentiment of people regarding US Economy and the future? These are some of the questions that many people ask these days  and the signs are somewhat mixed. Dow Jones is close to the 10000 mark and some US Economy Indices show that the worse is behind. But do people feel the same?&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;To answer these questions 10000 Tweets containing the word &lt;span style="font-style: italic;"&gt;economy&lt;/span&gt; were collected with the purpose of finding out what people think and how they feel about the US Economy and the economic crisis.   The following web chart shows some of the results :&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/Su1reGXYLZI/AAAAAAAAAVA/BCysPbZ69zY/s1600-h/econ_webchart.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 197px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/Su1reGXYLZI/AAAAAAAAAVA/BCysPbZ69zY/s400/econ_webchart.jpg" alt="" id="BLOGGER_PHOTO_ID_5399089693136006546" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;PositiveSentiment &lt;/span&gt;is an annotation type that includes all words that suggest positivity such as &lt;span style="font-style: italic;"&gt;good, better,advances &lt;/span&gt;while the opposite annotation (&lt;span style="font-style: italic;"&gt;NegativeSentiment) &lt;/span&gt;exists for all keywords that suggest negativity.&lt;br /&gt;&lt;br /&gt;The bolder the lines between words the heavier the association. To get an idea of how people feel, look at the line that connects &lt;span style="font-style: italic;"&gt;NegativeSentiment&lt;/span&gt; and the word &lt;span style="font-style: italic;"&gt;still&lt;/span&gt; which implies that the strongest sentiment is that US Economy is still under big problems.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;Some other findings  :&lt;br /&gt;&lt;br /&gt;- US President tells that the economy gets better but people don't feel the same.&lt;br /&gt;&lt;br /&gt;- Economy cannot be getting better while at the same time there are  layoffs.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;- People expressing very negative feelings after losing their jobs.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Notice also the association between &lt;span style="font-style: italic;"&gt;NegativeSentiment&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;people&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;job&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;money&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;sales&lt;/span&gt;. Interesting insights can also be found if brand names and product categories are also taken into account :  In this analysis a specific brand was found that was associated with word &lt;span style="font-style: italic;"&gt;sales &lt;/span&gt;and a good overall sentiment. Buying behavior can also be found regarding consumer intentions.&lt;br /&gt;&lt;br /&gt;You will also find that an association exists between &lt;span style="font-style: italic;"&gt;finance_institution&lt;/span&gt; keywords (implying keyword &lt;span style="font-style: italic;"&gt;Fed&lt;/span&gt;) and &lt;span style="font-style: italic;"&gt;PositiveSentiment&lt;/span&gt;. This association exists because a number of Re-Tweets is about the Fed signaling the start of exit from recession and its impact on housing. Interesting also is the association between &lt;span&gt;the words&lt;/span&gt;&lt;span style="font-style: italic;"&gt; fool &lt;/span&gt;and annotation &lt;span style="font-style: italic;"&gt;PositiveSentiment &lt;/span&gt;&lt;span&gt;(...)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Specific Tweets were removed such as spam Tweets (that try to sell investing products). Re-Tweets were kept intact since we are making the assumption that if someone Re-Tweets -say- a positive sentiment Tweet then he/she also feels the same -positive- sentiment. Tweets that were jokes were identified, marked accordingly and removed.&lt;br /&gt;&lt;br /&gt;As with many examples in the past, the software that was used consisted of &lt;a href="http://gate.ac.uk/"&gt;GATE&lt;/a&gt; (for annotating unstructured text from Tweets) but also SPSS Clementine (now PASW Modeller). Here is the setup from GATE :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/Su1lLULSBoI/AAAAAAAAAU4/KyyalN8Z12Q/s1600-h/econ_gate.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 216px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/Su1lLULSBoI/AAAAAAAAAU4/KyyalN8Z12Q/s400/econ_gate.JPG" alt="" id="BLOGGER_PHOTO_ID_5399082773356086914" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Specific rules (JAPE)  were used that identify and annotate accordingly   negative and positive sentiment. Consider the following sentences :&lt;br /&gt;&lt;br /&gt;- &lt;span style="font-style: italic;"&gt;The economy is most likely bad at the moment&lt;/span&gt;&lt;br /&gt;- &lt;span style="font-style: italic;"&gt;If the economy is great then why so many people can't find a job?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The first sentence has clearly a negative sentiment since the word &lt;span style="font-style: italic;"&gt;bad&lt;/span&gt; exists. However the second phrase contains  the word &lt;span style="font-style: italic;"&gt;great&lt;/span&gt; so a specific matching rule should take into consideration the word &lt;span style="font-style: italic;"&gt;If&lt;/span&gt; and annotate this phrase as one having negative sentiment despite the presence of word &lt;span style="font-style: italic;"&gt;great&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;After running GATE here is how the -now structured- data look like from a smaller sample of the original dataset (notice the highlighted record and the &lt;span style="font-style: italic;"&gt;IfGood&lt;/span&gt; flag) :&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/Su1s2rzSAfI/AAAAAAAAAVI/hj6yJ_-cafo/s1600-h/econ-table.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 147px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/Su1s2rzSAfI/AAAAAAAAAVI/hj6yJ_-cafo/s400/econ-table.JPG" alt="" id="BLOGGER_PHOTO_ID_5399091215013642738" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;With data in a structured form as the one depicted above we are then ready to identify which Tweets were found having a positive or negative sentiment, see erroneous annotations , take corrective actions and finally analyze the information and extract knowledge from it.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-167382529668312466?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/167382529668312466/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=167382529668312466' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/167382529668312466'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/167382529668312466'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/10/sentiment-on-us-economy-from-twitter.html' title='The sentiment  on US Economy from Twitter'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_koDJi0ps7Mw/Su1reGXYLZI/AAAAAAAAAVA/BCysPbZ69zY/s72-c/econ_webchart.jpg' height='72' width='72'/><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-5692298787848935176</id><published>2009-10-12T10:21:00.010+03:00</published><updated>2009-10-12T11:43:13.948+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Mining the Tweets</title><content type='html'>&lt;div style="text-align: justify;"&gt;I received through my Google Alerts a very interesting &lt;a href="http://kara.allthingsd.com/20091008/twitter-talking-separately-to-microsoft-and-also-google-about-big-data-mining-deals/"&gt;article&lt;/a&gt;  : Twitter is in talks with Microsoft and Google regarding the use of Data Mining technology on user Tweets.&lt;br /&gt;&lt;br /&gt;Despite the fact that Twitter execs do not appear so eager in making the deal as soon as possible, these news clearly show where things are going. If and when the  deal is finalized it will be very interesting to see :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;1) &lt;span style="font-style: italic;"&gt;What kind of Data and Text Mining techniques will be mostly used? Which of them will prove useful?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Many examples of what can be done in terms of Data and Text Mining application on Twitter were given in this blog (starting from January 2009). In my opinion, types of analysis that will prove to be interesting -apart from Sentiment Mining for Products and Services which is already taking place- are Cluster Analysis (see post "Clustering the Thoughts of Twitter Users" &lt;a href="http://lifeanalytics.blogspot.com/2009/01/clustering-thoughts-of-twitter-users.html"&gt;here&lt;/a&gt;) and Prediction of Virality.&lt;br /&gt;&lt;br /&gt;Although Twitter will be able to monetize through insights extracted from Cluster Analysis and Opinion - Sentiment Mining perhaps the most important analysis is finding patterns in user emotional states. Recall that everything needed for such an analysis exists in user Tweets : Life Events, thoughts and their associated emotional states. What emotions drive people in making several decisions such as which  Product to buy or which Politician to support? What kind of feelings are generated during a bad economy? Perhaps by analyzing Tweets we could understand people (and thus consumers) in entirely new ways since this is the first time that this information is available to us.&lt;br /&gt;&lt;br /&gt;2) &lt;span style="font-style: italic;"&gt;How will Twitter users react when knowing their Tweets are being analyzed?&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;My first impression is that Twitter users do not care too much if companies extract the insights discussed above however this does not mean that people's opinion will stay like this. Again, user reaction on this matter is something that could be changed anytime  and should be looked at closely.&lt;br /&gt;&lt;br /&gt;3) &lt;span style="font-style: italic;"&gt;Which other technologies will be mostly sought?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Although no one can give a definitive answer, i would likely expect Natural Language Processing (NLP) and Ontologies to be also heavily used and sought as expertise.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-5692298787848935176?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/5692298787848935176/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=5692298787848935176' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/5692298787848935176'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/5692298787848935176'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/10/mining-tweets.html' title='Mining the Tweets'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-2868954214242006071</id><published>2009-08-09T21:14:00.015+03:00</published><updated>2009-08-27T14:16:28.303+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Surviving Cancer, Happiness and Twitter</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;div style="text-align: justify;"&gt;Twitter is a great source of information on how people feel and how they behave. In previous posts we have discussed several examples of extracting from Twitter posts the feelings of Twitter users, their beliefs and values.&lt;br /&gt;&lt;br /&gt;My latest analysis goal was to extract specific life events (such as the birth of a child) and the associated feelings and emotions of such an event.&lt;br /&gt;&lt;br /&gt;First i wanted to identify life events associated with happiness. To do this i used text classification and a great piece of software called &lt;a href="http://gate.ac.uk/"&gt;GATE&lt;/a&gt;. The data used originated from tweets of 60K Twitter Users and their biographies.&lt;br /&gt;&lt;br /&gt;After completing the analysis, several "patterns of happiness" emerged but i believe that there is one that deserves a post on its own and  should be disclosed : One of the most happiest groups of people on Twitter are &lt;span style="font-style: italic;"&gt;cancer survivors&lt;/span&gt;. I was really amazed to find out that these people who faced -and possibly still facing- this life threatening disease were amongst the happiest people on Twitter and used very frequently words expressing happiness, satisfaction and &lt;span&gt;blessedness.&lt;br /&gt;&lt;br /&gt;I do believe that Twitter is a huge source of information and insights  for Marketing, Branding and PR. It also appears that by analyzing Tweets we could also learn some important life lessons as well.&lt;br /&gt;&lt;br /&gt;More to come soon.&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-2868954214242006071?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/2868954214242006071/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=2868954214242006071' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2868954214242006071'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2868954214242006071'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/08/surviving-cancer-happiness-and-twitter.html' title='Surviving Cancer, Happiness and Twitter'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-8032673299398654919</id><published>2009-08-04T12:55:00.023+03:00</published><updated>2009-08-05T21:09:21.964+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>A computer program predicts Viral Tweets</title><content type='html'>&lt;div style="text-align: justify;"&gt;In the &lt;a href="http://lifeanalytics.blogspot.com/2009/06/predicting-next-viral-tweet.html"&gt;previous post&lt;/a&gt; we have seen that the author of a Tweet is the most important factor for making a viral Tweet. This time we will use Text Mining to score Tweets and see how much viral they could become. Each Tweet is fed to a computer program (an algorithm) and the algorithm responds  with the probability each Tweet has to become viral (we assume that when a Tweet receives more than 30 Re-Tweets it is considered viral).&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/Snl4KGSEO4I/AAAAAAAAAS4/NWrmB824H74/s1600-h/knowledgeflow.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 170px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/Snl4KGSEO4I/AAAAAAAAAS4/NWrmB824H74/s400/knowledgeflow.JPG" alt="" id="BLOGGER_PHOTO_ID_5366452545869069186" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;The information that is given to the algorithm is the Text of the Tweet and its author. Many other parameters can be taken into consideration such as the time that the Tweet has been posted, the type of the Tweet (ie. politics, technology, health, etc) or even whether this Tweet is part of a novel subject. Here is the output of the software that performs the predictions :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/SnnGXWhsdTI/AAAAAAAAATo/WSnsWqTTIFs/s1600-h/classifier-run2.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 225px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/SnnGXWhsdTI/AAAAAAAAATo/WSnsWqTTIFs/s400/classifier-run2.JPG" alt="" id="BLOGGER_PHOTO_ID_5366538535474853170" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The number of Re-Tweets is shown in squares. Pay also close attention to the circled text shown above. For each Tweet the most probable outcome is given ('t'= Tweet will become viral, 'f'=otherwise) and a confidence for each prediction is given as a number from 0 to 1. As an example, the first Tweet shown above was posted from Paula Abdul saying that she will not return to American Idol. The algorithm predicts with a confidence of 63.38% that what Paula Abdul posted will be interesting (and it actually was).&lt;br /&gt;&lt;br /&gt;The predictive model has an overall accuracy of 72.88% in predicting which Tweets will be viral in a total of 59 Tweets. An example of an incorrect prediction can be seen at the 4th circle from the top. The algorithm gave a 53.66% confidence that this Tweet will not become viral but actually this was a viral Tweet.&lt;br /&gt;&lt;br /&gt;You can find the text file of the actual run from the algorithm &lt;a href="http://www.filefactory.com/file/ahh77hf/n/classifier-run_txt"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;By looking the text file, results metrics such as TP (True positives) versus FP (False positives) can be  calculated. It is also interesting to see how the algorithm switches to negative predictions when the number of Re-Tweets of each Tweet become less than 30.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Even though the example given here is very simplistic -and optimistic-, the application of a tool of this kind for PR,  Marketing and Branding could prove very useful. Marketeers can try different messages and see what impact each message is likely to have.  Consider the following run that shows that &lt;a href="http://twitter.com/mashable"&gt;@mashable&lt;/a&gt; is more influential than &lt;a href="http://twitter.com/lifeanalytics"&gt;@lifeanalytics&lt;/a&gt; :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/SnmXqR3yB-I/AAAAAAAAATQ/X9FOdGBe0RE/s1600-h/classifier-run1.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 542px; height: 118px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/SnmXqR3yB-I/AAAAAAAAATQ/X9FOdGBe0RE/s400/classifier-run1.JPG" alt="" id="BLOGGER_PHOTO_ID_5366487183596324834" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The following run shows that specific keywords raise our chances in making a Viral Tweet :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_koDJi0ps7Mw/SnmdaTJRtWI/AAAAAAAAATY/1JIyCj6RtD8/s1600-h/classifier-run3.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 577px; height: 119px;" src="http://2.bp.blogspot.com/_koDJi0ps7Mw/SnmdaTJRtWI/AAAAAAAAATY/1JIyCj6RtD8/s400/classifier-run3.JPG" alt="" id="BLOGGER_PHOTO_ID_5366493506129999202" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;In theory this information could provide the basis for performing A/B tests : One could  simply use the 2 messages shown above and record what impact each one has using Google Analytics (a process which could prove whether this technology works or not).&lt;br /&gt;&lt;br /&gt;Finding information that is interesting to masses is actually a much harder problem. Twitter is a data source that is biased for many reasons : Specific people can pass their messages with great ease and Twitter is used by specific population segments. Almost a week ago i came across &lt;a href="http://www.reddit.com/"&gt;reddit&lt;/a&gt; and i believe that this site (and also Digg) is able to capture the preference of masses in a more efficient way than Twitter. The truth is that the available information from forums, blogs and many other websites can capture different aspects of human behavior. All that is needed to extract useful knowledge is an efficient blending of these facts, emotions and beliefs of people from different web sources.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-8032673299398654919?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/8032673299398654919/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=8032673299398654919' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8032673299398654919'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8032673299398654919'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/08/computer-program-predicts-viral-tweets.html' title='A computer program predicts Viral Tweets'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_koDJi0ps7Mw/Snl4KGSEO4I/AAAAAAAAAS4/NWrmB824H74/s72-c/knowledgeflow.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-6156875757707532335</id><published>2009-06-30T15:52:00.031+03:00</published><updated>2009-07-22T18:52:15.867+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Predicting the next Viral Tweet</title><content type='html'>It is time to use Twitter data for another reason : Can Predictive Analytics be used to identify which tweets have an increased probability to become viral?&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_koDJi0ps7Mw/SmWNhzoDKtI/AAAAAAAAASo/DHL6ebvHVOc/s1600-h/viraltree.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 178px;" src="http://2.bp.blogspot.com/_koDJi0ps7Mw/SmWNhzoDKtI/AAAAAAAAASo/DHL6ebvHVOc/s400/viraltree.jpg" alt="" id="BLOGGER_PHOTO_ID_5360846543387830994" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;First we have to identify the problem and see what information we should consider. Every Tweet has an author, a content and is posted on a specific day and time. More specifically,  for every tweet we can collect usage data such as&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Day of Post&lt;/li&gt;&lt;li&gt;Time of post&lt;/li&gt;&lt;li&gt;Elapsed minutes since tweet has been posted&lt;/li&gt;&lt;li&gt;Author of tweet (Twitter username)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Number of followers of the author&lt;/li&gt;&lt;/ul&gt;and also information such as :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Subject of post&lt;/li&gt;&lt;li&gt;Whether the tweet involves a question being asked&lt;/li&gt;&lt;li&gt;Whether the tweet contains hashtags&lt;/li&gt;&lt;li&gt;Whether the tweet contains a "Please Re-Tweet" directive (or variants)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Whether a user is mentioned&lt;/li&gt;&lt;li&gt;The text of the tweet itself.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Our goal then is to combine the information mentioned above and come up with a predictive model that when given an author, day, time of post and text of the tweet it will be able to tell us whether this tweet has an increased probability to become viral.&lt;br /&gt;&lt;br /&gt;For this Data &amp;amp; Text mining exercise (and keeping in mind that tweets have been sampled from  one website and not Twitter itself)  let's define what is a viral tweet : After collecting  approx. 8000 tweets from &lt;a href="http://www.dailyrt.com/"&gt;dailyrt.com&lt;/a&gt; it was found that the median value of Re-tweets is 17.  Here  we make the assumption that if a tweet exceeds 30 Re-tweets it is considered viral (and actually this specific assumption makes the classification task much easier).&lt;br /&gt;&lt;br /&gt;As discussed above, usage data do not tell us anything about the content of a tweet. Usage data tell us about the name of the author, his/her followers, when the tweet has been posted and how many minutes elapsed since its post. Can this information alone predict whether a tweet will become viral? A  data mining model  predicted (without using the elapsed time as input field) with an overall accuracy of 75.03% whether a tweet can be viral and -perhaps as expected- shown that the most important factor for making a viral tweet is its author. Running a process called Feature Selection tells us just that :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/SmSxeBzIKoI/AAAAAAAAASQ/OBD1U_yJ9cQ/s1600-h/usagefs.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 197px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/SmSxeBzIKoI/AAAAAAAAASQ/OBD1U_yJ9cQ/s400/usagefs.JPG" alt="" id="BLOGGER_PHOTO_ID_5360604585914804866" border="0" /&gt;  &lt;/a&gt;&lt;br /&gt;But what we have seen so far only tells us one -the Data Mining- side of the story. With  Text Mining we can see the importance of words and authors. To do that, each author is appended at the end of each tweet (so essentially the author becomes a part of each tweet text). Here is what Feature Selection tells us :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/SmSyPmYkjqI/AAAAAAAAASY/_Hdzh6Ga4hU/s1600-h/textfs.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 152px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/SmSyPmYkjqI/AAAAAAAAASY/_Hdzh6Ga4hU/s400/textfs.JPG" alt="" id="BLOGGER_PHOTO_ID_5360605437549121186" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;A Tweet mentioning Michael Jackson has a great probability of becoming viral but perhaps  it should be also posted by a popular author to make a greater impact.  Pay attention also to the fact that &lt;a href="http://twitter.com/mashable"&gt;@mashable&lt;/a&gt; and the &lt;a href="http://twitter.com/theonion"&gt;@theonion&lt;/a&gt; are on top of our feature selection list shown above.&lt;br /&gt;&lt;br /&gt;The difficult -but also interesting- task is to predict a viral tweet that  has an impact not because of its author but because of its content and to do this the methodology of data collection and analysis differs significantly.&lt;br /&gt;&lt;br /&gt;On the next post we will see a model predicting viral tweets in action : We will submit  several tweets and their author and the model will tell us the probability that each submitted tweet has to become viral.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-6156875757707532335?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/6156875757707532335/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=6156875757707532335' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6156875757707532335'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6156875757707532335'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/06/predicting-next-viral-tweet.html' title='Predicting the next Viral Tweet'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_koDJi0ps7Mw/SmWNhzoDKtI/AAAAAAAAASo/DHL6ebvHVOc/s72-c/viraltree.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7923760654094503173</id><published>2009-06-23T14:05:00.011+03:00</published><updated>2009-06-24T08:11:53.563+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='social media analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>How Habitat UK *should* have used Twitter</title><content type='html'>&lt;div style="text-align: justify;"&gt;Following the great &lt;a href="http://socialmediatoday.com/SMC/103334"&gt;post&lt;/a&gt; from Tiphereth Gloria i wanted to take the opportunity to show an example of how Habitat UK &lt;span&gt;should&lt;/span&gt; be using Twitter.&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;My suggestion would be that instead of the "initiative" they took they should identify the values, beliefs and needs of their customers by capturing and analyzing relevant tweets instead. And here is how they could do it :&lt;br /&gt;&lt;br /&gt;First they should capture all relevant Tweets every -say- month :&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_koDJi0ps7Mw/SkDFbEKfSUI/AAAAAAAAAR4/pCSldDOUKPs/s1600-h/habitat3.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 474px; height: 127px;" src="http://2.bp.blogspot.com/_koDJi0ps7Mw/SkDFbEKfSUI/AAAAAAAAAR4/pCSldDOUKPs/s400/habitat3.JPG" alt="" id="BLOGGER_PHOTO_ID_5350493426081024322" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The second step would be to identify what people want when they talk about furniture. If they used Text Mining they would have found specific furniture products that customers want to buy and the values associated with these types. For an example look at the following table :&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/SkDF-z35GII/AAAAAAAAASA/RHROuBHm3pE/s1600-h/habitat1.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 217px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/SkDF-z35GII/AAAAAAAAASA/RHROuBHm3pE/s400/habitat1.JPG" alt="" id="BLOGGER_PHOTO_ID_5350494040183347330" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The table shows us (pay attention to dark red cells) that customers looking to buy baby furniture have &lt;span style="font-style: italic;"&gt;Safety&lt;/span&gt; as their number one associated value. With this knowledge then perhaps Habitat UK would make sure that when they advertise Baby furniture they would use this word on their advertisements to capture the interest of their customers.  Of course what was shown above is not some new information but is meant to be given as an example.&lt;br /&gt;&lt;br /&gt;Some more things that Habitat UK could have done with Text Mining would be to see  :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How important it is to suggest solutions to customers&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Which rooms people want to re-furnish more often and -more importantly- why.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;How problems (such as furniture received is damaged or difficult to assembly) affect their brand.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;How people &lt;span style="font-style: italic;"&gt;feel excited&lt;/span&gt; when they wait for their new furniture...and how bad they feel when furniture is not delivered on time.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;There is much more that can be done : By running Cluster analysis many kinds of customer thoughts can be grouped together : One of them was how much "Feeling good" is closely related to new furniture and how it affects people's psyche.&lt;br /&gt;&lt;br /&gt;By using Social Media Analytics, Habitat UK -and most other companies- would understand their customers better, see what is important for them and with this knowledge they would be able to take informed&lt;span style="font-style: italic;"&gt; &lt;/span&gt;decisions that would -most likely- make a real difference.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7923760654094503173?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7923760654094503173/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7923760654094503173' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7923760654094503173'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7923760654094503173'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/06/how-habitat-uk-should-have-used-twitter.html' title='How Habitat UK *should* have used Twitter'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_koDJi0ps7Mw/SkDFbEKfSUI/AAAAAAAAAR4/pCSldDOUKPs/s72-c/habitat3.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-8304360112878908903</id><published>2009-06-19T11:10:00.018+03:00</published><updated>2009-06-19T16:08:47.804+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>How people use Twitter - 10 distinct usage groups</title><content type='html'>&lt;div style="text-align: justify;"&gt;During this post we will be looking at another example of cluster analysis performed on Twitter.  The analysis was performed on 17000 Twitter users with the goal of extracting distinct groups of usage which essentially shows us the different types of Usage behavior of Twitter users. The following parameters were taken under consideration :&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Number of Followers&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Number of Links posted per 20 Tweets (not during RT)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Number of Updates&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Elapsed Days&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The following table shows the results :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/SjtZF5jYfwI/AAAAAAAAARw/oUDU1t76Rbg/s1600-h/twitclusters.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 527px; height: 200px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/SjtZF5jYfwI/AAAAAAAAARw/oUDU1t76Rbg/s400/twitclusters.JPG" alt="" id="BLOGGER_PHOTO_ID_5348966940316040962" border="0" /&gt;&lt;/a&gt;Note that each cluster has a specific number from 1 to 10. Clusters are listed according to their size which means that cluster "10" is the largest usage group, while cluster "5" being the smallest.&lt;br /&gt;&lt;br /&gt;Let's see what the table tells us, starting with the first line  : Cluster 10, is the largest (=more frequent) type of usage behavior.  Users of that group have an average number of followers, have been using Twitter for relatively many days (elapsedDays=high) ,have a high number of updates while the number of links they provide per 20 tweets is average - say around 3 links-&lt;br /&gt;&lt;br /&gt;Now consider -highlighted- cluster 8 which we will call &lt;span style="font-style: italic;"&gt;The Information providers&lt;/span&gt; : Notice that even though this group of users have relatively few elapsed days and average number of updates, they achieve a High number of followers. The reason is that these users provide a large number of links per 20 Tweets ( Note that this confirms findings during a previous analysis).&lt;br /&gt;&lt;br /&gt;See also cluster 3 : Even though this group of users has been on Twitter for many days but also has a high number of updates, it appears that it pays a price for not providing links.&lt;br /&gt;&lt;br /&gt;Recall that the "#OfLinks" parameter counts only these  links that are NOT part of a Retweet. This tells us that users that are able to find &lt;span style="font-style: italic;"&gt;original&lt;/span&gt; content and provide it to the community tend to gain more followers.&lt;br /&gt;&lt;br /&gt;This analysis was given with the aim of providing a simple example and should not be considered as a detailed analysis since few parameters have been taken into account. Cluster Analysis on Twitter data (which include things that people like doing, professions, interests, marital status, mention of products or opinions to name a few)  can -potentially- give us excellent insights on different aspects of user behavior.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-8304360112878908903?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/8304360112878908903/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=8304360112878908903' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8304360112878908903'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8304360112878908903'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/06/how-people-use-twitter-10-distinct.html' title='How people use Twitter - 10 distinct usage groups'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_koDJi0ps7Mw/SjtZF5jYfwI/AAAAAAAAARw/oUDU1t76Rbg/s72-c/twitclusters.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-3287863657046630280</id><published>2009-06-04T09:32:00.062+03:00</published><updated>2009-06-08T17:13:59.259+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='google wave'/><category scheme='http://www.blogger.com/atom/ns#' term='social media analytics'/><title type='text'>Social Media, Corporate Decisions and Analytics</title><content type='html'>&lt;div style="text-align: justify;"&gt;Over the past 6 months we have seen real-world applications of Data and Text Mining applied on Social Media Data from Twitter. We went through many &lt;a href="http://lifeanalytics.blogspot.com/search/label/twitter"&gt;examples&lt;/a&gt; that look at Social Media Data in different ways  :&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;We identified what Twitter users don't want, grouped their beliefs and also ordered all of this information accordingly&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;We identified which usage behavior increases our chances of having a large number of followers (if a large number of followers is our goal)&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/SijTlfh5STI/AAAAAAAAARA/DuP-kU66zN0/s1600-h/twitterdecisiontree.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 136px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/SijTlfh5STI/AAAAAAAAARA/DuP-kU66zN0/s400/twitterdecisiontree.jpg" alt="" id="BLOGGER_PHOTO_ID_5343753598947379506" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;We found which words appear to be associated with a large number of followers. (We have seen that negative thinking and words in Tweets possibly drive people away)&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/Sio4umCyYeI/AAAAAAAAARQ/lvdz0NUELRI/s1600-h/twitt-table2.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 217px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/Sio4umCyYeI/AAAAAAAAARQ/lvdz0NUELRI/s400/twitt-table2.JPG" alt="" id="BLOGGER_PHOTO_ID_5344146280965890530" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;We extracted segments of Twitter users with similar characteristics.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;The list of possible applications does not end here. Over the next posts we will also discuss about :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Predicting whether a Tweet has the potential to become "viral".&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Associating specific events and user emotional states.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;To recap :  A computer program is able to monitor the words - phrases that you say and your emotions, flag them as positive or negative, track the rate with which you increase your follower count, track the number of updates, Re-Tweets, replies, hashtags, smileys and questions that you make, flags any mentions about products and services and assigns you to a predefined segment of users sharing similar behavior and interests. Then for each segment its "social media fitness value"  is identified (by looking at the follower count).&lt;br /&gt;&lt;br /&gt;Usage of &lt;a href="http://wave.google.com/"&gt;Google Wave&lt;/a&gt; will possibly reveal other insights : Due to the fact that the sequence of posts will be easily extracted then we could also take under consideration the number of consecutive posts who had a positive sentiment and whether these positive posts appeared at the beginning, center or the end of each thread's sequence. We could also look at the number of posts -that are part of the same thread- having videos or pictures attached and ultimately identify how all of this information may affect one's point of view. Of course I am not certain whether such a scenario could prove useful. I sure would like to try though.&lt;br /&gt;&lt;br /&gt;We are presented with a unique opportunity to understand people much better than before and with the examples shown so far this should be more clear by now. Predictive Analytics is about extracting knowledge and identifying what is more likely to work. As &lt;a href="http://islandia.law.yale.edu/ayers/"&gt;Ian Ayres&lt;/a&gt; put it in his book &lt;span style="font-style: italic;"&gt;Super Crunchers&lt;/span&gt;, Decisions are beginning to be based even more on facts and less on intuition.  It appears that Social Media Analytics will play an important role in making Corporate decisions for PR, Branding and Marketing and this will happen through better understanding of human behavior.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-3287863657046630280?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/3287863657046630280/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=3287863657046630280' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3287863657046630280'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3287863657046630280'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/06/social-media-corporate-decisions-and.html' title='Social Media, Corporate Decisions and Analytics'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_koDJi0ps7Mw/SijTlfh5STI/AAAAAAAAARA/DuP-kU66zN0/s72-c/twitterdecisiontree.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-8461399928706898263</id><published>2009-05-21T21:38:00.051+03:00</published><updated>2009-05-26T17:21:11.110+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Twitter Analytics : Cluster Analysis reveals similar Twitter Users</title><content type='html'>&lt;div style="text-align: justify;"&gt;So far we have seen various examples of using analytics to gain insights from Twitter. Using cluster analysis is a personal favorite : It enables us to identify common groups of users and in this post we are going to look at a segmentation based on user biography keywords. This analysis was also presented in an &lt;a href="http://lifeanalytics.blogspot.com/2009/02/know-your-customers-twitter-way.html"&gt;older&lt;/a&gt; post but some readers asked me to  elaborate a bit more on this type of analysis.&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Biography information allows us to segment Twitter users in groups of similar interests, professions and qualities. What is more interesting however is that we can identify the words that each segment appears to be associated with.   Let's see an example of words that tend to co-exist with the phrase "social media" in the Biographies of Twitter users :&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/ShfOK29ELZI/AAAAAAAAAPw/rbF10AShQtM/s1600-h/clust1.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 456px; height: 241px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/ShfOK29ELZI/AAAAAAAAAPw/rbF10AShQtM/s400/clust1.JPG" alt="" id="BLOGGER_PHOTO_ID_5338962569216667026" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;By looking at the column named "social_media" we see some associated keywords like : &lt;span style="font-style: italic;"&gt;addiction&lt;/span&gt; (synonym for addicted, junkie etc), &lt;span style="font-style: italic;"&gt;evangelist, enthusiast, analytics&lt;/span&gt; etc.&lt;br /&gt;&lt;br /&gt;Other groups found and their associated words were :&lt;br /&gt;&lt;br /&gt;The Geeks  : &lt;span style="font-style: italic;"&gt;Developer, Linux, Mac, gaming, photography&lt;/span&gt;&lt;br /&gt;The Parents : &lt;span style="font-style: italic;"&gt;married, boys, girls, christian,conservative&lt;/span&gt;&lt;br /&gt;The business owners : &lt;span style="font-style: italic;"&gt;CEO, entrepreneur, marketing, founder, lifestyle&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span&gt;Note that "The Geeks" have &lt;span style="font-style: italic;"&gt;Mac&lt;/span&gt; as an associated keyword which of course refers to Apple Macintosh : An example suggesting a possible strong bond between a brand and a specific customer segment.&lt;br /&gt;&lt;br /&gt;Now imagine running a similar analysis for other segments such as Single Dads and Mothers,  Teenage Girls, Nice Guys, &lt;/span&gt;&lt;span&gt;IT Developers&lt;/span&gt;&lt;span&gt;, VIPs or any other "segment" you prefer &lt;/span&gt;(see &lt;a href="http://lifeanalytics.blogspot.com/2009/01/emotions-beliefs-and-analytics.html"&gt;this&lt;/a&gt; entry -posted Jan. 2009- for more)&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;On a personal Note : Having used Text Mining on Twitter over the past 6 months i realized that whenever a new cycle of analysis  is made i come up most of the time with things that i already know. But apart from expected results some of the fine details of people's lives also appear such as the implications of a life-changing event, the joy of owning something new or the plain fact of "watching TV and feeling bored". Many of the insights found during these months -although not discussed here on purpose- are highly thought provoking.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;Perhaps Twitter Analytics could also give us some &lt;span&gt;possible clues&lt;/span&gt; on:&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Whether a specific profession could be a risk factor for being single.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How important is fashion for girls.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How mobile phone user requirements change according to the "segment" they belong to.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;What are the most common things that people &lt;a href="http://lifeanalytics.blogspot.com/2009/01/clustering-thoughts-of-twitter-users.html"&gt;don't want&lt;/a&gt;.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Finding individuals  that &lt;span style="font-style: italic;"&gt;do not&lt;/span&gt; fit  any "segment".&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;But the list of potential applications does not end here : Using a technique called &lt;a href="http://en.wikipedia.org/wiki/Association_rule_learning"&gt;Association Rule Learning&lt;/a&gt;  (or Association Discovery) we can extract  emotions or thoughts that appear to co-exist and also emotions that seem to be associated with specific events. Classification Analysis can also play an important part  (more on these techniques soon).&lt;br /&gt;&lt;br /&gt;Each technique looks at the Social Media Data world from a different perspective. The usage behavior, cluster membership, the emotions and thoughts and also the Tweets that users seem to prefer most  (using data from sites such as &lt;a href="http://www.repeets.com/"&gt;repeets.com&lt;/a&gt;) may  be combined.  What we can potentially achieve from a combined analysis of this kind will be discussed in later posts.&lt;br /&gt;&lt;br /&gt;As already stated in previous posts: The use of the methods described so far enables us to form &lt;span style="font-style: italic;"&gt;hypotheses&lt;/span&gt; but in no way it is assumed that associations found are the definite cause of a specific event.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-8461399928706898263?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/8461399928706898263/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=8461399928706898263' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8461399928706898263'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8461399928706898263'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/05/twitter-analytics-cluster-analysis.html' title='Twitter Analytics : Cluster Analysis reveals similar Twitter Users'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_koDJi0ps7Mw/ShfOK29ELZI/AAAAAAAAAPw/rbF10AShQtM/s72-c/clust1.JPG' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-1411576489274620189</id><published>2009-05-15T09:16:00.059+03:00</published><updated>2009-05-18T16:05:03.727+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Twitter Analytics : Bio information and popularity</title><content type='html'>&lt;div style="text-align: justify;"&gt;In the previous post we identified words used in Tweets that appear to be associated with   low number of followers : We found that when someone uses foul or negative language then his/her follower count appears to be affected negatively  (see &lt;a href="http://lifeanalytics.blogspot.com/2009/05/twitter-analytics-these-words-may-be.html"&gt;here&lt;/a&gt; for more).&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;It is time to identify the words contained in the biographies of popular Twitter users and to be more specific the biographies of users being in the top 30% (in terms of no. of followers) of a random sample of 10000 users. As i always have stated in these series of posts : &lt;span style="font-style: italic;"&gt;Treat results as possible clues only. &lt;/span&gt;&lt;span&gt;Please also notice how i used (in this and older posts) the words "appears"  or "were found" when discussing correlation.&lt;/span&gt; The technique shown is the same as discussed in the previous post. Results are as follows :&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/Sg--veU2RmI/AAAAAAAAAPY/hwo4obeZX5Y/s1600-h/twitterbios.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 533px; height: 335px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/Sg--veU2RmI/AAAAAAAAAPY/hwo4obeZX5Y/s400/twitterbios.JPG" alt="" id="BLOGGER_PHOTO_ID_5336693806260962914" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Student&lt;/span&gt; appears to be correlated with low popularity accounts.&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Engineer &lt;/span&gt;also appears to exist often in low popularity accounts although the correlation was not found to be as strong as for students.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;Common words existing in popular users Bio appear to be the following : &lt;span style="font-style: italic;"&gt;social, media, marketing, CEO, founder, author, entrepreneur, blog, twitter, news, writer, internet. &lt;/span&gt;&lt;span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Some comments :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;It is not suggested that by having specific words in your bio, you will get more followers. Many other things are and could be important in achieving a high follower count. Same applies for unpopular accounts.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;Looking at the results i wondered why students were found to be associated with low follower numbers and i think that this  requires more attention. One possible reason could be that  students might be spending most of their social media time on FaceBook or other SM sites. There can be many pitfalls in performing a random sampling  from Twitter and "Students" could be one of these cases. However please share your comments.&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Notice that some words that appear to be associated with high follower numbers are words that communicate authority ( ex. founder, CEO).&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;To recap from the last 3 posts :&lt;br /&gt;&lt;br /&gt;1) Do not use foul language - keep your conversations positive.&lt;br /&gt;2) Use "Thank you" often. "Stay tuned" seems to work well also.&lt;br /&gt;3) Post frequently.  Posting some links is also important.&lt;br /&gt;4) Make sure you have a good Bio filled in.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;Finally, if you find the contents of this blog interesting you can always have a look for more updates on my new account on Twitter  &lt;a href="http://twitter.com/lifeanalytics"&gt;@lifeanalytics&lt;/a&gt;&lt;span style="text-decoration: underline;"&gt;&lt;/span&gt; and also send me your suggestions and/or comments.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-1411576489274620189?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/1411576489274620189/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=1411576489274620189' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/1411576489274620189'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/1411576489274620189'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/05/twitter-analytics-bio-information-and.html' title='Twitter Analytics : Bio information and popularity'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_koDJi0ps7Mw/Sg--veU2RmI/AAAAAAAAAPY/hwo4obeZX5Y/s72-c/twitterbios.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7577495947708581792</id><published>2009-05-06T00:59:00.106+03:00</published><updated>2009-05-12T08:27:36.250+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Twitter Analytics : These words may be affecting  your popularity</title><content type='html'>Text Mining techniques can be used to identify specific words that  are correlated with Twitter accounts having high or low popularity. This can be done in two ways :  (1) By analyzing the text of the Tweets of each user and (2) By analyzing the text of the biography of each user.&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Let's start with the results of the first type of analysis with data originating from user Tweets. &lt;span&gt;Pay attention only to cells that are&lt;/span&gt; &lt;span&gt;highlighted in red&lt;/span&gt;&lt;span style="font-style: italic;"&gt;,&lt;/span&gt; their corresponding category column (LOWFOLLOWERS , HIGHFOLLOWERS) and the word at the beginning of each corresponding row.&lt;span style="font-style: italic;"&gt; &lt;/span&gt;&lt;span&gt;Results show which&lt;/span&gt; &lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;words&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;&lt;span style="font-style: italic;"&gt;&lt;span style="font-style: italic;"&gt;&lt;span style="font-style: italic;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;appear &lt;/span&gt;&lt;span&gt;to be important especially because the affinity shown here is moderate&lt;/span&gt;&lt;span style="font-style: italic;"&gt;. Use results as possible clues only.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/SgRYwySVYfI/AAAAAAAAAO4/vcT3pOfZFHg/s1600-h/twitt-table1.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 515px; height: 278px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/SgRYwySVYfI/AAAAAAAAAO4/vcT3pOfZFHg/s400/twitt-table1.JPG" alt="" id="BLOGGER_PHOTO_ID_5333485453869146610" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The results so far show us that :&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;hate, bed&lt;/span&gt; : are found to be correlated with low popularity&lt;br /&gt;&lt;span style="font-style: italic;"&gt;top, online, send, list,web,media, join&lt;/span&gt; : with high popularity&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Here is another portion of the results table  :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/SgRY9GlipFI/AAAAAAAAAPA/H3zizp1N9yY/s1600-h/twitt-table2.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 515px; height: 279px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/SgRY9GlipFI/AAAAAAAAAPA/H3zizp1N9yY/s400/twitt-table2.JPG" alt="" id="BLOGGER_PHOTO_ID_5333485665476846674" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The pattern should be evident by now : Words of negative attitude appear to be influencing  a user's follower count negatively. As also shown above, foul language appears to work negatively also. Several other insights were found  such as the existence of specific phrases that are correlated with low popularity ("watching TV")  while  other phrases ("stay tuned" ) with popular accounts. The number shown in parentheses quantifies the magnitude of the association that each word has and thus enables us to order words by their importance.&lt;br /&gt;&lt;br /&gt;Some of the words -and their synonyms- that were found to be associated with &lt;span style="font-style: italic;"&gt;very&lt;/span&gt; low   follower counts are :&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;br /&gt;- Sleep, Hate, Damn, Feeling, Homework, Class, Boring, Stuck&lt;br /&gt;&lt;/span&gt;&lt;span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;A total of 63  words and 25 phrases were found having  either a positive or negative association with the followers count. Interestingly, specific phrases that communicate any kind of &lt;span&gt;opportunity&lt;/span&gt; are also associated with high number of followers. "Thank you" is highly related with a user's large popularity.&lt;br /&gt;&lt;br /&gt;Here comes the interesting part : Once the Text Mining analysis is completed, a predictive model can be generated that may be used for scoring &lt;span style="font-style: italic;"&gt;future&lt;/span&gt; Tweets. Let's assume that you are about to send the following 2 Tweets :&lt;br /&gt;&lt;br /&gt;1)&lt;span style="font-style: italic;"&gt; 'Today i feel like sleeping all day. Yawn...'&lt;/span&gt;&lt;br /&gt;2)&lt;span style="font-style: italic;"&gt; '@xyz Your website traffic can be increased with good marketing'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;br /&gt;Before you post however, you decide to feed these 2 sentences to a predictive model. The predictive model returns for every Tweet the predicted result (GOOD or BAD)  and the associated probability. Here are the results  for these 2 examples from an actual run  :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/SgPn693Y5_I/AAAAAAAAAOw/gFHF7pqy8Zw/s1600-h/classifier-run.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 104px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/SgPn693Y5_I/AAAAAAAAAOw/gFHF7pqy8Zw/s400/classifier-run.JPG" alt="" id="BLOGGER_PHOTO_ID_5333361383962109938" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In other words  :&lt;br /&gt;&lt;br /&gt;1) The first Tweet may have a negative effect with a probability of  83.5%&lt;br /&gt;2) The second Tweet may have a positive effect with probability 99.9%&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Note that :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A predictive model is able to consider&lt;span style="font-style: italic;"&gt; &lt;/span&gt;&lt;span&gt;combination&lt;/span&gt; of words, not just single words. This raises considerably the accuracy of any prediction.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;In any real world application of Text Mining  a 100% prediction accuracy &lt;span&gt;cannot&lt;/span&gt; be achieved: Although application-specific, a 72-78% accuracy may be achieved - with considerable effort. Of course many more things are important to achieve high popularity and the example above is given merely to discuss what  techniques currently exist. A combination of analytical techniques  is the best option  and this will be discussed in a future post.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Several other types of analysis can extract similarly interesting insights : Let's not forget that Twitter Tweets contain the emotions, beliefs and values of users. They contain what people want and what they don't want. See &lt;a href="http://lifeanalytics.blogspot.com/2009/01/clustering-thoughts-of-twitter-users.html"&gt;Clustering the thoughts of Twitter Users&lt;/a&gt; and &lt;a href="http://lifeanalytics.blogspot.com/2009/02/know-your-customers-twitter-way.html"&gt;Know your customers the Twitter way&lt;/a&gt; for a further discussion on this.&lt;br /&gt;&lt;br /&gt;There will be more to say about Text Mining and how it can be put to use by PR Agencies and Marketing companies with practical examples shortly.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7577495947708581792?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7577495947708581792/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7577495947708581792' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7577495947708581792'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7577495947708581792'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/05/twitter-analytics-these-words-may-be.html' title='Twitter Analytics : These words may be affecting  your popularity'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_koDJi0ps7Mw/SgRYwySVYfI/AAAAAAAAAO4/vcT3pOfZFHg/s72-c/twitt-table1.JPG' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-9009707470865167303</id><published>2009-05-03T23:25:00.032+03:00</published><updated>2009-05-07T04:51:32.948+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Twitter Analytics : Which usage behavior attracts many followers?</title><content type='html'>&lt;div style="text-align: justify;"&gt;This is the first part of a series of posts where Data Mining and Text Mining will be applied  to extract potentially useful facts about the usage of Twitter and to draw some conclusions such as what makes a Twitter account interesting enough to other users.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The conclusions that will be presented here are from the analysis of 3651 Twitter accounts and are meant to show how Predictive Analytics can help. Please note that results are shown for informational purposes only. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;First, the data used can be summarized with the following table :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/Sf7RF33AmII/AAAAAAAAANw/wdUis1Q6pVg/s1600-h/twittertable.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 500px; height: 122px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/Sf7RF33AmII/AAAAAAAAANw/wdUis1Q6pVg/s400/twittertable.JPG" alt="" id="BLOGGER_PHOTO_ID_5331928907677472898" border="0" /&gt;&lt;/a&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;You can immediately see problems in the ranges of the data used especially on the number of "followers" and "following".   This is something to be expected since among the users captured were Jack Dorsey (founder of Twitter), Sen. McCain and George Stephanopoulos - users that obviously have a huge amount of followers.&lt;br /&gt;&lt;br /&gt;Before finding which usage behavior attracts many followers, one should be able to identify what exactly is a "popular twitter account". Is it just the &lt;span style="font-style: italic;"&gt;absolute&lt;/span&gt; number of followers? Perhaps it could be equally important -or at least interesting- to also look at :&lt;br /&gt;&lt;br /&gt;1)  The followers/following ratio&lt;br /&gt;&lt;br /&gt;2) The number of followers per day&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;For our example the absolute number of followers was used as the only criterion of a successful Twitter account. The results can be summarized with the following decision tree :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/Sf9jw4cmRuI/AAAAAAAAAOQ/xPgWqTxMDW4/s1600-h/twitterdecisiontree.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 482px; height: 164px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/Sf9jw4cmRuI/AAAAAAAAAOQ/xPgWqTxMDW4/s400/twitterdecisiontree.jpg" alt="" id="BLOGGER_PHOTO_ID_5332090175267161826" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Some usage patterns that raise the chance of having a successful Twitter account are the following :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Having a bio is an absolute must : 82.3% of unsuccessful Twitter accounts have their biography information missing.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;You should provide more than 3 links per 20 tweets and also more than 0.960 updates per day&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If you don't want to provide more than 3 links per 20 tweets, then try to post more than 5.857 updates per day.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Users that post more than 3 links per 20 tweets but post less than or equal to 0.960 updates per day, will need more than 222.5 days of usage to get an adequate amount of followers.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;By using Feature Selection we are able to look  also at the relevant importance of each parameter  on achieving many followers : Here are the results of Feature Selection from using ChiSquare, GainRatio and InfoGain attribute evaluators.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;=== Attribute selection 10 fold cross-validation (stratified), seed: 1 ===&lt;br /&gt;&lt;br /&gt;average merit      average rank  attribute&lt;br /&gt;362.743 +-10.419     1   +- 0       4 numberOfLinks&lt;br /&gt;319.397 +-10.133     2.4 +- 0.49    6 hasBlankProfile?&lt;br /&gt;311.661 +- 8.612     2.6 +- 0.49    7 updatesPerDay&lt;br /&gt;192.525 +- 7.481     4.1 +- 0.3     3 retweetsNumber&lt;br /&gt;178.236 +- 5.963     4.9 +- 0.3     1 elapsedDays&lt;br /&gt;36.148 +- 3.579     6   +- 0       2 otherUsersTalk&lt;br /&gt;17.843 +- 4.475     7   +- 0       5 questionsAsked&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;average merit      average rank  attribute&lt;br /&gt;0.1   +- 0.003     1   +- 0       6 hasBlankProfile?&lt;br /&gt;0.042 +- 0.001     2.4 +- 0.49    4 numberOfLinks&lt;br /&gt;0.039 +- 0.002     3.2 +- 0.6     3 retweetsNumber&lt;br /&gt;0.04  +- 0.004     3.4 +- 0.92    7 updatesPerDay&lt;br /&gt;0.025 +- 0.001     5   +- 0       1 elapsedDays&lt;br /&gt;0.011 +- 0.001     6   +- 0       2 otherUsersTalk&lt;br /&gt;0.005 +- 0.001     7   +- 0       5 questionsAsked&lt;br /&gt;&lt;br /&gt;average merit      average rank  attribute&lt;br /&gt;0.082 +- 0.002     1   +- 0       4 numberOfLinks&lt;br /&gt;0.074 +- 0.003     2.1 +- 0.3     6 hasBlankProfile?&lt;br /&gt;0.071 +- 0.002     2.9 +- 0.3     7 updatesPerDay&lt;br /&gt;0.044 +- 0.002     4.1 +- 0.3     3 retweetsNumber&lt;br /&gt;0.041 +- 0.001     4.9 +- 0.3     1 elapsedDays&lt;br /&gt;0.008 +- 0.001     6   +- 0       2 otherUsersTalk&lt;br /&gt;0.004 +- 0.001     7   +- 0       5 questionsAsked&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;We see that all three attribute evaluators agree that the number of links  provided on Tweets and whether the profile of the user is filled in are the two most important parameters in achieving many followers. Notice also that sending messages to other users (otherUsersTalk) and asking questions (questionsAsked) is not as important as one would expect.&lt;br /&gt;&lt;br /&gt;The analysis shown above gives many insights but it does not take into account what the users say and how this affects the popularity of a Twitter account. Text Mining will try to give some answers for this question and also identify which keywords on Twitter profiles seem to be associated with many followers.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-9009707470865167303?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/9009707470865167303/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=9009707470865167303' title='17 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/9009707470865167303'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/9009707470865167303'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/05/twitter-analytics-which-usage-behavior.html' title='Twitter Analytics : Which usage behavior attracts many followers?'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_koDJi0ps7Mw/Sf7RF33AmII/AAAAAAAAANw/wdUis1Q6pVg/s72-c/twittertable.JPG' height='72' width='72'/><thr:total>17</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-2245360945335358166</id><published>2009-04-27T20:39:00.015+03:00</published><updated>2009-05-15T09:39:17.850+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Twitter Analytics : Words that make a difference</title><content type='html'>&lt;div style="text-align: justify;"&gt;Predictive Analytics are already widely used on Twitter to extract -potentially- interesting insights. In &lt;a href="http://lifeanalytics.blogspot.com/search/label/twitter"&gt;previous posts&lt;/a&gt; we discussed about  :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Sentiment Analysis and Ontologies&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Analyzing the biographies of Twitter users and identifying clusters of similar users.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Cluster Analysis on the thoughts of Twitter users&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Identifying the values and beliefs of Twitter users.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;One additionally interesting insight is the knowledge of what makes a Twitter user having many followers. Consider the following questions :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Are there words that could potentially decrease the popularity of a Twitter account?&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;How important is to have an actual photo (and not the default o_O photo)?&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Which interests or professions tend to be associated with many followers?&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;How important is to have &lt;span style="font-style: italic;"&gt;at least&lt;/span&gt; a small text of biography information?&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;To answer these questions, data from 100000 Twitter users were collected over  the past few weeks. Information collected includes the number of followers, number of friends, total updates, number of Retweets (per 20 tweets), number of replies to other users, number of links to external URLs, number of months that the user is on Twitter, etc. Here is how the data looks like :&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/SfY0zYx3yKI/AAAAAAAAANo/FL6Lxy6b9Io/s1600-h/usercapt.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 577px; height: 301px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/SfY0zYx3yKI/AAAAAAAAANo/FL6Lxy6b9Io/s400/usercapt.JPG" alt="" id="BLOGGER_PHOTO_ID_5329505266469161122" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;You will notice that the separator tilde '^' is used. The first portion of each line contains the user name, date of account creation, months elapsed since account creation, number of friends,number of re-tweets etc.&lt;br /&gt;&lt;br /&gt;The first analysis that was performed was to identify whether specific keywords that exist on user biographies seem to be associated with a large number of followers. A second type of analysis was performed only with numeric data (such as number of re-tweets, number of user replies, number of updates,etc). Then a third type of analysis uses both a vector of keywords plus numerical data. Since a lot of work is needed, the process (but not all results) will be presented during the next posts.&lt;br /&gt;&lt;br /&gt;FYI  : Users that tend to use a lot the words "boredom", "boring" or "bored"  tend to minimize their chances of being popular.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-2245360945335358166?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/2245360945335358166/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=2245360945335358166' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2245360945335358166'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2245360945335358166'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/04/twitter-analytics-words-that-make.html' title='Twitter Analytics : Words that make a difference'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_koDJi0ps7Mw/SfY0zYx3yKI/AAAAAAAAANo/FL6Lxy6b9Io/s72-c/usercapt.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7958861129239442111</id><published>2009-03-12T12:56:00.010+02:00</published><updated>2009-04-06T21:47:07.288+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='scoutlabs'/><title type='text'>Social Media Monitoring with ScoutLabs - Interview</title><content type='html'>&lt;div style="text-align: justify;"&gt;During my previous posts i have shown some &lt;a href="http://lifeanalytics.blogspot.com/search/label/twitter"&gt;examples&lt;/a&gt; of Sentiment Analysis using Twitter. I came across a Sentiment Analysis product named &lt;a href="http://www.scoutlabs.com/"&gt;ScoutLabs&lt;/a&gt; which is able to give insights on what customers are saying on the Web about a product or service and decided to get an interview from ScoutLabs CEO Jennifer Zeszut :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/SbjvUor-2II/AAAAAAAAANg/4ZkMzPrPthQ/s1600-h/scoutlabs1.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 258px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/SbjvUor-2II/AAAAAAAAANg/4ZkMzPrPthQ/s400/scoutlabs1.jpg" alt="" id="BLOGGER_PHOTO_ID_5312258898281814146" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;- &lt;i&gt;Please tell us about ScoutLabs and how companies may benefit from using it.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:Calibri,Verdana,Helvetica,Arial;"&gt;&lt;span style="font-size:11;"&gt;Scout Labs is a powerful, web-based application that finds signals in the noise of social media to help teams build better products and stronger customer relationships.&lt;br /&gt;&lt;br /&gt;Scout Labs is a product company, not an agency. We provide cutting-edge technology and a collaborative platform for companies and their agents to listen to customers and engage with them out across the Internet. With Scout Labs, our users:&lt;br /&gt;&lt;br /&gt;   * Know when to tune in and what’s most important to pay attention to&lt;br /&gt;   * Hear what customers love and hate about brands&lt;br /&gt;   * Reach out to influential customers to build relationships&lt;br /&gt;   * Engage in proactive customer service&lt;br /&gt;   * Let the voice of the customer inspire new product and marketing ideas&lt;br /&gt;&lt;br /&gt;Scout Labs has grown significantly since it was founded in 2006. With offices in San Francisco and users all over the world, the company currently employs over 20 professionals. Our CEO and product team guide the application with insight from the world of marketing, brand management and product management, but the majority of Scout Labs employees are senior engineers with expertise in search technology, high-performance systems, natural language processing, machine learning, web crawling and data visualization.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;- &lt;i&gt;How is ScoutLabs different from other sentiment analysis solutions?&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Many sentiment analysis solutions are really human-powered, which is great, if you’ve got a big budget and or a lot of time. Ours is automated, and we process millions of posts per day and score it for sentiment as it happens, with an accuracy rate (agrees with humans) 73% of the time and it will get better, because as users (across our system) change sentiment values in our system, we aggregate that data and use it as labeled data to improve our algorithms even further. We also can “backfill” or back-score 3 months of previous data with its sentiment scores in 20 minutes to (at most) 24 hours.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;- &lt;i&gt;&lt;b&gt;What kind of information (product names, company names, areas, city names) can ScoutLabs identify in user conversations "Out of the Box" ?&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Scout labs lets you track anything you like. It is purely search based (not a database of x fixed company names). We see searches on everything! Company names, product names, people, industries, “the price of rubber”, green energy, styrofoam, “tricked out shoe”, “canceling cable”, “favorite hotel in bali”-- you name it. For some of these, sentiment doesn’t make a ton of sense, but we’ll score it for you nonetheless (just in case).&lt;br /&gt;&lt;b&gt;&lt;i&gt;&lt;br /&gt;- One interesting functionality is to be able to identify associated keywords by products. For example consider two different brands of running shoes such as Nike and Adidas. "Great design" might be associated with the first brand while the other brand is perceived as being "comfortable". Is ScoutLabs able to identify automatically such information between two or more similar brands-products?&lt;br /&gt;&lt;br /&gt;&lt;/i&gt;&lt;/b&gt;We do offer an analysis of the top conversations associated with any search. This comes stright from the conversations themselves. They are the frequent words that are emerging from the conversations for the time period. And yes! Very often we find powerful adjectives that describe a brand (sometimes good, but not always). For example, when i did a search for dora the explorer (the popular cartoon character for the pre-school set), the words that emerged was “skank” “sexy”. Huh? Sure enough, parents are outraged by a mattel announcement about how they are going to have dora grow up, move to new york and get fashionable (with a short skirt). This is a true story: &lt;a href="http://www.scoutlabs.com/2009/03/12/scandalous-doll-drama-scout-labs-style/"&gt;http://www.Scoutlabs.Com/2009/03/12/scandalous-doll-drama-scout-labs-style/&lt;/a&gt;&lt;br /&gt;&lt;i&gt;  &lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;- &lt;i&gt;How much should one expect to pay for such a service?&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;/div&gt;A single team with 25 searches (which is where most companies start) is only $249. For now. We have only committed to this low price for the first 1000 companies to try us. That price may go up one day soon. All our pricing plans are here: &lt;a class="linkification-ext" href="http://www.Scoutlabs.Com/plans/" title="Linkification: http://www.Scoutlabs.Com/plans/"&gt;http://www.Scoutlabs.Com/plans/&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7958861129239442111?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7958861129239442111/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7958861129239442111' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7958861129239442111'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7958861129239442111'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/03/social-media-monitoring-with-scoutlabs.html' title='Social Media Monitoring with ScoutLabs - Interview'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_koDJi0ps7Mw/SbjvUor-2II/AAAAAAAAANg/4ZkMzPrPthQ/s72-c/scoutlabs1.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-5866891251659525816</id><published>2009-02-23T13:49:00.033+02:00</published><updated>2009-03-12T13:05:06.450+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><category scheme='http://www.blogger.com/atom/ns#' term='ontologies'/><category scheme='http://www.blogger.com/atom/ns#' term='information extraction'/><title type='text'>Making more sense out of Twitter Tweets</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Over the last 5 posts i have described how unstructured text information from Twitter can be used for Knowledge Extraction. Specific examples were given such as &lt;a href="http://lifeanalytics.blogspot.com/2009/02/sentiment-mining-for-amazons-kindle.html"&gt;Sentiment Analysis for products (Amazon's Kindle)&lt;/a&gt;, &lt;a href="http://lifeanalytics.blogspot.com/2009/02/know-your-customers-twitter-way.html"&gt;Segmentation of Twitter users&lt;/a&gt;, and finally &lt;a href="http://lifeanalytics.blogspot.com/2009/01/clustering-thoughts-of-twitter-users.html"&gt;cluster analysis of the emotions and thoughts expressed from twitter users&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;So far i have discussed some ways that text mining could help us in getting more insight on how people think. Now it is time to put Information Extraction and Ontologies to the equation.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Information_extraction"&gt;Information Extraction&lt;/a&gt; (IE) is the automated extraction of any information such as (to name a few) Names (first names, city names, country names etc), facts  or events from unstructured text. An example of IE was given in &lt;a href="http://lifeanalytics.blogspot.com/search/label/real%20estate"&gt;these posts&lt;/a&gt; where thousands of adverts of  flats are extracted and then data mining analysis is performed to identify what characteristics are important for achieving a high renting price.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Ontology_%28computer_science%29"&gt;Ontologies&lt;/a&gt; are used for knowledge representation and may also be used for structuring the information that exists on the web.  To give an example, consider  the following product keywords :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Coke&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Sprite&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Dr Pepper&lt;/li&gt;&lt;/ul&gt;If one asks you what is &lt;span&gt;common&lt;/span&gt; about them, your brain looks for generalizations and comes up with the following answers :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;They are all &lt;span style="font-style: italic;"&gt;Carbonated Drinks&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;ul&gt;&lt;li&gt;(Possibly) &lt;span&gt;they&lt;/span&gt;&lt;span style="font-style: italic;"&gt; all contain sugar&lt;/span&gt; since the word "Diet" or "Zero" or "Light" is not mentioned.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Now let's assume having an Ontology Engine that is able to do this and to be able to &lt;span&gt;infer automatically&lt;/span&gt; that all these products are sugar-carbonated drinks. Such an action enables us to extract facts in a more coherent way. The reason behind this is that we lessen the effect discussed on &lt;a href="http://lifeanalytics.blogspot.com/2009/01/statistics-of-everyday-talk.html"&gt;The Statistics of Everyday Talk&lt;/a&gt; and thus are able to capture growing trends such as people expressing their thoughts regarding &lt;span style="font-style: italic;"&gt;carbonated drinks&lt;/span&gt; rather than matching "Coke", "Sprite" and "Dr Pepper" individually. Without Ontologies such a trend could be easily missed.&lt;br /&gt;&lt;br /&gt;By using Ontologies or taxonomies where applicable, an associations discovery algorithm can search in different levels of information detail. For example data miners usually employ taxonomic information (ex. Sprite, Coke, Pepsi = carbonated drinks) when performing associations discovery analysis on Super Markets and the effort of applying taxonomies almost always pays back in terms of the knowledge extracted regarding consumer behavior.&lt;br /&gt;&lt;br /&gt;I have used Ontologies over the past 3 years and have seen them in action. The fact that with Ontologies one could possibly have access to inference and deductive reasoning techniques is of great use. The application of Information Extraction, Natural Language Processing and subsequent insertion of this information in an Ontological setting has  many potential applications.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-5866891251659525816?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/5866891251659525816/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=5866891251659525816' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/5866891251659525816'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/5866891251659525816'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/02/making-more-sense-out-of-twitter-tweets.html' title='Making more sense out of Twitter Tweets'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-2888714321826440230</id><published>2009-02-15T12:40:00.024+02:00</published><updated>2009-05-08T21:13:43.942+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Know your customers - The Twitter way</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;br /&gt;The more i analyze tweets on Twitter, the more interesting i find the whole process. First it was &lt;a href="http://lifeanalytics.blogspot.com/2009/01/clustering-thoughts-of-twitter-users.html"&gt;Cluster analysis of specific thoughts expressed from Twitter users&lt;/a&gt; and then it was &lt;a href="http://lifeanalytics.blogspot.com/2009/02/sentiment-mining-for-amazons-kindle.html"&gt;Sentiment Mining for Amazon's Kindle.&lt;/a&gt; It was just a matter of time from having the urge to analyze Tweets on a broader perspective.&lt;br /&gt;&lt;br /&gt;So i decided to perform a segmentation of the Twitter users : extract common groups of users but this time not for &lt;span style="font-style: italic;"&gt;specific thoughts&lt;/span&gt; or &lt;span style="font-style: italic;"&gt;specific products&lt;/span&gt; but a segmentation based on a more generic basis.&lt;br /&gt;&lt;br /&gt;I had two goals in this cluster analysis :&lt;br /&gt;&lt;br /&gt;1) Cluster the biographies of users&lt;br /&gt;2) Cluster the tweets of the users.&lt;br /&gt;&lt;br /&gt;I then decided that the more information i could collect the better, so the first thing i did was to make a 'spider' program to extract 10,000 twitter user names. Then for each twitter user the software visits his/her page and extracts :&lt;br /&gt;&lt;br /&gt;a) The user's bio&lt;br /&gt;b) Number of followers&lt;br /&gt;c) Number of people following&lt;br /&gt;d) Number of updates&lt;br /&gt;e) 20 latest Tweets&lt;br /&gt;f) Number of re-tweets&lt;br /&gt;g) Number of replies to other users (ex when @user directive exists)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Let's see now what we could -potentially-  do with such information  :&lt;br /&gt;&lt;br /&gt;1) Cluster analysis on user bios&lt;br /&gt;&lt;br /&gt;2) Cluster analysis on user tweets&lt;br /&gt;&lt;br /&gt;3) Classification analysis for identifying the common characteristics of users with many followers&lt;br /&gt;&lt;br /&gt;4) Associations discovery between products : Which products tend to be mentioned &lt;span&gt;together&lt;/span&gt; in each user's tweets?&lt;br /&gt;&lt;br /&gt;5) Identification of common keywords per cluster : If we identify a cluster of users that we characterize  as the "Parents", what keywords do "Parents" tend to use more? What about the "Tech junkies" cluster?&lt;br /&gt;&lt;br /&gt;But let's start with the first analysis : Clustering the biographies of Twitterers. The analysis generated 30 clusters of users. Some of them are :&lt;br /&gt;&lt;br /&gt;1) The Parents&lt;br /&gt;2) The computer Geeks&lt;br /&gt;3) The students&lt;br /&gt;4) The social media addicts&lt;br /&gt;5) The entrepreneurs&lt;br /&gt;&lt;br /&gt;I looked at the "Parents" cluster more closely and wanted to find keywords that this cluster is associated with : &lt;span style="font-style: italic;"&gt;Single &lt;/span&gt;and &lt;span style="font-style: italic;"&gt;Jesus &lt;/span&gt;where some of them.&lt;br /&gt;&lt;br /&gt;So we immediately identify one of the many customer groups : The parents, of which a significant percentage of them are &lt;span style="font-style: italic;"&gt;single. &lt;/span&gt;The "Parents" cluster also expresses one of its &lt;span style="font-style: italic;"&gt;values&lt;/span&gt; :&lt;span style="font-style: italic;"&gt; &lt;/span&gt;Christianity.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;br /&gt;By moving on to each generated cluster and finding the associated keywords, i was able to retrieve &lt;span&gt;the values and beliefs&lt;/span&gt;&lt;span style="font-style: italic;"&gt; &lt;/span&gt;of each cluster. Knowledge Extraction at its best.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-2888714321826440230?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/2888714321826440230/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=2888714321826440230' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2888714321826440230'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2888714321826440230'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/02/know-your-customers-twitter-way.html' title='Know your customers - The Twitter way'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-4677079308616046161</id><published>2009-02-11T16:32:00.020+02:00</published><updated>2009-05-08T09:57:10.864+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Sentiment Mining for Amazon's Kindle</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Following the post on &lt;a href="http://lifeanalytics.blogspot.com/2009/01/clustering-thoughts-of-twitter-users.html"&gt;Clustering the thoughts of Twitter users&lt;/a&gt;, it is time to look at another example where Twitter can be used. So i decided to analyze -just- 1054 tweets that are about Amazon's e-reader &lt;a href="http://www.amazon.com/Kindle-Amazons-Wireless-Reading-Device/dp/B000FI73MA"&gt;kindle&lt;/a&gt; to see what i could come up with.&lt;br /&gt;&lt;br /&gt;My goal was not to classify between positive or negative sentiment but to extract the general "buzz" about the product by means of cluster analysis. After extracting the tweets that  contain the word "kindle" i continued in removing non-relevant information (such as tinyurl links)  by using regex expressions.&lt;br /&gt;&lt;br /&gt;Next, it was time to understand the data and a good way to do this is to look at word frequencies using &lt;a href="http://www.niederlandistik.fu-berlin.de/textstat/software-en.html"&gt;TextStat&lt;/a&gt;. Here is what i came up with :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/SZLpmtQuFKI/AAAAAAAAANA/PjSNQYPpiEM/s1600-h/textstat.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 210px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/SZLpmtQuFKI/AAAAAAAAANA/PjSNQYPpiEM/s400/textstat.JPG" alt="" id="BLOGGER_PHOTO_ID_5301556562562520226" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;On the top of the word frequency list are the usual suspects :   "I", "and", "to", but also "kindle", "kindle2" and "amazon" which is something that was expected. Now, let's see what are some of the words that do not occur frequently :&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_koDJi0ps7Mw/SZLrjmFPK7I/AAAAAAAAANI/HojL0VKZK-U/s1600-h/textstat2.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 218px;" src="http://2.bp.blogspot.com/_koDJi0ps7Mw/SZLrjmFPK7I/AAAAAAAAANI/HojL0VKZK-U/s400/textstat2.JPG" alt="" id="BLOGGER_PHOTO_ID_5301558708118956978" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Here appears a fact that requires attention : Text miners use stop-word lists to remove the most frequent words but they also remove words that do not occur frequently. The table above shows that a non-frequently occurring word is &lt;span style="font-style: italic;"&gt;disappointed&lt;/span&gt; and if we had chosen to omit words of a specific frequency range -such as less than 3- we could loose this important information. So caution is needed.&lt;br /&gt;&lt;br /&gt;After running the analysis, i came up with 20 different clusters of similar "thinking". Note that we are not only interested in which those clusters are but also -more importantly- to the proportion of cases that each cluster contains (see &lt;a href="http://lifeanalytics.blogspot.com/2009/01/clustering-thoughts-of-twitter-users.html"&gt;previous&lt;/a&gt; post). Some of the examples of clusters found are :&lt;br /&gt;&lt;br /&gt;1) A cluster of users that are questioning the usefulness of the product&lt;br /&gt;2) Excited users&lt;br /&gt;3) Users that are happy about the text-to-speech recognition feature of the product&lt;br /&gt;4) Text-to-speech recognition and potential copyright issues&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Twitter is a great source for sentiment extraction but one problem is the fact that people are re-tweeting the same news (" The new Kindle 2 is out") or they tweet about similar information from various tech news websites.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-4677079308616046161?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/4677079308616046161/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=4677079308616046161' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4677079308616046161'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4677079308616046161'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/02/sentiment-mining-for-amazons-kindle.html' title='Sentiment Mining for Amazon&apos;s Kindle'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_koDJi0ps7Mw/SZLpmtQuFKI/AAAAAAAAANA/PjSNQYPpiEM/s72-c/textstat.JPG' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-898843660289624703</id><published>2009-01-21T02:51:00.054+02:00</published><updated>2009-05-08T09:59:48.697+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Clustering the thoughts of Twitter Users</title><content type='html'>During the last two posts i presented the reasons and some problems on analyzing the thoughts of users on the web and particularly Twitter. (For more see  &lt;a href="http://lifeanalytics.blogspot.com/2009/01/emotions-beliefs-and-analytics.html"&gt;Part1&lt;/a&gt; and &lt;a href="http://lifeanalytics.blogspot.com/2009/01/statistics-of-everyday-talk.html"&gt;Part2&lt;/a&gt; ).&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;As an example, we are going to be looking at a specific kind of thought that Twitter users make : What they &lt;span style="font-style: italic;"&gt;don't&lt;/span&gt; want. By using the Twitter API i managed to extract all tweets having the phrase "i don't want to". The following text file shows the results :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/SXnELPQUcLI/AAAAAAAAAMg/4ykX8p6E95s/s1600-h/idontwanto1.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 547px; height: 262px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/SXnELPQUcLI/AAAAAAAAAMg/4ykX8p6E95s/s400/idontwanto1.JPG" alt="" id="BLOGGER_PHOTO_ID_5294478534303314098" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The next step is to remove all phrases that do not give us any information about what users do not want :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_koDJi0ps7Mw/SXwhgt0pMNI/AAAAAAAAAM4/wY4c-d3KYXI/s1600-h/idontwanto2.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 544px; height: 360px;" src="http://2.bp.blogspot.com/_koDJi0ps7Mw/SXwhgt0pMNI/AAAAAAAAAM4/wY4c-d3KYXI/s400/idontwanto2.JPG" alt="" id="BLOGGER_PHOTO_ID_5295144107820789970" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Finally we remove the phrase "i don't want to". However, consider the following example:&lt;br /&gt;&lt;br /&gt;"I must go to Chicago. I don't want to do that"&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The steps discussed above will discard the first sentence which is actually what the user does not want to do and leave only the phrase "i don't want to do that" which is not particularly informative. At this point we must quantify the problem -let's assume it involves the 8.5% of our records- and recall what the pareto principle is all about.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;After some additional pre-processing steps which are not discussed here, i feed the data to K-Means to see the clusters the algorithm comes up with. For a better presentation of the results, here is a screen capture from   &lt;a href="http://www.alphaworks.ibm.com/tech/uimodeler"&gt;IBM's UI Modeler&lt;/a&gt; :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/SXnbpW83SdI/AAAAAAAAAMw/OhKpeHmMkTw/s1600-h/thoughtclusters.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 590px; height: 249px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/SXnbpW83SdI/AAAAAAAAAMw/OhKpeHmMkTw/s400/thoughtclusters.JPG" alt="" id="BLOGGER_PHOTO_ID_5294504340532709842" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;We immediately see -in descending order- what Twitter users do &lt;span style="font-style: italic;"&gt;not &lt;/span&gt;want :&lt;br /&gt;&lt;br /&gt;1) They do not want to go to work&lt;br /&gt;2) They do not want to go to school&lt;br /&gt;3) They do not want to hear about various issues&lt;br /&gt;4) They do not want to buy things&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Notice also the top two categories named Miscellaneous and None. These categories contain thoughts that have a very small frequency to form a cluster. These two categories consist the 69.56% of our records and at this point we should think again about the pareto principle.&lt;br /&gt;&lt;br /&gt;Please note that not all necessary work is discussed here and i had to omit several actions that have to take place. In trying to understand what people actually think i am using an approach which uses Ontologies, Information Extraction, Clustering and Classification analysis with the ultimate goal to minimize the percentage of thoughts  (69.56% in this example) that cannot form a cluster and to increase the accuracy of the analysis.&lt;br /&gt;&lt;br /&gt;It is also an interesting fact that we could move further down the sentence branch (see &lt;a href="http://lifeanalytics.blogspot.com/2009/01/statistics-of-everyday-talk.html"&gt;this&lt;/a&gt; post) for even better insight. Here i presented a cluster analysis about what users do not want. As an example we could apply clustering on user thoughts specifically for "I don't want to feel".&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-898843660289624703?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/898843660289624703/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=898843660289624703' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/898843660289624703'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/898843660289624703'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/01/clustering-thoughts-of-twitter-users.html' title='Clustering the thoughts of Twitter Users'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_koDJi0ps7Mw/SXnELPQUcLI/AAAAAAAAAMg/4ykX8p6E95s/s72-c/idontwanto1.JPG' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-2860969719841529690</id><published>2009-01-15T18:52:00.027+02:00</published><updated>2009-05-15T09:40:21.580+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>The Statistics of Everyday Talk</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;br /&gt;As discussed in the previous &lt;a href="http://lifeanalytics.blogspot.com/2009/01/emotions-beliefs-and-analytics.html"&gt;post&lt;/a&gt;, the analysis of free text on the Web -and as an example  the thoughts expressed by Twitter users- could extract very interesting insights on how users think and how they behave.&lt;br /&gt;&lt;br /&gt;In 2001 i visited &lt;a href="http://www.trilliumsoftware.com/"&gt;Trillium&lt;/a&gt; where i had a very useful seminar on Data Cleaning, Data Quality and Standardization during which the &lt;a href="http://en.wikipedia.org/wiki/Pareto_principle"&gt;pareto principle&lt;/a&gt; became -once again- evident.  When someone wishes to standardize entries in a Database so that the word "Parkway" is  written in the same way across all records, he might  find the following distribution of  "parkway" entries  :&lt;br /&gt;&lt;br /&gt;15%  of records contain the word "Parkway"&lt;br /&gt;3%    of  records contain the word "Pkwy"&lt;br /&gt;0.2%   of  records contain the word "Prkwy"&lt;br /&gt;0.01% of records contain the word "Parkwy"&lt;br /&gt;&lt;br /&gt;What that essentially means is that with a single SQL query one can find and correct 15% of "parkway"  word synonyms to whatever standardized form is needed. But for the remaining variations one query solves only a very small fraction of the problem and this in turn  increases the amount of work required, sometimes overwhelmingly.&lt;br /&gt;&lt;br /&gt;In capturing and analyzing natural language we are confronted with the same problem : 60% of people might be using the same phrase for describing the fact that they don't want to go to sleep with a simple "I don't want to go to sleep". But another 20% might be using something like : "i don't feel like sleeping" and another 10% something like "i don't want to go to bed right now".&lt;br /&gt;&lt;br /&gt;So we immediately see one of the issues that Text Miners face : The fact that we can use different phrases to communicate the same meaning. If we wish to analyze text information for classification purposes -say the sentiment of customers- we could achieve a 60-65% accuracy in our results with some effort. For a mere 4% increase in accuracy -from 65% to 69%- the amount of extra effort required could prove prohibitive.&lt;br /&gt;&lt;br /&gt;Consider the following chart :&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/SXL8sRvKeOI/AAAAAAAAAMY/DjgrrIoFr4c/s1600-h/sentencebranch.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 292px; height: 400px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/SXL8sRvKeOI/AAAAAAAAAMY/DjgrrIoFr4c/s400/sentencebranch.png" alt="" id="BLOGGER_PHOTO_ID_5292570349719419106" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;These are all examples of phrases people use in their everyday talk. We can visualize such phrases starting with" i don't want to" and then each branch adds a new meaning to the phrase. So branches marked with numbers are the parts of speech that give us an idea of what a person doesn't want to do : To go, to feel, to visit,to know.  Things are getting &lt;span&gt;much&lt;/span&gt; more difficult in terms of the effort required if we wish to add more detail -and  probably insight- to our analysis by moving further down the branches in our sentence tree.&lt;br /&gt;&lt;br /&gt;Perhaps for marketeers, the ability to quantify the distribution of words on the 1st level of the tree depicted above could be enough : If we end up with the following words distribution :&lt;br /&gt;&lt;br /&gt;To feel    : 15%&lt;br /&gt;To know : 7%&lt;br /&gt;To go      : 1%&lt;br /&gt;To visit   : 1%&lt;br /&gt;&lt;br /&gt;Then, we get an insight on which words to use to market products more efficiently.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;On the next post we will go through a hands-on example of analyzing the thoughts of Twitter users and specifically what people seem to "don't want".&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-2860969719841529690?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/2860969719841529690/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=2860969719841529690' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2860969719841529690'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2860969719841529690'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/01/statistics-of-everyday-talk.html' title='The Statistics of Everyday Talk'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_koDJi0ps7Mw/SXL8sRvKeOI/AAAAAAAAAMY/DjgrrIoFr4c/s72-c/sentencebranch.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7770456858487779445</id><published>2009-01-05T18:34:00.007+02:00</published><updated>2009-05-25T18:44:29.536+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Emotions, Beliefs and Analytics</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;br /&gt;When i first came across Data Mining and Machine Learning in 1998 i had no idea of the kind of applications that this field can have. As time passes by, the knowledge that can be available to a data/text miner becomes more and more a serious business....actually, a very serious one.&lt;br /&gt;&lt;br /&gt;Not long time ago i have seen a presentation where a map of emotions from the web was created &lt;span&gt;in real time&lt;/span&gt; by aggregating specific keywords from blogs and forum posts. &lt;a href="http://twistori.com/"&gt;Twistori&lt;/a&gt; is an example of such an application. Now, let's take this idea one step further.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://twitter.com/"&gt;Twitter&lt;/a&gt; is a "social messaging utility" in which users describe what they are doing -or what  they are feeling/thinking- now. Users are able to send "tweets"   even through SMS messages. The way that these messages are written is an ideal format for text mining : Short phrases that summarize what a user wants to say are a text miner's paradise.&lt;br /&gt;&lt;br /&gt;It is logical to assume that Text mining and Information extraction techniques will become more important, since more data will be generated in the future.  It is only a matter of time until the next "killer app" like FaceBook, YouTube and Twitter appears. Data/Text miners will be able to identify common "thought clusters" of people.&lt;br /&gt;&lt;br /&gt;Now, consider the following example : By visiting &lt;a href="http://search.twitter.com/search?q=%22i+don%27t+want+to%22"&gt;this link&lt;/a&gt; you will get a list of people that have written in their "tweets" the phrase "I don't want to....".&lt;br /&gt;&lt;br /&gt;Once this textual information is captured, preprocessed and then analyzed through cluster analysis we could end up with the following clusters of "I don't want-er's " :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;- The cluster of users that do not want to work again/tomorrow/today (18.5%)&lt;br /&gt;&lt;br /&gt;- The cluster of users that do not want to go to sleep (6%)&lt;br /&gt;&lt;br /&gt;- The cluster of users that do not want to hurt someone (4.2%)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;What is also interesting, is the ability to quantify the proportion of cases belonging to each cluster to the total of tweets. As shown in the example above,  the most frequently occurring thought is from people that do not feel like working.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Now in the same way one could perform this type of analysis for :&lt;br /&gt;&lt;br /&gt;"I Believe...."&lt;br /&gt;"I wish i...."&lt;br /&gt;"I want to buy..."&lt;br /&gt;&lt;br /&gt;Essentially, what we are talking about is the extraction of the values, hopes and beliefs of hundreds of thousands -or even millions- of users...and  in &lt;span style="font-style: italic;"&gt;descending&lt;/span&gt; order. Once a first run is performed and clusters are extracted  one could run this process again every  month and see the trends of those clusters in time. It would be also interesting to see how these thought clusters change after specific World events.&lt;br /&gt;&lt;br /&gt;For some people such as marketeers and social researchers -providing that results are accurate enough- this information is invaluable. Others, might feel that such an analysis is bad practice. Of course, there are companies that already capture brand sentiment across the web : &lt;a href="http://www.crimsonhexagon.com/home/"&gt;Crimson Hexagon&lt;/a&gt; and &lt;a href="http://twitrratr.com/"&gt;Twitrratr&lt;/a&gt; are just two examples.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;This post is the first in a series of posts discussing the application of Analytics to capture the thoughts that -as we speak now- exist on the Web. We will go through ways that one could explore this information  and more specifically we will look at :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How clustering can  group people's values, beliefs and emotions. &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Why Ontologies and Natural Language Processing are needed for better results.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How classification analysis might give us knowledge on what are the common characteristics of various 'categories' of users.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7770456858487779445?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7770456858487779445/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7770456858487779445' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7770456858487779445'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7770456858487779445'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2009/01/emotions-beliefs-and-analytics.html' title='Emotions, Beliefs and Analytics'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-6573384083107420896</id><published>2008-12-18T22:24:00.022+02:00</published><updated>2008-12-22T13:00:26.633+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='personalization'/><category scheme='http://www.blogger.com/atom/ns#' term='rss'/><title type='text'>Personalizing your RSS Feeds</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;br /&gt;78.8%. This is the estimated accuracy with which an algorithm is able to predict what kind of information i like to read and thus what kind of news seem to be more interesting to me.&lt;br /&gt;&lt;br /&gt;The system proves itself every day and it actually helps a lot because it can spot instantly the information that i like among hundreds of &lt;a href="http://en.wikipedia.org/wiki/RSS_%28file_format%29"&gt;RSS&lt;/a&gt; news headers. Such service is so much more than simple keyword matching because it takes into account the &lt;span style="font-style: italic;"&gt;combination&lt;/span&gt; of words and thus it can differentiate news (in terms of how interesting they are) even if these news  are about similar concepts.&lt;br /&gt;&lt;br /&gt;Personalization of RSS feeds is a well-known application of text classification. The amount of information -the header- is almost always 2-3 sentences long which makes it ideal for feeding it to a classifier. The software that i built is quite simple : First i have a list of about 10 RSS sources : Financial, Medical, International News, Tech news etc.  The application scans the RSS feeds every 20 minutes and each new header is appended to a  text file on my hard disk.&lt;br /&gt;&lt;br /&gt;When i ran the application for the first time, i simply saved all headers on the hard disk and built my first text classifier. But that was back then. Today the classifier scans the RSS feeds and automatically appends an RSS header to either the "Interesting" text file or the "Uninteresting" text file...and it does so correctly most of the time.&lt;br /&gt;&lt;br /&gt;When i have some spare time, i look at the classified headers and correct the errors my classifier made by putting the right headers to the right place. I then re-train the classifier  and everything is ready for the next run.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/SUq5FGxWYUI/AAAAAAAAAMI/aUqHPiU9YWY/s1600-h/RSS.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 501px; height: 205px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/SUq5FGxWYUI/AAAAAAAAAMI/aUqHPiU9YWY/s400/RSS.png" alt="" id="BLOGGER_PHOTO_ID_5281237010413412674" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Also interesting is the fact that, the more i am using the classifier the better it gets in terms of the training frequency it requires. During the first week of usage i had to produce a model almost every day...but not any more.&lt;br /&gt;&lt;br /&gt;So RSS personalization is an application to look at. First of all it saves a lot of time. Second, really useful applications can emerge. For example, consider one investor that wishes to know if something &lt;span style="font-style: italic;"&gt;significant enough&lt;/span&gt; occurs on the news that might affect the markets for better or for worse. Notice that on a previous &lt;a href="http://lifeanalytics.blogspot.com/2008/11/text-mining-on-financial-news.html"&gt;post&lt;/a&gt;,  i described how i am &lt;a href="http://4.bp.blogspot.com/_koDJi0ps7Mw/SR_dlukAbMI/AAAAAAAAAI4/yCR5Kuzn1vs/s1600-h/textnewscrop.JPG"&gt;flagging&lt;/a&gt;  every news header as an important or unimportant one. Therefore if a classifier is able to differentiate accurately enough as to what is important, then an investor can receive e-mail alarms -or even SMS messages if he is not online- of the event. Perhaps, the message might even include how much a stock or an index is likely to be affected by the breaking news.&lt;br /&gt;&lt;br /&gt;There is also the personalization that results from collaborative filtering. However, i believe that a "personal news classifier" -if i may call it like that- after sufficient training time, does a much better job in terms of its predictive accuracy.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-6573384083107420896?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/6573384083107420896/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=6573384083107420896' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6573384083107420896'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6573384083107420896'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/12/personalizing-your-rss-feeds.html' title='Personalizing your RSS Feeds'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_koDJi0ps7Mw/SUq5FGxWYUI/AAAAAAAAAMI/aUqHPiU9YWY/s72-c/RSS.png' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-499122288772241241</id><published>2008-12-09T01:14:00.017+02:00</published><updated>2009-03-12T13:07:58.702+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='computational linguistics'/><category scheme='http://www.blogger.com/atom/ns#' term='information extraction'/><category scheme='http://www.blogger.com/atom/ns#' term='telecoms'/><title type='text'>When Telecom customers complain-Pt. 2</title><content type='html'>&lt;div style="text-align: justify;"&gt;On the previous &lt;a href="http://lifeanalytics.blogspot.com/2008/12/when-telecom-customers-complain-part1.html"&gt;post&lt;/a&gt; i explained the first steps in deploying Information Extraction, Text Mining and Computational Linguistics to capture the essence of Telecom customers complaints.&lt;br /&gt;&lt;br /&gt;We have already discussed about the big picture : Retrieve data (essentially user messages from forums) and then use Information Extraction to transform unstructured information to a structured form. This transformation is done by building a set of matching rules for specific phrases or keywords such as&lt;br /&gt;&lt;br /&gt;-signal&lt;br /&gt;-antenna&lt;br /&gt;-customer care&lt;br /&gt;&lt;br /&gt;and words of &lt;span style="font-style: italic;"&gt;sentiment&lt;/span&gt; such as&lt;br /&gt;&lt;br /&gt;-worse&lt;br /&gt;-worst&lt;br /&gt;-better&lt;br /&gt;-best&lt;br /&gt;-outraged&lt;br /&gt;&lt;br /&gt;among many others.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;So here comes the interesting part : Suppose that a telecom company  has in its possession an application that is able to search and extract sentiment from unstructured information. Having such a tool means that :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A user can &lt;span style="font-style: italic;"&gt;query directly on user forums&lt;/span&gt; for -example- specific network problems and break down those problems by area name.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;A user can directly &lt;span style="font-style: italic;"&gt;query for hot phrases&lt;/span&gt; such as "canceling my subscription" and cluster keywords around those messages. If the telecom company is also running (and most likely it &lt;span&gt;is&lt;/span&gt; running) churn prediction models, then analysts have yet another source to cross-check and/or enhance  the conclusions of their churning models with this new information.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Special matching rules can be applied to extract why users prefer company XYZ over company ABC.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;This technology can be applied to e-mails and/or free text complaints to the customer care center, which means that analysts can further enhance their churning models with additional data.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Matching rules can be built that associate keywords to Telecom companies in terms of their co-occurrence. So telecom company XYZ has the phrase "good signal" &lt;span style="font-style: italic;"&gt;associated with its brand &lt;/span&gt;whilst company ABC has the phrase "bargain" as the associated keyword.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Match &lt;span style="font-style: italic;"&gt;billing plan &lt;/span&gt;keywords and then cluster them with sentiment keywords. In other words, how do customers perceived the new billing plan and what is the sentiment about it?&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;It is easy to realize that Information Extraction combined with Text Mining and linguistics is a powerful combination that can extract many "knowledge nuggets".  The fact that such an application &lt;span style="font-style: italic;"&gt;cannot&lt;/span&gt; be 100% accurate may arise acceptance problems but its sure worth the effort in the end if potential problems are clearly presented before implementation of this application.&lt;br /&gt;&lt;br /&gt;Let us not forget that a complaint given by a customer to the customer center &lt;span style="font-style: italic;"&gt;remains there&lt;/span&gt; - between the boundaries of the company. A complaint posted on a forum can be seen by hundreds of thousands of others (and it will most likely stay there for a long time ) ,influencing potential and existing customers in a non-positive way.&lt;br /&gt;&lt;br /&gt;A Sentiment analysis application may be also used for :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Banking&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Pharmaceuticals&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Insurance&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Consumer Products (Customer Reviews)&lt;/li&gt;&lt;/ul&gt;.&lt;br /&gt;and of course for capturing the sentiment of citizens for politicians (...)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-499122288772241241?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/499122288772241241/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=499122288772241241' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/499122288772241241'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/499122288772241241'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/12/when-telecom-customers-complain-pt-2.html' title='When Telecom customers complain-Pt. 2'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-4119913980696031358</id><published>2008-12-07T16:48:00.008+02:00</published><updated>2008-12-07T18:15:15.393+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='unstructured information'/><category scheme='http://www.blogger.com/atom/ns#' term='information extraction'/><category scheme='http://www.blogger.com/atom/ns#' term='telecoms'/><title type='text'>When Telecom customers complain</title><content type='html'>&lt;div style="text-align: justify;"&gt;Probably one of the best uses of Information Extraction, Text mining and Computational Linguistics combined together, is their ability to show us  the sentiment of customers.  Today we are going to see an example for capturing the sentiment of Telecom customers.&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;When a customer writes his/her opinion on a forum, a wealth of information is generated because -more importantly- a customer uses words and phrases that cannot be found during a controlled study.  The words, phrases and expressions are far more emotionally powerful than a Likert scale answer of type "Totally disagree / agree".&lt;br /&gt;&lt;br /&gt;So let us see the steps required   :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;First Step&lt;/span&gt; : The first thing of course is to actually find the data : User forums where people talk about mobile phones and mobile companies is obviously the place to look and there are lots of those places. Perhaps the volume of the messages is not enough but usually the available information is more than enough.  Special code can be written to extract text from posts &lt;span style="font-style: italic;"&gt;but without loss of the nature of the posting&lt;/span&gt;.  As an example, the fact that a post has generated 20 replies is considered valuable information. The more posted replies, the more sentiment exists and this information has to be taken into consideration.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Second Step&lt;/span&gt; : Deploy information extraction techniques to identify phrases of good or bad sentiment (and actually many other things) about Telecom keywords such as :&lt;br /&gt;&lt;br /&gt;- Signal&lt;br /&gt;- Customer Care&lt;br /&gt;- Billing&lt;br /&gt;&lt;br /&gt;....etc&lt;br /&gt;&lt;br /&gt;The following screen capture shows an example which is in Greek but i will provide all necessary explanation - Please also note that this is a simplified version of the process  :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/STvsSAfhQyI/AAAAAAAAALo/0xsreNdgPDQ/s1600-h/telecom.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 420px; height: 292px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/STvsSAfhQyI/AAAAAAAAALo/0xsreNdgPDQ/s400/telecom.jpg" alt="" id="BLOGGER_PHOTO_ID_5277071182508671778" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Notice that on the right hand-side  there are some bars that denote the type of keywords found : The first category is called "Characterization" and if it is checked (which on the above screen capture it is) the software will highlight posts that only have some kind of characterization, whether good or bad. Notice also the yellow bar which has the name "Network". Because it is checked, words that are synonyms of "Network" are highlighted and indeed this is the case because&lt;br /&gt;&lt;br /&gt;Signal = σήμα (in Greek) and&lt;br /&gt;Flawless = άψογο&lt;br /&gt;&lt;br /&gt;so the highlighted phrase &lt;span style="font-style: italic; font-weight: bold;"&gt;άψογο σήμα&lt;/span&gt; means "flawless signal", which is a good characterization for the signal of two particular telecom companies. Notice also a line under the "Features" tab which says that between positions 3425 to 3429 there is a mention about signal ("mentionsSignal = true").&lt;br /&gt;&lt;br /&gt;Again, i have to point out that this is a simplified version of the process. Text Mining and Information Extraction is actually very hard work but it is also very rewarding for those that ultimately deploy and use it. On the next post we will see  the problems (and there are many of them) but also how this unstructured information is turned to "nuggets of gold".&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-4119913980696031358?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/4119913980696031358/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=4119913980696031358' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4119913980696031358'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/4119913980696031358'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/12/when-telecom-customers-complain-part1.html' title='When Telecom customers complain'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_koDJi0ps7Mw/STvsSAfhQyI/AAAAAAAAALo/0xsreNdgPDQ/s72-c/telecom.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7271871406078217910</id><published>2008-12-03T17:42:00.012+02:00</published><updated>2008-12-04T01:15:59.693+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='financial markets'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><title type='text'>Analytics and the Financial Markets</title><content type='html'>&lt;div style="text-align: justify;"&gt;On previous posts, i explained ways to analyze the financial markets by using data mining and text mining techniques. I also went through some potential pitfalls and perils during such type of analysis.&lt;br /&gt;&lt;br /&gt;By&lt;span style="font-style: italic;"&gt; combining&lt;/span&gt; different data sources (worldwide indices, moving averages, oscillators, clustering or categorization of financial news) an investor could take better decisions on where and when to invest. After such an analysis our goal is to make &lt;span style="font-style: italic;"&gt;sufficiently better predictions than mere chance&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Some days ago i came across a website called &lt;a href="http://www.inner8.com/"&gt;Inner8&lt;/a&gt;. Inner8 is a really interesting idea : Collaborative filtering of stock picking. Combine this with analytics and an investor has on his arsenal -yet- another investing tool. Imagine thousands of Inner8 subscribers making stock predictions and giving their ideas, insights and sentiment for the stock market. After a few months &lt;span style="font-style: italic;"&gt;some&lt;/span&gt; users &lt;span style="font-style: italic;"&gt;will&lt;/span&gt; be "prediction super stars" from mere chance, so one has to proceed with caution. Nevertheless it is a website to keep looking at in the future, especially if the subscriber volume increases significantly.&lt;br /&gt;&lt;br /&gt;So let us go back to our problem : We have to think of a good way to combine the information in our possession (aka problem representation) and feed this data on one or more algorithms with the goal of achieving models of high predictive value.&lt;br /&gt;&lt;br /&gt;Some of the things to consider :&lt;br /&gt;&lt;br /&gt;1) Should the "sliding window" technique be used? Could repetition of training data (because there &lt;span style="font-style: italic;"&gt;is&lt;/span&gt; repetition of data in sliding window training) affect the predictive power of the model?&lt;br /&gt;&lt;br /&gt;2) How many variables? Which are good predictors?&lt;br /&gt;&lt;br /&gt;3) Do we care only about predictive power of the model? How about the interpretation of why a stock behaves as it does?&lt;br /&gt;&lt;br /&gt;4) How can we represent the "additive effect" of 2 straight days of bad market news if a sliding window is not used?&lt;br /&gt;&lt;br /&gt;5) Prediction Goal : Are we after price prediction (Regression) or price limits? (Classification)&lt;br /&gt;&lt;br /&gt;Unfortunately the list does not end here : Since i am after predictions of stock prices in the Greek Stock Exchange, the data should be presented to the learning algorithm in a coherent way. European Markets are affected by the closing of US Markets and Asia. During Greek trading hours the US Markets open (approx. 45 mins before the end of trading - at 16:30 EET) , a fact that should be also taken into account.&lt;br /&gt;&lt;br /&gt;I am sure that there are many users out there that have read a couple of data mining books, downloaded an open-source data mining tool, fed some data in and expect to see results. My only advice to them without the slightest sign of criticism: Paper-trade first...&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7271871406078217910?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7271871406078217910/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7271871406078217910' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7271871406078217910'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7271871406078217910'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/12/analytics-and-financial-markets.html' title='Analytics and the Financial Markets'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-8786755319808739343</id><published>2008-11-26T09:45:00.007+02:00</published><updated>2008-11-26T12:43:37.313+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='digg'/><title type='text'>Predicting popular stories on Digg</title><content type='html'>On its latest news, KDNuggets mentions  a paper from HP Labs that outlines a  process of analyzing and predicting the popularity of a Digg story or a YouTube video submission.&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;No question that this is interesting material. On my &lt;a href="http://lifeanalytics.blogspot.com/2007/10/what-people-digg-more.html"&gt;post&lt;/a&gt; dated October 16th, 2007 i presented an analysis that i performed on what keywords seem to play a part on a post being popular on Digg. You can find all 3 parts of the post &lt;a href="http://lifeanalytics.blogspot.com/search/label/digg"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;So i made a new run to collect the stories from Digg and this is an example of what i came with (please note : For illustrative purposes only) :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/SS0imazeGNI/AAAAAAAAAKQ/QRSRkjBn4vU/s1600-h/digg.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 274px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/SS0imazeGNI/AAAAAAAAAKQ/QRSRkjBn4vU/s400/digg.JPG" alt="" id="BLOGGER_PHOTO_ID_5272908782146296018" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://www.hpl.hp.com/research/scl/papers/predictions/predictions.pdf"&gt;paper&lt;/a&gt; from HP Labs takes a different route and makes its predictions based on the popularity of a submitted story in the first few hours rather than after some days. The authors also conclude that after a digg story is out, users tend to vote for it in the beginning but when a specified threshold time has passed the rate with which the story is digged fades away. On the contrary, videos submitted on YouTube are being viewed by users on a linear trend after the video is submitted.&lt;br /&gt;&lt;br /&gt;It is true that there is an inherent nature of seasonality on the news and the way that users 'digg' stories. It is also interesting to see at buzzwords that seem to keep &lt;span style="font-style: italic;"&gt;repeating&lt;/span&gt; (in terms of how interesting they are -or not) over time.&lt;br /&gt;&lt;br /&gt;Between the previous runs that i have made and the current one, i have seen some repeating patterns. One of these patterns shows Microsoft on a declining trend in terms of how much of  an interesting subject it appears for digg users. Here is what Google Trends  shows about the term 'Microsoft' :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/SS0mb51KjTI/AAAAAAAAAKY/ySI4FjnQhsI/s1600-h/viz.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 179px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/SS0mb51KjTI/AAAAAAAAAKY/ySI4FjnQhsI/s400/viz.png" alt="" id="BLOGGER_PHOTO_ID_5272912999542852914" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Could such a trend may be a glimpse on Microsoft's 'future' somehow? &lt;br /&gt;&lt;br /&gt;I have already built a text classifier which accepts phrases and shows the probability of this phrase being highly 'digged' based on the keywords that the phrase has. More on this on a future post.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-8786755319808739343?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/8786755319808739343/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=8786755319808739343' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8786755319808739343'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8786755319808739343'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/11/predicting-popular-stories-on-digg.html' title='Predicting popular stories on Digg'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_koDJi0ps7Mw/SS0imazeGNI/AAAAAAAAAKQ/QRSRkjBn4vU/s72-c/digg.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7910860431481081530</id><published>2008-11-21T13:43:00.002+02:00</published><updated>2008-11-21T14:28:49.899+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='reality mining'/><category scheme='http://www.blogger.com/atom/ns#' term='life analytics'/><title type='text'>Reality Mining vs Life Analytics</title><content type='html'>&lt;div style="text-align: justify;"&gt;We will be taking a short break from making predictions for the financial markets because i just came across a novel term (at least for me) regarding an analytics application : It is called &lt;span style="font-style: italic;"&gt;Reality Mining&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In short, Reality Mining is about using smart phones to record the interaction of a user with his device and the interaction of the user with other cell phone users.  By analyzing this  data, patterns of behavior can be extracted that can potentially be interesting to social researchers. For more information on Reality Mining see &lt;a href="http://reality.media.mit.edu/"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;What made me start this blog was the following question:  If there could be a way to capture -and thus record- what a person feels, thinks and otherwise experiences everyday, what kind of patterns might emerge from analyzing this information? I think that this is one step (or even more steps) further from Reality Mining. What would happen if this kind of information was recorded for a vast amount of people and then analyzed?  What if we could predict how thoughts we make and what we experience  might affect our life later on and the decisions we will make? This is full-blown 'Life Mining'.&lt;br /&gt;&lt;br /&gt;I get some e-mails from readers on "how this life analytics project is going" and for this subject there will be some future posts very soon.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7910860431481081530?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7910860431481081530/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7910860431481081530' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7910860431481081530'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7910860431481081530'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/11/reality-mining-vs-life-analytics.html' title='Reality Mining vs Life Analytics'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7960423217407067465</id><published>2008-11-16T10:20:00.010+02:00</published><updated>2008-11-16T22:01:43.802+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='financial news'/><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><title type='text'>Text Mining on Financial News</title><content type='html'>As discussed previously, an analyst should give specific attention to problem representation particularly when we are dealing with text data. A way to do this will be discussed below, however something has to give and there is no perfect solution for such a task.&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;First of all we have to find the source of the news : It could be financial news sites such as Bloomberg, Financial Times, or RSS Feeds URLs such as the ones provided by MarketWatch. RSS Feeds might be a better solution because there is already some predetermined categorization of news according to the feed type and this can be great help for some analysts.&lt;br /&gt;&lt;br /&gt;After finding the news sources and making the necessary code to get the actual information we could end up with the following text file :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/SR_dlukAbMI/AAAAAAAAAI4/yCR5Kuzn1vs/s1600-h/textnewscrop.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 275px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/SR_dlukAbMI/AAAAAAAAAI4/yCR5Kuzn1vs/s400/textnewscrop.JPG" alt="" id="BLOGGER_PHOTO_ID_5269173729270721730" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;You can see that i use a '^' separator to differentiate between :&lt;br /&gt;&lt;br /&gt;1) A date stamp,&lt;br /&gt;2) A date string&lt;br /&gt;3) The news string&lt;br /&gt;4) A characterization of the news (important or unimportant)&lt;br /&gt;5) A categorization of the financial news.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;This simple file could provide the basis for a training file for text categorization. Assuming that we have trained algorithms to automatically classify news, we could use a news classifier to first categorize news to important or unimportant and pass &lt;span style="font-style: italic;"&gt;only the&lt;/span&gt; &lt;span style="font-style: italic;"&gt;important&lt;/span&gt; news to a second classifier which will do the detailed classification of the news.&lt;br /&gt;&lt;br /&gt;Another option is to use clustering : You can imagine that the solution detailed above has a tremendous amount of work depending on how much data you are planning to collect...so too much data means too much work, less data could mean -usually but not always- less accuracy.&lt;br /&gt;&lt;br /&gt;But how could clustering be performed on such data? Simply, we just use field number (4) on our training  text file to train a clustering algorithm and then see what 'classes' the algorithm has come up with.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;So let's see a small example about clustering : This is a capture from WEKA just before the clustering process :&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/SSACsdhH_4I/AAAAAAAAAJA/8mQuCkbvGuM/s1600-h/wekacapture.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 291px;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/SSACsdhH_4I/AAAAAAAAAJA/8mQuCkbvGuM/s400/wekacapture.JPG" alt="" id="BLOGGER_PHOTO_ID_5269214526884544386" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;I have produced a training file which essentially contains the 'buzzwords' of financial news : barrel, recession, Yen, Euro, ECB, price, consumer, etc. The file is then analyzed by &lt;a href="http://en.wikipedia.org/wiki/K-means_algorithm"&gt;K-means&lt;/a&gt; algorithm to extract clusters of the same 'buzzwords'. Each cluster is assigned a number so each news header ultimately falls onto one cluster number.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;After running the K-Means algorithm i ended up with 16 clusters. Let's see two instances that K-Means decided that they should fall under cluster '6' :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Instance_number : 130.0&lt;br /&gt;&lt;br /&gt;Fear&lt;br /&gt;Decrease&lt;br /&gt;US&lt;br /&gt;Economy&lt;br /&gt;Futures&lt;br /&gt;  &lt;br /&gt;and&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Instance_number : 174.0&lt;br /&gt;&lt;br /&gt;Fear&lt;br /&gt;Decrease&lt;br /&gt;US&lt;br /&gt;Price&lt;br /&gt;Oil&lt;br /&gt;Banking&lt;br /&gt;Recession&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;So the first instance is about fears for the US Economy which results in US Futures dropping and the second instance must be -something about- a decrease of Oil prices and Banking stocks because of the fear of US recession. Not bad at all...&lt;br /&gt;&lt;br /&gt;But not so fast : Clustering presents a lot of problems later in the process. Remember that what we are after, is to &lt;span style="font-style: italic;"&gt;combine&lt;/span&gt; text mining and data mining together to better understand how the markets react.  Should one use classification or clustering?  There are many more things to  take under consideration and for obvious reasons i cannot disclose all the details of such a project...but i am hoping to give to the interested reader a good enough introduction on the subject.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7960423217407067465?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7960423217407067465/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7960423217407067465' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7960423217407067465'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7960423217407067465'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/11/text-mining-on-financial-news.html' title='Text Mining on Financial News'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_koDJi0ps7Mw/SR_dlukAbMI/AAAAAAAAAI4/yCR5Kuzn1vs/s72-c/textnewscrop.JPG' height='72' width='72'/><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-2074987296946806238</id><published>2008-11-14T10:32:00.006+02:00</published><updated>2008-11-15T21:53:45.224+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='financial news'/><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><title type='text'>Capturing the Financial Facts</title><content type='html'>&lt;div style="text-align: justify;"&gt;So far, we have seen the &lt;span style="font-style: italic;"&gt;data &lt;/span&gt;mining part on analyzing the financial markets and some of the problems that arise during such an analysis : Data have to be collected and pre-processed accordingly. There are dangers of over-fitting and the analyst must make sure that  the model(s) created have the expected quality.   The analyst has also to choose relevant attributes with which the analysis will be performed and how the training of the algorithms will be made.&lt;br /&gt;&lt;br /&gt;The markets react to financial news and there is no question about this. Of course there are other factors that make people buy or sell : For example if a stock price has hit a &lt;a href="http://en.wikipedia.org/wiki/Support_%28technical_analysis%29"&gt;support or resistance&lt;/a&gt; level then some investors are going to either buy or sell when such a price level is reached. Investors are also going to buy or sell when specific technical indicators such as &lt;a href="http://en.wikipedia.org/wiki/MACD"&gt;MACD&lt;/a&gt; or oscillators show the signals to do so.  Even when bad news are out, markets after an -unknown- number of consecutive drops will go up by an -unknown- percentage and vice-versa.&lt;br /&gt;&lt;br /&gt;People that are involved with Machine Learning know that the representation of the problem at hand is of high importance...so first we are going to see how financial news can be represented in a helpful way for the analysis.&lt;br /&gt;&lt;br /&gt;We have to see with what we are dealing here. To do this, we have to analyze and categorize accordingly the financial information as this is created. Financial News can be news about a number of things :&lt;br /&gt;&lt;br /&gt;1) The number of jobless claims in US is higher than last year.&lt;br /&gt;2) Automotive company's XYZ sales were dropped by 15%&lt;br /&gt;3) Oil prices hit -yet- another record high&lt;br /&gt;4) The dollar is dropping&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;....and the list goes on.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;So the first problem arises : Should we &lt;span style="font-style: italic;"&gt;categorize &lt;/span&gt;the information according to its content and present it to the algorithms? We could do that by having a boolean field for each type of news on our training file and set it accordingly to TRUE or FALSE values. By using this method we could easily reach thousands of input fields, since for the "jobless claims" news type we could have the following variants :&lt;br /&gt;&lt;br /&gt;-A specific country for the jobless claim report (not only the US, it could be any country)&lt;br /&gt;&lt;br /&gt;-Jobless claims could be higher than expected or higher than last year or the highest in the last decade.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;It is easy to see that this gets way too fast out of control.  Perhaps a better solution would be to try to create clusters of (more or less) the same news. The idea of &lt;a href="http://en.wikipedia.org/wiki/Data_clustering"&gt;clustering&lt;/a&gt; the financial news might seem an interesting one and an analyst could define a number of clusters -say he is after 100- and let the clustering process categorize accordingly all the news. But is clustering the solution? More on this on the next post...&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-2074987296946806238?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/2074987296946806238/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=2074987296946806238' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2074987296946806238'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2074987296946806238'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/11/capturing-financial-facts.html' title='Capturing the Financial Facts'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-8244592874981577805</id><published>2008-11-12T10:41:00.008+02:00</published><updated>2008-11-12T15:08:57.648+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='model testing'/><title type='text'>Model Testing</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;div style="text-align: justify;"&gt;Once a model has been created (such as the decision tree for our example), the analyst is required to test the model. During model testing, an analyst performs specific tests that  show the &lt;span style="font-style: italic;"&gt;actual &lt;/span&gt;predictive power of a model.&lt;br /&gt;&lt;br /&gt;Many methods can be used for model testing, depending on the problem. For our example and since the available volume of data is sufficiently large, the model training and testing methodology i used was as follows :&lt;br /&gt;&lt;br /&gt;1) 50% of data were used for model training&lt;br /&gt;2) 25% of data were used for model validation - fine tuning&lt;br /&gt;3) 25% were used for testing of the model.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;In other words, 75% of the data were used for training the algorithm and assessing the impact that changes on  algorithm parameters have on the accuracy of the model. For a decision tree algorithm (and depending on the type of decision tree used) an analyst might try different settings for splitting criteria and/or number of minimum cases per branch, etc.&lt;br /&gt;&lt;br /&gt;Unfortunately, numerous times an analyst finds that the predicted accuracy of the model given during training - model validation phases (ie steps 1 and 2 shown above) is in no way representative when the model is tested on unseen cases ( Step 3).&lt;br /&gt;&lt;br /&gt;During my analysis, numerous models were showing an estimated accuracy of 85% or more but when they were presented on actual data, the accuracy was dropping down to 50-53%, suggesting that &lt;a href="http://en.wikipedia.org/wiki/Overfitting"&gt;overfitting&lt;/a&gt; was present. Consequently, the use of these biased models to predict new cases would have detrimental effects in actual stock trading.&lt;br /&gt;&lt;br /&gt;When all models are built, the analyst should choose a model (when there is a requirement to use only one model) according to :&lt;br /&gt;&lt;br /&gt;1) (Statistically significant) best accuracy.&lt;br /&gt;2) Misclassification costs, if these are not taken into account during the model building process.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;On the next post we will see how text mining may help us in making better predictions for the markets.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-8244592874981577805?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/8244592874981577805/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=8244592874981577805' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8244592874981577805'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/8244592874981577805'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/11/evaluating-what-is-learned-model.html' title='Model Testing'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-3069036271437131283</id><published>2008-10-30T16:34:00.009+02:00</published><updated>2008-10-31T11:31:09.743+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='feature selection'/><title type='text'>Decision Tree Interpretation</title><content type='html'>&lt;div style="text-align: justify;"&gt;On the &lt;a href="http://lifeanalytics.blogspot.com/2008/10/insights-from-decision-tree.html"&gt;previous post&lt;/a&gt; i went through some basic steps required for predicting the price changes of a specific stock of the Greek stock exchange market. As a result of this process, the following decision tree was generated :&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_koDJi0ps7Mw/SQnHRg23TJI/AAAAAAAAAHM/m-yJhR9Mi08/s1600-h/stockdecisiontree2.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 159px;" src="http://1.bp.blogspot.com/_koDJi0ps7Mw/SQnHRg23TJI/AAAAAAAAAHM/m-yJhR9Mi08/s400/stockdecisiontree2.jpg" alt="" id="BLOGGER_PHOTO_ID_5262956743250889874" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;To interpret a decision tree, the analyst starts from the root of the tree  and reads through it until a leaf node is reached. For example a rule that can be extracted from the decision tree above is the following:&lt;br /&gt;&lt;br /&gt;"IF aseStockExchange &gt; 0.360 AND aseStockExchange &gt; 1.985 THEN price&gt;+2"&lt;br /&gt;&lt;br /&gt;The rule above can be found by starting from the root of the tree, moving on the left branch and then continuing to the right sub-branch. In the same way an analyst is able to find the rest of the rules identified by the decision tree.&lt;br /&gt;&lt;br /&gt;When using decision tree learners or rule extractors, analysts record the &lt;a href="http://en.wikipedia.org/wiki/Precision_and_recall"&gt;precision and recall&lt;/a&gt; of a rule which are not shown in the decision tree above. However, for matters of simplicity i will omit this information and describe the insights provided from the analysis. Decision Trees possess the two following qualities :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;1) They provide easy model interpretation&lt;br /&gt;&lt;br /&gt;and&lt;br /&gt;&lt;br /&gt;2) They show us the relevant importance of the variables&lt;br /&gt;&lt;br /&gt;When confronted with many variables, analysts usually start by building a decision tree and then using the variables which the decision tree algorithm has selected with other methods that suffer from the complexity of many variables, such as neural networks.  However, decision trees perform worse when the problem at hand is not linearly separable. For the purpose of our example though, a decision tree 'explains' the behavior of the stock nicely.&lt;br /&gt;&lt;br /&gt;It should be noted that during the &lt;a href="http://lifeanalytics.blogspot.com/2008/10/sowhats-important.html"&gt;Feature Selection analysis&lt;/a&gt; of our stock example we have found that features 'aseStockExchange' and 'DAX' are important. Other features such as 'xaaPersonalHouseProducts' were flagged as important from the Feature Selection algorithm and were not used in the decision tree. Different feature selection methods produce different results (and one might say that this is not  very assuring) but usually most methods produce a common feature subset that is of high predictive value.&lt;br /&gt;&lt;br /&gt;The importance of the attributes can be seen from the level that they appear on the decision tree (the higher the level, the better is the prediction power of the attribute). So in our example, the 'aseStockExchange' feature is the most important (since it is the attribute with which the decision tree starts) and less important attributes seem to be 'xaaLeisure' and 'xaaBenefit'.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-3069036271437131283?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/3069036271437131283/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=3069036271437131283' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3069036271437131283'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3069036271437131283'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/10/decision-tree-interpretation.html' title='Decision Tree Interpretation'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_koDJi0ps7Mw/SQnHRg23TJI/AAAAAAAAAHM/m-yJhR9Mi08/s72-c/stockdecisiontree2.jpg' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-9057252958658388124</id><published>2008-10-16T00:22:00.017+03:00</published><updated>2008-10-16T15:49:28.061+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='financial markets'/><category scheme='http://www.blogger.com/atom/ns#' term='stock exchange indices'/><title type='text'>Insights from a Decision Tree</title><content type='html'>&lt;div  style="text-align: justify;font-family:georgia;"&gt;&lt;span style=";font-family:georgia;font-size:100%;"  &gt;Assuming that an analyst has made all necessary &lt;a href="http://en.wikipedia.org/wiki/Data_Pre-processing"&gt;pre-processing&lt;/a&gt; tasks prior to the data mining phase, we are ready to deploy analytical methods such as &lt;a href="http://en.wikipedia.org/wiki/Decision_tree_learning"&gt;decision tree learners&lt;/a&gt; that can classify unseen cases. For the goal of stock prediction we assume that we have the following data collected :&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/SPZihzYXWdI/AAAAAAAAAG8/rEXy2xLhbvo/s1600-h/stockdata.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/SPZihzYXWdI/AAAAAAAAAG8/rEXy2xLhbvo/s400/stockdata.JPG" alt="" id="BLOGGER_PHOTO_ID_5257497947869239762" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-family:georgia;"&gt;The column named as XAACLASS is the target column that we wish to classify. Essentially here we have the following classes :&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;-price change percentage greater than 2%&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;-price change percentage less than -2%&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:georgia;" &gt;-price change percentage greater than 0% and +2% inclusive&lt;/span&gt;&lt;br /&gt;&lt;price&gt;&lt;price&gt;&lt;price&gt;&lt;span style="font-family:georgia;" &gt;-price change percentage between -2% inclusive and 0% inclusive  &lt;/span&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;In other words, each line shows us the &lt;span style="font-style: italic;"&gt;state &lt;/span&gt;of the stock we wish to predict, that occurs &lt;span style="font-style: italic;"&gt;given &lt;/span&gt;the rest of the market indices (such as realTimeFTSE, realTimeDAX, etc).&lt;br /&gt;&lt;br /&gt;So, let us assume that we are ready to build such a model. However, we have to decide the time window that our predictions will be made for...do we wish to predict what the stock price change will be 2 hours ahead? How about 1 day ahead?&lt;br /&gt;&lt;br /&gt;Before dealing with this issue, i wanted to see how good a predictive model is by predicting the stock price percentage change &lt;span style="font-style: italic;"&gt;right now&lt;/span&gt;, based on the &lt;span style="font-style: italic;"&gt;current &lt;/span&gt;market conditions. Here is a decision tree that is created from such data:&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/price&gt;&lt;/price&gt;&lt;/price&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/SPZp_hf5F8I/AAAAAAAAAHE/tPBeASdPbsA/s1600-h/stockdecisiontree2.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/SPZp_hf5F8I/AAAAAAAAAHE/tPBeASdPbsA/s400/stockdecisiontree2.jpg" alt="" id="BLOGGER_PHOTO_ID_5257506155046442946" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-family:georgia;" &gt;More to come on the next post where the model seen above will be explained in detail. Until then  please read the post from &lt;/span&gt;&lt;a style="font-family: georgia;" href="http://dataminingresearch.blogspot.com/2008/09/stock-prediction-using-decision-tree.html"&gt;this&lt;/a&gt;&lt;span style="font-family:georgia;" &gt; blog about the same problem. If you can, read &lt;/span&gt;&lt;a style="font-family: georgia;" href="http://www.amazon.com/Fooled-Randomness-Hidden-Chance-Markets/dp/1400067936/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1224109126&amp;amp;sr=8-1"&gt;Fooled By Randomness&lt;/a&gt;&lt;span style="font-family:georgia;" &gt; also...&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-9057252958658388124?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/9057252958658388124/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=9057252958658388124' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/9057252958658388124'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/9057252958658388124'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/10/insights-from-decision-tree.html' title='Insights from a Decision Tree'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_koDJi0ps7Mw/SPZihzYXWdI/AAAAAAAAAG8/rEXy2xLhbvo/s72-c/stockdata.JPG' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7622466189506491000</id><published>2008-10-10T00:23:00.009+03:00</published><updated>2009-09-26T19:29:10.982+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='feature selection'/><title type='text'>So...What's important??</title><content type='html'>&lt;div style="text-align: justify;"&gt;A step of a Knowledge Discovery Process is to perform what is known as &lt;a href="http://en.wikipedia.org/wiki/Feature_selection"&gt;Feature Selection&lt;/a&gt;, which essentially is the identification of a subset of features with high predictive value.&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Feature selection can potentially help in increasing the accuracy of prediction models. Methods such as &lt;a href="http://en.wikipedia.org/wiki/Naive_bayes"&gt;Naive Bayes&lt;/a&gt; can perform better when presented with a subset of selected features, rather than the whole feature set (because of feature redundancy).&lt;br /&gt;&lt;br /&gt;Even if feature selection does not prove to help too much, it is important to &lt;span style="font-style: italic;"&gt;know &lt;/span&gt;the predictive power of each feature. There are numerous methods to do this and -as normally is the case- there is no universally better method to perform an optimal feature selection. The following is a representation of all available Feature Selection methods in &lt;a href="http://www.cs.waikato.ac.nz/ml/weka/"&gt;WEKA&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_koDJi0ps7Mw/Sr5A9t09YpI/AAAAAAAAAUA/vHKjLJb8xL0/s1600-h/Feature+Selection.jpeg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 332px; height: 400px;" src="http://4.bp.blogspot.com/_koDJi0ps7Mw/Sr5A9t09YpI/AAAAAAAAAUA/vHKjLJb8xL0/s400/Feature+Selection.jpeg" alt="" id="BLOGGER_PHOTO_ID_5385813633399612050" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Let us stick to our example with stocks, to make things more clear. Suppose that i would like to know which features seem to be important for predicting the behavior of a stock. For our example we will try to find out about how the stock of &lt;a href="http://www.nyse.com/about/listed/nbg.html"&gt;NBG&lt;/a&gt; reacts.&lt;br /&gt;&lt;br /&gt;By using a feature selection method we extract the following information :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_koDJi0ps7Mw/SO7oXuk88ZI/AAAAAAAAAEE/8_EkxnHi948/s1600-h/fs1.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_koDJi0ps7Mw/SO7oXuk88ZI/AAAAAAAAAEE/8_EkxnHi948/s400/fs1.JPG" alt="" id="BLOGGER_PHOTO_ID_5255393309525602706" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The feature selection method above shows us how many times each attribute was selected during a 10-fold cross validation. We can see that some attributes are used more times than other attributes during each cross validation . For example :&lt;br /&gt;&lt;br /&gt;realTimeDax&lt;br /&gt;aseStockExchangeIndex&lt;br /&gt;xaaPersonalHouseProducts&lt;br /&gt;xaaTechnology&lt;br /&gt;bankAgrotiki&lt;br /&gt;bankAlpha&lt;br /&gt;bankPiraeus&lt;br /&gt;bankEuro&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;are present in all 10 folds of our cross-validation and hence the 10(100%) entry. xaaFinancialServices index has been selected fewer times (8 out of 10) and hence the 8(80%) entry. Other features never appear to any of the cross validation folds.&lt;br /&gt;&lt;br /&gt;Of course feature selection does not stop here and there are many ways to enhance the process. Data Mining is &lt;span style="font-style: italic;"&gt;both &lt;/span&gt;an art &lt;span style="font-style: italic;"&gt;and&lt;/span&gt; a science. However for our purpose, we were able to identify those attributes that seem to be important in the prediction of the NBG stock. We immediately see for example that &lt;a href="http://en.wikipedia.org/wiki/DAX"&gt;DAX&lt;/a&gt; index and the Athens Stock Exchange Index are two important features, plus the stocks of four specific banks. Other methods of feature selection produce &lt;span style="font-style: italic;"&gt;weights&lt;/span&gt; that essentially rank the importance of each attribute for class prediction.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7622466189506491000?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7622466189506491000/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7622466189506491000' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7622466189506491000'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7622466189506491000'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/10/sowhats-important.html' title='So...What&apos;s important??'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_koDJi0ps7Mw/Sr5A9t09YpI/AAAAAAAAAUA/vHKjLJb8xL0/s72-c/Feature+Selection.jpeg' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7248887647151395392</id><published>2008-10-06T15:59:00.006+03:00</published><updated>2008-10-06T17:10:29.647+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='correlation matrix'/><category scheme='http://www.blogger.com/atom/ns#' term='stock exchange indices'/><title type='text'>Always know your data!</title><content type='html'>&lt;div style="text-align: justify;"&gt;Before rushing in analyzing and predicting the Financial Markets (and actually &lt;span style="font-style: italic;"&gt;anything &lt;/span&gt;else) it is essential that we get an idea about the data at hand. So after data collection (ie getting values of different market indices) i wanted to understand first what is going on to the markets.  And a correlation Matrix tells us just that. Let's see what happens on the Greek Stock Exchange :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_koDJi0ps7Mw/SOoPvCB9XdI/AAAAAAAAADs/RZ_e8tFhpHs/s1600-h/correlation.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 290px; height: 212px;" src="http://2.bp.blogspot.com/_koDJi0ps7Mw/SOoPvCB9XdI/AAAAAAAAADs/RZ_e8tFhpHs/s400/correlation.JPG" alt="" id="BLOGGER_PHOTO_ID_5254029215954460114" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;By looking at the matrix we can immediately see some interesting things :&lt;br /&gt;&lt;br /&gt;1) There is a high correlation (=0.847) between &lt;a href="http://en.wikipedia.org/wiki/DAX"&gt;DAX&lt;/a&gt; index and the Greek stock exchange index (marked as aseStockExchangeIndex)&lt;br /&gt;&lt;br /&gt;2) The Insurance index sector (xaaInsurance) and the Media sector (xaaMedia) have a low correlation with the aseStockExchangeIndex. Consider the following scatter chart that shows the poor correlation between Insurance sector stocks and the aseStockExchangeIndex :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_koDJi0ps7Mw/SOoVvzwexrI/AAAAAAAAAD0/fwOchmdfnbM/s1600-h/scatter.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 378px; height: 199px;" src="http://2.bp.blogspot.com/_koDJi0ps7Mw/SOoVvzwexrI/AAAAAAAAAD0/fwOchmdfnbM/s400/scatter.JPG" alt="" id="BLOGGER_PHOTO_ID_5254035826372691634" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Those two facts alone can help significantly in trading: For example if an investor's trading decision is heavily based on aseStockExchangeIndex then the investor should also keep a close look on the &lt;a href="http://en.wikipedia.org/wiki/DAX"&gt;DAX&lt;/a&gt; Index as opposed to other European indices (such as FTSE,CAC40,etc).&lt;br /&gt;&lt;br /&gt;A lot of problems later in the  analysis can be prevented if one pays attention to the "Data Understanding" phase. Plus, we also get an insight as to what kind of results should we expect from the learning algorithms.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7248887647151395392?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7248887647151395392/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7248887647151395392' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7248887647151395392'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7248887647151395392'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/10/always-know-your-data.html' title='Always know your data!'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_koDJi0ps7Mw/SOoPvCB9XdI/AAAAAAAAADs/RZ_e8tFhpHs/s72-c/correlation.JPG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-3925922604248847406</id><published>2008-09-25T13:46:00.004+03:00</published><updated>2008-09-25T14:46:54.079+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='financial markets'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><title type='text'>Predicting the Financial markets</title><content type='html'>&lt;div style="text-align: justify;"&gt;After a &lt;span style="font-style: italic;"&gt;very &lt;/span&gt;long break i decided to start writing again. It is very interesting to see that people are still answering on my questionnaire (see Links area)  and every week i see more answers coming in....but more data is always welcome!&lt;br /&gt;&lt;br /&gt;Since February i have been involved with yet another Data &amp;amp; Text Mining Application, namely predicting the financial markets by using financial - world news (the Text Mining side) and key financial indices (The Data Mining side).  There are numerous blogs and entries i have found about this problem. One blog example is &lt;a href="http://www.neuralmarkettrends.com/"&gt;neural market trends&lt;/a&gt; and also a series of &lt;a href="http://www.b-eye-network.com/view/6386"&gt;articles&lt;/a&gt; which i originally found in &lt;a href="http://www.kdnuggets.com"&gt;kdnuggets&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;There is no question that such an application of Predictive Analytics in Financial Markets is interesting but also it can be (potentially) dangerous. However my experience so far on the subject has shown me that by a) getting a grip on the risk of a trading decision and b) using Predictive Analytics to make the trading decision, a user has more chances in making a successful trade.&lt;br /&gt;&lt;br /&gt;On the next post, we will go through the data mining side of the problem.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-3925922604248847406?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/3925922604248847406/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=3925922604248847406' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3925922604248847406'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/3925922604248847406'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/09/predicting-financial-markets.html' title='Predicting the Financial markets'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-6218535885331863068</id><published>2008-02-11T13:08:00.001+02:00</published><updated>2008-02-14T20:25:24.627+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='real estate'/><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><title type='text'>Analyzing the Real Estate Market - Part 2</title><content type='html'>&lt;div style="text-align: justify;"&gt;In the &lt;a href="http://lifeanalytics.blogspot.com/2007/12/analyzing-real-estate-market-part-1.html"&gt;previous part&lt;/a&gt; i listed the first steps required  that can turn unstructured information of flat adverts for rent to a suitable form for further analysis of the Greek Real Estate Market.&lt;br /&gt;&lt;br /&gt;Once the Information Extraction step is finished, the characteristics of each flat advert (price, square meters, type of heating, years old etc) are inserted into a database. Once flat adverts data are inserted, we are able to extract key information about price trends for specific areas of Athens such as Nea Smyrni.  The following screen capture shows a portion of the records that exist in the database, after the information extraction phase :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_koDJi0ps7Mw/R7F2IHanW7I/AAAAAAAAADk/Xyj4n7KLp0A/s1600-h/sql.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp2.blogger.com/_koDJi0ps7Mw/R7F2IHanW7I/AAAAAAAAADk/Xyj4n7KLp0A/s400/sql.JPG" alt="" id="BLOGGER_PHOTO_ID_5166040129372380082" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;With the advert data in place we are ready to deploy data mining algorithms that can reveal to us  potentially useful patterns.  For example, a classification analysis aimed in finding which characteristics are important to obtain a high renting price produces the following decision tree :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_koDJi0ps7Mw/R7Bl1XanW6I/AAAAAAAAADc/XuzJdwaLr3s/s1600-h/treeexample.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp2.blogger.com/_koDJi0ps7Mw/R7Bl1XanW6I/AAAAAAAAADc/XuzJdwaLr3s/s400/treeexample.jpg" alt="" id="BLOGGER_PHOTO_ID_5165740740087077794" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The decision tree depicted above essentially gives us the following information :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The most important characteristic for obtaining a high renting value (in terms of Euros per square meter) is the provision of a parking space with the flat.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;If a flat provides a parking space, has a storage area and has up to two bed rooms then the flat obtains the highest renting rate, (ie 7.54 Euros per square meter)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;If a flat does not provide a parking space but has at least  one bedroom and is located at the fourth floor (or higher) then the flat obtains the highest renting rate per square meter.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-6218535885331863068?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/6218535885331863068/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=6218535885331863068' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6218535885331863068'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6218535885331863068'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2008/01/analyzing-real-estate-market-part-2.html' title='Analyzing the Real Estate Market - Part 2'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp2.blogger.com/_koDJi0ps7Mw/R7F2IHanW7I/AAAAAAAAADk/Xyj4n7KLp0A/s72-c/sql.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-2680257023994496860</id><published>2008-01-06T16:27:00.001+02:00</published><updated>2008-10-15T19:54:38.567+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='real estate'/><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='unstructured information'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='information extraction'/><title type='text'>Analyzing the  Real Estate Market - Part 1</title><content type='html'>&lt;div style="text-align: justify;"&gt;Over the next days i will present an example of using Data Mining and Information Extraction techniques to analyze Real Estate in the Greek Market.&lt;br /&gt;&lt;br /&gt;The problem is as follows : In a specific suburb of Athens in Greece (let's say Nea Smyrni) what are the key factors (or characteristics) that contribute to a high renting price of a flat? Which is more important? Having a parking space,  or whether the house is less than 5 years old?&lt;br /&gt;&lt;br /&gt;This piece of information is particularly valuable for flat owners, real estate investors and real estate agents (to name a few) according to my experience.&lt;br /&gt;&lt;br /&gt;I really like this example of analysis given, because it shows the power of Information Extraction and Data Mining combined and the insight that these techniques can reveal.&lt;br /&gt;&lt;br /&gt;In order to implement this analysis, the first required action is the collection of information. For this reason, special software collects flat adverts for rent from Greek websites. The next step is to extract each flat's information from each advert. Information extraction is used to extract these characteristics as shown below  :&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_koDJi0ps7Mw/R2kxXjGQkZI/AAAAAAAAAC0/ylHxtAa89mY/s1600-h/realestateIE.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp1.blogger.com/_koDJi0ps7Mw/R2kxXjGQkZI/AAAAAAAAAC0/ylHxtAa89mY/s400/realestateIE.jpg" alt="" id="BLOGGER_PHOTO_ID_5145698329875747218" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The goal of Information Extraction is to transform unstructured information to a form suitable for further analysis. More specifically, after the Information Extraction phase,  the characteristics of each flat advert are inserted into a database. More on this on Part 2...&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-2680257023994496860?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/2680257023994496860/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=2680257023994496860' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2680257023994496860'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/2680257023994496860'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2007/12/analyzing-real-estate-market-part-1.html' title='Analyzing the  Real Estate Market - Part 1'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp1.blogger.com/_koDJi0ps7Mw/R2kxXjGQkZI/AAAAAAAAAC0/ylHxtAa89mY/s72-c/realestateIE.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7330810331389158923</id><published>2007-12-14T08:21:00.000+02:00</published><updated>2007-12-16T23:10:49.732+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='unstructured information'/><category scheme='http://www.blogger.com/atom/ns#' term='digg'/><title type='text'>What People Digg More? - Part 3</title><content type='html'>&lt;div style="text-align: justify;"&gt;In this third -and final- part of the way that &lt;a href="http://www.digg.com/"&gt;digg &lt;/a&gt;stories are analyzed,  i will present an example of the co-occurrence  table used to find statistically significant correlation of words.&lt;br /&gt;&lt;br /&gt;In the &lt;a href="http://lifeanalytics.blogspot.com/2007/11/what-people-digg-more-part-2.html"&gt;previous&lt;/a&gt; part i outlined the way stories from digg are collected and how these stories are transformed in a way suitable for analysis. This last step of the analysis, is finding out about what subjects people seem to really like and which subjects are not 'digged' so much. To do that, a co-occurrence table is used which maps words to two categories : HighDiggs (=stories that are interesting) and LowDiggs (=stories that are not interesting).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;This is an example of a word co-occurrence table from IBM's &lt;a href="http://www.alphaworks.ibm.com/tech/uimodeler"&gt;UI Modeler&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_koDJi0ps7Mw/R2ItLbZBO2I/AAAAAAAAACU/qrLxJ9v1IMI/s1600-h/correlations.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp2.blogger.com/_koDJi0ps7Mw/R2ItLbZBO2I/AAAAAAAAACU/qrLxJ9v1IMI/s400/correlations.JPG" alt="" id="BLOGGER_PHOTO_ID_5143723398765034338" border="0" /&gt;&lt;/a&gt;The statistical significance between words and categories of interestingness is denoted by colors.The more intense the color, the more higher the affinity between the word and the category.&lt;br /&gt;&lt;br /&gt;In the example table above we see that people are interested (and therefore 'digged' more) in :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;1) Stories that have pictures&lt;br /&gt;2) US President George W Bush&lt;br /&gt;3) Apple Leopard&lt;br /&gt;4) Ron Paul (not shown in table)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;On the other hand people do not 'digg' stories about :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;1) Microsoft (..!)&lt;br /&gt;2) Blogs (not shown in table)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7330810331389158923?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7330810331389158923/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7330810331389158923' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7330810331389158923'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7330810331389158923'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2007/12/what-people-digg-more-part-3.html' title='What People Digg More? - Part 3'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp2.blogger.com/_koDJi0ps7Mw/R2ItLbZBO2I/AAAAAAAAACU/qrLxJ9v1IMI/s72-c/correlations.JPG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-6671641356575249723</id><published>2007-11-06T11:55:00.001+02:00</published><updated>2008-10-15T19:55:49.648+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='digg'/><title type='text'>What people Digg More? - Part 2</title><content type='html'>&lt;div style="text-align: justify;"&gt;After getting some e-mails requesting more details about the way i analyze diggs, here are the details of the process :&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;First of all, obviously some coding is necessary to implement software that sifts through &lt;a href="http://digg.com/"&gt;digg &lt;/a&gt;and records the number of diggs of the story as well as the time that the story has been out. The software is also responsible for selecting all stories that have been out for 10-11 days and for calculating the diggs_per_minute metric&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;where :&lt;br /&gt;&lt;br /&gt;Diggs_per_minute= total_diggs / total_minutes&lt;br /&gt;&lt;br /&gt;During the analysis it appeared that the diggs_per_minute metric was not normally distributed since its &lt;a href="http://en.wikipedia.org/wiki/Skewness"&gt;skewness&lt;/a&gt;&lt;span style="text-decoration: underline;"&gt;&lt;/span&gt; (positive) was found to be 2.795.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_koDJi0ps7Mw/RzC7CTJvKgI/AAAAAAAAABc/HckBg86HBb0/s1600-h/unnormalized.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp2.blogger.com/_koDJi0ps7Mw/RzC7CTJvKgI/AAAAAAAAABc/HckBg86HBb0/s400/unnormalized.png" alt="" id="BLOGGER_PHOTO_ID_5129805623750240770" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;After applying log transformation, skewness dropped to 0.534 having a mean value of -2.878 :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_koDJi0ps7Mw/RzC8IDJvKhI/AAAAAAAAABk/bJr4lEPiu6U/s1600-h/normalized.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp1.blogger.com/_koDJi0ps7Mw/RzC8IDJvKhI/AAAAAAAAABk/bJr4lEPiu6U/s400/normalized.png" alt="" id="BLOGGER_PHOTO_ID_5129806822046116370" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The next step is to create a text file, as follows :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_koDJi0ps7Mw/RzCCaTJvKbI/AAAAAAAAAAU/9hDLnIr7-NI/s1600-h/diggfile.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp2.blogger.com/_koDJi0ps7Mw/RzCCaTJvKbI/AAAAAAAAAAU/9hDLnIr7-NI/s400/diggfile.jpg" alt="" id="BLOGGER_PHOTO_ID_5129743363904317874" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;Notice that there is a 'highdiggs' or 'lowdiggs' word at the end of each line (story). If the diggs_per_minute metric for each story exceeds the threshold value -2.878 then 'highdiggs'  is appended at the end of the line, otherwise 'lowdiggs' is added.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The last step of the analysis is to use a co-occurrence matrix to see which words are associated with high digg and low digg stories. A chi-square test is used to test for statistical significance of word co-occurrences.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;For the last part of the analysis i use a tool called &lt;a href="http://www.alphaworks.ibm.com/tech/uimodeler"&gt;Unstructured Information Modeler&lt;/a&gt; from &lt;a href="http://ibm.com/"&gt;IBM&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-6671641356575249723?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/6671641356575249723/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=6671641356575249723' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6671641356575249723'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6671641356575249723'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2007/11/what-people-digg-more-part-2.html' title='What people Digg More? - Part 2'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp2.blogger.com/_koDJi0ps7Mw/RzC7CTJvKgI/AAAAAAAAABc/HckBg86HBb0/s72-c/unnormalized.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-530315443077295877</id><published>2007-10-16T11:31:00.001+03:00</published><updated>2008-10-15T19:56:15.273+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='digg'/><title type='text'>What people Digg More?</title><content type='html'>&lt;div style="text-align: justify;"&gt;Due to too much work i wasn't able to write to this blog as much as i wanted. Although people continue to answer the questionnaire (over 400!) i wasn't able to make any other analysis so far that will shed some light on the patterns that emerge from living our lives.&lt;br /&gt;&lt;br /&gt;However, i feel that i should write something about my new ventures on text mining. The question that came up to my mind was simple :&lt;br /&gt;&lt;br /&gt;"What stories people tend to digg more?"&lt;br /&gt;&lt;br /&gt;So i collected all stories on &lt;a href="http://www.digg.com/"&gt; digg&lt;/a&gt; and for each story the number of diggs and the time that the story has been around was recorded. By dividing the number of diggs by the total minutes the story has been out, you get a "Diggs_per_Minute" score which essentially designates which stories are "hot" and which are not.&lt;br /&gt;&lt;br /&gt;After the preliminary analysis i immediately found out that it is essential to use data from a specific time  period and not just everything. If you think about it, a story should be out for quite a while (say 10 days) so that you are able to get a good estimate of the "Diggs_per_Minute" variable. Stories that have been out for less than 2 days tend to have a much greater score of Diggs per Minute than newer stories.&lt;br /&gt;&lt;br /&gt;So the process is as follows: Diggs from stories that have been out for 10-11 days are collected. I then use text mining techniques to find out what words the stories with many diggs have in common. Don't you think that marketing people would love to know this information?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;First Results for Most Digged stories :&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;1) Stories that have pictures tend to be digged more&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;2) Having the phrase "Digg this if you....."&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;3) Specific Companies / technologies etc (e.g Apple and Ipod)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;That's all for now but i will come back with more.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-530315443077295877?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/530315443077295877/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=530315443077295877' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/530315443077295877'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/530315443077295877'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2007/10/what-people-digg-more.html' title='What people Digg More?'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-6793108941301678463</id><published>2007-09-14T15:15:00.001+03:00</published><updated>2008-12-11T00:39:55.358+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='life analytics'/><title type='text'>First Results Out : Phobias</title><content type='html'>&lt;div style="text-align: justify;"&gt;Well over 300 people have submitted and described their lives on the questionnaire. I must admit that i couldn't wait for this time to come so that i can start finding out about patterns that emanate from our lives.&lt;br /&gt;&lt;br /&gt;As explained in another &lt;a href="http://lifeanalytics.blogspot.com/2007/07/on-public-demand-here-is-more.html"&gt;post&lt;/a&gt;  regarding classification analysis, the target variable in my first attempt is predicting class "Phobias". Simply put, what are the common characteristics of people having phobias?&lt;br /&gt;&lt;br /&gt;It seems that people that have answered  &gt; 2 on the good looking scale question are less likely by 85% to have phobias. Does this make sense to you?&lt;br /&gt;&lt;br /&gt;This is just one example of how interesting patterns may arise from analyzing submitted data. If you already have submitted the questionnaire why not asking your friends to do it?. The questionnaire can be reached on the following URL :&lt;br /&gt;&lt;br /&gt;&lt;a style="font-weight: bold;" href="http://lifeanalytics.org/MainSurvey"&gt;http://lifeanalytics.org/MainSurvey&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-6793108941301678463?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/6793108941301678463/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=6793108941301678463' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6793108941301678463'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6793108941301678463'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2007/09/first-results-out-phobias.html' title='First Results Out : Phobias'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-820306802334320063</id><published>2007-08-27T12:02:00.001+03:00</published><updated>2009-05-08T19:58:39.182+03:00</updated><title type='text'>LifeAnalytics : Over a month online</title><content type='html'>&lt;div style="text-align: justify;"&gt;Time sure passes by very quickly...it is over one month now that this blog has started. Over 200 people have submitted their answers to the &lt;a href="http://lifeanalytics.org/MainSurvey"&gt;questionnaire&lt;/a&gt; so far...Over half of visitors originate from the US but also people from Europe (especially UK, Germany and the Netherlands) are also producing many hits. A Big Thanks to kdnuggets (see links area) -the best site for data mining news-  for listing this blog on its pages.&lt;br /&gt;&lt;br /&gt;The truth is that with over 60 questions, we need more people to fill the survey, so if you are reading this and you haven't filled the questionnaire yet, you can submit your answers here :&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a style="font-weight: bold;" href="http://lifeanalytics.org/MainSurvey"&gt;http://lifeanalytics.org/MainSurvey&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Very shortly, i will make a new post explaining what types of analysis can also be made -apart from classification and clustering- once we have enough data...&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-820306802334320063?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/820306802334320063/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=820306802334320063' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/820306802334320063'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/820306802334320063'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2007/08/lifeanalytics-over-month-online.html' title='LifeAnalytics : Over a month online'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-6132845783027433887</id><published>2007-07-26T14:26:00.000+03:00</published><updated>2007-07-31T11:00:51.230+03:00</updated><title type='text'>By public demand, here is more information</title><content type='html'>&lt;div style="text-align: justify;"&gt;Already the first answers are coming in. A Big Thanks to everyone that already have submitted their answers to our &lt;a href="http://lifeanalytics.org/MainSurvey"&gt;questionnaire&lt;/a&gt;. I get quite a few e-mails asking me how i am going to use the results, so i feel it is time for more explanation.&lt;br /&gt;&lt;br /&gt;The process of Data Mining consists of the following steps (simplified..) :&lt;br /&gt;&lt;br /&gt;1) Data Collection&lt;br /&gt;2) Data Preparation&lt;br /&gt;3) Analysis&lt;br /&gt;4) Application of results&lt;br /&gt;&lt;br /&gt;Currently we are on step (1), collecting the data. Although questions on your e-mails sent to me are for step (4), i feel that it is important to talk about step (3) as well :&lt;br /&gt;&lt;br /&gt;By deploying specialized algorithms, we try to find common characteristics of a specific &lt;span style="font-style: italic;"&gt;class&lt;/span&gt; in data (also known as &lt;a href="http://en.wikipedia.org/wiki/Statistical_classification"&gt;classification&lt;/a&gt;) . By &lt;span style="font-style: italic;"&gt;class &lt;/span&gt;we mean a category. For "happiness"   we have two possible categories of people : Those that are generally happy and those that are not. In the same manner several other classes of people exist .&lt;br /&gt;&lt;br /&gt;Once we decide on which class to analyze (say married/divorced) , now it is time to perform the analysis with the goal of creating a &lt;a href="http://en.wikipedia.org/wiki/Predictive_modelling"&gt;predictive model&lt;/a&gt;.&lt;span style="font-style: italic;"&gt; &lt;/span&gt;Once a -reasonably accurate- model is created, &lt;span style="font-style: italic;"&gt;we are ready to predict new, unseen cases of people.  &lt;/span&gt;That essentially means that when new users submit the questionnaire, they will be given a score (= a percentage) of the probability of getting a divorce. Note that "divorce" is just one class ; several other predictions can be made for other classes too, once a predictive model has been created for each class . But that's not all. Not only we are able to predict new cases, but -with specific algorithms- we are able to find out &lt;span style="font-style: italic;"&gt;what &lt;/span&gt;is important on an outcome (ie getting a divorce) and also the importance of each parameter (where each parameter is a question on the questionnaire).&lt;br /&gt;&lt;br /&gt;In our example rule :&lt;br /&gt;&lt;br /&gt;&lt;span&gt;&lt;span style="font-style: italic;"&gt;IF AGE &gt;31 AND AGE&lt;=40 AND NUM_OF_CHILDREN = 0 THEN DIVORCE="TRUE"    &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Not only we realize that age and the number of children is important in getting a divorce, but also we are able to know &lt;span style="font-style: italic;"&gt;the relevant importance of each parameter&lt;/span&gt;.  For example, a model may tell us that the most important factor in getting a divorce is &lt;span style="font-style: italic;"&gt;first of all&lt;/span&gt; having children and &lt;span style="font-style: italic;"&gt;the second factor in importance&lt;/span&gt; is the age.&lt;br /&gt;&lt;br /&gt;Of course there are other kinds of analysis which do not search for common characteristics but seek to find &lt;a href="http://en.wikipedia.org/wiki/Association_rule_learning"&gt;associations&lt;/a&gt; between variables. Another analysis type  finds homogeneous groups through &lt;a href="http://en.wikipedia.org/wiki/Data_clustering"&gt;clustering&lt;/a&gt;, but we will leave examples of these types of analysis for another post.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-6132845783027433887?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/6132845783027433887/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=6132845783027433887' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6132845783027433887'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6132845783027433887'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2007/07/on-public-demand-here-is-more.html' title='By public demand, here is more information'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-7392113502970471971</id><published>2007-07-20T09:15:00.001+03:00</published><updated>2008-12-11T00:42:33.193+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='life analytics'/><title type='text'>Questionnaire is Ready</title><content type='html'>&lt;div style="text-align: justify;"&gt;Although i thought it will take a lot of time, the questionnaire is ready. I used a great tool to create the web questionnaire called &lt;a href="http://sourceforge.net/project/screenshots.php?group_id=80811"&gt;websurveytoolbox&lt;/a&gt;. By using it, i managed having the questionnaire ready in less than 2 days. All data and questions of the questionnaire are saved on a &lt;a href="http://www.mysql.com/"&gt;MySQL&lt;/a&gt; database and the tool automatically creates &lt;a href="http://en.wikipedia.org/wiki/JavaServer_Pages"&gt;jsp&lt;/a&gt; pages with the questions. I also registered the domain  lifeanalytics.org although there is not much to see there for now, except that it hosts the necessary code and database for the questionnaire.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-weight: bold;"&gt;To fill the questionnaire please visit&lt;/span&gt;:&lt;br /&gt;&lt;br /&gt;&lt;a style="font-weight: bold;" href="http://lifeanalytics.org/MainSurvey"&gt;http://lifeanalytics.org/MainSurvey&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;I counted the time needed to fill out the questionnaire, so assuming you know "Everyday English"  it shouldn't take more than 5 minutes to complete it. The questionnaire comprises of all kinds of questions, having the (tough) aim of describing one's life as best as possible in terms of facts, personal decisions and points of view.&lt;br /&gt;&lt;br /&gt;As a last note, i would like to stress out that no personal data are asked, not even your e-mail. If you feel that this effort is worth it, please let other people know. You can even &lt;a href="http://www.digg.com/software/LifeAnalytics_Predict_how_your_life_will_be"&gt;digg&lt;/a&gt; it to spread out the story ;-)&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-7392113502970471971?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/7392113502970471971/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=7392113502970471971' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7392113502970471971'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/7392113502970471971'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2007/07/questionnaire-is-ready.html' title='Questionnaire is Ready'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-840205959890493236</id><published>2007-07-18T20:21:00.002+03:00</published><updated>2007-07-21T18:42:49.790+03:00</updated><title type='text'>Life, Uncertainty and Mathematics</title><content type='html'>&lt;div style="text-align: justify;"&gt;One of the immediate questions that may arise  is "How on earth are you going to predict if someone is going to get a divorce?".&lt;br /&gt;&lt;br /&gt;So, what essentially i am trying to do here is to model life and its uncertainties with Mathematics...Now, can this be possible?&lt;br /&gt;&lt;br /&gt;Certainly there are hundreds of factors that could play a role in getting a divorce. A questionnaire of a 100 -or less- questions cannot capture the facts of a person's life. But my goal is to just give it a try and see how it goes. Perhaps tens or even hundreds of thousands of answers may be able to give us a clue as to what is happening.&lt;br /&gt;&lt;br /&gt;Each rule extracted (see previous &lt;a href="http://lifeanalytics.blogspot.com/2007/07/lifeanalytics-blog-has-started.html"&gt;post&lt;/a&gt; about what i mean by rules), will be tested for its statistical validity through &lt;a href="http://en.wikipedia.org/wiki/Chi-square_test"&gt;chi-square tests&lt;/a&gt; and making adjustments through &lt;a href="http://en.wikipedia.org/wiki/Bonferroni"&gt;Bonferroni correction&lt;/a&gt;. Several other techniques will be used to assess the quality of the extracted models. In other words, if there is something there, we will find it.&lt;br /&gt;&lt;br /&gt;Once a model is produced and is reasonably accurate, we will be ready to predict unseen cases. In other words -and continuing our divorce example- if a model is 80% correct in predicting whether someone will get a divorce, then anyone that fills the questionnaire at the end of the process, will also find out about the probability of getting a divorce. More importantly : &lt;span style="font-style: italic;"&gt;Why &lt;/span&gt;he or she, is likely to get one.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-840205959890493236?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/840205959890493236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=840205959890493236' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/840205959890493236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/840205959890493236'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2007/07/life-uncertainty-and-mathematics.html' title='Life, Uncertainty and Mathematics'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9150291873749799355.post-6228812806272056107</id><published>2007-07-17T14:53:00.002+03:00</published><updated>2008-12-11T00:40:35.486+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='life analytics'/><title type='text'>LifeAnalytics blog has started</title><content type='html'>&lt;div style="text-align: justify;"&gt;I finally made the decision to start LifeAnalytics blog. Hopefully, many people will find useful the findings from the research  on the patterns that emerge by just &lt;span style="font-style: italic;"&gt;living&lt;/span&gt;&lt;span&gt;.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;What is LifeAnalytics? Simply put, i will be using analytical techniques (especially &lt;a href="http://en.wikipedia.org/wiki/Statistical_classification"&gt;classification&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Association_rule_learning"&gt;associations discovery&lt;/a&gt;), also known as &lt;a href="http://en.wikipedia.org/wiki/Data_mining"&gt;Data Mining&lt;/a&gt; to understand key facts about a person's life : For example ,what are the common characteristics of people that are divorced? What factors play an important role in having an increased risk for getting a divorce?&lt;br /&gt;&lt;br /&gt;Of course, getting a divorce is one probability in someone who is married. Several other facets and facts compose our lives.... as an example consider the following life facts :&lt;br /&gt;&lt;br /&gt;- Having a good marriage&lt;br /&gt;- Being happy about work&lt;br /&gt;- Having phobias&lt;br /&gt;- Having an above-average salary&lt;br /&gt;- Being Depressed&lt;br /&gt;&lt;br /&gt;The goal then, is to look at all of those probable outcomes in one's life and try to extract "rules" that increase (or decrease) the probability of experiencing the above facts. In order to do this, thousands of people must somehow describe their lives and their character idiosyncrasies by submitting  a questionnaire.&lt;br /&gt;&lt;br /&gt;Example : By analyzing thousands of people's life facts (submitted via questionnaire), we may extract the following rule :&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;br /&gt;IF AGE &gt;31 AND AGE&lt;=40 AND NUM_OF_CHILDREN = 0 THEN DIVORCE="TRUE"&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In other words, the example rule above says that People between 32 and 40 years old without children have increased probabilities (say for example 82%) in getting a divorce.  Findings and conclusions  like the example shown above will be given  to anyone interested &lt;/span&gt;-free of  charge of course- from this blog.&lt;br /&gt;&lt;br /&gt;Think about it. Living our life creates "data" and along with the "data" of thousands of others we may find some really interesting answers.&lt;br /&gt;&lt;br /&gt;Stay tuned, the journey to this kind of knowledge has -hopefully- just begun...&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9150291873749799355-6228812806272056107?l=lifeanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lifeanalytics.blogspot.com/feeds/6228812806272056107/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9150291873749799355&amp;postID=6228812806272056107' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6228812806272056107'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9150291873749799355/posts/default/6228812806272056107'/><link rel='alternate' type='text/html' href='http://lifeanalytics.blogspot.com/2007/07/lifeanalytics-blog-has-started.html' title='LifeAnalytics blog has started'/><author><name>Themos Kalafatis</name><uri>http://www.blogger.com/profile/14323291739097798038</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_koDJi0ps7Mw/TLvrneEoO1I/AAAAAAAAAbY/vDxsCjZAISQ/S220/DSC07812.jpg'/></author><thr:total>4</thr:total></entry></feed>
