Monday, February 23, 2015

Extracting Insights from Consumer Reviews

Here is one more example on how we can extract Insights from Consumer Reviews. This time we will use Reviews that were given for several Supplement Brands of Omega-3 Fish Oil.

For this example we analyze 4018 Reviews of Consumers who bought Omega-3 Supplements.  Keep in mind that in most cases each Product Review has an associated Rating (usually given as 1-5 stars) which signifies the overall satisfaction of each Consumer . Therefore, after data collection of the Reviews and Ratings we have a file with the following entries per row :

[Text of Review,Rating]

The fact that a Customer gives also a Score can be especially helpful because we can identify the words and Phrases that differentiate Positive experiences (ie those having 5 Star Ratings) from the Negative Ones (We assume that any Review having a Rating of  4 stars or less is Negative). So for example, Positive Reviews may contain mostly words and phrases such as "Great", "Happy" and "Will buy again" whereas Negative Reviews may contain words and phrases such as "Never buying again","not happy" or "damaged".

The tools used for this example are NLTK and Python. The code simply reads the reviews and associated text and creates a Matrix with the same representation as the file it read.

Next, we want to identify which Insights we can extract from this representation. For example :

-Identify which words commonly occur in 5-star reviews
-Identify which words commonly occur in Reviews with a rating of 4 Stars or Lower.
-Identify potentially Interesting Phrases and Words
-Extract term Co-Occurrences

We start with terms occurring more frequently in Negative Reviews for Omega-3 Supplements. Here is what we've found :






So it appears that people tend to give negative Reviews when the Taste (and possibly After-Taste) is not quite right. A lot of people complain about a Fishy odor. Notice also that the 3rd Term is sure which we can assume that it originates from customers saying that they are not sure if the Product works or not (Notice also that the 4th term is yet). Some more terms to consider :

however
rancid
krill (a type of Oil which is alternative Product to Omega-3 Supplementation)
soy
stick


Now let's look at the Terms associated with Positive Reviews :




great and excellent are terms that were expected to be found in Positive Reviews.  Some terms to consider are :

price
quality
brain
triglycerides
cholesterol

We move on to identifying potentially interesting terms and Phrases. Here is a Screenshot from the Software that i used  :







I added a Red Rectangle wherever sensitive information (such as Company Names) appears which for the purpose of this post is not relevant (but it certainly is relevant in a different setting).

We immediately see some interesting mentions, for example : Heavy Metal poisoning, Upset Stomach incidences, Cognitive Function , Joint Pains, Panic Attacks, Reasonably Priced Items, Postpartum Depression, Allergic Reactions, Speedy Delivery and Soft Gels that Stick together.

Recall that in a previous example we found that the term however is a term that occurs frequently within Negative Reviews. Some analysts may have chosen to treat this term as a stopword which in this case would be a serious mistake. The reason for this is that the term however shows us very often the reason for which a product or service is not receiving a perfect rating and vice-versa. Therefore, If a Data Scientist would have chosen to exclude this term from the Analysis (stopwords are typically removed from the text), potentially interesting insights would have never surfaced.

Ideally, we would like to know what is the context that occurs after the term however whenever this term occurs withing a negative review. That will help us to focus on all occurrences of however with negative sentiment. To do this, we only take into account all reviews containing the term however and having a Rating of 3 stars or less. It appears that the most common terms occurring after the term however was Fishy odor and After-taste. In other words, fishy odor is the cause that keeps Customers from giving a 5-star Rating.

On the other hand, phrases such as highly recommend are interesting because we may use co-occurrence analysis to see which terms co-occur with a highly recommended product.

Of course this is -by no means- the end on what we can do. To extract even better insights we have to spend significantly more time to do proper Pre-processing, use Information Extraction and use several other techniques to analyze Text Data in novel and potentially interesting ways.



Thursday, October 16, 2014

Sequence Data Mining for Health Applications

An often overlooked type of Analysis is  Sequence Data Mining (or Sequential Pattern Mining).


Sequence Data Mining is a type of Analysis which aims in extracting patterns sequences of  Events. We can also see Sequence Data Mining as an Associations Discovery Analysis with a Temporal Element.

Sequence Data Mining has many potential applications (Web Page Analytics, Complaint Events, Business Processes) but here today we will show an application for Health. I believe that this type of Analysis will become even more important as wearable technology will be used even more and therefore more Data of this kind will be generated.

Consider the following hypothetical scenario : 

A 30-year old Male patient complaints about several symptoms which -for simplicity reasons- we will name them as Symptom1, Symptom2, Symptom3,etc.

His Doctor tries to identify what is going on and after the patient takes all necessary Blood work and finds no problems. After thorough evaluation the Doctor believes that his patient suffers from Chronic Fatigue Syndrome. Under the Doctor's supervision the patient will record his symptoms along with different supplements to understand more about his condition. Several events (e.g a Visit to the Gym, a stressful Event) will also be taken under consideration to see if any patterns emerge.

-How Can we easily record Data for the scenario above?
-Can we extract sequences of events that occur more frequently than mere chance?
-Can we identify which sequences of Events / Food / Medication may potentially lead to specific Symptoms or to a lack of Symptoms?


Looking the problem through the eyes of a Data Scientist, We have :

A series of Events that happen during a day : A Stressful event, A sedentary day, Cardio workouts, Weight Lifting, Abrupt Weather Deterioration, etc

A Number of Symptoms : Headaches, "Brain Fog", Mood problems, Insomnia, Arthralgia, etc.


Let's begin with Data Collection. We first suggest to the patient to use an Android app called MyLogsPro (or some other equivalent application) to easily input information as this happens :


  
So if the patient feels a specific Symptom he will press the relevant Symptom button on his  mobile device. The same applies for any events that have happened and any Food or Medication taken. As the day passes we have the following data collected :



The snapshot shows what happened starting on the 20th of August 2014, where our patient has logged the intake of Medication (at 08:22 AM) and/or Supplements upon waking up then a Food entry was added at 08:47. At 11:06 the patient had a Symptom and immediately reached his phone and pressed the relevant Symptom (Symptom No 4) button.

After many days of Data Collection we decide that its time to analyze this information. We export the data from the application as a csv file which looks as follows :



We will use KNIME to read the csv file, change the contents of the entries accordingly so that an Algorithm can read the events and then perform Sequence Data Mining. We have the following layout :



 The File Reader reads the .csv file, then during the Pre-processing block (shown in yellow), a String Manipulation node which removes colon (:) from time field (e.g 12:10 becomes 1210). The Sorter sorts the data according to date then time as the second field and a Java snippet uses replaceAll() function to remove all leading zeros from Time field (e.g 0010 becomes 10).

The R Snippet loads the CSPADE Algorithm and then uses this Algorithm to extract pattern of sequences.


After executing the stream we get the following output :


The information consists of two outputs : The first one is a list of sequences along with their support and the second one contains the output from rule induction which gives us two more useful metrics (namely the lift and the confidence for each rule).

We immediately notice an interesting entry on the first output :

Medication1->Symptom2

and on the second output we see that this particular rule has a lift of 1.4 and 0.8 confidence.

However, as Data Scientists we should always double-check the extracted knowledge and must be aware of pitfalls. Let's see some examples (list not exhaustive) :

1) The algorithm does not account for time as it should : As an example, consider the following entries :

10/09/14,08:00,Medication1
10/09/14,08:05,Symptom2

We assume that Medication1 is taken by mouth and needs 60 minutes to be properly dissolved and that these entries occur frequently enough in that order in our data set. Even though the algorithm might show a statistically significant pattern , it is not logical to hypothesize that Medication1 could be related to Symptom2. The Analyst should first examine each of these entries to see which proportion of the records has a time difference of at least -say- or greater than 60 minutes.

Apart from the example shown above we must consider the opposite effect. Consider this entry :

10/09/14,08:00,Medication1
...
...
...
10/09/14,21:05,Symptom2

In other words : Is it possible that a Medication taken in the morning to generate a Symptom 12 hours later?


2) The algorithm is not able to account for the compounding effect of a Medication. For example, the patient might have low levels of Taurine and for this level to be replenished, an x amount of days of Taurine supplementation is needed. The algorithm cannot account for this possibility.


 3) The patient should also input entries of "No Symptoms". It is not clear however when this should be done (e.g at the end of each day? assess every 6 hours and add 2 entries accordingly?)


However, this does not mean that a Sequence Mining algorithm should not be used under these circumstances. This technique can generate several potentially interesting hypotheses which Doctors and/or Researchers may wish to pursue further.
 




Thursday, July 3, 2014

Becoming a Data Scientist : A RoadMap

I receive a lot of questions regarding which books one should read to become a Data Miner / Data Scientist. Here is a suggested reading list and also a proposed RoadMap (apart from the requirement of having an appropriate University degree) in becoming a Data Scientist. 

Before going further, it appears that a Data Scientist should possess an awful lot of skills : Statistics, Programming, Databases, Presentation Skills, Knowledge of Data Cleaning and Transformations.
 

The skills that ideally you should acquire are as follows :

1) Sound Statistical Understanding and Data Pre-Processing
2) Know the Pitfalls : You must be aware of the Biases that could affect you as an analyst and  also the common mistakes made during Statistical Analysis
3) Understand how several Machine Learning / Statistical Techniques work.
4) Time Series Forecasting
5) Computer Programming (R, Java, Python, Scala)
6) Databases (SQL and NoSQL Databases)
7) Web Scraping (Apache Nutch, Scrapy, JSoup)
8) Text Data




Statistical Understanding :  A good Introductory Book is Fundamental Statistics for the Behavioral Sciences by Howell. Also IBM SPSS for Introductory Statistics - Use and Interpretation and IBM SPSS For Intermediate Statistics by Morgan et al. Although all of the books (especially the two latter) are heavy on  IBM SPSS Software they are able to provide a good introduction to key statistical concepts while the  books by Morgan et al give a methodology to use with a practical example of analyzing the High-Scool and Beyond Dataset.

Data Pre-Processing : I must re-iterate the importance of thoroughly checking and identifying problems within your Data. Data Pre-processing guards against the possibility of feeding erroneous data to a Machine Learning / Statistical Algorithm but also transforms data in such a way so that an algorithm can extract/identify patterns more easily. Suggested Books :

  •  Data Preparation for Data Mining by Dorian Pyle
  • Mining Imperfect Data: Dealing with Contamination and Incomplete Records by Pearson
  • Exploratory Data Mining and Data Cleaning by Johnson and Dasu


Know the Pitfalls : There are many cases of Statistical Misuse and biases that may affect your work even if -at times- you do not know it consciously. This has happened to me in various occasions. Actually, this blog contains a couple of examples of Statistical Misuse even though i tried (and keep trying) to highlight limitations due to the nature of Data as much as i can. Big Data is another technology where caution is warranted. For example, see : Statistical Truisms in the Age of Big Data and The Hidden biases of Big Data.

Some more examples :

-Quora Question : What are common fallacies or mistakes made by beginners in Statistics / Machine Learning / Data Analysis

-Identifying and Overcoming Common Data Mining Mistakes by SAS Institute

The following Book is suggested :

  • Common Errors in Statistics (and how to avoid them) by P. Good and J. Harding

In case you are into Financial Forecasting i strongly suggest reading Evidence-Based Technical Analysis by David Aronson which is heavy on how Data Mining Bias (and several other cognitive biases) may affect your Analysis . 


Understand how several Machine Learning / Statistical Algorithms work : You must be able to understand the pros and cons of each algorithm. Does the algorithm that you are about to try handle noise well? How Does it scale? What kind of optimizations can be performed? Which are the necessary Data transformations? Here is an example for fine-tuning Regression SVMs:

Practical Selection of SVM Parameters and Noise Estimation for SVM Regression 

Another book which deserves attention is Applied Predictive Modelling by Khun, Johnson which also gives numerous examples on using the caret R Package which -among other things- has extended Parameter Optimization capabilities.


When it comes to getting to know Machine Learning/ Statistical Algorithms I'd suggest the following books  :

  • Data Mining : Practical Machine Learning Tools and Techniques by Witten and Frank
  • The Elements of Statistical Learning by Friedman, Hasting, Tibishirani 


Time Series Forecasting : In many situations you might have to identify and predict trends from Time Series Data. A very good Introductory Book is Forecasting : Principles and Practice by Hyndman and Athanasopoulos which contains sections on Time Series Forecasting. Time Series Analysis and its Applications with R Examples by Shumway and Stoffer is another book with Practical Examples and R Code as the title suggests.

In case you are interested more about Time Series Forecasting i would also suggest ForeCA (Forecastable Component Analysis) R package written by Georg Goerg -working at Google at the moment of writing- which tells you how forecastable a Time Series is (Ω = 0:white noise, therefore not forecastable, Ω=100: Sinusoid, perfectly forecastable).

Computer Programming Knowledge: This is another essential skill. It allows you to use several Data Science Tools/APIs that require -mainly- Java and Python skills. Scala appears to be also becoming an important Programming Language for Data Science. R Knowledge is considered a "must". Having prior knowledge of Programming gives you the edge if you wish to learn n new Programming Language. You should also constantly be looking for Trends on programming language requirements (see Finding the right Skillset for Big Data Jobs). It appears that -currently- Java is the most sought Computer Language, followed by Python and SQL. It is also useful looking at Google Trends but interestingly "Python" is not available as a Programming Language Topic at the moment of writing. 

Database Knowledge : In my experience this is a very important skill to have. More often than not, Database Administrators (or other IT Engineers) that are supposed to extract Data for you are just too busy to do that. That means that you must have the knowledge to connect to a Database, Optimize a Query and perform several Queries/Transformations to get the Data that you want on a format that you want.

Web Scraping: It is a useful skill to have. There are tons of useful Data which you can access if you know how to write code to access and extract information from the Web. You should get to know  HTML Elements and XPath.  Some examples of Software that can be used for this purpose : 

-Scrapy
-Apache Nutch
-JSoup

Text Data: Text Data contain valuable information : Consumer Opinions, Sentiment, Intentions to name just a few. Information Extraction and Text Analytics are important Technologies that a Data Scientist should ideally know.

Information Extraction :

-GATE
-UIMA

Text Analytics

-The "tm" R Package
-LingPipe
-NLTK

The following Books are suggested :

  • Introduction to Information Retrieval by Manning, Raghavan and Schütze
  • Handbook of Natural Language Processing by Indurkhya, Damerau (Editors)
  • The Text Mining HandBook - Advanced Approaches in Analyzing Unstructured Data by Feldman and Sanger

Finally here are some Books that should not be missed by any Data Scientist :

  • Data Mining and Statistics for Decision Making by Stéphane Tufféry (A personal favorite)
  • Introduction to Data Mining by Tan, Steinbach, Kumar 
  • Applied Predictive Modelling by Khun, Johnson
  • Data Mining with R - Learning with Case Studies by Torgo
  • Principles of Data Mining by Bramer


Thursday, February 13, 2014

Analyzing PubMed Entries with Python and NLTK

I decided to take my first steps of learning Python with the following task : Retrieve all entries from PubMed and then analyze those entries using Python and the Text Mining library NLTK. 

We assume that we are interested in learning more about a condition called Sudden Hearing Loss. Sudden Hearing Loss is considered a medical emergency and has several causes although usually it is idiopathic (a disease or condition the cause of which is not known or that arises spontaneously according to Wikipedia). 

At the moment of writing, the PubMed Query for sudden hearing loss  returns 2919 entries :

  


In order to be able to collect this information from PubMed we can use Entrez library from BioPython which enables us to save to a csv file the Title (Column A) and the Abstract (Column B) of each PubMed entry :


Now that we have the PubMed entries in place we may start Text Analysis with the  NLTK Toolkit. The Python code simply reads the csv file which was created, it removes stop words and uses a simple function to search and replace specific words. For example for this type of Data it is a good idea to replace occurrences of : 

 genes to gene,
 induced, induces,inducing to induction,
antagonists to antagonist,
agonists to agonist,
..etc.

This pre-processing work will help for more efficient retrieval of (possibly) interesting findings. For this example we want to find Collocations of terms and to do this we will use the BigramCollocationFinder from the NLTK Toolkit. After running the Bigram Collocator the program prints the top 100 most-important word-pairs scored using Pointwise Mutual Information :


Let's try to "relax" our requirements by increasing the amount of words that are fed to our analysis. Here are our new results :


We immediately notice the differences between the first analysis and this one since on the second instance we see much more potentially interesting word pairs (more Medical Conditions, Substances, etc are shown) as opposed to the first set of results.

Let's suppose that we are interested in finding which gene(s) could be involved. To do this a Python function is used which scores the co-occurrence of a word of interest (in this case 'gene') with other words.

Here are the results :



The result is not very useful. Nevertheless, it reminds us that we should probably replace polymorphisms with polymorphism
We may then decide to relax again the way with which bigrams are created and we increase the number of subsequent words that are searched and re-run the code :



The Top-Rated results from 'gene' analysis return a term named MTHFR which actually is a gene called Methylenetetrahydrofolate reductase. The same happens with the occurrences of the genes in the bigram co-occurrence analysis just before our 'gene' inspection. We also notice that Co-enzyme Q10 ( a well known and popular supplement) shows on the top of the list. After a bit of searching within PubMed entries it was found that CoQ10 is used for treatment of Sudden hearing Loss and also that CoQ10 was found in low concentrations of people having this condition. 
We can use BioGraph to submit a query for sudden hearing loss and see which concepts are associated with this Condition :



So MTHFR was found on Biograph as well, however at the moment of writing CoQ10 was not in the list (not shown because of its length).  We submit a query to the same engine  for CoQ10 and also filtering specifically for diseases:




Again - at the moment of writing - Sudden hearing loss was not found on Biograph as an associated condition. Of course it is not suggested here that Biograph entries are incomplete. Different types of Analysis may be used and the data that i used for this example were much more targeted (which in its own right should warrant extreme caution) to the specific problem.

Biograph is a wonderful resource (more on this later) which enables researchers to form several hypotheses (Notice Known, Inferred keywords in the results) with which new solutions to medical problems may be found.

The subject of the analysis was not random. For more than 2 years a person who i will not disclose had several incidents of Sudden Hearing Loss which -luckily- were not permanent. Several ENTs have consulted him and dismissed this event as "Too much Stress" and "Idiopathic" after making sure that no other problems (e.g acoustic neuroma) were present.

Upon further investigation the person found to have an MTHFR C677 homozygous polymorphism and additional testing revealed elevated Homocysteine levels. After administration of 5-Methyltetrahydrofolate (an activated form of Folic Acid - Levomefolic Acid)  there were no further incidences of Sudden Hearing Loss.

The solution was originally found using BioGraph.















Wednesday, December 11, 2013

Venture Capitals in an Age of Algorithms (Revisited)


Some time ago i wanted to explore the idea of analyzing several kinds and sources of Information (e.g TechCrunch, TheNextWeb,  News sites and Twitter) to identify promising Investment opportunities in Technology and more specifically Startups. 

Here is a snapshot of a Webpage from TechCrunch :



In many posts in this Blog it was discussed how our Reactions for almost any kind of information are recorded. This was not possible when everyone was reading newspapers in its paper form whereas  now any kind of Text is associated with a number of Views, Re-Tweets, Number of Comments and FaceBook "Likes".

The second important information that is being generated is our Emotions for any Topic as these are expressed within Comments, Twitter and FaceBook posts. The intensity of our emotions is also captured and this information is very important since  whatever we associate with intense emotions really stays within our psyche, fuels our interest and (usually) drives our purchase decisions.

We may then continue with some Exploratory work as follows : We can collect Posts from various Tech sources and their associated Reactions, annotate the text with Sentiment, Events and Topics and analyze this information to understand which Topics and/or Events appear to have an affinity for a high number of Reactions or High Sentiment intensity for Startups or Tech Topics .

As an example, 10K posts from various Tech sources were collected and each one of the posts was marked as generating either HIGH or LOW interest based on the amount of Reactions (Re-Tweets, FB Likes, Comments)  that each post generated. Special filtering is applied for the frequencies of the words that appear in each post :




Then this information is fed to KNIME for further analysis. The implementation which  is shown here is rather naive and simplistic for many reasons :  Only keywords are used as input -as opposed to Topics, Events- and many other parameters that are involved and which will be discussed later but for our example we will keep things simple.

The workflow uses 3 algorithms namely PART (so that some rules are generated), SMO and Random Forests :


 
This -again- is a very naive approach which gave a result of 61.9% (F-Measure) in identifying keywords that commonly appear with posts that generate Interest vs posts that do not. We keep in mind that this knowledge alone is not enough with which a decision can be made but we decide to explore things a little further.

We may find that some words that we expected do appear in posts of High interest (such as Google, Apple, Pinterest). There could be however some words that deserve more of our attention such as Education and Schools which during the analysis appeared to exist more frequently in High Interest posts.

So how can this information be used for a potential investment on a Startup and is there really a way to model new ideas and predict their performance?  Again, it is not suggested here that if you come across a startup aimed in Education you should immediately put your money in but this observation could be one parameter to consider. There are so many other considerations such as whether the idea is novel or not, how many competitors exist, who are the people behind the Startup, whether its founders have created a successful Startup in the past, which people have already invested in the particular Startup, what is the "buzz" that this Startup has generated so far and so on.

Whenever we read about a new startup there are some immediate thoughts going through our minds : Does this sound like a good idea? Is it applicable to me and would it make my life easier? Is this idea truly disruptive or not? What does our "gut feeling" tells us?  

We should always keep in mind that there are limitations to what Predictive Analytics can do but perhaps we can extract some hints that we may then use to make better decisions.

It was also interesting to read this post (hence the use of word "Revisited" in this Post's Title) on Gigaom regarding the same Subject. This is a fascinating area that i started looking at and there will be similar posts in the future on this Subject.


Friday, June 7, 2013

Finding the Right Skillset for Big Data Jobs

Perhaps one of the key skills of a Data Scientist is the ability to be able to collect and access data that are not readily available. 

I was wondering about the trends in Job Postings and more specifically which skills and qualities employers (or agencies) search for on a candidate for a job in "Big Data" so i decided to use R to answer this question.

Of course, one must first find Data (in this case Job Postings) so that they may be analyzed. This is possible by using the library scrapeR of R to scrape content from websites that contain Job Postings. Once this is done, tm package can be used to analyze thousands of Job Advertisements so we may extract useful knowledge.

The analysis which you will see below is based at around one thousand Job Postings that contain the phrase "Big Data". Better pre-processing could help in getting better term co-occurrences but here i aimed in presenting the application.  Once the data are collected we can start by looking at the Frequency distribution of the words found (after removal of stop words) :

Note that the word 'big' is removed from the bar chart. Notice also how the term "experience" (which also includes occurrences of term "experienced") was frequently found in Big Data Job Postings.

Interestingly, the term "skill" (which also counts the term "skilled") is found way below in the frequency diagram.

Next we can use Text Analytics to find which words co-occur with topics of interest. We start by looking at which terms co-occur in Job Postings where Hadoop is mentioned :


Suppose that one wishes to better understand which skills are discussed along with the Java programming language :


 When it comes to skills, it appears that communication skills are those which are important  (as expected):

In the same manner we can  :

-Find the frequencies of skills of interest (e.g Java, Python, Ruby, NoSQL, Oracle DB) and generate trend charts for each of them.

-Run term co-occurrence analysis on the skills which are "good to have" or "preferred".

-Capture early trends on emerging skills (in the "Big Data" case, this could be Pig)

The idea of analyzing Job postings and CVs using Information Extraction (and then using Predictive Analytics once this information becomes structured) is quite interesting. The ability to extract inferred knowledge is also quite challenging : For example could we infer from the text found in CVs  :

-The total number of years of experience in Project Management of an Applicant in case that this has not explicitly been stated in his/her CV?

-Whether an Applicant shows a coherent Career growth through the years ?

-The years needed for an Applicant to move to a Managerial Position?


Tuesday, February 19, 2013

Personal Data Mining - (Part 2)

On the previous post i described the way that i used to capture a 1-year worth of personal data using my Smartphone with the purpose of identifying trends with my immune system which at times gave me perennial conjunctivitis and also swollen lymph nodes. Now it was time to analyze all of this data in hope that some useful knowledge could be found.

 I had to make a decision for which tools to use. I used WEKA and also decided to give KNIME a try so here is an example of a KNIME workflow :






I first use the File Reader to read in my 1-year worth of life data, then an R Script which is used for several data transformations. I then send the streams of Data first to an R Script which runs the FSelector package with which several Feature Selection algorithms (about 10 of them) are applied to get an understanding of what are the important Features for the problem at hand.

Then another stream sends the Data to an R node which creates dummy Variables and then sends the transformed Features to a Linear Correlation node for further inspection.

A third stream (not shown) sends the data to 3 Machine-Learning algorithms (namely an SVM, Decision Tree and Random Forest) and the Scorer shows how each algorithm performed.

I first executed the FSelector node using 10-fold Cross-Validation because i wanted to get a first feel of the features that are important in identifying some patterns about my perennial conjunctivitis.  7 out of 10 of  FSelector algorithms agreed that :

1) Vitamin D3
2) Garlic
3) Yoghurt

..Appear to have the most predictive power. The problem is that at this point we do not know if any of the features actually help or aggravate my condition. However, the output of FSelector gives an idea on which features should be looked at more closely.

Then the second stream was run, namely the one which sends the data to 3 machine learning algorithms so that i could get a first feel of how the algorithms perform. All three algorithms gave an F-Score of around  59 - 62%.

By looking at the results some patterns appeared to arise (Note the word "appeared")

1) A rather large daily dose  (>5200 IU) of Vitamin D3 appears to be associated with smaller incidences of conjunctivitis
2) Garlic consumption appears to increase my conjunctivitis incidences.
3) Yoghurt consumption appears to increase my conjunctivitis incidences.


For Pattern (1) we need to be aware that Vitamin D3 dosage has a compounding effect so it is rather naive to think that boolean logic applies (see previous post for more).

Next i had to look at patterns (2) and (3). One of the things that i realized when searching the web for the effect of various nutrients in functions of the human body is the fact that you can find for any several entries that some times contradict  each other. My very brief web search has found Garlic and Yoghurt to be "immune boosters". Of course caution should be exercised in drawing any conclusions because of the way the data have been collected and also the problematic origin of the analysis. Moreover, i am not a doctor and i cannot possibly know whether Garlic or Yoghurt can aggravate an immune response in such a way. 

 I began taking Vitamin D3 and eliminating Garlic and Yoghurt from my diet. The result was that over a period of one month i stopped getting bouts of conjunctivitis and incidences of swollen lymph nodes. So has Vitamin D3 acted as an "immune response regulator" and Garlic - Yoghurt as "immune boosters"?


Although my bouts of conjunctivitis have ceased, I am not in any position to make any claims because there are a lot of uncontrolled variables :

- It could be a placebo effect.
-There may be unknown hidden variables that are important
- (My) Genetics
- Environment
- Variations in Dosage and Nutrient Content
- Interactions between nutrients

and lots of others that could not possibly be accounted for under these circumstances.

What i can say (and this is the reason for writing this post) is that analytics may help us to identify several patterns that may then be used to guide a sound knowledge discovery process. If people had the ability to collect data on a daily basis (see Quantified Self) and then analyze them on a massive scale, several unknown patterns that call for closer investigation could emerge.