An Example of Predictive Analytics for beginners ( With R Code )
I am just posting my experience of participating in StatWars- An event conducted by Department of Business Economics, University of Delhi in its annual flagship event Vishleshan. (Eventually, I ended up winning the event). As this event was related to predictive analytics, those who do not have much Idea of predictive analysis can refer to my past post How can you learn Predictive Modelling?
Introduction to StatWars
In this event we were asked to predict whether the residents of the city will purchase the one time investment scheme or not. The full problem description is here.
My Methodology
Before explaining the methodology, I assume you have some idea of four algorithms namely logistic regression, random forest, decision tree and knn. I used an ensemble of these four data mining algorithms.
There are two ways of understanding my code
1. Read Individual codes and then read Ensemble.R (Which might take longer but this is better way)
2. Just read the Ensemble.R and forget about Optimization of individual algorithms. (In case you have limited time)
I assume that you are going by method 1
Decision Tree (refer decision_tree.R)
There are three possible ways in which you can make a tree. I tried all the three methods are there in this file. you can use any one the below methods
1. tree()
2. rpart()
3. ctree()
The best one in my opinion is using ctree(). However, Each of the above method has different underlying algorithm.
For learning purpose I have used fulltrain (It is always a better choice to use full training data for learning purpose).For cross validation purpose,I have divided the training data(fulltrain) into two parts sub.train and sub.test .
In cross validation the calscore() is just giving us the sum of sensitivity and accuracy for given true values and predictions.
Random Forest (refer randomforest.R)
I have used K-Fold Cross validation to find the best value of mtry parameter. calcscore() and other things are similar to decision tree.
KNN (refer KNN.R)
It is important to note that KNN is applicable only to numerical data. Hence I have used only numerical variables. I have used in built k-fold cross validation function knn.cv() in this case.
Logistic Regression (refer Logistic.R )
I have used k-fold cross validation to find the best cutoff.(the probability above which the response will be considered as true)
I have used two passes to find Best cutoff . In the first pass it will tell the best cutoff for two decimal places. In next pas I have tried to find it for 4 decimal places. (The four decimal place best cutoff is stored invariable VBcutoff)
Those who want to view how I tried other things in logistic regression can refer to LogisticFull.R This code might seem lengthier one as I tried many thing in this. However all things are self explanatory.
Now coming to the best part. Ensemble!
Why Ensemble?
"N weak predictors can outperform one strong predictor"
Ensemble (refer ensemble.R)
In this code, I have used the best value of all parameters . Please note that I have use two logistic regression with two different cutoffs so as to make 5 predictors.
I have used Bootstrap aggregating, often abbreviated as bagging.In bagging each model is given ensemble vote with equal weight. In our case of five predictors, Only if 3 or more of the 5 predictors predict 1, the predicted value is 1.
Codes
codes can be downloaded from the google drive by clicking here. There are 6 files in that folder.You can read R files in notepad also.
What can you do to improve the performance further?
The methodology that I have used for solving can be improved further by following below guidelines.
1. Include more predictors. In my case, I have used 5 predictors only . you can add few more predictors of same algorithms also.
2. Enhance the prediction of individual predictor by improving its parameters to the optimal level according to your data set. for example, you can improve optimal value of mtry, ntree, nodesize etc. in case of RandomForest.
(There is a lot of scope in improving the parameters)
3. You can also use techniques such as outlier detection,.
Thank You!
Comments
Post a Comment