I joined Kaggle four years ago during the last few semesters of my Bachelor’s degree. During this time I had taken part in various different courses on machine learning, and discovered my love for this field. I always wanted to join one of the competitions on the site, but somehow it wasn’t until now that I finally got around to actually participating. In the past few years, I focused more on areas like Natural Language Processing and robotics, so I felt like a total beginner coming back to data analysis. This post is a recap of the things I have learned during my first Kaggle competition, maybe it can serve as a small guide to others who are just starting out.
While I was browsing through the Kaggle competitions earlier this year, the Santander Customer Satisfaction competition seemed like a good choice to get started, because the data was very easy to process and one could focus more on the machine learning part and the overall process of entering a competition on Kaggle. (Before participating in a real competition I also did the Titanic tutorial competition on Dataquest, which I highly recommend.)
The goal of this competition was to estimate the probability of a customer being dissatisfied with the service of Santander, giving them the chance to react before a customer would leave. For this purpose, they provided data with more than 300 anonymized features. The special challenge of this data set was that the columns weren’t even labeled in an eligible way – with column headers like ‘var38‘ and ‘saldo_medio_var33_hace3′, one could only guess what kind of data each column contained. Fortunately the data was all numerical and mostly cleaned up already.
The data consisted of 76020 rows with 337 features. The target value was either 0 (satisfied) or 1 (dissatisfied), with a very small percentage of rows describing dissatisfied customers. The goal was to predict the probability of the target value being 1 (dissatisfied).
An area under the ROC curve between the predicted value and the actual target was used to evaluate results, so throughout this post, this evaluation method will be used as well.
Data Clean-up and Feature Engineering
One thing that struck me as remarkable during the competition was that people were publishing their high-scoring code in the scripts section while the competition was still ongoing! In this way one could learn from the experts in real-time, so to say. It was a lot of fun to read about and try different approaches used by others.
I started out with a script titled “~0.83 score with 36 features only” by user Koba. It was basically an XGBOOST classifier that used an ExtraTreesClassifier for feature selection.
# Feature selection clf = ExtraTreesClassifier(random_state=1729) selector = clf.fit(X_train, y_train) fs = SelectFromModel(selector, prefit=True) X_train = fs.transform(X_train) X_test = fs.transform(X_test)
The ExtraTreesClassifier fits a number of decision trees on various splits of the training data and averages the results to prevent overfitting. The fit classifier is then used by SelectFromModel, which is a meta-transformer that uses any estimator’s coef_ or feature_importances_ attribute, if it exists. Depending on a threshold, this algorithm determines if a feature should be kept (attribute value over threshold) or removed. The script’s page linked above has a nice plot of the feature weights, which shows that only two features really stuck out from the rest: ‘var38‘ and ‘var15‘. (It was guessed among users that ‘var38‘ contained mortgage.)
The script also contained other basic functionality like removing constant and duplicated columns. I also cleaned up the ‘-999999’ values in the ‘var3‘ column and replaced them with the most common value of that column.
Submitting the predictions of the so fitted model lead to a relatively good score, as expected. Out of curiosity I also uploaded a submission generated by a model without feature engineering, only with the minor clean-up described in the paragraph above. This submission scored a 0.836892 on the Public Leaderboard, which was better then the feature engineered score, and none of the feature engineering efforts I tried thereafter managed to score higher than the basic XGBOOST algorithm on its own.
Looking at the plot of the feature importance values, this is not really surprising, as most columns seemed equally important or unimportant to the final outcome of the customer’s satisfaction.
I also tried two other approaches: calculating the Pearson Coefficient, which is a measure of the linear correlation between two variables, and only keeping the features that had a correlation value with the target higher than |0.01|. This scored similarly to using the approach with the ExtraTreesClassifier and SelectFromModel (also not surprising).
The second approach was using Principal Component Analysis (PCA) to generate some new features. In this approach, an orthogonal transformation is used to transform the original feature set into an approximation of less then or equally many features which have the highest possible variance – as the resulting vectors are all at right angles to each other. (Some, as me, might remember having to calculate such transformations by hand in advanced math classes… Yuck.)
I played around with including different amounts of PCA generated features to see what influence it would have on the final score. I tried this in combination with the Pearson reduced features, as well as the whole set. In both cases, the AUC score was going up until a certain number of features, and then down again. Combined with Pearson, the best score was a 0.827698 on the Public LB with using 10 PCA generated features in addition to the training data. Without excluding any features based on their correlation coefficient, the plateau was reached at a Public LB score of 0.830498 with 50 PCA features. Both were pretty far away from the un-engineered score shown above. (I also tried using only the features generated by PCA, which scored an AUC of 0.82 so I didn’t play around with this further.)
This is how you win ML competitions: you take other peoples’ work and ensemble them together.
As the competition was almost over, I didn’t have much time to experiment with ensembling. The best score was achieved by the XGBOOST without feature engineering, so I played around with combining the predictions of different algorithms into one submission file by just averaging the results. With every new model that was added, the score went up a little. At the beginning of the competition I had also played with Random Forests, but abandoned them because they were either horribly overfit, or took too long to run as I adjusted the parameters. The article also cites another post on Deep Learning by Ilya Sutskever:
One may be mystified as to why averaging helps so much, but there is a simple reason for the effectiveness of averaging. Suppose that two classifiers have an error rate of 70%. Then, when they agree they are right. But when they disagree, one of them is often right, so now the average prediction will place much more weight on the correct answer.
This lead me to the idea of averaging two normally trained models, and including one obviously overfit Random Forest. At this point I only had one submission left to make before the competition would end, and I uploaded these averaged predictions just for fun. The score actually improved quite much compared to the other averaged submissions to 0.836734, but still not better than the XGBOOST by itself.
|Averaged algorithms||Public LB score|
|XGBOOST, RandomForest, AdaBoost||0.836381|
|XGBOOST, AdaBoost, RandomForest (overfit)||0.836734|
There would probably have been quite some potential for improvement if I had experimented more with ensembling, and I will definitely keep it in mind for future projects!
This competition was a great learning experience, especially since the more advanced members of the community shared their scripts during the process. It was also a lot of fun to watch the Public LB change during the competition. The final results have definitely taught me that focusing on the LB score too much will only lead one to overfit on the test data, so to say. One user was among the top three for almost the whole competition, and I was actually rooting for them – but they dropped over 3000 (!) positions on the Private LB!
From what I have read in the forums a lot of users suffered from this “overfitting drop”, because many of them used one of the high scoring public scripts to improve their position on the Public LB. I didn’t bother with climbing the positions of the Public LB, as I was participating for the learning experience. At the end I was even pleasantly surprised by rising 361 positions on the Private LB. My final score was 2621st/5123.
All in all Kaggle-ing is a lot of fun, and I will definitely keep participating in competitions in the future!