MY FIRST KAGGLE COMPETITION – AVITO DEMAND PREDICTION
- Kaivalya Kandukuri
- Aug 2, 2018
- 7 min read
Along with my Research Assistant job on campus part time, I decided to do something else in the summer when I received an invite to the Avito demand prediction challenge on Kaggle.
I had previously taken lots of Data Science and Machine Learning courses online as well as part of my Masters’ degree, but I didn’t really get a chance to apply a large part of what I learnt. I decided that a Kaggle competition would provide me with that opportunity and I went ahead and signed up for the competition.
About the competition:
Avito was a Russian online classified advertisement website. In this challenge, they wanted us to predict the demand for an online ad based on various attributes like the image in the ad (image quality basically), the ad title, its description and a few other attributes like the place it was posted, the price of the product in the ad etc.,
It fell into the category of Image Processing and Text Mining as one had to process images to assess their quality and mine the text from the title and description to see if that would impact the demand in any way. Predicting the demand would serve the purpose of helping the sales team with valuing the product in the ad. This is mainly to prevent over pricing of items with little demand and underpricing an item with huge demand.
Data description:
The data was provided in the form of a comma-separated variable file, one containing the test data and the other containing the train data. Also, another zip file was provided with about a million images in .jpg format. The data had columns to describe details about the advertisement as follows –
· Itemid
· Userid
· Region and city from where the ad was posted
· Category name
· Ad Title
· Ad Description
· Activation date of the ad
· Image (this was an ID, the file names in the zip file are same as this Image ID)
There were a few other columns which were related to Avito’s ad model. All the text columns were mostly in Russian. The train set consisted of about 1.6 million rows while the test set contained about 10,000 rows.
Data Cleaning:
Like I have heard for most Kaggle competitions, the data didn’t require a lot of cleaning. There were a few missing values but that was only because an ad didn’t have an image and the image column was missing. There also weren’t any outliers.
Image Processing:
I first started with processing the one million images I was provided with. It was a very tedious task and extracting all those images and processing those took a very long time (about 3-4 days in total). I did most of my image processing in Python using the cv2 package.
I could just extract a few basic features like the image size, mean of the color components of the image and the standard deviation of the color. The color consisted of the red, blue and green components and I thought the color would provide with details of the image quality.
I would like to mention here that the Kaggle kernels were very useful. I got most of my ideas for image processing from other Kagglers. I had no idea of image processing on Python and they were a lot of help when it came to learning the basics. To be honest, I didn’t stray very far from the basics.
From the image file name, I also extracted the image ID and joined all these columns to the main table. So, the table now had three extra columns, the size of the image, the mean color and standard deviation.
Text Mining:
I started off with using the ‘tm’ package for text mining of the ad title and ad description. The data was mostly in Russian and I couldn’t understand the words in the word clouds and plots I was generating, but I created a corpus and a Term Document Matrix out of the text attributes and used them in my model. I also extracted the count of characters in the title and the ad description.
Other than these features, I also created one extra column, which was called ‘has_image’. This was a flag column that indicated whether an ad had an image or not. I thought, an ad with an image would have more probability of having a deal than one without an image.
Modeling:
After adding all these new features to my dataset, I went ahead with the modeling part of the project. From the articles I read from fellow Kagglers and competition winners, I figured that tree-based models were doing better than other models and I decided to use tree-based models for my prediction. I used R for modeling. I first split my data into a training and test set with a 70-30 split. All these functionalities were provided by the ‘caret’ package in R. I would like to mention that ‘caret’ is a magical package for Data Science in R. It helps us do everything, right from modeling to splitting the data, and hyper-parameter tuning.
I started with building an XGBoost model with a learn rate of 0.05, max_depth of 18, alpha of 2.25 and the number of rounds as 500. Most of the Kaggle competitions use RMSE as the evaluation metric of a model and this model gave me an RMSE of about 0.2218 (the deal_probabilty was in the range of 0-1). That means the accuracy of the model was about 78%.
I performed hyper-parameter tuning using a grid search method in the ‘mlr’ package of R. The tuned parameters didn’t change the performance much and I was still getting the best result in the model I started out with.
I built a set of other models –
RandomForest (RMSE-0.242)
GBM (RMSE- 0.27)
Logistic Regression (RMSE – 0.28)
I was still getting the best result with the XGBoost model. That was when I tried ensembling.
Ensembling, another idea I got from fellow Kagglers is the process of combining different base models to get a better performing model. There are various ways of ensembling like averaging, weighted averaging, boosting, bagging and stacking. Stacking is very popular amongst Kagglers and most of the competition winners get their prize winning RMSE with stacking.
I first tried averaging of all the models and I got an RMSE of 0.267. A weighted average where I gave more weight to a model not performing so well gave me an RMSE of 0.25.
The initial XGBoost model I created was still the best performing model.
Then I took to stacking. Stacking is the process where the outputs of the base models I created is given to another top layer model to improve the performance. One thing to be noted here is that the output from the base models should be not be highly correlated. Also, the output here is the out of fold predictions from the model and I didn’t perform any cross validation for any of my models.
I was running out of time and the h2o package, came to my rescue. The ‘h2oensemble’ package in R allows you to build a stack with a few default base learners. The base learners in this case were the deep learning, randomforest, logistic regression and gbm. Since these were similar to the models I had already built, I went ahead with the default learners and created a stack with a logistic regression model as my top layer. This gave me an RMSE comparable to that of the one I was getting with my XGBoost model (0.2227). I was a little disappointed that I didn’t get to use the best performing XGBoost model in my stack but I was short on time and so I went ahead and extracted the output of this model too.
Conclusions:
Finally, I submitted two output files (which only had to contain the itemid and the predicted deal_probability on the test dataset). I got a public score of 0.2391 (approximately) for both my submissions and a private score of 0.2435. Not a very great score, I agree, and I was ranked in the top 85%.
What I learnt:
Inspite of all the hard work I did, my R kept crashing and there were lots of incompatibilities with the ‘caret’ package and R 3.5.0. This caused me lots of trouble and I wasted lots of time running models that weren’t even running. But I was glad I could make a submission and even though my rank wasn’t very great, I had learnt a lot about Machine Learning, which I actually didn’t get to learn during my course and I was happy and also satisfied that I kind of accomplished what I set out to do.
After the competition, I saw the solutions of the top rankers and they were very enlightening. Most of the top rankers, generated about 10-20 models and ensembled (mostly stacked) various combinations of these models to get better outputs. One other team that secured the third rank advised that instead of concentrating a lot on modeling (which almost everybody does) concentrate more on getting more features out of the data. Feature engineering is a major part and every small feature that can be extracted makes a difference to the model.
They had extracted a lot of interesting features like the weekday from the activation date and binning the price into different bins and stuff like that.
One other thing all the top rankers had in common was that none of them used R. All of them used Python. When I asked them why they chose Python when most of the packages were also in R, they said that the package APIs are better in Python than in R, also the packages in R are way behind their Python counterparts.
Maybe I could explore Python for my next competition and try to get more creative with the data I have.
You can look at my code in R here –
I know it might not help you with winning a competition, but it will give you a basic direction of what to do and how to tackle stuff in the competition.
I had lots of fun while doing it and I want to take up more of such projects in the future and try to get better and maybe one day win one of them, who knows?
Σχόλια