Sample scripts and data can be found from github:
The competition details are from Kaggle
This competition is about to predict each tube pricing based on detailed data such as tube assembly, components, annual volume and so on. For the training set, there are around 30K rows you can use to train your model, and the final test set has roughly the same amount you need to predict. The final judgement is based on Root Mean Squared Logarithmic Error (RMSLE) as evaluation metrics.
The challenging part of this competition is Feature Engineerings. I would say 60% of my time was spent on Feature Engineerings, and the remaining 40% on modeling.
Ok, let’s first take a look at top few rows from the file train_set.csv
This is the primary training file that includes all the rows for training. However, there are not enough features in this file to train your model and get good predictions. And if you can guess, most of the valuable information might be hiding behind the tube_assembly_id col that you need to mine from all other data provided in the competition.
Looking at the ER Diagram below (created by EaswerC in the competition forum), you will have a rough idea of the complicated relationships between each file in the provided data set:
- Each tube_assembly_id represents a tube whose parameters can be found in the file tube.csv (like diameter, length, wall, etc).
- Each tube might contain 0-4 different components which can be located in the file bill_of_materials.csv.
- Finally, parameters for all the components can be located in all the Comp_[type].csv files, and this is the most challenging part for feature engineerings because the formats of all the comp files are different, thus hard to create consistent features for all the tubes.
Generally speaking, there are mainly two categories of features that I managed to come up with – namely tube features and components features.
1. Tube features
- Based on tube.csv, group by each tube_assembly_id, added in all tube parameters such as ‘diameter’, ‘wall’, ‘length’, etc.
- Create ‘forming’ features (logical T/F) for ‘end_x’ and ‘end_a’ cols based on tube_end_form.csv
- Categorize ‘supplier’ feature, and convert those rare levels into one level
2. Components features
- Based on bill_of_materials.csv (bill), calculate total weights for all components contained in each tube_assembly
- Based on bill, calculate max weight for all components contained in each tube_assembly
- Based on bill, calculate average weight for all components contained in each tube_assembly
- Categorize components based on component_type_id in the components.csv file, and create 30 component_types cols, and for each tube_assembly, one-hot encode the component_type/component_types contained in that tube_assembly, if any
- Create ‘unique_feature’ feature (logical T/F col) indicating whether each tube_assembly contains components that has ‘unique_feature’, based on all comp_[type].csv files.
- Create ‘orientation’ feature (logical T/F col) indicating whether each tube_assembly contains components that requires ‘orientation’, based on all comp_[type].csv files.
- From the components.csv file, identify the top 10 special processings that are significant/important (e.g. in CONNECTOR-SEAL, the SEAL is considered as special processing). Create 10 ‘special-processing cols’ for the all tube_assemblies, and one-hot encode for each tube_assembly which special processings its components have.
Modeling / Ensembling
1. Models based on the above data set (level 1)
The pre-processing step I did prior to any training was using log(1+cost) to transform the cost. Subsequently, someone from the forum mentioned that the root transform for cost actually works better than log transform, so I tried both root transform(N=15) and log transform.
The first model I used was xgboost. For 1000 rounds with eta = 0.1 and max_depth =8, I got around 0.227 (CV and LB) for a single run.
Bagging of xgb actually boosts the results to 0.219! (Setting subsample rate = 0.8)
The other model that I has also tried that gives relatively good results is Extra Trees (extraTrees package in R), which gives around 0.23 on both LB and CV.
So the first ensemble model is 0.8 * xgb_bagging + 0.2 * extra_trees, which gives a result around 0.2165. — 1)
The second model – metaBagging – the code can be found in my github – gives around 0.2156. — 2).
( MetaBagging is basically a bagging of xgb. However, instead of throwing away the OOB data for each bagging round, the OOB is used to train some base model (in my case I used extraTrees), and its prediction is then stacked as a new meta feature to BAG and test data, and then we use BAG to train the meta model and predict on the test data. We do this many rounds similar to bagging.
The details of meta Bagging can be found in this thread, where Kaggle Master Mike Kim introduces this famous meta bagging method for Otto Group Competition.
However, instead of sampling with replacement, I used sampling without replacement and used a sampling rate of 80% (previously 100% for sampling with replacement). The remaining 20% is my OOB data, and the 80% is BAG data. Making the sampling rate an adjustable parameter can be quite powerful sometimes, especially for this case, and can be faster due to less OOB data and OOB data used. )
The ensemble of 1) and 2) gives around 0.215 LB, and 0.212 CV, which I used as my level 1 ensembled model.
2. Models based on level 2 data
Now I have my level 1 ensembled model, and the next thing I did was create five-fold cross predictions as meta features, and used those prediction features only to train my level 2 models.
The algorithms I used to create those five-fold predictions are xgboost (log and root transforms), bagging of neural network (h2o), extra trees, random forest, svm, KNN (k ranges from 2 to 1024), linear model, and xgboost trained on PCA of first 140 cols (90%).
These prediction features combined together to become my level 2 data set. I then used linear regression (major model), ensembled with other models like NN, xgb, rf and ET. The final prediction from level 2 was my second ensembled model, which gives LB of 0.216 and CV of 0.2125.
3. Final Ensemble
The ensemble of both my level 1 and level 2 ensembled models (50/50) gives me LB of 0.214 and CV 0.210.
Joining in team MDT (Nabil & Adil)
At this moment, I could probably say that my ensembles and stacking have squeezed every bit of juice out of my data set, and it would be very hard to improve the score much more. Since all my models were built on the same data set, substantial improvements would only be possible if I could merge with another team who can bring in a different data set.
I joined in the team of MDT, and by using a simple ensemble of both our best models, we increased our score up to 0.209 on LB!
MDT’s data set was quite different from my data set by the time we merged, and they used quite different modeling as well. With such quite diverse models, we boosted our ranking to No. 7.
Final take aways
I think I made a good choice in the end to join a top team, and it has been really a great experience working with Nabil and Adil even though it was just a few days time.
Relying on stacking can boost your score, but to a limit, and further improvements can not be much more unless you can bring in a diverse data set. In this case, choosing a correct team to merge is definitely a strategic choice.
Always trust your CV!