Caterpillar Tube Pricing – feature engineering and ensemble approach

CAT

Sample scripts and data can be found from github: 

https://github.com/aaxwaz/Caterpillar-Tube-Pricing-Kaggle

The competition details are from Kaggle

Introduction

This competition is about to predict each tube pricing based on detailed data such as tube assembly, components, annual volume and so on. For the training set, there are around 30K rows you can use to train your model, and the final test set has roughly the same amount you need to predict. The final judgement is based on Root Mean Squared Logarithmic Error (RMSLE) as evaluation metrics.

Feature Engineerings

The challenging part of this competition is Feature Engineerings. I would say 60% of my time was spent on Feature Engineerings, and the remaining 40% on modeling.

Ok, let’s first take a look at top few rows from the file train_set.csv

train_data_view

This is the primary training file that includes all the rows for training. However, there are not enough features in this file to train your model and get good predictions. And if you can guess, most of the valuable information might be hiding behind the tube_assembly_id col that you need to mine from all other data provided in the competition.

Looking at the ER Diagram below (created by EaswerC in the competition forum), you will have a rough idea of the complicated relationships between each file in the provided data set:

ERD

Basically,

  1. Each tube_assembly_id represents a tube whose parameters can be found in the file tube.csv (like diameter, length, wall, etc).
  2. Each tube might contain 0-4 different components which can be located in the file bill_of_materials.csv.
  3. Finally, parameters for all the components can be located in all the Comp_[type].csv files, and this is the most challenging part for feature engineerings because the formats of all the comp files are different, thus hard to create consistent features for all the tubes.

Generally speaking, there are mainly two categories of features that I managed to come up with – namely tube features and components features.

1.  Tube features

  1. Based on tube.csv, group by each tube_assembly_id, added in all tube parameters such as ‘diameter’, ‘wall’, ‘length’, etc.
  2. Create ‘forming’ features (logical T/F) for ‘end_x’ and ‘end_a’ cols based on tube_end_form.csv
  3. Categorize ‘supplier’ feature, and convert those rare levels into one level

2.  Components features

  1. Based on bill_of_materials.csv (bill), calculate total weights for all components contained in each tube_assembly
  2. Based on bill, calculate max weight for all components contained in each tube_assembly
  3. Based on bill, calculate average weight for all components contained in each tube_assembly
  4. Categorize components based on component_type_id in the components.csv file, and create 30 component_types cols, and for each tube_assembly, one-hot encode the component_type/component_types contained in that tube_assembly, if any
  5. Create ‘unique_feature’ feature (logical T/F col) indicating whether each tube_assembly contains components that has ‘unique_feature’, based on all comp_[type].csv files.
  6. Create ‘orientation’ feature (logical T/F col) indicating whether each tube_assembly contains components that requires ‘orientation’, based on all comp_[type].csv files.
  7. From the components.csv file, identify the top 10 special processings that are significant/important (e.g. in CONNECTOR-SEAL, the SEAL is considered as special processing). Create 10 ‘special-processing cols’ for the all tube_assemblies, and one-hot encode for each tube_assembly which special processings its components have.

Modeling / Ensembling

1. Models based on the above data set (level 1) 

The pre-processing step I did prior to any training was using log(1+cost) to transform the cost. Subsequently, someone from the forum mentioned that the root transform for cost actually works better than log transform, so I tried both root transform(N=15) and log transform.

The first model I used was xgboost. For 1000 rounds with eta = 0.1 and max_depth =8, I got around 0.227 (CV and LB) for a single run.

Bagging of xgb actually boosts the results to 0.219! (Setting subsample rate = 0.8)

The other model that I has also tried that gives relatively good results is Extra Trees (extraTrees package in R), which gives around 0.23 on both LB and CV.

So the first ensemble model is 0.8 * xgb_bagging  + 0.2 * extra_trees, which gives a result around 0.2165.  — 1)

The second model – metaBagging – the code can be found in my github – gives around 0.2156. — 2).

( MetaBagging is basically a bagging of xgb. However, instead of throwing away the OOB data for each bagging round, the OOB is used to train some base model (in my case I used extraTrees), and its prediction is then stacked as a new meta feature to BAG and test data, and then we use BAG to train the meta model and predict on the test data. We do this many rounds similar to bagging.

The details of meta Bagging can be found in this thread, where Kaggle Master Mike Kim introduces this famous meta bagging method for Otto Group Competition.

However, instead of sampling with replacement, I used sampling without replacement and used a sampling rate of 80% (previously 100% for sampling with replacement). The remaining 20% is my OOB data, and the 80% is BAG data. Making the sampling rate an adjustable parameter can be quite powerful sometimes, especially for this case, and can be faster due to less OOB data and OOB data used. )

The ensemble of 1) and 2) gives around 0.215 LB, and 0.212 CV, which I used as my level 1 ensembled model.

2. Models based on level 2 data

Now I have my level 1 ensembled model, and the next thing I did was create five-fold cross predictions as meta features, and used those prediction features only to train my level 2 models.

The algorithms I used to create those five-fold predictions are xgboost (log and root transforms), bagging of neural network (h2o), extra trees, random forest, svm, KNN (k ranges from 2 to 1024), linear model, and xgboost trained on PCA of first 140 cols (90%).

These prediction features combined together to become my level 2 data set. I then used linear regression (major model), ensembled with other models like NN, xgb, rf and ET. The final prediction from level 2 was my second ensembled model, which gives LB of 0.216 and CV of 0.2125.

3. Final Ensemble

The ensemble of both my level 1 and level 2 ensembled models (50/50) gives me LB of 0.214 and CV 0.210.

Joining in team MDT (Nabil & Adil) 

At this moment, I could probably say that my ensembles and stacking have squeezed every bit of juice out of my data set, and it would be very hard to improve the score much more. Since all my models were built on the same data set, substantial improvements would only be possible if I could merge with another team who can bring in a different data set.

I joined in the team of MDT, and by using a simple ensemble of both our best models, we increased our score up to 0.209 on LB!

MDT’s data set was quite different from my data set by the time we merged, and they used quite different modeling as well. With such quite diverse models, we boosted our ranking to No. 7.

Pri_LB

Final take aways

I think I made a good choice in the end to join a top team, and it has been really a great experience working with Nabil and Adil even though it was just a few days time.

Relying on stacking can boost your score, but to a limit, and further improvements can not be much more unless you can bring in a diverse data set. In this case, choosing a correct team to merge is definitely a strategic choice.

Always trust your CV!

7 thoughts on “Caterpillar Tube Pricing – feature engineering and ensemble approach

  1. Congratulations Weimin and thanks for sharing your solution!

    It looks to me that you did a great job both in terms of feature engineering and model ensembling.

    I also used XGBoost for my best single model and got a score of 0.219 in public LB. However the set of parameters was quite different: 9000 rounds, eta=0.01, max_depth=12, min_child_weight=18, subsample=0.7, colsample_bytree=0.65 (actually max_depth=8 and min_child_weight=3 also worked well for a similar model). With this set of parameters I got the highest 5-fold CV score which was consistent with the public LB. I am wondering how did you do the parameter tuning for the XGBoost and why you picked these values. It looks like there is room for improvement if you pick a smaller eta and set the min_child_weight to a value larger than zero.

    Also, what is the total number of features in your models? Did you apply any feature selection method? I had hundreds of features after the feature engineering stage and I used Random Forest and Gradient Boosting from Scikit-learn to rank them. I got the highest 5-fold CV score using around 40-60 features (depending on the model).

    This was actually my first Kaggle competition and by reading your solution I realized that I haven’t done very good job in model ensembling. My best score in public LB was around 0.217, which is not much better than my best single model. I may also have overfitted the public LB a little bit.All in all, participating in this competition was a great experience and I look forward to the next one.

    Hope we can work together in a future competition!

    Thanks,
    Stathis

    Like

    1. Hi Stathis,

      I have also tried increasing number of rounds and depth, and decreasing eta, but it didn’t give much improvements and I guess it is probably due to my data set is quite different. Another reason being that using large number of trees for both training and CV will be too time consuming and I always prefer to get the results quickly :).

      Well for my data set, I only have less than 200 cols (including one-hot encodings) and I used most of them, so there will still be a lot potential in terms of getting more features I guess. For feature selections, I tried leave-one-out for the top 20 features suggested by xgboost’s importance function, and deleted those that give CV improvements when NOT adding them in the model.

      Weimin

      Like

  2. Dear Weimin

    “Well for my data set, I only have less than 200 cols (including one-hot encodings) and I used most of them, so there will still be a lot potential in terms of getting more features I guess. For feature selections, I tried leave-one-out for the top 20 features suggested by xgboost’s importance function, and deleted those that give CV improvements when NOT adding them in the model.”

    On the above point could you please elaborate? Did you run by turns various combination of features to get CV scores?

    Also you indicate using 200 features but limiting the above exercise to the top 20. Could you please explain?

    I tried to use xgboost importance info + Correlation info amongst variables with limited success. Also my guess is this requires tweaking xgb column sampling parameter as well.

    Thanks in advance.

    Krishna

    Like

    1. Thanks for your comments!

      Since most of the features were hand-picked and chosen only because it improved the CV score, so I didn’t do much of the feature selections/reductions (There are not so many of them anyway). The only selection I did was after I created all the features, for the top 20 significant features indicated by xgb importance, I did a leave-one-feature-out (or forward selection) to omit one feature at a time, and checked the CV score accordingly. If the score improved, I removed that feature. (and I remember I only deleted one feature after this cause the rest are all useful)

      I didn’t do any feature combinations because I didn’t have enough time.

      By saying 200 features – it actually means cols after one-hot encodings. Quite some of the features are high in levels so after one-hot, it will expend to 30 or even more cols, so the actually features are only less than 100 I guess (not remembered quite well, but not too many of them), and most of them are handed-picked or verified by CV as mentioned above.

      By limiting the exercise to top 20 – I just wished to see if removing any feature would give significant improve on CV so the top 20 are high in potential. I think I should have done so for all but that would take a lot of time (5 rounds of CV * 100) for a single laptop running.

      My way of selection is definitely not perfect, or not even good enough, if you have suggestions or criticisms please let me know :)

      Hope it clarifies.

      Thanks,
      Weimin

      Liked by 1 person

      1. Thanks for the clarification.Rgds

        Like

  3. Thanks for the write up.
    How do you choose the weights for your ensembles?

    Like

    1. Sorry for missing our your reply.

      For ensembling weights – say you have two models RF and XGB which you want to ensemble, and let’s assume their CV performances are 0.23 and 0.22 respectively. I would give a try within the ranges from (0.5, 0.5) through (0.2, 0.8). This is because beyond (0.2, 0.8), the ensembling effect will be very small.

      Then I will have potential weights like (0.2, 0.8), (0.3, 0.7), (0.4, 0.6), (0.5, 0.5). I will then use LB as my feedback and try each of them to see which gives me the best performance.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close