Given a customer’s search query and the returned product in text format, your predictive model needs to tell whether it is what the customer was looking for! This is the recently completed Kaggle competition – Home Depot Product Search Relevance – where I have tried out tons of feature engineering techniques to compete against 2000+ Kaggle teams around the world.
The Business Challenge
Search relevancy is an implicit measure Home Depot uses to gauge how quickly they can get customers to the right products. It serves as their Search Engine Auditor.
However, their current approach using human raters to evaluate the search engine is slow and sometimes subjective. Therefore, Home Depot desires a robust predictive model which is able to automatically predict a relevance score (say 1~3) for every pair of search query and its returned product.
The data sets
There are four data sets provided for this competition. The relationship can be easily seen from the above ER Diagram.
There are 70,000+ rows in training data, and 160,000+ in testing. Each row in train or test data sets represents a (query, product) pair, where the Product_uid key can be used to join Product_Description and Product_Attributes tables to extract out product information such as description, brand, color, bullets and so on.
Top 3 rows from train.csv are shown below:
|2||100001||Simpson Strong-Tie 12-Gauge Angle||angle bracket||3|
|3||100001||Simpson Strong-Tie 12-Gauge Angle||l bracket||2.5|
|9||100002||BEHR Premium Textured DeckOver 1-gal. #SC-141 Tugboat Wood and Concrete Coating||deck over||3|
The test.csv is same except, of course, the last column of relevance which you need to predict.
This is the most challenging part of the competition!
Since the raw English text are not ‘understandable’ by our machine learning model, we will need to create numeric features as new training and testing data sets for modeling.
Overall I divide my feature engineering session into four parts, namely Preprocessing and Cleaning, Similarity and Distance Features, TF-IDF Features and Word Vectors Features. I will explain each in details.
Preprocessing and Cleaning
The raw text is full of spelling errors, html tags, units, numbers and many others which need to be corrected, removed or standardized.
Some examples are like
"high boy tolet", "28 snow thower", "Steves &", "4 in. x 4 in. x 5-1/3 ft",
a) Spelling Correction
There are in general two ways for spelling correction. One way is to create a vocabulary based on our existing data sets, which are the product data, since they are more likely to be typo-free (in this case are the product title, description and attributes). Then use this set as a dictionary to correct all the typos found in the query terms. There is a good reference for writing a spell checker in python (However, you will need to prepare your own training data!)
Another way is to use Google’s auto spell checking, and someone from this competition actually posted his scripts by inputting each query term in Google search bar, and replace it with words came up with “Did you mean:”.
b) Regular Expressions
Spelling correction is crucial and must be corrected properly. And the rest will be simply applying regular expressions to do the job. For example, we could change all the length units like ‘inches’, ‘inch’ and ‘in’ into one word ‘inch’, and all tags like ‘&’, ”’, ‘"’ need to be removed as well. For the details, you can refer to the excellent script from the forum.
c) Words Stemming
Using Porter Stemmer or Snowball Stemmer to stem each word to a normalized and reduced format. Formal definition for Porter Stemmer can be found here. For example,
'argue', 'argued', 'argues', 'arguing', 'argus'
all stemmed to
Notice – This is a useful step when we are doing Similarity and Distance Features, TF-IDF Features, and self-trained word2vec generations later on.
However, when using pre-trained models for word2vec, such as GloVe pre-trained models, it is not suggested to stem the words, and original word formats are needed in order to get the correct word vectors. This is because those pre-trained models are trained based on original word formats.
Similarity and Distance Features
Now comes to generating new features!
There are basic (and important) counting features like
- String and word length of query, title, brand and description.
- Number of matched digits between query and product (title, description).
- Ratio of matched digits between query and product (title, description).
- Number of digits in query and product respectively.
Also, I generated words n-grams (unigram to trigram) and character n-grams (2-gram to 7-gram) on query, product title and brand, and words n-grams on product description only (since character n-grams are too large for product description).
For example, a query of
["Simpson Strong Tie 12"]
has word bigram of
["Simpson Strong", "Strong Tie", "Tie 12"]
and character 2-gram of
["Si", "im", "mp", "ps", "so", "on", " S", "St", "st", "ro", "on", "ng", "g ", " T", "Ti", "ie", "e ", " 1", "12"]
With these n-grams, I collectively generated numeric features like:
- Number of matches in word n-grams between query and title, query and description, query and brand
- Number of matches in char n-grams between query and title, query and brand
- Number of unique matches of above two
- Ratio of the matches of the above
- Statistics (min, max, median, std, mean) of matched positions of query in product title, brand, description in word n-grams.
- Statistics (min, max, median, std, mean) of matched positions of query in product title, brand, description in character n-grams.
- Normalized statistics of the above
- Jaccard and Dice distances between query and product titles, brand, description in word n-gram
These in total give me around a few hundred feature columns.
With all query, product titles, descriptions and brands as vocabulary, I trained my TF-IDF vectorizer, and transformed query, title and description into TF-IDF space (very high dimension and sparse matrix). Then the cosine similarities were calculated between query and title, description for each row.
I also applied SVD to reduce the TF-IDF space to 50 and 100 dimensions, and recalculated the cosine similarities. The raw SVD vectors were also used as features.
Bonus Point – It is also possible to calculate pairwise cosine similarity statistics grouped by relevance. This is done by firstly calculating pairwise cosine similarities for each row from test set against all rows from the training set (query, title and description individually). Then for each relevance level (1.0, 1.25, 1.5, 2.0, …, 3.0), we calculate the common statistics (min, max, std, mean, median) of the similarities belonging to that relevance level. And we do this for each relevance level, and for each query, title and description. However, it took forever to run on my machine, so I didn’t eventually adopt this feature set. :(
Word Vectors Features
Word vectors have great potential here because they can give insights that are quite different from Counting Features or TFIDF.
However, I only had time to explore the gensim package in python, and used the traditional word2vec library to train on my own data sets, with the output word dimension of 100.
I then averaged across all word vectors for each query, title and description, and used the final averaged vector for each row as my training feature.
The main model used for this competition is xgboost (eXtreme Gradient BOOST), which performed much better than other models like random forest, extra trees, neural networks, etc.
More about xgboost can be found here.
I have also tried stacking which is to use predictions from all other models such as
- random forest,
- extra trees,
- neural networks,
- ridge & lasso,
- KNN for neighbors of 1, 2, 4, 8,…1024,
and ran a 5-fold cross-validation to create predictions for the training data and testing data, which forms my level 2 data.
I then used xgboost again on this level 2 data for my final submission. My model’s final ranking is 33 out of 2125 participants.
Limitation and Improvement Areas
This has been quite a fun competition for me, but in order to get real impact on the business world, there are areas that need to draw attention.
The top models are not easily scalable
Since most top models are from teams who have basically ensembled and stacked on thousands of feature columns and different models, it is extremely hard to scale up the implementation on larger data sets.
Solution – A simpler and easy-to-implement model should be given higher priority.
Essentially, we are just mimicing humans …
Since the target variables are all labeled by human raters, all models are just trying to get as close as possible to the raters’ judgement, rather than the real relevancy. However, this can be quite subjective, since different raters might give different judgement. It could be slow in process as well, because every time when we need to update the model with new datasets, we will need to get some human raters to do the labeling.
Solution – What we could do instead is use a real-time customer feedback loop. For example, we could treat the clicking of a customer on a returned product as positive, and not clicking as negative. We can even prompt the customer for their opinions on the search results from time to time, and use that as response.
Google search bar might not be a good idea …
Correcting spelling errors is an important step to improve the predictive model’s accuracy. However, relying on Google’s spelling error correction, which was being used by most top performance teams (including myself of course), might not sound like a good long-term strategy to implement.
Solution – Making full use of Home Depot’s own data sets, and build a reliable dictionary for spelling correction.