Recommendations Studio: Using Auto-Encoders To Find Similar Items

Blueshift Recommendations Studio

Today’s blog post is the second of two posts that provide an under-the-hood view of how the expanded Blueshift Recommendation Studio works. The expansion includes the launch of 100+ pre-built AI marketing recipes. In the last blog post, we talked about how to rank inside recipes. Today, we’ll show you how to embed items using the Auto-encoder. 

This blog post is written by Anmol Suag, senior data scientist at Blueshift, and Siqi Li, a Blueshift summer intern.

INTRODUCTION

Using Auto-Encoder To Find Similar Items

Finding similar items is critical for recommendations. A user that has shown interest in an item is usually also interested in other similar items. Similarity among items can be found by multiple methods:

  • Collaborative Filtering – Similarity is based how other similar users interact with items
  • Content-based Filtering – Similarity is based on item attributes

In the absence of interactions or for a new item that hasn’t had much exposure yet, content-based filtering is helpful to find other similar items by just looking at the items’ attributes such as the title, price, category, brand, location, tags, rating, etc. In this blog, we talk about how Blueshift uses Auto-encoders to encode an item’s attribute information to find other similar items.

Auto-encoder filtering

DEFINITION

1. Auto-Encoder

An Auto-encoder is an artificial neural network that learns to encode the input with a fixed-length array of numbers (called embedding), then learns to reconstruct data from this representation so that the reconstructed output is as close to the input as possible. An Auto-encoder model usually has a symmetrical architecture. The layer in the middle is called the ‘bottleneck’, from which we get the embeddings. The below diagram illustrates the Auto-encoder architecture. 

 

Auto-encoder architecture

Through training an Auto-encoder model, we can generate a numeric representation of each item using its various attributes as input. In other words, we encode the attribute information into a list of numbers (an embedding) for each item, and then find other similar items by finding the nearest distance. 

2. Modeling Process

Auto-encoder modeling process

The core purpose of the Auto-encoder is to transform the mixed-datatype attribute information of an item into a fixed-length numeric representation called the embedding. The item embeddings can then be used to find similar items by a nearest neighbor search.

An item’s attributes can be of various data-types. Attributes like price and user rating are numeric, attributes like title and description are textual, attributes like brand and shipping method are categorical and attributes like tags are a list of keywords. However, the Auto-encoder can only take in numeric input, hence we need to transform all types of data into numeric types.

We classify item attributes into several different types: numeric, array-of-string, textual and categorical. Each data-type is feature engineered into a valid input for the auto-encoder as follows:

A. Numeric Attributes

Numeric attributes don’t need much transformation. We simply impute empty values with mean or median and then normalize the values so that they have zero mean and unit standard deviation.

B. Categorical Attributes

String attributes are usually categorical values that contain either a single word or phrase for each item. Such attributes are handled with eigen-categorical-encoding where the category is represented by a real number.

C. Textual Attributes

An item may have several textual attributes like title, description, etc. We utilize Natural Language Processing (NLP) models to convert the text into numeric embeddings. These embeddings are fed into the auto-encoder as input.

D. Array-of-String Attributes

Array-of-string attributes such as tags or keywords are values consisting of an array of words or phrases. We utilize a top-k based one-hot-encoding that is used as an input to the auto-encoder.

The transformed attribute values for all items are then passed into the auto-encoder that learns a fixed-length numeric representation for each item (embedding).

3. Finding Similar Items

Once we extract the trained item embeddings from the auto-encoder, we can use these embeddings to find the nearest items to a given item in the n-dimensional space. A distance metric such as cosine distance or euclidean distance is computed for every item using an approximate nearest-neighbor algorithm. The items with the smallest distance to a given item are the most similar to it based on the item attributes.

Auto-encoder finding similar items

4. Example: IMDB Movie Dataset

We trained the auto-encoder model to construct embeddings for an IMDB movie dataset of over 7,000 movies. Each movie in the catalog has many attributes: numeric attributes like user rating, box office revenue and number of reviews, textual attributes like title and plot and categorical attributes like category and genre.

Auto-encoder movie in the catalog

We use the feature-engineering as discussed above to input the movies into the auto-encoder model and extract 50-dimension item embeddings. Then an approximate nearest neighbor algorithm is used to find similar movies using cosine distance.

Here are few sample results:

Example 1

Auto-encoder filtering

For “Harry Potter and the Half-Blood Prince,” we see that the nearest neighbors of its embedding are the following:

  1. Harry Potter and the Deathly Hallows: Part 1
  2. Harry Potter and the Chamber of Secrets
  3. Harry Potter and the Prisoner of Azkaban
  4. The Lord of the Rings: The Return of the King
  5. Harry Potter and the Goblet of Fire
  6. The Hobbit: An Unexpected Journey
  7. Harry Potter and the Order of the Phoenix
  8. Harry Potter and the Deathly Hallows: Part 2
  9. Harry Potter and the Philosopher’s Stone
  10. Pirates of the Caribbean: Dead Man’s Chest

The list includes the other seven movies of the Harry Potter series, which can be considered the most relevant. The other three movies are also in the same genres (‘Adventure’ and ‘Fantasy’).

Example 2

auto-encoder movie selection

For “Ice Age: Dawn of the Dinosaurs,” the nearest neighbors are:

  1. The Secret Life of Pets
  2. Minions
  3. Ice Age: The Meltdown
  4. Ice Age: Continental Drift
  5. Despicable Me 3
  6. Despicable Me 2
  7. Moana
  8. Finding Dory
  9. Shrek the Third
  10. Frozen II

All of the nearest neighbors are popular animation movies, and have tags ‘Adventure’, ‘Family’ and ‘Comedy’.

We observe that the auto-encoder is able to identify similar movies that have common or similar attributes. Of the 10 nearest neighbors of all movies considered, 93.4% of them have overlap in tags (genres) with the queried movies, and 57.9% of them have no less than two tags in common with the queried movies. Moreover, we find that the average standard deviation of numerical attributes in similar movies is much smaller than the standard deviation of the whole dataset. 

Conclusion

Auto-encoders enable finding similarity among items in a catalog by just using the item attributes and doesn’t need any data about users or historical activity. The Auto-encoder algorithm at Blueshift can work with most attribute data types and find similar products. These similar products can be recommended to users on the item page or in newsletter campaigns.

We announced the expanded Recommendations Studio and the AI recipes at Engage 2022 San Francisco. Go to our Engage 2022 on-demand page to watch the session hosted by Manyam Mallela, Blueshift co-founder and head of AI. 

 

This article contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of how attributes like title, price, category, brand, location, tags, rating, etc. may be used to generate recommendations. This constitutes a ‘fair use’ of any such copyrighted material as provided for in section 107 of the US Copyright Law.

In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed. If you wish to use copyrighted material from this site for purposes of your own that go beyond ‘fair use’, you must obtain permission from the copyright owner.

All product and company names are trademarks™ or registered® trademarks of their respective holders. Use of them does not imply any affiliation with or endorsement by them.