Generating relevant complementary item recommendations that drive conversion at eBay is no easy task. eBay is an e-commerce marketplace with 1.2 billion items and 179 million buyers, where users can buy and sell virtually anything. In addition to the challenge of the large scale, there is limited structured data attributes, such as ISBN, available for these items, which makes it difficult to use traditional collaborative filtering approaches for generating recommendations.
The inventory is also volatile; some items on eBay are listed for just a week and never appear again. Given all of these constraints, it is difficult to even generate similar item recommendations given an input seed item (example: seed = iPhone 7 32GB, recommendation = iPhone 7 64GB). Generating items that would complement the seed item, so that the seed and recommended items might be purchased together in a bundle for example, is even more challenging (example: seed = iPhone 7 32GB, recommendation = iPhone 7 case). Here, we describe the complementary items algorithm we developed to solve this task.
The algorithm is used in modules on several pages on the eBay site, most notably on the item page in the “Frequently Bought Together” merchandising strip below the seed item description (see example below).
The module is designed to provide the user with suggestions for add-on items to the seed item that the user might not have thought of buying, thereby enhancing the user’s overall shopping mission and experience. This can lead to increasing the number of items bought, which is good for the eBay marketplace as a whole.
Using implicit user feedback (item purchases) with traditional collaborative filtering methods alone does not work at eBay due to the extreme sparsity of the user-item matrix. In layman’s terms, because of the very large number of often short-lived items, the available information about items purchased together by the same user is often insufficient to make recommendations with confidence. However, performing item-based collaborative filtering on aggregations of item-level implicit user data makes sense.
How do you choose the appropriate level of aggregation of items? A natural choice would be to aggregate items at the category level. eBay has a category taxonomy and all items belong to a specific leaf category in this category tree. We can aggregate user purchases to form the user-category matrix, where the columns represent leaf categories and the entries in matrix are either 1 if the user has purchased from that category or 0 otherwise. We then use cosine similarity, with appropriate thresholds, to find the top-K nearest categories to the input seed category.
Finding the nearest categories (related categories) constrains our search space of possible recommendation candidate items significantly and also reduces the possibility of irrelevant recommendations. All of the items that we recommend, in all the recall sets described below, will come from these related categories.
Here are two examples of the top four related categories for the given input seed categories. Each category here shows the full breadcrumb in the category tree, with the last part (italicized) being the leaf category.
Seed category = Cell Phones & Accessories:Cell Phones & Smartphones
Cell Phones & Accessories:Cell Phone Accessories:Cases, Covers & Skins
Cell Phones & Accessories:Cell Phone Accessories:Screen Protectors
Cell Phones & Accessories:Phone Cards & SIM Cards:SIM Cards
Cell Phones & Accessories:Cell Phone & Smartphone Parts
Seed category = Clothing, Shoes & Accessories:Men's Shoes:Athletic
Clothing, Shoes & Accessories:Men's Accessories:Hats
Clothing, Shoes & Accessories:Men's Clothing:Jeans
Clothing, Shoes & Accessories:Men's Shoes:Casual
Clothing, Shoes & Accessories:Kids' Clothing, Shoes & Accs:Boys' Shoes
One interesting question here is how to decide whether to include the original seed category in the list of related categories. Most of the time when you are performing a K-nearest neighbor (KNN) search, you would not include the input entity in your search results. However, in the case of categories, it is possible that we want to recommend items from the same category as the category of the seed item. Think of Baseball Trading Cards or Video Games categories. It is reasonable to assume that a user who is looking at a baseball trading card will want to see more baseball trading cards (same category), as opposed to say basketball trading cards (different category). We capture this logic with the following heuristic: we calculate d, the mean # of purchases / user for each category. If the value of d for a specific seed category is above a threshold, we include the seed category as a related category, and exclude it otherwise.
Now that we have a first-level relevance filter (related categories), we turn our attention to how to generate the actual candidate recommendation items. A set of such candidate items is referred to as a Recall Set. The input to generating the recall sets is the information about the seed item. This is a very strong piece of context, so it is imperative that the recommendations shown to the user have some relevance to the seed item. As we saw in the previous section, we use the seed category to generate a set of related categories.
Here are some of the ways we generate candidate items for recommendations using a variety of signals:
Related Products: This recall set uses the collaborative filtering approach seen in the previous section, but aggregated at the product level. “What is the difference between a product and an item?” you might ask. An item refers to any listing posted by a seller, while a product at eBay is defined as a concrete entity in the real world. For instance, for books, think of the ISBN number. Having product information for items allows many items to be aggregated to the same product entity.
If the seed item can be mapped to a product, we generate a recall set of related products by taking the cosine similarity of vectors of implicit feedback in the form of product-level purchase data. The relevance quality of the recommended products depends highly on the minimum thresholds of the Ochiai coefficient. There is always a coverage/quality tradeoff here. Coverage here is defined as the percent of input seed items for which our algorithms produces recommendation results. Higher coverage typically means lower quality and vice-versa. Human judgment and business rules often guide the balance of this tradeoff.
Since the final results that we want to show the user are items, we have a separate mapping from products to items that is stored in a cache. We generate this product-to-item mapping by aggregating the most-viewed items for a given product, which incorporates a popularity signal into the results.
Co-views: While the last recall set utilized the purchase behavioral signal, this recall set utilizes the view behavioral signal. An item purchase is an ultimate sign of user intention. While a view signal carries less intention, for instance a user might simply be browsing, the benefit to using this signal is the sheer increase in volume/coverage of recommendations. We use this signal to generate recall sets at the product level and directly at the item level, since co-view data is dense enough. Recall sets that use the co-view signal are high quality in terms of conversion.
Related Queries: Besides co-purchase and co-view signals, another source of behavioral data is co-searches. The related queries recall set is contextualized to a search session and incorporates the user search query into the recommendations. We developed a cache of related queries that uses user co-search signals (Ex: “digital camera” might be co-searched with “canon SLR”). To map the queries to items, we utilize another cache where we store the most popular items for a particular query. When a user arrives at an item page from a search page, popular items from related queries will be displayed.
Compatibility: Issues with compatibility between the seed and recommended items can be a serious concern for quality in hard goods categories such as electronics. Recommending a Samsung cell phone case for an Apple iPhone cell phone is a bad user experience that will make the user lose trust in eBay’s recommendations. In general, there is an implicit user assumption that recommended items will fit well with the seed item. Therefore it is important to take compatibility into consideration when generating complementary recommendations.
So far we have discussed sources of recall that use behavioral signals in some way. Often, behavioral signals are not available (cold start problem) so we look to content-based signals to generate recommendations. Some items have compatibility/fitment data associated with them. We curate pairs of aspect names in certain hard goods categories to validate compatibility. An aspect is a structure data attribute for an item. For example, we make sure recommended items with a “compatible model” aspect match the seed items’ “model” aspect. In addition to generating a recall set using this method, we also have a filter so that other recall sets benefit from this compatibility enforcement.
Complementary of Similar: Often we encounter the situation when there will be complementary recommendations for a product (Ex: “Silver iPhone 7 32GB”), but a nearly identical product (Ex: “Gold iPhone 7 32GB”) will not have results, perhaps due to lack of behavioral data for instance. We address this problem by developing a “complementary of similar product” type algorithm. We generate product embeddings using textual information from the product title and aspects and use eBay’s GPU cluster to find similar products with a KNN search of the product embeddings. Therefore, when there is no direct complementary results from a seed product, this recall set will return complementary items from similar products (from product embeddings).
DeepRecs: While the last recall set focused on finding product-based embeddings, the DeepRecs recall set explores item embeddings directly. This recall set uses a text-based deep learning model, incorporating the title, aspects, and category information, from both seed and recommendation candidate item pairs, trained with the implicit co-purchase signal as the target. The co-purchase probability between the seed and recommendation candidate items within the related categories is then calculated using the neural network architecture on a GPU cluster, and the top-K results are returned. Comparing item embeddings, which incorporates textual content information, instead of implicit item vectors directly, as in the case of collaborative filtering, helps address the sparsity issue endemic to eBay data. Details of this approach can be found here.
Popular: When we run out of other behavioral or content-based signals, we fall back to the popular items in a related category recall set. Due its low relevance quality, this recall set is not used in all versions of the algorithm. This recall set has the lowest operation performance in terms of click-through rate (CTR) and purchase-through rate (PTR) metrics, and it is used as a baseline when developing new recall sets. In addition to storing popular items in a category in our cache, we also store popular items in a category with an aspect for all category-aspect combinations. An improved version of this recall set additionally matches an important aspect in each category (such as brand for fashion), between the seed and recommended item. The important aspect for a given category is generated using another sub-model.
For the most part, we have focused on the algorithms and modeling details up to this point. Here is a high-level overview of our engineering architecture. We typically perform most of our aggregation/offline computation of historical data in a Hadoop cluster using Twitter’s Scalding library or Spark. Training deep learning models as well as performing KNN search of embeddings is done in a GPU cluster. All of the pre-aggregated results are then stored in a Couchbase DB cache for access at run time. The backend Scala application for serving live production traffic uses data from eBay internal real-time services as well as the Couchbase DB to generate the results.
We are leaving out many details here, but this is a simplified overview of the overall architecture. Our real-time application serves over 1 billion impressions daily.
We presented numerous algorithmic features and models that together constitute the complementary recommender system at eBay. It is the combination of many smaller components that produces a high level of quality as well as coverage. It is important to have stable sub-components and sub-models as well as the overall engineering infrastructure to produce final recommendations that are robust. Incorporating several different signals, including those based on behavior (co-purchase, co-view, co-search, popularity) and content (title text), significantly enriches the coverage of complementary recommendations.
In a typical information retrieval system, the retrieval process is divided into a recall and a ranking stage. In this blog post, we have focused mainly on the recommendation candidate generation (recall) stage. In another blog post, we will review our approach for the ranking stage, which is analogous to what was previously done for the similar items algorithm. An analogy comes to mind. Building a recommender system is like baking a cake: the recall sets are the cake, the ranking is the frosting on the cake, and personalization is the cherry on top. Here we described how to bake the cake, and we will leave applying the frosting for another time!
We would like to acknowledge several people who have worked on components of this algorithm including Tommy Chen, Daniel Galron, Sourigna Phetsarath, Mike Firer, Natraj Srinivasan, Ved Deshpande, Aditi Nair, Katrina Evtimova, Paul Wang, Shuo Yang, and Barbara Duckworth.