Image search is one of our central products. The user experience has shifted from textual search, to visual search, where a user can either capture a live image or select an image from their device.
Through the use of convolutional neural networks, we are able to process images and then find similar live eBay listings to suggest to a user. These suggested listings are ranked based on their similarity and then displayed to the user. As we train these models, we face the challenge of evaluating their performances. How can we compare several visual search models and say which of them works better?
Since our main objective is to create compelling customer experiences, this article will describe a method that is tackling this problem directly from the eyes of the users.
Preparing the data set for the evaluation
We are using a fixed set of n randomly selected user loaded images that will serve as our query images for both models during this evaluation.
These images were not part of the training set that consists of eBay’s active listings, but are reflective of the true images our buyers use to search for eBay products. For each query (i.e. anchor image) we call a model, obtain the top 10 results per anchor image, and then collect 10Xn images per model output for our evaluation dataset.
Adding the human to the loop
Once we have the evaluation dataset, we upload these images to FigureEight (i.e. crowdflower), a crowd tagging platform that we use to collect responses on how well the output of a model compares to the anchor image given (see Figure 1).
Since images are extremely subjective to evaluate, we decided to incorporate dynamic judgments in order to establish a confidence score for every pair of questioned images. We start by asking three people the same question and reviewing their responses. If they all answer the same, we keep this answer. If they answer differently, we will ask two more people (totaling up to five) to ensure a high confidence of this response.
Our evaluators are also being tested while answering these questions. There are test questions, handpicked by our team, that every evaluator must go through in order to qualify as a valid labeler. Their accuracy on these test questions will be linked to their trust score. They must score at least a 70% on the test in order to be accepted to complete this task. In addition to the pre-test, there are test questions distributed throughout the task that could result in their trust score falling below our designated threshold of 0.7, which would result in these labelers being removed from the task.
The overall confidence score per each answer is calculated by the level of agreement between labelers and their assigned level of trust.
For example, if there were two types of answers selected for the same question, we will take the answer that has a higher confidence score overall. Only questions that have a confidence greater than or equal to 70% are being evaluated (see Figure 2).
Calculating the total score per model
This process is done in order to obtain a score per each of the models we are evaluating so we can do a fair evaluation between them and decide which one users might prefer. We are using DCG (Discounted Cumulative Gain), which is a standard metric for ranked results (see Figure 3).
The weights we are using are described in the following table.
|Good Match||Exact match or a very good substitute||1|
|Fair Match||Same product type, but with slight variation (i.e. color)||0.8|
|Bad Match||Same product type, but significant differences||0.3|
|Very Bad Match||Different product type||0|
Once we have all the answers from the crowd, we can assign the relevant numbers to the formula and accumulate the total score per each model. A model with a higher score means a model that produced more relevant search results per this 10Xn evaluation set and thus will be the chosen one.