How do you normalize Edit Distance? Some simple ideas to get useful numbers about the changes in your text.
Divided by what?
One thing is absolutely true: the absolute number that is Edit Distance (see our previous post) is not very useful in most MT cases. A change of 5 words in a sentence of 20 words is an Edit Distance of 5 (and is 25%). A change of 2 words in a sentence with 4 is an Edit Distance of 2 (and is 50%). So, 5 words looks like much more than 2 words, but 25% is less than 50%. That is why we need to use percentages or relative numbers, because the change effort needs to be placed in context, related to the length of the text. I used word counts below because they are easier to visualize.
Let’s talk about naming: TAUS is calling this Edit Density, which is a fine name. I used to call it “Percent Edit Distance” or “Percentage of Change”. It is easy to slip into saying an “Edit Distance of 30%”, but strictly the “pure” Edit Distance is in words or characters (or operations of change applied to words or characters) and it is not a percentage. TAUS presents it as number of edits per 100 characters, which is a percentage.
Now that we see that we need a percentage, comes the question: the Edit Distance should be divided by… what? Should be a percentage of what?
You may hear this question as “how should we normalize Edit Distance?”. There are various statistical definitions for normalization, but what we want to do with Edit Distance is simply called scaling: we want to bring the values into the same range, so that we can compare them. It is a simple form of normalization.
There are three possibilities:
- Divide by the initial (MT) count
In MT, this means that you are calculating based on the number of words that the posteditor started working with.
- Divide by the final (PE) count
In MT, this is the postedited word count, the number of words that the posteditor ended with.
- Based on the maximum between both initial and final
The numbers are between 0 and 1
One could argue that using the MT count lets you know the costs before the work is done, while the others require the work to be finished to get the number. I think that this is traditionally why the industry uses source word counts, because you know the count before you work on it. Then you can give your price to the client, and the translators knows how much they will make at the beginning. But it doesn’t necessarily have to be that way. Translators many times struggle with projects that took much more effort than they initially estimated. I can’t think of a reason to use the PE count. If you have one, please share.
The best metric is to use the maximum between both MT and PE. Placing the results between 0 and 1 mean that your percentage is between 0 and 100%, which is very convenient to create charts.
Here is how this works. Let’s take our previous example of “Roses are sometimes red > Violets are blue and you are sweet”. There were 7 changes in this transformation, the Edit Distance is 7 words.
You can see the % based on MT getting a high score of 175%. But something different could happen:
The % based on the PE could go high, or the MT could go really high to 400%. Meanwhile, the % on Max is always well behaved, between 0 and 100%. So, let’s use Max.
What do you think?
You can find the first two articles in this series at Going the Distance — Edit Distance 1 and Going the Distance — Edit Distance 2. If you enjoyed this article, please check other posts from the eBay MT Language Specialists series.