Characters or words?
Before we go any further, Edit Distance is, by definition, an absolute number, which is the number of operations for change. It is not a relative number. Can you do anything with absolute numbers? Absolutely! Spellcheckers look for words in a dictionary that have the smallest Edit Distance to the word that has been misspelled (or is it mispelled?). The smallest is 1 character, an absolute value. But most applications in MT will need a relative number, proportional to some form of length.
The Edit Distance can be measured in characters or words. Which one should we choose?
Some arguments in favor of calculating in characters could be:
- It can be used for any language, including Asian languages.
- It can be used for German, where a compound word could be equivalent to several English words, for example, throwing the word-based calculation off a little.
- This is more of an “against-words” argument than an “in-favor-of-characters” one.
- It better represents minor changes to words such as adding an “s” for a plural.
- Change Rose to Roses = 1 character operation
- Change Rose to Roses = 1 word operation
Some arguments against characters could be:
- Changing one Asian character is not the same as changing one non-Asian character.
- They carry more meaning and there are less Asian characters in a sentence compared to its equivalent in English, for example.
- So, you can’t really compare character distances across all languages.
- Some languages, such as Japanese, will have Asian and non-Asian characters in the same sentence. They would need to have different weights.
- And a reordering of one word from MT to PE:
MT: The seller voluntarily refunded the buyer. > PE: The seller refunded the buyer voluntarily.
- The seller voluntarily refunded the buyer.
- The seller refunded the buyer voluntarily.
The change of position of the word “voluntarily” means two word operations: the deletion of the word where it was initially, and the addition of the word where it is now. For 2 changes out of 6 words, we get 33%. But for characters, it looks like this:
1. The seller voluntarily refunded the buyer.
2. The seller voluntarily refunded the buyer.
3. The seller voluntarily refunded the buyer.
4. The seller voluntarily refunded the buyer.
5. The seller voluntarily refunded the buyer.
6. The seller voluntarily refunded the buyer.
11. The seller voluntarily refunded the buyer v.
12. The seller voluntarily refunded the buyer vo.
22. The seller voluntarily refunded the buyer voluntarily.
There was a lot of change, 22 out of 42 characters, more than 50% of the characters were moved. Does this look more like two changes of words or more like changing over 50% characters? So, certain changes are better represented as word changes.
Some other possible arguments that could be made in favor of words:
- The “unit of attention” of a translator is a word and not a character. Nobody changes characters, they change words.
- Translators think of the meaning of the whole word and then apply a change in meaning, which may be just a character. So, the “effort of PE” is a thinking effort in words.
- It is easier to think of changing 2 words out of 5 than 17 characters out of 85.
This image below shows the difference in Words – Chars looks like for a sample of about 1000 segments. The average % edit distance in words was 44% and for chars was 28%. The numbers I have seen usually seem to be around the edit distance per characters being smaller by about 35 to 40% of the edit distance value for words. If 44-28 = 16, then 16 is about 36 % of 44.
Some automatic metrics are based on words, such as BLEU and TER. Other metrics, such as CharacTer and chrF++ have words and characters working together to produce scores. And some metrics are calculated to ignore changes in position, such as Position Independent Word Error Rate (PER), and BLEU to some extent. These are all attempts to better represent the changes.
We should just calculate both edit distances (by characters and by words) for a while, until we get better numbers that help us choose one or the other in each situation.
This list of arguments for one calculation or the other is by no means exhaustive. What others can you think of?
There is one more choice about Edit Distance: how should we normalize the Edit Distance?
You can find the first two articles in this series at Going the Distance — Edit Distance 1 and Going the Distance — Edit Distance 3. If you enjoyed this article, please check other posts from the eBay MT Language Specialists series.