Imagine this.

You are a beginning English learner. You enjoy the methodical approach, so you tackle the language systematically, memorizing lists of irregular verbs, spelling norms, and syntactic rules. No conversational practice, no watching movies. You want to get the theory right first.

One day, you think you have mastered it. You are a walking grammar book.

You after 6 months of English studies

Then you realize you have been so engrossed in your studies that you skipped lunch, so you ask a passer-by:

Excuse me, sir. I am heavily hungry. Could you point me to the nearest swift-food restaurant?

Which he greets with a baffled stare.

you dont say

What went wrong?

You studied the standard rules of English, but there is a part of the language (of any language) that will never fit in that tidy set of axioms — collocations.

collocations_imageCollocations — a vast n-gram web that connects all words in a language.

According to the Oxford English Dictionary, collocations are pairs or groups of words that are habitually juxtaposed, such as strong coffee or heavy drinker. As such, they are the final touch foreign learners (or say, machine translation systems) need to “speak properly.” You can communicate without knowing them, but you will sound pretty weird to the native ear.

In a wider sense, collocations are a sort of lightweight idioms, but they differ from idioms in a couple of ways:

  • Their meaning is transparent — you can guess it the first time you see them (which you can’t with proper, metaphorical idioms, such as kick the bucket or a piece of cake).
  • There are vastly more collocations than idioms. (The largest explanatory collocation dictionary in existence covers only 1% of all possible vocabulary.)

Most importantly, collocations don’t follow clear rules. There is no apparent reason why we should say a burning thirst and not a blazing thirst, except that most people say the former and not the latter. In a way, these whimsical word patterns are like an unexplored realm at the edges of grammar — a lush rainforest with all sorts of curious and apparently random species. At the edge of the forest, the human language learner and the MT system both face the same problem — how to chart it?

Playing Linnaeus to the human language “biosphere” is no trivial task, but fortunately there is help — massive computational power applied to vast sets of texts (linguistic corpora) is producing resources for us all:

The work with collocations is far from over. For MT, the challenge is finding enough corpora. (Except for a few — such as English, French, and Spanish — most languages don’t have enough online texts to create accurate models.) For human learners, there is the additional problem of analyzing and describing the vast amount of data in terms useful to the language student.

The good news is that here, as in other areas, human linguists and MT systems can leverage each other’s efforts. Every new language model provides helpful data that can be used by the next generation of dictionaries, and every dictionary throws new light on the relationship patterns between words that MT can incorporate.


Meanwhile, language students will do well heading to the pub every once in a while for some conversational practice.

Want to learn more?

  • Go academic — Check out Mike Dillinger’s and Brigitte Orliac’s paper on Collocation extraction for MT.
  • Go deeper — Learn about collocations’ extended family: the phrasemes.
  • Go pro — Find your own collocations: Juan Rowda’s article on MT tools (the “Concordance Tool” section) tells you how.

And if you enjoyed this article, please check other posts from the eBay MT Language Specialists series.