Google Translate is by no means a perfect translation service — you’re still going to have to invest in those language classes if you want to be able to fluently communicate with speakers of foreign tongues — but if you remember the online wasteland that preceded it, you’ll know it’s pretty darn good at conveying the gist of what you’re trying to express. Also: That ‘detect language’ feature saves you the hassle of fiddling with lots of dropdown menus. So how does it do it?
As with many of Google’s technologies, Google Translate’s M.O. consists of sifting through large piles of data — in this case, text. Google refers to this process of translation by finding patterns in vast swathes of writing “statistical machine translation.” As humans, when we learn languages, we do so by navigating the sets of rules which govern them, so Google’s process might seem deeply unintuitive. However, when you compare its results to those of translation services like Babel Fish, which is powered by the rule-based machine translation of SYSTRAN, the improved accuracy of the results speaks for itself.
Indeed, Google used SYSTRAN for its translations up until 2007, when it switched to its own system. At the time, Google research scientist Franz Och explained the switch as follows:
“Most state-of-the-art commercial machine translation systems in use today have been developed using a rules-based approach and require a lot of work by linguists to define vocabularies and grammars. Several research systems, including ours, take a different approach: we feed the computer with billions of words of text, both monolingual text in the target language, and aligned text consisting of examples of human translations between the languages. We then apply statistical learning techniques to build a translation model.”
With a much more robust set of data at its disposal today — Google has reportedly fed thousands of multilingual United Nations and European Union official documents to Google Translate to get it up to speed, Rosetta Stone-style, and the millions of pages it indexes can’t hurt — Google continues to stick up for its data-driven approach to translation while granting that it’s imperfect. Recently, Google released the video below to explain Google Translate.
From the video: (via Google OS‘ transcript)
“If you want to teach someone a new language you might start by teaching them vocabulary words and grammatical rules that explain how to construct sentences. A computer can learn foreign language the same way – by referring to vocabulary and a set of rules. But languages are complicated and, as any language learner can tell you, there are exceptions to almost any rule. When you try to capture all of these exceptions, and exceptions to the exceptions, in a computer program, the translation quality begins to break down. Google Translate takes a different approach…”
“…Once the computer finds a pattern, it can use this pattern to translate similar texts in the future. When you repeat this process billions of times you end up with billions of patterns and one very smart computer program. For some languages however we have fewer translated documents available and therefore fewer patterns that our software has detected. This is why our translation quality will vary by language and language pair.”
Google Translate has been through 20 different stages of development, the most recent of which, in June of this year, allowed for the romanization of Arabic text. It’s not without its critics: Forbes’ Lee Gomes recently compared its fluency to that of “a barely competent human translator, one who happens to be both distracted and drunk,” and wrote that he was “doubtful” that its language proficiency would ever equal a human’s.
As humans who like to do things in human ways, there is something distasteful about using computational brute force to solve problems: However good the solutions might look, the process of arriving at them is so opaque, so unintuitive, that it usually fails to shed insight onto other problems, which is why mathematicians tend to hate proofs by exhaustion, which do, however, constitute valid proofs. But sometimes, the geyser of data is the best approach: The team that recently proved that any Rubik’s Cube can be solved in 20 moves or less did so by running an enormous number of positions, although human cleverness came into play by greatly reducing the number of those positions.
Is language solvable in the same way as a Rubik’s Cube? The answer to that depends on how tight a solution you want: It’ll probably be some time before you’ll want to go into a foreign land armed only with your Google Translate-equipped Android and hope for the best. But statistical machine translation still holds a lot of promise: Even if we can’t understand exactly how it makes every choice that it makes from our vantage, it has already represented a leap past decades-old rule-based translation systems that preceded it, and any future attempt to sort out the world’s languages for human usage will owe something to Google’s opaque stacks of data.
(via Google OS)
Published: Aug 12, 2010 11:16 am