Google Translate

A Brief Discussion on the AI behind Google Translate

7 min readApr 9, 2019

Google Translate is a translation service developed by Google. If you’ve travelled, tried to learn another language, or wanted to understand the comments section under a video or post, chances are you’ve used Google Translate to help. Google Translate is available on web browser interface, mobile apps for Android and iOS, and even has an API that helps developers build browser extensions and software applications. Google Translate began in 2006 by using UN and European Parliament transcripts to gather linguistic data. It currently supports over 100 languages and boasts over 500 million daily users.

We are going to try to take a look under the hood of Google Translate and discuss what it takes to build a translator.

Let’s look at what approaches we could take to build our own translator:

Approach 1: Word-for-Word Translation

Word-for-Word translation involves taking every single word in the source language sentence and finding the corresponding word in the target language.

To do this, all we need is a curated database with translations from one language to another. For example, if we are translating from English to French, for every English word, we look up the corresponding French word in the database. We repeat this process for every word in the sentence.

Although this approach is simple and easy to implement, it generally does not produce correct sentences, as seen above. Languages are composed of two things:

Tokens: Smallest units of language.
Grammar: Defines the order tokens so that they make sense.

Every word is a token. If languages were only constructed by tokens and grammar did not matter, the word-for-word translation model would be acceptable and the problem of language translation would be easily implemented.

Alas, grammar is key in making sense of things (ask the grammar nazis of the interwebs). Grammar must be incorporated into a translator’s logic. To add to that, if you are bi-lingual or have ever attempted to learn a foreign language, you know that that each language has exceptions to the rules of grammar. Trying to capture all of these rules, exceptions and the exceptions to the exceptions in a program rapidly deteriorates the quality of translations.

How can we incorporate grammar?

In order to incorporate grammar, many things need to be considered. The first is syntax analysis. Syntax is the basic structure of a sentence. The second thing to consider is semantics. Semantics refers to the meaning of a sentence. Does the sentence make sense in the context? If these two components are not considered, the translator’s output will not be valid language.

Approach 2: Neural Networks

Neural network learns to solve problems by looking a vast amount of examples. They can be used to define grammar for a translator. A simplistic model of a translator using a neural network may look something like the figure below:

Neural networks are taught language patterns and eventually are able to translate a given English sentence into French all on their own.

To continue our example of English to French translation, a neural network takes an English sentence or sequence of words as an input and gives a French sentence or sequence as an output. In order for this input to be interpreted by a neural network, it must be converted into a format it understands, i.e. a vector or matrix.

Vectors and matrices are an assortment of numbers representing data. The conversion from sentence to vector, called the Vector Mapper, is the first part of the network. It takes our English sentence and returns a vector that a computer can understand.

Because a translator deals with sentences or sequences of words, we can use a Recurrent Neural Network (RNN). RNNs are networks that learn to solve problems that involve sentences.

Once a sentence is translated into a vector, it needs to be translated into a French sentence. This vector mapping is done with a second neural network. Once again, because we are working with sentences, another RNN can be used. Together, these two neural networks make the basic foundation of a language translator. This is called the Encoder-Decoder Architecture.

The encoder encodes the input sequence of length n to n encoding vectors and the decoder decodes the vectors back into language.

Encoder-Decoder Architecture: The first RNN (sequence-to-vector) encodes the English sentence to computer data. The second network (vector-to-sequence) decodes the computer data into the French sentence.

More specifically, these RNNs are called “Long Short-Term Memory Recurrent Neural Networks” (LSTM-RNN). LSTM networks are able to deal with longer sentences fairly well. They were introduced by Hochreiter and Schmidhuber in 1997 and were refined and popularized by many people in later works.

The encoder-decoder architecture works well for medium-length sentences (around 15–20 words). However, LSTM-RNN encoder-decoder structures do not fair as well with longer sentences. RNNs are not able to address the complexity of grammar in longer sentences. RNNs use persisted past information to make decisions about the present. This means that while translating the 8th English word to French, the RNN looks back to the previous 7 words that were translated to make a decision. However, in language, a word depends not only on the words that come before in a sentence but also the words after as well.

In order to look in both directions, forwards and backward, a normal RNN is replaced with a bi-directional recurrent neural network.

Approach 3: Bi-Directional Recurrent Neural Network

Source: Ekaterina Lobacheva, JetBrains, 2016

Bi-directional recurrent neural networks were introduced in 1993 but gained popularity recently with the emergence of deep learning. If we are performing English to French translation, while joining some word in the French translation, we are looking at words that come before it and words that come after it.

With a bi-directional network, we are able to do this. This solves a big problem but also brings up a new issue: is every word in a sentence pivotal to the structure of the previous and next word? Which words should we focus on more in a large sentence?

A method to figure this out was introduced by Bahdanau et al. in 2015 learning to jointly align and translate words. The alignment refers to the order of the words as well as an individual word’s weight in affecting previous and post words.

The figure above is from Bahdanau et al. the vertical axis are the French words and the horizontal axis are the English words. The squares shaded from black to white represent the weight of the alignment. White squares are the words that need to have more emphasis on and affect their surrounding word structure. This alignment is taught to an extra unit called an “Attention Mechanism”. It learns which English words to focus on while generating the words of the French translation. It sits between the encoder and decoder.

The Bi-Directional RNN model works as follows:

An English sentence is fed to an encoder.
The encoder translates the sentence to a vector (numbers).
The vector is sent to the Attention Mechanism (AM). The AM decides which French words will be generated by which English words.
The decoder will then generate the French translation, one word at a time, focusing its attention on the words determined by the AM, producing the French sentence.

Bahdanau et al. found that this model performs better than the original encoder/decoder architecture.

Source: Bahdanau et. al, 2016. The RNNenc-30 and RNNenc-50 paths are the performance maps of the original encoder/decoder architecture.

Let’s go back to Google Translate. In November 2016, Google announced the transition to a neural machine translation.

Google Translate’s AI works exactly like the bi-directional RNN, but at a much larger scale. Instead of using one LSTM for the encoder and decoder, they use 8 layers with connections between layers. 8 layers for the encoder and another 8 for the decoder. The first two layers are bi-directional, which take both forwards and backwards context into consideration. Google chose not to use bi-directional RNNs at every layer to save computation time.

Source: Google’s Neural Machine Translation System, Jiaming Song, http://tsong.me/blog/google-nmt/

This is done because deeper networks are better at modelling complex problems. This network is more capable of understanding the semantics of language and grammar. In the final model, English text is passed, word for word to the encoder. It converts these words into a number of word vectors. These word vectors are then passed into an attention mechanism. This determines the English words to focus on while generating some French word. This data is passed to the decoder which generates the French sentence one word at a time. This is a very high level summary of how Google Translate’s AI works. Google’s work on Neural Machine Translation utilizes state-of-the-art training techniques to improve the Google Translation system. Google Translate uses RNNs to directly learn the mapping between a sentence in one language to a sentence in another.

Great! You now know how Google Translate works! Now onto a deeper look into neural networks in a future discussion…

References:

D Bahdanau, K Cho, Y Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015
Google Translate, Wikipedia, https://en.wikipedia.org/wiki/Google_Translate
Jiaming Song, Google’s Neural Machine Translation System, http://tsong.me/blog/google-nmt/
How Google Translate Works, https://www.youtube.com/watch?v=AIpXjFwVdIE&t=68s