Join gamers online at the GamesBeat Summit Next in the upcoming 9-10. November. Learn more about what comes next.
Today, Microsoft announced that Microsoft Translator, its AI-powered text translation service, now supports more than 100 different languages and dialects. With the addition of 12 new languages, including Georgian, Macedonian, Tibetan and Uighur, Microsoft claims that Translator can now make text and information in documents available to 5.66 billion people worldwide.
Its translator is not the first to support more than 100 languages - Google Translate reached that milestone in early February 2016. (Amazon Translate only supports 71.) But Microsoft says the new languages are supported by unique advances in AI and will be available in Translator apps, Office and Translator for Bing, as well as Azure Cognitive Services Translator and Azure Cognitive Services Speech.
“One hundred languages is a good milestone for us to achieve our ambition that everyone should be able to communicate regardless of the language they speak,” Microsoft Azure AI Chief Technology Officer Xuedong Huang said in a statement. “We can take advantage [commonalities between languages] and use it … to improve the whole language family[ies]. ”
As of today, Translator supports the following new languages, which Microsoft says speaks built in 84.6 million people in total:
- Mongolian (Cyrillic)
- Mongolian (traditional)
- Uzbek (Latin)
Powering Translator’s upgrades are Z-Code, part of Microsoft’s major XYZ Code initiative to combine AI models for text, vision, sound and language to create AI, speech, hearing and comprehension systems. The team consists of a group of researchers and engineers who are part of the Azure AI and Project Turing research team, which focuses on building multilingual, large-scale language models that support different production teams.
Z-Code provides frameworks, architecture, and models for text-based, multilingual AI language translations for entire language families. Due to the sharing of linguistic elements across similar languages and transfer learning, which apply knowledge from one task to another related task, Microsoft claims to have dramatically improved the quality and reduced the cost of its machine translation capabilities.
With Z-Code, Microsoft uses transfer learning to move beyond the most common languages and improve the translation accuracy of “low-resource” languages that refer to languages with less than 1 million sentences of training data. (Like all models, Microsoft learns from examples in large datasets derived from a mix of public and private archives.) Approximately 1,500 known languages fit these criteria, which is why Microsoft developed a multilingual translation training process that marries language families and language models.
Techniques such as neural machine translations, rewrite-based paradigms, and processing on the device have led to quantifiable leaps in machine translation accuracy. But until recently, even the state-of-the-art algorithms hung behind human performance. Efforts beyond Microsoft illustrating the scale of the problem – the Masakhane project, which aims to make thousands of languages on the African continent automatically translatable, has not yet moved beyond data collection and the transcription phase. In addition, Common Voice, Mozilla’s efforts to build an open source collection of transcribed speech data, has only examined dozens of languages since its 2017 launch.
Z-code language models are trained in multilingualism across many languages, and that knowledge is transferred between languages. Another round of training transfers knowledge between translation assignments. For example, the models’ translation skills (“machine translation”) are used to help improve their ability to understand natural language (“natural language comprehension”).
In August, Microsoft said a Z-Code model with 10 billion parameters could achieve state-of-the-art machine translation and multilingual summary tasks. In machine learning, parameters are internal configuration variables that a model uses when making predictions, and their values essentially – but not always – define the model’s skills in a problem.
Microsoft is also working on training a 200 billion parameter version of the aforementioned benchmark beating model. For reference, OpenAI’s GPT-3, one of the world’s largest language models, has 175 billion parameters.
Leading competitor Google also uses new AI techniques to improve language translation quality across its services. Not to be missed, Facebook recently unveiled a model that uses a combination of word-for-word translations and backward translations to surpass systems for more than 100 language pairings. And in the academic world, MIT CSAIL researchers have presented a model without supervision – ie. a model that learns from test data that is not explicitly labeled or categorized – that can translate between texts in two languages without direct translational data between the two.
Of course, no machine translation system is perfect. Some researchers claim that AI-translated text is less “lexically” rich than human translations, and there is ample evidence that language models amplify the bias found in the datasets in which they are trained. AI researchers from MIT, Intel and the Canadian initiative CIFAR have found high levels of bias from language models, including BERT, XLNet, OpenAI’s GPT-2 and RoBERTa. In addition to this, Google identified (and claims to have addressed) gender bias in the translation models underlying Google Translate, particularly with regard to resource-poor languages such as Turkish, Finnish, Persian and Hungarian.
Microsoft, for its part, points to Translator’s traction as evidence of the platform’s sophistication. In a blog post, the company notes that thousands of organizations around the world use Translator for their translation needs, including Volkswagen.
“The Volkswagen Group uses machine translation technology to serve customers in more than 60 languages - translating more than 1 billion words each year,” writes Microsoft’s John Roach. “The reduced data requirements … enable the translation team to build models for languages with limited resources, or which are threatened by declining mother tongue populations.”
VentureBeat’s mission is to be a digital urban space for technical decision makers to gain knowledge about transformative technology and transactions. Our site provides important information about data technologies and strategies to guide you as you lead your organizations. We invite you to join our community to access:
- updated information on topics that interest you
- our newsletters
- gated thought-leader content and discount access to our valued events, such as Transform 2021: Learn more
- networking features and more