Microsoft and Nvidia team up to train one of the world’s largest language models

That Transform Technology Summits launch on October 13 with Low-Code / No Code: Enabling Enterprise Agility. Register now!

Microsoft and Nvidia announced today that they have trained what they claim is the largest and most capable AI-powered language model to date: Megatron-Turing Natural Language Generation (MT-NLP). The successor to the companies’ Turing NLG 17B and Megatron-LM models, MT-NLP contains 530 billion parameters and achieves “unsurpassed” accuracy in a wide range of natural language tasks, say Microsoft and Nvidia – including reading comprehension, general reasoning and natural linguistic inferences.

“The quality and results we have achieved today are a major step forward on the journey towards unlocking the full promise of AI in natural language. The innovations in DeepSpeed ​​and Megatron-LM will benefit existing and future AI model development and make large AI models cheaper and faster to train, ”wrote Nvidia’s Paresh Kharya and Microsoft’s Ali Alvi in ​​a blog post. “We look forward to how MT-NLG will shape tomorrow’s products and motivate the community to push the boundaries of natural language processing (NLP) even more. The trip is long and far from finished, but we are excited about what is possible and what awaits. ”

Training of massive language models

In machine learning, parameters are the part of the model learned from historical training data. In general in the field of language, the correlation between the number of parameters and sophistication has held up remarkably well. Language models with a large number of parameters, more data and more training time have been shown to achieve a richer, more nuanced language comprehension, for example to have the opportunity to summarize books and even complete programming code.

To train MT-NLG, Microsoft and Nvidia say they have created a training set of 270 billion tokens from English-language websites. Tokens, a way of separating pieces of text into smaller units of natural language, can be either words, characters, or parts of words. Like all AI models, MT-NLP had to “train” by consuming a set of examples to learn patterns between data points, such as grammatical and syntactic rules.

The dataset came largely from The Pile, an 835GB collection of 22 smaller datasets created by the open source AI research effort EleutherAI. The pile spans academic sources (e.g. Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (Github) and more, which Microsoft and Nvidia say they have cured and combined with filtered snapshots of Common Crawl, a large collection of web pages including news stories and social media posts.

Microsoft Nvidia MT-NLP

Above: The data used to train MT-NLP.

Training took place across 560 Nvidia DGX A100 servers, each containing 8 Nvidia A100 80 GB GPUs.

When it’s the benchmark, Microsoft says that MT-NLP can derive basic mathematical operations, even when the symbols are “poorly veiled”. Although the model is not extremely accurate, it seems to go beyond remembering to arithmetic and is able to complete tasks that contain questions that ask it for an answer, a major challenge in NLP.

It is well established that models such as MT-NLP can amplify the biases in data they were trained on, and in fact Microsoft and Nvidia recognize that the model “captures stereotypes and biases from [training] This is probably because part of the dataset was sourced from societies with pervasive gender, race, physical and religious prejudices that cure cannot completely resolve.

In a paper, the Middlebury Institute of International Studies’ Center on Terrorism, Extremism and Counterterrorism claims that GPT-3 and similar models can generate “informative” and “influential” text that can radicalize people into far-right extremist ideologies and behaviors. A group at Georgetown University has used GPT-3 to generate misinformation, including stories about a false narrative, articles altered to push a false perspective, and tweets that riff about certain points of misinformation. Other studies, such as one released by Intel, MIT and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular open source models, including Google’s BERT and XLNet and Facebook’s RoBERTa.

Microsoft and Nvidia claim that they are “committed to working with addressing [the] problem ”and encourage“ continued research to help quantify the model bias. “They also state that any use of Megatron-Turing in production” shall ensure that appropriate measures are taken to mitigate and minimize potential harm to users “and follow principles such as those outlined in Microsoft’s responsible AI principles.

“We live in a time when AI progress far exceeds Moore’s law. We continue to see more computing power being made available with newer generations of GPUs associated with lightning speeds. At the same time, we continue to see that hyper-scaling of AI models leads to better performance, apparently no end in sight, ”continued Kharya and Alvi. “Marrying these two trends together is software innovation that pushes the boundaries of optimization and efficiency.”

The cost of large models

Projects like MT-NLP, AI21 Labs’ Jurassic-1, Huawei’s PanGu-Alpha, Navers HyperCLOVA and Beijing Academy of Artificial Intelligence’s Wu Dao 2.0 are impressive from an academic point of view, but building them does not come cheap. For example, the training dataset for OpenAI’s GPT-3, one of the world’s largest language models – was 45 terabytes in size, enough to fill 90,500 GB of hard drives.

AI training costs fell 100 times between 2017 and 2019, according to a source, but the totals still exceed the computing budgets of most startups. Inequality favors companies with extraordinary access to resources at the expense of small entrepreneurs, strengthening current benefits.

For example, OpenAI’s GPT-3 required an estimated 3,1423 floating point operations per second. Second (FLOPS) calculation during training. In computer science, FLOPS is a measure of raw processing performance typically used to compare different types of hardware. Assuming OpenAI reserved 28 teraflops – 28 trillion floating operations per year. Second – cross-bank calculation of Nvidia V100 GPUs, a common GPU available via cloud services, would take $ 4.6 million for a single workout. An Nvidia RTX 8000 GPU with 15 teraflops calculation would be significantly cheaper – but it would take 665 years to complete the training.

A synchronized report estimated that a fake news discovery model developed by researchers at the University of Washington cost $ 25,000 to train, and Google spent about $ 6,912 on training a language model called BERT, which it used to improve the quality of Google’s search results. . Storage costs also increase rapidly when handling data sets in the terabyte or petabyte scale. To take an extreme example, one of the datasets accumulated by Tesla’s self-propelled team — 1.5 petabytes of video footage — costing over $ 67,500 to store in Azure for three months, according to CrowdStorage.

The effects of AI and machine learning model training on the environment have also been mitigated. In June 2020, researchers at the University of Massachusetts in Amherst published a report estimating that the amount of power required for training and searching for a particular model involves emissions of approximately 626,000 pounds of carbon dioxide, equivalent to nearly 5 times the lifetime emissions from average American car. OpenAI itself has admitted that models like Codex require significant amounts of computing – on the order of hundreds of petaflops a day – which contributes to carbon emissions.

In a whiff of good news, the cost of FLOPS and basic machine learning operations has dropped over the last few years. An OpenAI survey from 2020 showed that the amount of calculations needed to train a model for the same performance when classifying images in a popular benchmark – ImageNet – has fallen by a factor of two every 16 months since 2012. . Other recent studies suggest that large language models are not always more complex than smaller models, depending on the techniques used to train them.

Maria Antoniak, a natural language researcher and data researcher at Cornell University, says that when it comes to natural language, it is an open question whether larger models are the right approach. While some of the best benchmark performance scores today come from large datasets and models, the benefits of dumping huge amounts of data into models are uncertain.

“The current structure of the field is task-focused, where society gathers to try to solve specific problems on specific datasets,” Antoniak told VentureBeat in an earlier interview. “These tasks are usually very structured and can have their own weaknesses, so even though they help our field move forward in some ways, they can also limit us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to a true language understanding is up for debate. ”


VentureBeat’s mission is to be a digital urban space for technical decision makers to gain knowledge about transformative technology and transactions. Our site provides important information about data technologies and strategies to guide you as you lead your organizations. We invite you to join our community to access:

  • updated information on topics that interest you
  • our newsletters
  • gated thought-leader content and discount access to our valued events, such as Transform 2021: Learn more
  • networking features and more

sign up

Leave a Comment