AI Weekly: Researchers attempt an open source alternative to GitHub’s Copilot

That Transform Technology Summits launch on October 13 with Low-Code / No Code: Enabling Enterprise Agility. Register now!

Leave it OSS Enterprise newsletter guide your open source journey! sign up here.

In June, OpenAI teamed up with GitHub to launch Copilot, a service that provides suggestions for entire lines of code in development environments like Microsoft Visual Studio. Powered by an AI model called Codex – which OpenAI later revealed via an API – Copilot can translate native language into code across more than a dozen programming languages, interpret commands in plain English, and execute them.

A community effort is now underway to create an open source, freely available alternative to Copilot and OpenAI’s Codex model. Named GPT Code Clippy, its contributors hope to create an AI pair programmer that will allow researchers to study large code-trained AI models to better understand their capabilities – and limitations.

Open source models

Codex is trained in billions of lines of public code and works with a wide set of frameworks and languages ​​that adapt to the edits that developers make to match their coding formats. Similarly, GPT Code Clippy learned from hundreds of millions of examples of code bases to generate code similar to how a human programmer can.

The contributors to the GPT Code Clippy project used GPT-Neo as the basis for their AI models. Developed by the grassroots research collective EleutherAI, GPT-NEo is what is known as a transformer model. This means that it weighs the influence of different parts of the input data rather than treating all the input data equally. Transformers do not have to process the beginning of a sentence before the end. Instead, they identify the context that makes sense for a word in the sentence so that they can process input data in parallel.

Above: Visual Studio plugin for GPT Code Clippy.

GPT-Neo was “pre-trained” on The Pile, an 835 GB collection of 22 smaller datasets including academic sources (eg Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (Github) and more. Through fine-tuning, GPT Code Clippy contributors improved their code-understanding capabilities by exposing their models to GitHub stores that met a specific search criterion (e.g. had more than 10 GitHub stars and two commits), filtered to duplicate files.

“We used the Hugging Face’s Transformers library … to fine-tune our model[s] on various code datasets, including one of our own, which we scraped from GitHub, ”explains the contributors on the GPT Code Clippy project page. “We decided to fine-tune rather than train from the bottom up, as they report in OpenAI’s GPT-Codex paper that training from the bottom up and fine-tuning the model [result in equivalent] performance. But fine-tuning allowed the model[s] to converge faster than training from scratch. Therefore, all versions of our models are fine-tuned. ”

GPT Code Clippy contributors have trained several models to date using third-generation tensor processing devices (TPUs), Google’s custom AI accelerator chip available via Google Cloud. Although early on, they have created a plugin for Visual Studio and plan to extend the capabilities of GPT Code Clippy to other languages ​​- especially underrepresented ones.

“Our ultimate goal is not only to develop an open-source version of Github’s Copilot, but one that is of comparable performance and ease of use,” the contributors wrote. “[We hope to eventually] devise ways to update version and updates to programming languages. ”

Lifts and setbacks

AI-powered coding models are not only valuable for writing code, but also when it comes to fruit with lower hanging as an upgrade of existing code. To migrate an existing code base to a modern or more efficient language such as Java or C ++ requires expertise in both source and target languages ​​- and it is often expensive. The Commonwealth Bank of Australia spent about $ 750 million over five years converting its platform from COBOL to Java.

But there are many potential pitfalls, such as bias and unwanted code suggestions. In a recent article, Salesforce researchers behind CodeT5, a Codex-like system that can understand and generate code, recognize that the datasets used to train CodeT5 can code some stereotypes like race and gender from text comments — or even from source code. In addition, they say that CodeT5 may contain sensitive information such as personal addresses and identification numbers. And it can produce vulnerable code that negatively affects software.

Similarly, OpenAI found that Codex could suggest compromised packages, invoke functions uncertainly, and produce programming solutions that look correct but do not actually perform the intended task. The model may also be asked to generate racist and harmful output as code, e.g. The word “terrorist” and “violent” when writing code comments with the prompt “Islam.”

The GPT Code Clippy team has not said how it can reduce the bias that may be present in its open source models, but the challenges are clear. Although the models e.g. Ultimately, being able to reduce Q&A sessions and repeated feedback from code reviews, they can cause harm if not carefully revised – especially in light of research showing that coding models lack human accuracy.

For AI coverage, send news tips to Kyle Wiggers – and be sure to subscribe to the AI ​​Weekly newsletter and bookmark our AI channel, The Machine.

Thanks for reading,

Kyle Wiggers

AI Staff Writer


VentureBeat’s mission is to be a digital urban space for technical decision makers to gain knowledge about transformative technology and transactions. Our site provides important information about data technologies and strategies to guide you as you lead your organizations. We invite you to join our community to access:

  • updated information on topics that interest you
  • our newsletters
  • gated thought-leader content and discount access to our valued events, such as Transform 2021: Learn more
  • networking features and more

sign up


Leave a Reply

Your email address will not be published.