The Quality of Auto-Generated Code – O’Reilly

Kevlin Henney and I riffed on some ideas about GitHub Copilot, the tool to automatically generate codebase on GPT-3’s language model, trained in the code code in GitHub. This article asks some questions and (maybe) some answers without trying to present any conclusions.

At first we wondered about code quality. There are plenty of ways to solve a given programming problem; but most of us have some ideas about what makes code “good” or “bad”. Can it be read, is it well organized? Things like. In a professional framework where software must be maintained and changed over long periods, readability and organization count a lot.

Learn faster. Dig deeper. See further.

We know how to test whether the code is correct or not (at least up to a certain limit). Considering enough device tests and acceptance tests, we can imagine a system to automatically generate code that is correct. Property-based testing can give us some additional ideas on building test suites that are robust enough to verify that the code works properly. But we do not have methods to test for code that is “good”. Imagine asking Copilot to write a function that sorts a list. There are many ways to sort. Some are pretty good – e.g. Quicksort. Some of them are awful. But a unit test has no way of telling if a function is implemented using quicksort, permutation sorting, (which completes in factorial time), sleep sorting, or one of the other strange sorting algorithms that Kevlin has written about.

Do we care? Well, we worry about O (N log N) behavior versus O (N!). But provided we have some way to solve this problem, if we can specify the behavior of a program accurately enough, so we are very confident that Copilot will write code that is correctly and tolerably executed, then we do not care its aesthetics? Do we care if it is legible? 40 years ago, we may not have cared about the collection’s language code generated by a compiler. But today we do not, except for a few increasingly rare corner cases that usually involve device drivers or integrated systems. If I type something in C and compile it with gcc, I will realistically never look at the compiler’s output. I do not have to understand it.

To get to this point, we may need a meta-language to describe what we want the program to do that is almost as detailed as a modern, high-level language. That may be what the future holds: an understanding of “prompt engineering” that lets us tell an AI system exactly what we want a program to do, rather than how it should be done. Testing would become much more important, as would understanding the business problem to be solved. “Slinging code” in any language would become less common.

But what if we do not reach the point where we trust auto-generated code as much as we now trust output from a compiler? Readability will be a premium as long as people need to read code. If we have to read output from one of Copilot’s descendants to judge whether it will work or not, or if we have to troubleshoot that output because it mostly works but in some cases fails, we need to use it to generate code that is readable. Not that people are currently doing a good job of writing readable code; but we all know how painful it is to debug code that cannot be read, and we all have an idea of ​​what “readability” means.

Second: Copilot was trained in the code body of GitHub. At this point, it is all (or almost all) written by humans. Some of it is good, high-quality, readable code; much of it is not. What if Copilot became so successful that Copilot-generated code came to make up a significant percentage of the code on GitHub? The model will definitely need to be retrained from time to time. So now we have a feedback loop: Copilot trained in code that is (at least in part) generated by Copilot. Is code quality improving? Or does it break down? And again, these would mean that you have to spend for these processes.

This question can be argued both ways. People working with automated tagging for AI seem to take the position that iterative tagging leads to better results: ie. after a tagging pass, use a human-in-the-loop to check some of the tags, correct them where they are wrong, and then use that extra input in another workout. Repeat as needed. It is not so different from the current (non-automated) programming: type, compile, run, debug as often as necessary to get something that works. Feedback loop allows you to write good code.

A human-in-the-loop approach to training an AI code generator is a possible way to get “good code” (no matter what “good” means) – even if it’s only a partial solution. Questions like indentation style, meaningful variable names and the like are just a start. It is a more difficult problem to assess whether a code is structured in coherent modules, has well-designed APIs and can be easily understood by maintainers. Humans can evaluate code with these qualities in mind, but it takes time. A human-in-the-loop can help train AI systems to design good APIs, but at some point, the “human” part of the loop will begin to dominate the rest.

If you look at this problem from the point of view of evolution, you see something else. If you breed plants or animals (a very selected form of development) for a desired quality, you will almost certainly see all the other qualities deteriorate: you get large dogs with hips that do not work, or dogs with flat faces that can not pull the weather properly.

What direction will automatically generated code take? We do not know. Our guess is that code quality is likely to deteriorate without ways to measure “code quality” rigorously. Ever since Peter Drucker, management consultants have been saying, “If you can’t measure it, you can’t improve it.” And we assume that this also applies to code generation: aspects of the code that can be measured, improved, aspects that cannot. Or as the accounting historian H. Thomas Johnson said: “Maybe what you measure is what you get. More likely, what you measure is all you get. What you cannot (or cannot) measure is lost. ”

We can write tools to measure some superficial aspects of code quality, such as obeying stylistic conventions. We already have tools that can “solve” rather superficial quality issues like indentation. But again, the superficial approach does not touch on the more difficult parts of the problem. If we had an algorithm that could score readability and limit Copilot’s training set to code that scores in the 90th percentile, we would definitely see outputs that look better than most human code. Even with such an algorithm, however, it is still unclear whether this algorithm could determine whether variables and functions had appropriate names, let alone whether a large project was well-structured.

And a third time: do we not care? If we have a strict way of expressing what we want a program to do, we may never have to look at the underlying C or C ++. At some point, one of Copilot’s descendants may not need to generate code in a “high language” at all: it may generate machine code for your target machine directly. And maybe the target machine will be Web Assembly, JVM or something else that is very portable.

Do we care if tools like Copilot write good code? We will until we do not. Readability will be important as long as people have a role to play in the troubleshooting loop. The important question is probably not “indifferent”; it is “when do we stop worrying?” When we can rely on output from a code model, we see a rapid phase change. We care less about the code and more about describing the task (and appropriate tests for that task) correctly.

Leave a Comment