GitHub Next | GitHub Copilot for *Your* Codebase

This is a concluded exploration by GitHub Next.

GitHub may or or may not release products related to this research in the future.

User Statement

As a: developer on a private codebase
I want to: get completions that match my codebase's APIs and idioms
But: Copilot gives generic completions that are ignorant of the specifics of my codebase
Copilot for Your Codebase helps by: showing Copilot snippets from my private codebase that are relevant to the task I'm trying to accomplish
Unlike: Copilot today

Real programming tasks are seldom self-contained: What you’re about to write in one file crucially depends on details elsewhere in the codebase. It may be type declarations, function signatures, how the codebase interfaces with particular libraries, consistent patterns for recurrent problems, how to format error messages, etc.

Right now, GitHub Copilot does not have the ability to look all over your codebase for information, crucially limiting its overview, and often leaving it to just guess at how your codebase fits together for the task at hand. GitHub Copilot currently only looks at the current file and other files you may have open in your editor. Imagine how much better its suggestions could be, if it knew about your entire codebase.

We want to give GitHub Copilot this ability. There’s two crucial problems we need to solve to do that:

Firstly, we need to be able to recognize whether a code snippet in another file is relevant for the current task or not.
Secondly, we need to be able to search through your entire codebase extremely quickly, so we do not delay the suggestions.

So how may we do this?

In the ML literature, the key concept is a “Retriever”: a precompiled index that allows quickly looking up data items relevant to a given query. In our case, the index will consist of code snippets harvested from the user’s repository or some larger corpus. The query will be a digest of the code that the user is currently working on, and we want the Retriever to return similar-looking snippets of code from the index.

The Retriever then needs to be wired together with the code generation model. A straightforward way to do this is to put retrieved snippets directly into the prompt sent to the code generation model. This is the approach used in REALM, where the language model and the Retriever are trained simultaneously. This makes the Retriever good at finding the snippets that seems most helpful to the language model. Another approach used by both kNN-LM and RETRO is to use the prompt without retrieved snippets in the code generation model, but then afterwards directly affect the probabilities for what text is produced by looking at how the retrieved snippets continue.

The Retriever needs to fetch similar-looking snippets extremely quickly. To this end, the index will often be built as a fast k-nearest neighbour data structure. There’s several great production-level libraries that do this at amazingly low latencies, e.g. SPANN, FAISS and ScaNN. They operate on vectors with hundreds of floating-point entries: given a query vector, they output an approximate set of closest vectors in the index. There may be different ways to measure closeness, e.g. by Euclidean distance or by cosine similarity.

This means we need to “embed” our code snippets into floating-point vector space. One approach to this is to simply feed the code snippet to a code generation model like the one powering GitHub Copilot, and then use the internal state representation - which is a floating-point vector - as the embedding. Another approach is to specifically train a model to optimize for embedding similar snippets close to each other and disparate snippets far apart; CodeBERT and its successor UniXCoder are examples of this.

So there’s no shortage of proposals in the literature on how to approach this problem. Some of the key challenges that we face to roll it out and give value for our GitHub Copilot users are:

The architecture: Is the index stored server-side or client-side? Who does the embedding of the user’s current code context? A crucial constraint here is that we can’t run the large code generation model client-side, while performing the embedding server-side easily ends up adding latency.
What’s the best embedding that fits into the chosen architecture, and which presents a good trade-off between speed and quality. And how do we evaluate such an embedding and pitch it against another without serving it to users?

These are all important questions which we will need to answer as this project progresses.

What's next

When we began this project in August 2021, combining retrievers and embeddings with LLMs were mostly seen in the research literature and not in actual products. Since then the landscape has completely changed, and "Retrieval-augmented generation" (RAG) and vector databases have become staples of advanced systems around LLMs, with off-the-shelf systems built into AI platforms, and scalable and robust open-source projects like qdrant available to build your own with.

We used the RAG + vector database system that we built in GitHub Next to power our technical preview of Copilot for Docs. RAG is also a key component of the GitHub Copilot Chat in VSCode and other editors. And by combining RAG with GitHub's sophisticated non-neural code search capabilities, Copilot Chat on github.com allows the model to know lots about your repo.

Though this project has concluded from GitHub Next's side, experiments are also on-going in the GitHub Copilot team for optimising use of RAG + vector databases for the Copilot customization use-case outlined above.