GitHub Next | Discovery Agent

As we enter the era of software engineering agents, the ability to automatically set up, build, and test repositories will greatly enhance AI-based code-generation experiences. To achieve this, we developed Discovery Agent, an AI system that autonomously sets up, builds, and tests GitHub repositories in containerized environments. It uses a react-style loop, tool-based execution, and a structured system prompt to reason through projects without relying on hardcoded heuristics or web searches.

Evaluated across three datasets, the agent showed strong performance with low runtimes, even on complex or manually configured builds. This capability paves the way for smarter AI-assisted development and better tooling for code generation, validation, and cloud IDE adoption.

Generative AI tools are increasingly becoming valuable partners to developers, enabling them to deliver functionality at a velocity previously thought to be out of reach. These partnerships range from developers passively accepting incremental code suggestions directly in their IDEs to explicitly delegating tasks to AI agents capable of autonomously rewriting partial or entire sets of files across a repository.

Regardless of how extensive the AI’s involvement is, developers ultimately rely on the trusty build-test-run validation loop to ensure code quality and correctness. While not an absolute guarantee of correctness, an uneventful build-test-run loop is a pretty good indicator that no net new problems have been introduced by the changes — or, ideally, changes have resolved existing issues. In contrast, if this validation loop fails, at a minimum, the output of these commands provides enough information for developers to plan the next remedial steps. Thus, we believe as helpful this loop is to developers, having the capability of automatically setting up, building, and testing a repository will also provide a better perspective for AI-assisted workflows as it pertains to code generation. However, enabling the build-test-run loop requires an often manual setup process that blocks AI from delivering functionality until it is implemented or fixed, greatly slowing down the velocity that can be achieved, especially for new and rapidly evolving repositories.

The goal of this exploration is to prototype an AI agent that can automatically figure out the setup, build, and test commands of a GitHub repository using a containerized environment. Eventually, we would like to build towards a system where the AI agent could augment existing LLM-based code generation capabilities by performing an iterative validation of successive code changes, similar to how a typical developer would.

Approach

To achieve this goal, we built “Discovery Agent”, an AI Agent that can iteratively determine the setup, build, and test commands of a given repository. Our approach is based on Bouzenia and Pradel’s recent paper, "You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects", particularly leveraging iterative refinement based on execution feedback. The figure below shows the high-level components of the discovery agent.

Overview of Discovery Agent architecture.

Discovery Agent is essentially a react-loop style agent with chain-of-thought reasoning. At each step, it selects and executes a tool from a predefined set of tools. For instance, Shell tool allows the agent to execute shell commands and capture the output of those commands to be fed back to the react loop to identify the next steps. Read File and Write File tools allow the agent to read and update the files in the repository. While we can use shell commands to achieve file handling, we observed that having explicit tools improves the overall performance of the agent, especially for file writes. We also provide the agent with a special “exit” tool, which the agent can use to exit the react loop when it determines that the exploration is complete.

Discovery Agent always interprets tool invocations in a virtual environment, either based on Docker or Codespaces. In both cases, the agent has access to the cloned repository and can execute arbitrary shell commands in the container.

Initially, we experimented with a tool that utilized heuristic-based strategies to populate the prompt with relevant context and attempt to infer the build and test commands in one shot. For example, we can detect the primary programming language of a repository and programmatically extract key files related to setup and build. For a TypeScript project, package.json is critical; for a typical Java project, pom.xml is important. README files like README.md or README.txt also frequently contain setup instructions.

The idea behind providing this approach as a tool for the agent was that, even if the one-shot approach did not yield the final command, it would give the agent enough initial context to build upon and refine in the following react loop. However, we noticed that even when we pre-filled the prompt with relevant context, the agentic loop often re-read those files to confirm the information, effectively duplicating the work.

As a result, we shifted to providing these heuristics as suggestions in the system prompt rather than injecting them directly into the prompt. This approach improves efficiency by avoiding unnecessary confirmations. It also allows the model to use its internal worldview encoding to prioritize relevant heuristics over less useful ones, making the exploration process more effective overall.

While the react loop and chain-of-thought reasoning effectively handle the step-by-step operations by selecting the appropriate tools, the system prompt establishes the overall guiding strategy, ensuring that tool invocations contribute meaningfully to the repository-building process. In effect, the system prompt provides high-level guidance on how the model should go about problem-solving and how to leverage the collection of provided tools along the way. Here is a simplified high-level system prompt for our discovery agent. In particular, we chose a structured approach to the system prompt, delineating the objective and a list of higher-level suggestions to approach the problem. In addition, we also outlined the constraints that the agent must adhere to while making tool selections in the course of its react-loop.

# Objective:

You are an autonomous coding assistant tasked with validating developer changes by setting up, building, and testing projects automatically in a non-interactive, virtualized environment.

# Steps:

1. Look for devontainer.json, CI script, and README to get ideas about setup/build and test the project.
2. Consider exploring the contents of the repository to some degree because there can be script files specifically for building and testing the project, and so it may be a good idea to reuse them.
3. The repository may be implemented in a variety of programming languages, so you may need to explore building and testing for multiple languages.
4. Given that your ultimate goal is to execute tests in the project, you must first ensure that the project is setup and built successfully.
5. It is not enough to build a subsystem of the project. Your goal is to build the entire project or provide reasonable evidence on why focusing on a subsystem is enough.
6. It is okay if there are no tests in the project, so that the testing step can be skipped, but you must verify this.
7. It is not enough to execute tests only for a subsystem in the project. Ultimately, your goal is to run all tests or provide clear evidence on why this is not necessary or possible.
8. It is not sufficient to just stop with the test exploration when _some_ test command is found but that fails due to e.g., missing dependencies. You must make a reasonable effort to actually kick off a test run successfully.

# Constraints:

1. Do not execute interactive commands
2. If a command fails, analyze the error output and attempt corrective actions before retrying.
3. If a tool/dependency is missing, install it and continue.

Technical Challenges

Handling interactive and stuck terminal commands

The model should avoid suggesting interactive terminal commands, such as vi or nano, since these require user input during execution. Such commands can lead to scenarios where manual intervention is needed, disrupting automated processes and complicating the overall workflow. By steering clear of interactive commands, the model ensures that instructions remain clear, non-blocking, and better suited for automated or scripted operations.

To address this, we include a system prompt constraint that discourages the use of interactive commands. However, we found that suggestions alone aren’t always sufficient. Due to the inherent randomness in model outputs, these suggestions are sometimes ignored. Moreover, interactivity may arise unexpectedly—for example, from a poorly written unit test that prompts for user input or a dependency installation that requires confirmation.

To handle these cases, we implemented an AI-based monitor that periodically samples the standard output and standard error of the running command. Based on the deltas between samples, the monitor determines whether the command is making progress or if it’s stuck, potentially waiting for user input. This approach allows us to effectively distinguish between long-running commands and those blocked on user interaction.

Summarizing command output to preserve context

Example of terminal commands with large output

It’s important to summarize the output of terminal commands before feeding it back to the model. This serves two main purposes. First, large outputs can quickly consume the model’s context window, preventing further instruction processing. Second, verbose or unfiltered outputs may distract or mislead the model from its intended task. Summarization helps maintain clarity, relevance, and efficiency in interactions.

That said, not all outputs require summarization—especially if they’re small. We implemented a simple strategy based on a size threshold: if the output exceeds the threshold, it’s summarized; otherwise, it’s passed along in full. This threshold is determined empirically through experimentation.

Choosing the output format for agent explorations

One way to make the output of the Discovery agent reusable is to output scripts that encapsulate setup, build, and test commands. Initially, we experimented by providing the logs of the discovery agent and prompting the model to generate three separate scripts: setup.sh, build.sh, and test.sh. However, the model frequently produced duplicate or even conflicting commands across these scripts. Additionally, we also observed that output would sometimes include unnecessary commands such as installing and configuring Python in an environment that comes with preinstalled and configured Python.

Through experimentation and iterative refinement of the prompt, we found that generating a single script with clearly delineated sections for setup, build, and test resulted in more coherent and consistent outputs. This structure provided better clarity and reduced redundancy in the generated scripts.

Can the Discovery Agent successfully determine the setup, build, and test commands of a repository?

To answer this question, we did the following experiments.

Agent vs Agent

We started with the dataset that was used as part of Bouzenia and Pradel's Execution Agent work. This dataset consists of 50 popular repositories using different programming languages.

Discovery Agent could successfully build ~30 repositories out of the 50, and it could successfully execute the tests in ~25 repositories out of the 50. While these numbers are lower than the Execution Agent’s numbers reported in the paper (41 & 33), they cannot be directly compared due to the Execution Agent using a best-of-N metric and the Discovery agent using best-of-1. However, the Execution Agent takes on average 74 minutes per repository, while the Discovery Agent only needs ~10 minutes per repository.

The lower success rates in our setting may be partly due to the Execution Agent dataset. The dataset focuses on popular repositories, and the web search incorporates a tool that performs web searches for "how to build [repository name]." This increases the likelihood of success, as popular repositories are more likely to have tutorials, blog posts, or forum discussions that provide build instructions. Additionally, their evaluation reports the best result out of multiple runs, whereas our assessment does not follow a best-of-N strategy.

In the future, we plan to experiment with incorporating a web search tool to improve the agent's performance on popular repositories while still preserving its ability to reason through unfamiliar ones without relying on pre-existing instructions.

Copilot Offline Eval

GitHub Copilot’s Offline Evaluation project features a dataset of containerized repositories that serve as the basis for thousands of automated tests and live internal evaluations. In these tests, the repositories are deliberately modified to challenge the models, while performance is measured using metrics such as unit test pass rates, token efficiency, and response accuracy. This rigorous evaluation process helps ensure that GitHub Copilot delivers high-quality, reliable code suggestions.

One benefit of the Copilot Offline Eval dataset is that it includes manually curated ground truth data of the setup, build, and test commands used for a repo encoded as a reference Dockerfile. We programmatically extracted the required setup, build, and test commands as ground truth from these Dockerfiles. In all, we used 160 of these repositories across programming languages.

We used the LLM-as-a-judge-based evaluation methodology. Specifically, we created an LLM judge as part of the benchmark driver that takes the ground truth information together with the snapshots of the agent’s work to determine if the agent has done a fair job in terms of the build & test exploration. The LLM judge was provided with the output of the Discovery agent run along with the ground truth. The model was then prompted to “assess the agent's logs to determine if the agent has successfully and exhaustively explored how the project can be built and tested ”. We also leveraged structured output to force the model to produce a true/false judgment along with the rationale for a mechanical evaluation.

The system prompt of the LLM judge approach was iteratively refined over many runs to refine our evaluation criteria. Here is a snippet of the system prompt for the LLM-judge we used:

# Evaluation Criteria

## Build exploration:

- If the project requires a build step, success means the agent ran the build commands.
- Just building a subsystem of a bigger project is not sufficient. The goal of the agent should be to build the entire project.
- For projects that do not require a build (for example, because the project is written in Python), the build is successful only if the agent verified that no build step was necessary.
- Compiling or building a single example or subsystem in the project is not considered building.
- It is perfectly fine to try various build commands; it is not a problem if the agent does not figure out the right command immediately.

## Test exploration:

- Primarily, the test exploration should be about figuring out how to run the unit tests of a project.
- If there are no tests to run, the test execution is considered successful, but the agent must verify that no tests are present.
- Tests must be executed by the agent directly. It is not sufficient if the agent only runs the application in some form.
- Compilation or building must not be considered testing. Testing is about directly running some kind of test suite.
- Failing tests are okay as long as the agent acknowledges the presence of test failures. It is not the agent's job to try to fix the test failures.
- It is perfectly fine to try various test commands; it is not a problem if the agent does not figure out the right command immediately.

For both build and test, the agent must actually execute commands to verify something. It is not sufficient that the agent just prints out a command or simulates their execution somehow.

The Discovery Agent performs very well on this dataset. It successfully built 146 (out of the 160) repositories and successfully executed the tests in 135 (out of the 160) repositories. The average run time per repository is only 2 minutes.

Compared to the execution agent dataset, the complexity of the build commands in this dataset is lower. This is likely because the execution agent dataset consists of popular open source projects, which are likely to be a little more complex and thus have a little more involved setup steps. In contrast, Copilot’s Offline Evaluation dataset appears to be more representative of an average GitHub open-source repository.

While the majority of build and test commands in the Copilot’s Offline Evaluation dataset fall into the “simple” category because they are “standard” commands typically without special flags (e.g. npm run build for TS projects or mvn clean package for Java projects), we also observed some outliers that are fairly sophisticated setup build and test steps.

CodeQL Dataset

CodeQL is a language and toolchain for code analysis, designed to empower security researchers and developers to scale their vulnerability detection by extending their knowledge of a single bug into finding variants across multiple codebases, thus automating security checks and enhancing code quality.

CodeQL operates by first creating a detailed CodeQL database through the extraction of repository information and then running standard or custom queries against it. Thus, an understanding of how the code is built allows for a more accurate analysis. CodeQL leverages Autobuild (a heuristic-based approach) to automatically infer the build command. If the Autobuild fails, then the repository maintainer must configure CodeQL manually by explicitly specifying the build command.

We queried public repositories that have CodeQL enabled and focused explicitly on repositories where CodeQL is manually configured. In all, we curated 160 repositories and created a new dataset. Using an LLM, we have also extracted ground truth build information for these repositories from the GitHub Actions yaml files that configure CodeQL. We then use the same LLM-based judge that we used for the COffE dataset, but this time, the judge only has access to the ground truth build command, and the test exploration has to be assessed without ground truth.

The Discovery Agent performs very well on this dataset. It successfully built 63 (out of the 160) repositories, and it successfully executed the tests in 52 (out of the 160) repositories. The average run time per repository is 6 minutes.

We are excited about these results, as these repositories are the ones where CodeQL was manually configured. This likely implies that Autobuild, the existing state-of-the-art approach to guessing the build command, was not quite up to the task. Either the build was too complicated, or the maintainer needed better control over the CodeQL configuration.

A majority of these build commands are orders of magnitude more complex than the Copilot Offline Eval dataset, with some commands dealing with builds across multiple programming languages.

Open Problems / Limitations

How to deal with external dependencies? Repos that require external fixtures (e.g., Redis database) or multiple interacting containers (e.g., client, load-balancer, server) to test correctly. This may involve giving the agent access to container orchestration tools in a secure way.
An effective secrets management system. There are some significant security and authorization challenges here. Ideally, the Agent could be authorized to perform specific actions, but in a way that would not expose those secrets directly to the Agent or to LLM itself.
Mono and multi-language repos. For repos that have multiple languages and multiple target environments, it would be helpful to be able to select the subset of functionality we need to build/test for a particular change. Specifically, how can we be smart about identifying and ignoring expensive (time and resource-intensive) or unnecessary ( in the context of current code changes) tests? Developers typically intuitively ignore many tests when iterating on code changes.

What's next?

As we enter the era of Software Engineering agents, the core ability to set up, build, and test repositories will significantly improve all AI-based Code generation experiences.

In addition, automatically inferring the setup/build/test scripts may also improve the general adoption of the cloudIDE here. While we have devcontainer standard, it is not adopted widely because of the inertia/friction in setting up the JSON to do the required setup. If we can infer that automatically, this will lead to more adoption as developers will now have assistance to author bespoke environments for their needs.

Finally, we are excited about the possibility that the artifacts generated by our approach can be leveraged for fine-tuning and/or reinforcement learning to further improve the code generation capability of core models and make the next iteration of models natively better at dealing with code in motion.