SWE-gen: Scaling SWE-bench task generation

January 18, 2026 • Rishi Desai

Checkout SWE-gen and the SWE-gen-JS dataset.

Farming SWE-bench tasks

Standing up arbitrary GitHub repos is challenging. Each repo has a different language, build system, dependencies, and test framework. While Python repos are simple to dockerize, JS/TS repos are more difficult due to the variance in frameworks and repo styles (e.g., monorepos).

SWE-gen leverages Claude Code to clone the repo, infer how to build and test it, and generate the task’s Dockerfile. This enables task farming at an unprecedented scale.

The Reversed Baseline

In SWE-bench and many other PR-based datasets, the agent starts from the PR’s base commit. While intuitive, the discrepancy between base and head often breaks the unit tests, leading to an invalid task (i.e., it fails nop or oracle agent validation).

The head commit may have:

Added new dependencies that tests now import.
Set up a build step that only exists at HEAD.
Updated data or scripts that the tests now expect.

Rolling back to BASE forces the agent to rediscover those changes even though they often aren’t mentioned in the linked Issues or PR body. Instead, SWE-gen constructs the reversed baseline. It

Clones the repo at HEAD where the maintainer already showed the tests pass.
Generates a bug.patch that reintroduces the bug.
Applies that patch to recreate the buggy baseline the agent sees.

Because everything except the reverted lines stays identical to the maintainer’s passing state, the environment, lockfiles, and tooling remain in sync with the tests.

Ensuring Task Difficulty

SWE-gen withholds PR-modified unit tests from the agent: tests are stored separately and only copied into the container at verification time. Furthermore, we add rm -rf .git at the end of the Dockerfile and encourage users to run agents without network access to prevent cheating via git or fetching external artifacts.

SWE-gen-JS

We used SWE-gen to create 1000 tasks across 30 popular JavaScript and TypeScript GitHub repos. Tasks follow the Harbor format and are available on the Harbor registry for easy access.

The PRs span from 2016 through early 2026.

Checkout the GitHub repo for more information!

Contents

SWE-gen: Scaling SWE-bench task generation

Farming SWE-bench tasks

The Reversed Baseline

Ensuring Task Difficulty

SWE-gen-JS