Rishi Desai

Designing eval harnesses that prevent reward hacking

• Rishi Desai

While building long-horizon SWE tasks for SWE-marathon, I encountered a lot of reward-hacking behavior: agents wrapping open-source implementations instead of solving the task, searching the internet for answers, or modifying tests and configs so passing the unit tests became trivial.

In dockerized RL environments, the agent executes arbitrary code inside the same environment that later gets graded. There are two boundaries that matter most:

  1. The verifier boundary. The agent should not be able to mutate the environment that computes reward.
  2. The network boundary. The agent’s network needs (e.g., model APIs) should not grant the rest of the environment broad internet access.

The core rule is:

The reward must be computed from declared submission bytes in an environment the agent could not mutate.

Hackable
same mutated container agent verifier tests, tools, processes Verifier operates inside the agent's container.
Safe
agent container verifier container agent verifier artifact Verifier starts clean; only artifact crosses.

The Verifier Boundary

Harbor is a popular framework for RL environments because it makes agent evaluations feel like real software engineering: agents run in Docker containers with a filesystem, shell, and the rough edges of a normal development box. Harbor now supports a fresh verifier sandbox, with explicit artifact handoff between the agent and verifier environments.

Without that boundary, the verifier breaks. The agent gets to mutate the environment that grades it: its tests, its shells, its toolchain, its DNS, its leftover processes. Sharing one container is convenient (one image, one lifecycle), but it gives the agent too much influence over its own grader.

This failure mode is independent of where the agent loop itself runs. Whether the agent is a closed-source binary running inside the container or an off-container loop driving the sandbox through a tool protocol, if the verifier shares the agent’s container, it inherits all the state the agent touched.

Failure mode Why it matters The fix
Agent writes the reward file directly Reward is decoupled from task semantics Reward path only exists in the verifier container
Agent edits data, tests, or tools The verifier is contaminated Verifier image owns tests and toolchain
Agent leaves a daemon Verification runs in a hostile process environment Agent container is destroyed before verification

Untrusted code should not control the thing that judges it.

Artifacts

The fix is an explicit handoff. The agent produces a declared artifact, the harness extracts just those bytes, and a fresh verifier sandbox grades them.

agent sandbox harness copies artifact verifier sandbox reward agent container is destroyed before verification starts

An artifact is the explicit submission the verifier grades — a directory, a patch, a checkpoint, a binary, a JSON file. The agent can edit code, install dependencies, run tests, and produce that submission, but the harness should extract only the declared bytes and evaluate them somewhere fresh.

In concrete terms:

The agent’s container is destroyed before the verifier’s container is created. The only causal channel between them is the submitted bytes that crossed the harness.

A fresh verifier does not mean submitted code is trusted. If the verifier has to execute a module, binary, server, or checkpoint, that code should run behind a narrow interface with limited network, no reward-path access, resource limits, and structured outputs.

The mechanism can vary (e.g., runc, nested gVisor). The key point is that trusted verifier code should not casually import arbitrary agent code and let that same process write the reward.


The Network Boundary

The verifier boundary above is independent from a second design choice: where the agent lives. It can sit inside the task sandbox (like claude-code) or outside it, driving the sandbox through tool calls. On-container is the more realistic setup for evaluating closed-source production agents — agents directly access the filesystem and shell — but it makes the network problem harder, because the agent’s LLM API calls now live inside the sandbox alongside whatever the task code does.

off-container agent agent model APIs actions bash, file edits sandbox shell files internet only actions enter; model traffic stays outside on-container agent sandbox agent shell files model APIs internet model traffic and task traffic share the sandbox

A single internet on/off switch is too crude. If network is off, on-container agents (e.g., claude-code) break. If network is on, task code gets unrestricted egress — curl, npm, the whole internet.

Agent placement Use cases Network consequence
Off-container Easy for simple agents like mini-swe-agent. Network restriction is simple. LLM-provider traffic never enters the sandbox, so egress can be deny-by-default with a small allowlist.
On-container Required for closed-source agents like claude-code. Network restriction is harder. Model-API traffic and task traffic share one sandbox, so the harness must separate them.

So the harness has to separate agent egress from task egress. Task creators shouldn’t need to maintain fragile allowlists of Anthropic or OpenAI just to run codex or claude-code. Those are harness requirements, not task requirements.

Task creators should declare only task egress: docs, benchmark APIs, package proxies, or no internet at all. The harness should own agent egress. Ideally, it can distinguish the agent process from commands spawned by the agent, so model traffic gets the harness profile while shell commands and submitted code get only the task allowlist.

If the harness cannot distinguish those processes yet, the fallback is coarser: give the agent sandbox the union of the agent egress profile and task allowlist.


Practical Costs

The verifier boundary costs a second container per trial and forces task creators to declare artifacts explicitly. That’s more work than “grade whatever the agent left behind,” but it’s also the discipline that makes the task meaningful.

Supporting on-container agents costs network plumbing: egress logging, managed package proxies, and (ideally) process-aware separation between agent traffic and task traffic. Off-container agents avoid that complexity, but at the cost of running two sandboxes per trial: a cheap one for the agent loop plus the task sandbox, whose size varies with the workload.


Takeaways

Destroy the agent container before verification. The verifier runs in a fresh container that only sees the declared artifact.

Build for on-container agents. Production systems like claude-code, codex, cursor, and devin run on the user’s computer or in a cloud sandbox with a shell and filesystem. Pulling the model loop out and wrapping it for off-container execution biases the evaluation.

The harness owns agent network egress. Task creators declare task egress; model-API allowlists are not their concern.