第 4 章:复现-评审回路

前三章不依赖「跑代码」。这章讲 ACR 的实验增强:有测试设施时,如何用真实执行结果把补丁逼向正确。这条线由 config.reproduce_and_review 开启(config.py:37),且只对 SWE-bench 任务可用。

4.1 它要解决的小问题

LLM 写完补丁说「我修好了」,凭什么信?最硬的证据是:有一个脚本,打补丁前它报错、打补丁后它通过。ACR 这一章就是围绕「造出这个脚本」和「用它来评判补丁」展开。

4.2 第一步:写一个能复现 issue 的脚本

TestAgent(agent_reproducer.py:57)负责写 reproducer.py。流程:

先判断 issue 到底能不能复现。 _issue_has_reproduction_steps(agent_reproducer.py:128)先问模型「这条 issue 含可复现示例吗?」,要个 {"has-reproducible-example": true/false}。不能复现就抛 NoReproductionStep,整条复现线跳过。
写脚本。 prompt(INITIAL_REQUEST,agent_reproducer.py:21)要求:写一个独立 reproducer.py,放项目根目录、python3 reproducer.py 执行;issue 存在时抛 AssertionError 并打印栈,修好后退出码 0。还塞了一个标准的 print_stacktrace 函数模板,让栈里的行号清晰(方便后续检索用)。
跑它、看结果。 task.execute_reproducer 跑脚本得到 ReproResult。判定「复现成功」的标准很朴素(data_structures.py:174):

# app/data_structures.py:170-174
self.reproduced = returncode != 0 and "AssertionError" in stderr

即「非零退出 + stderr 里有 AssertionError」。复现成功就登记这个测试(_register_reproducing_test,agent_reproducer.py:213),否则把执行输出当反馈(_feedback_from_repro_result,agent_reproducer.py:246)继续重试。

复现成功带来两个产物:一段确认问题存在的测试内容,和它的 stderr——后者会被喂给检索阶段当额外线索(_run_one_task 里 repro_stderr 传进 search_iterative,inference.py:285、inference.py:303)。

4.3 第二步:复现-评审回路

回路在 ReviewManager._generator(review_manage.py:78),每轮做这几件事:

     有了复现测试 + 第一版补丁
              │
   ┌──────────▼───────────┐
   │ 跑「补丁后」的复现脚本│  得到 patched_repro_result
   └──────────┬───────────┘
              │ orig_stderr / patched_stderr
   ┌──────────▼───────────┐
   │ reviewer agent       │  同时判:补丁对吗?测试对吗?
   └──────────┬───────────┘
     ┌────────┼─────────────────┐
  patch=YES   patch=NO         test=NO
     │          │                │
  交去验证   把分析+建议      把分析+建议
  (yield)    回灌→重写补丁    回灌→重写测试

关键:reviewer 拿到的是「执行差异」。agent_reviewer.run(agent_reviewer.py:88)把这些一起喂给模型(run_with_retries,agent_reviewer.py:115):issue、测试内容、补丁前脚本的 stdout/stderr、补丁内容、补丁后脚本的 stdout/stderr。模型据此输出一个 JSON:

{
  "patch-correct": "yes|no",
  "test-correct": "yes|no",
  "patch-analysis": "...", "patch-advice": "...",
  "test-analysis": "...", "test-advice": "..."
}

「测试也可能是错的」是这套设计的灵魂(system prompt 里明说 both the test and the patch may be wrong,agent_reviewer.py:23)。所以 reviewer 同时审两边:

patch_decision == NO → 用 compose_feedback_for_patch_generation(review_manage.py:168)把「测试 + 执行分析 + 改进建议」拼成反馈,让 PatchAgent 带反馈重写补丁。
test_decision == NO → 用 compose_feedback_for_test_generation(review_manage.py:189)让 TestAgent 重写测试。
patch_decision == YES → yield 出去交给验证阶段(第 5 章);验证若说不对,把验证消息再当反馈回灌(review_manage.py:133-139)。

这是一个双向纠错的回路:补丁和测试互为镜子,谁错了就被对方的执行结果照出来。

4.3.1 原理演示:为什么「执行差异」是好信号

# 示意,非源码:reviewer 的判断直觉
def judge(orig_stderr, patched_stderr):
    issue_present_before = "AssertionError" in orig_stderr   # 补丁前能复现 = 测试有效
    issue_gone_after    = "AssertionError" not in patched_stderr  # 补丁后消失 = 补丁可能有效
    if issue_present_before and issue_gone_after:
        return "patch likely correct"     # 前红后绿:最强证据
    if not issue_present_before:
        return "test may be wrong"         # 测试根本没复现:先怀疑测试
    return "patch likely wrong"            # 前红后还红:补丁没修好
# 重点看:把判断锚在「跑出来的差异」上,而不是让模型凭空读补丁猜对错

真实判断由 LLM 在 run_with_retries 里基于完整 stdout/stderr 做(agent_reviewer.py:134-179),上面只是把它的直觉显式化。

4.4 回退:没有复现就不评审

如果 issue 不可复现(NoReproductionStep)或写不出复现测试(InvalidLLMResponse),整条评审线跳过,_run_one_task 退回到无评审的迭代补丁(write_patch_iterative,inference.py:63)——只写补丁 + 跑验证,不带 reviewer。_run_one_task 末尾的逻辑(inference.py:327-340)清楚体现:reproduce_and_review and reproduced 才走带评审的路径。

4.5 关键细节 / 坑

复现判定很窄。 只认 AssertionError(data_structures.py:174),所以这套增强主要覆盖「能用断言脚本触发」的 issue,对需要复杂环境 / 多步交互的 bug 力不从心。
回路有轮数上限。 _generator(rounds=5)(review_manage.py:63),最多来回 5 轮。
每轮全程落盘。 补丁(extracted_patch_*.diff)、测试(reproducer_*.py)、评审(review_p*_t*.json)、执行结果(execution_*.json)都存,且评审 json 里的 patch-correct 字段后面选补丁时会被读(第 5 章)。
reviewer 用 JSON mode。 common.SELECTED_MODEL.call(..., response_format="json_object")(agent_reviewer.py:184)强制结构化输出,降低解析失败。

→ 下一章:用测试套件做故障定位、回归验证、最终选补丁

4.6 代码地图

主题	文件	符号
复现 agent	`app/agents/agent_reproducer.py`	`TestAgent._write_reproducing_test`、`_issue_has_reproduction_steps`
复现判定	`app/data_structures.py`	`ReproResult`
提取代码块	`app/agents/agent_reproducer.py`	`convert_response_to_test`、`extract_markdown_code_blocks`
评审 agent	`app/agents/agent_reviewer.py`	`run`、`run_with_retries`、`Review`
复现-评审回路	`app/api/review_manage.py`	`ReviewManager._generator`、`compose_feedback_for_patch_generation`
回路串接	`app/inference.py`	`_run_one_task`、`write_patch_iterative_with_review`

4.1 它要解决的小问题​

4.2 第一步:写一个能复现 issue 的脚本​

4.3 第二步:复现-评审回路​

4.3.1 原理演示:为什么「执行差异」是好信号​

4.4 回退:没有复现就不评审​

4.5 关键细节 / 坑​

4.6 代码地图​