第 1 章:程序结构感知的代码搜索

这一章讲 ACR 最独特的部件:一套懂代码结构的搜索 API。读完你会明白「为什么 ACR 不让 LLM 自己 grep」,以及它的索引和 8 个 API 各自怎么工作。

1.1 它要解决的小问题

LLM 的上下文窗口装不下整个大仓库。所以 agent 必须按需取代码。最朴素的做法是给它一个 grep:模型给字符串,系统返回匹配行。但 grep 有两个毛病:

返回的是孤立的行,不是「一个完整的方法 / 类」——模型还得再追问上下文。
不知道某行属于哪个类、哪个方法——定位 bug 时这个归属信息最关键。

ACR 的思路:先用 AST 把整个项目结构索引好,再提供按结构查询的 API。 模型说「给我 WCS 类里的 _array_converter 方法」,系统直接返回那个方法的完整源码,并知道它的类归属和精确行范围。

1.2 索引:一次扫描,四张表

项目启动时,SearchBackend.__init__ 调 _build_index(search_backend.py:49)扫一遍所有非测试 .py 文件,建四张索引表:

索引	键 → 值	干什么
`class_index`	类名 → `[(文件, 行范围)]`	按类名查类
`class_func_index`	类名 → {方法名 → `[(文件, 行范围)]`}	查某类里的某方法
`function_index`	函数名 → `[(文件, 行范围)]`	查顶层(非类内)函数
`class_relation_index`	类名 → `[父类名]`	查继承关系(给「找父类被覆写的方法」用)

值都是列表,因为同名类 / 同名方法可以出现在多处(search_backend.py:34-47 的注释明说了这点)。

怎么扫的。 parse_python_file(search_utils.py:58)读一个文件、ast.parse 成语法树,然后 ast.walk 遍历:遇到 ast.ClassDef 记类名 + 起止行,并把类内的 ast.FunctionDef 收成「类→方法」表;遇到顶层 ast.FunctionDef 记成顶层函数。行号用 AST 的 node.lineno / node.end_lineno(1-based)。解析失败的文件直接跳过(search_utils.py:73-78 的 try/except 返回 None),所以索引只覆盖「能被 ast 解析的子集」。

只索引源码,不索引测试。 find_python_files(search_utils.py:21)用 is_test_file(search_utils.py:8:路径含 test/tests 或文件名 *_test.py / test_*)过滤掉测试文件——因为这一阶段的目标是「写补丁的上下文」,不关心测试。

索引缓存。 _build_python_index 是 @classmethod @cache(search_backend.py:74-75),键是 project_path。这意味着同一个项目路径第二次构建索引会直接命中缓存——run_one_task 的整体重试(inference.py:110)每轮都新建 ProjectApiManager → 新建 SearchBackend,但索引扫描只真正发生一次。

1.3 原理演示:索引大概长这样

下面这段**示意(非源码)**演示「扫一个文件建索引」的核心想法:

# 示意,非源码:展示 AST 索引的核心思路
import ast

def index_one_file(path, src):
    tree = ast.parse(src)
    class_index = {}            # 类名 -> [(文件, 起, 止)]
    class_func_index = {}       # 类名 -> {方法名 -> [(文件, 起, 止)]}
    for node in ast.walk(tree):
        if isinstance(node, ast.ClassDef):
            class_index.setdefault(node.name, []).append((path, node.lineno, node.end_lineno))
            # 收集这个类里的方法
            methods = {}
            for n in ast.walk(node):
                if isinstance(n, ast.FunctionDef):
                    methods.setdefault(n.name, []).append((path, n.lineno, n.end_lineno))
            class_func_index[node.name] = methods
    return class_index, class_func_index
# 重点看:键是「名字」,值带「文件 + 行范围」——之后查名字就能精确取到那段源码

真实实现见 parse_python_file(search_utils.py:58-118)和把结果灌进表的 _build_python_index(search_backend.py:101-116)。

1.4 八个搜索 API:各查什么

这些 API 是暴露给 LLM 的工具(在 prompt 里列给它,见 agent_search.SELECT_PROMPT,agent_search.py:25)。每个 API 的返回都是 (tool_output, search_results, call_ok) 三元组——只有第一个 tool_output 字符串会发给模型(search_backend.py:235-237 的注释明说),后两个是 ACR 自己留用的。

API	查什么	真实符号
`search_class(class_name)`	全库找类,只返回签名(类名+基类+各方法签名)	`search_backend.py:276`
`search_class_in_file(class_name, file_name)`	指定文件里找类,返回整个类定义	`search_backend.py:318`
`search_method(method_name)`	全库找方法(顶层 + 所有类内)	`search_backend.py:451`
`search_method_in_file(method_name, file_name)`	指定文件里找方法	`search_backend.py:361`
`search_method_in_class(method_name, class_name)`	指定类里找方法	`search_backend.py:409`
`search_code(code_str)`	全库找一段代码,返回它所在的方法 / 周边区域	`search_backend.py:481`
`search_code_in_file(code_str, file_name)`	指定文件里找代码片段	`search_backend.py:530`
`get_code_around_line(file_name, line_no, window)`	取某文件某行附近 ±window 行	`search_backend.py:588`

几个值得注意的设计:

「类签名」vs「整类」。 search_class 故意只返回签名而非全部代码(search_backend.py:294 调 get_class_signature)——因为一个类可能几百行,全给会爆 token。签名由 extract_class_sig_from_ast(search_utils.py:253)抽:类头 + 每个方法的签名行(含装饰器,见 extract_func_sig_from_ast,search_utils.py:224)+ 类级赋值(跳过 __doc__)。模型看到签名后,如果想看某个方法的实现,再调 search_method_in_class。

结果数量上限。 RESULT_SHOW_LIMIT = 3(search_backend.py:22):同名结果超过 3 个时,只展开前 3 个的完整代码,其余坍缩到文件级(SearchResult.collapse_to_file_level,data_structures.py:231)只列文件名 + 命中数。这是又一处控 token 的细节。

search_code 会回填「这段代码在哪个方法里」。 它先用 get_code_region_containing_code(search_utils.py:121,正则在文件里找,带 ±3 行上下文)定位到行号,再用 _file_line_to_class_and_func(search_backend.py:125)反查这行属于哪个类 / 方法——这样返回给模型的片段是带结构归属的,而不是裸文本。

1.5 真实实现:一个 API 的全貌

以 search_method_in_class 为例(search_backend.py:409-448),看一个 API 怎么从索引取数、怎么组织给模型的字符串:

# app/search/search_backend.py:420-433(节选)
if class_name not in self.class_index:
    tool_output = f"Could not find class {class_name} in the codebase."
    return tool_output, [], False
search_res = self._search_func_in_class(method_name, class_name)
if not search_res:
    tool_output = f"Could not find method {method_name} in class {class_name}`."
    return tool_output, [], False
tool_output = f"Found {len(search_res)} methods with name {method_name} in class {class_name}:\n\n"

关键点:

先查 class_index 确认类存在,不存在就返回 call_ok=False 和一句给模型看的解释。
真正取代码的是 _search_func_in_class(search_backend.py:149):它从 class_func_index[class_name][method_name] 拿到 (文件, 行范围),再用 get_code_snippets(search_utils.py:203,带行号读取那几行)取出源码,包成 SearchResult。
给模型的字符串用 to_tagged_str(data_structures.py:225)包成 <file>...</file>\n<class>...</class> <func>...</func>\n<code>...</code> 这种带标签的格式,方便模型(和后续解析)识别结构。

容错装饰器。 几乎每个 API 都挂了 @catch_all_and_log(utils.py:339),任何异常都被吞掉并记日志,返回安全的失败值——保证一个搜索 API 抛异常不会让整个任务崩。search_code 还额外挂了 @timeout_decorator.timeout(120)(search_backend.py:480),因为全库正则搜索可能很慢。

1.6 proxy:把「我想看 X」翻译成可执行调用

LLM 在 agent_search 阶段是用自然语言说要调哪些 API 的(它写散文式分析,顺带列出 API 调用)。这些文本要变成真能调的函数,中间隔着一个 agent_proxy(agent_proxy.py),它的职责是把散文抽成 JSON:

{
  "API_calls": ["search_method_in_class(\"_array_converter\", \"WCS\")"],
  "bug_locations": [{"file": "...", "class": "...", "method": "...", "intended_behavior": "..."}]
}

抽完还要校验(is_valid_response,agent_proxy.py:90):

每个 API 调用必须能被 parse_function_invocation(utils.py:305,用 ast.parse 解析函数调用语法)解析。
调的函数必须真的存在于 SearchBackend 上(getattr(SearchBackend, func_name, None),agent_proxy.py:116)。
参数个数必须对:它用 inspect.getfullargspec 取真实参数名,和模型给的参数个数比对(agent_proxy.py:122-129)。注意这里要先 while "__wrapped__" in ... 把 @catch_all_and_log 装饰器层层剥开,才能拿到原函数的真实签名。

校验不过就重试(run_with_retries,默认 5 次,agent_proxy.py:45)。这套「先抽 JSON 再按真实签名校验」是 ACR 把不可靠的自然语言收口成可靠函数调用的关键,下一章会看到它在主循环里怎么被驱动。

1.7 关键细节 / 坑

同名实体靠列表 + 文件过滤区分。 索引值是列表,search_*_in_file 系列通过 _get_candidate_matched_py_files(search_backend.py:215,忽略大小写地用 endswith 匹配文件名)再过滤,所以模型给短文件名也能命中。
行号始终 1-based。 整个搜索栈统一用 AST 的 1-based 行号,get_code_snippets 读取时做 start-1 偏移(search_utils.py:216)。
search_code 的正则是 re.escape 的。 get_code_region_containing_code(search_utils.py:142)对 code_str 做了 re.escape,所以是字面匹配而非正则匹配——模型给的代码片段会被原样找,不会被特殊字符意外解释。

→ 下一章:这些零件怎么接成一个会自我迭代的检索循环

1.8 代码地图

主题	文件	符号
索引构建(带缓存)	`app/search/search_backend.py`	`_build_python_index`、`_build_index`、`_update_indices`
AST 解析单文件	`app/search/search_utils.py`	`parse_python_file`、`parse_class_def_args`
取代码 / 取签名	`app/search/search_utils.py`	`get_code_snippets`、`get_class_signature`、`extract_class_sig_from_ast`
代码片段定位	`app/search/search_utils.py`	`get_code_region_containing_code`、`get_code_region_around_line`
8 个搜索 API	`app/search/search_backend.py`	`search_class`、`search_method_in_class`、`search_code`、`get_code_around_line`
行号→类/方法反查	`app/search/search_backend.py`	`_file_line_to_class_and_func`
结果坍缩 / 标签化	`app/data_structures.py`	`SearchResult.collapse_to_file_level`、`to_tagged_str`
自然语言→JSON 校验	`app/agents/agent_proxy.py`	`is_valid_response`、`run_with_retries`
函数调用解析	`app/utils.py`	`parse_function_invocation`

1.1 它要解决的小问题​

1.2 索引:一次扫描,四张表​

1.3 原理演示:索引大概长这样​

1.4 八个搜索 API:各查什么​

1.5 真实实现:一个 API 的全貌​

1.6 proxy:把「我想看 X」翻译成可执行调用​

1.7 关键细节 / 坑​

1.8 代码地图​