主循环:从一条指令到一串动作

这一章把 ComputerAgent.run() 一圈圈拆开看。读完你会知道:一次“看-想-做”循环到底发生了什么,模型的动作怎么被执行,截图怎么喂回去。

它要解决的小问题

computer-use 是多步的:模型看一张图、出一个动作,但一个任务要几十步。所以必须有一个循环:执行动作 → 截新图 → 再问模型。run() 就是这个循环。

思路/直觉:一切都是“事件项”

Cua 把对话历史表示成一串 Responses API 事件项(item)。每个 item 有个 type:

item type	含义	谁产生
`message`	文字消息(用户指令、模型的思考/最终回答)	用户 / 模型
`computer_call`	模型想做一个动作(如 `{action: {type: "click", x, y}}`)	模型
`computer_call_output`	动作执行后的结果(通常是新截图)	框架
`function_call` / `function_call_output`	自定义函数工具的调用与返回	模型 / 框架

循环的本质: 模型产出 computer_call,框架执行它、产出 computer_call_output(新截图),拼回历史,再问模型。直到模型只产出 message(没有更多动作)——任务结束。

图示:一圈循环

          old_items + new_items (历史)
                 │
        ┌────────▼─────────┐  回调预处理
        │ _on_llm_start()  │  (裁图/PII/...)
        └────────┬─────────┘
                 ▼
        loop.predict_step()  ── 问模型 ──►  返回 output[] (含 computer_call)
                 │
        ┌────────▼──────────────┐
        │ for item in output:   │
        │   _handle_item(item)  │
        │     ├ computer_call ─► computer.click(x,y) ─► 等待 ─► 截新图
        │     │                  └─► computer_call_output (input_image)
        │     └ function_call ─► 调你的函数 ─► function_call_output
        └────────┬──────────────┘
                 ▼
      new_items += 动作 + 结果,回到顶部
                 │
     直到 new_items 末尾是 role:"assistant" 的 message → 退出

原理演示(简化)

这段把循环的骨架演出来,省掉回调和重试:

# 示意,非源码:run() 的骨架
async def run(self, messages):
    old_items = to_items(messages)       # 用户指令变成事件项
    new_items = []
    # 末尾不是 assistant 文字 → 还没做完
    while not (new_items and new_items[-1].get("role") == "assistant"):
        history = old_items + new_items
        result = await self.agent_loop.predict_step(history, self.model, ...)  # 问模型
        new_items += result["output"]    # 模型给的 computer_call 等
        for item in result["output"]:
            outputs = await self._handle_item(item, self.computer_handler)  # 执行
            new_items += outputs         # 执行结果(新截图)拼回去
        yield result                     # 把这一圈 yield 给调用方

重点看: 循环的退出条件是 new_items[-1] 是个 assistant 角色的消息——模型这一圈只说话、不动手,就意味着它认为任务完成了。

真实实现

循环条件(agent/cua_agent/agent.py:954):末尾不是 assistant 就继续转。

问模型这一步包了一层指数退避重试(agent/cua_agent/agent.py:1011-1022,_predict_step_with_retry):只对“瞬时错误”(429、5xx、超时、连接)重试,并且关掉了 liteLLM 自己的内层重试(max_retries: 0),避免两层重试叠乘(agent/cua_agent/agent.py:974-976)。

执行动作的核心在 _handle_item(agent/cua_agent/agent.py:730-896)。对 computer_call,它用 getattr 按动作名取方法再调用:

# agent/cua_agent/agent.py:771-777  (真实源码,节选)
computer_method = getattr(computer, action_type, None)
if computer_method:
    assert_callable_with(computer_method, **action_args)
    action_result = await computer_method(**action_args)
else:
    raise ToolError(f"Unknown computer action: {action_type}")

这行 getattr(computer, action_type) 是动作分发的关键:模型说 "click",框架就调 computer.click(**args)。computer 是个 AsyncComputerHandler(见 05 章),它把 click 翻成沙箱里的真实点击。

截图喂回:执行后(非 terminate)等 screenshot_delay 秒,截图,封装成 computer_call_output 里的 input_image(agent/cua_agent/agent.py:800-837):

# agent/cua_agent/agent.py:829-837  (真实源码,节选)
call_output = {
    "type": "computer_call_output",
    "call_id": item.get("call_id"),
    "acknowledged_safety_checks": acknowledged_checks,
    "output": {
        "type": "input_image",
        "image_url": f"data:image/png;base64,{screenshot_base64}",
    },
}

这个 call_id 把“动作”和“它的结果截图”配对——下一圈模型就能看到“我上一步点完后屏幕变成这样”。

关键细节/坑

terminate 动作不截图。 模型可以发 terminate 动作主动收尾;此时不再截图,output 里带 {"terminated": True}(agent/cua_agent/agent.py:796-827)。这是除“输出纯文字”之外的第二种结束方式。
失败的 computer_call 会被改写。 进 loop 前调 replace_failed_computer_calls_with_function_calls(agent/cua_agent/agent.py:965),把没成功的 computer_call 转成普通 function_call,避免某些模型 API 因“有 call 没有对应 output”而报错。
函数工具走另一条路。 自定义 function_call 不经 computer_handler,而是 _get_tool(name) 找到你注册的函数,async 直接 await、sync 用 asyncio.to_thread 跑(agent/cua_agent/agent.py:850-892)。
每一圈都 yield。 run() 是异步生成器,调用方能实时拿到每一步的 message/computer_call/截图,适合做流式 UI。

代码地图

主题	文件路径	符号名
主循环	`agent/cua_agent/agent.py`	`ComputerAgent.run`
动作分发	`agent/cua_agent/agent.py`	`ComputerAgent._handle_item`
退避重试	`agent/cua_agent/agent.py`	`_predict_step_with_retry`, `_is_retryable_error`
失败 call 改写	`agent/cua_agent/responses.py`	`replace_failed_computer_calls_with_function_calls`
loop 协议	`agent/cua_agent/loops/base.py`	`AsyncAgentConfig.predict_step`

它要解决的小问题​

思路/直觉:一切都是“事件项”​

图示:一圈循环​

原理演示(简化)​

真实实现​

关键细节/坑​

代码地图​