组合式 grounding:会想的 + 会点的,搭档干活

这是 Cua 最妙的一招。读完你会明白:为什么把模型名写成 A+B 就能让“不会输出坐标的强推理模型”也能操作电脑。

它要解决的小问题

Computer-use 需要两种能力,而很少有模型两样都强:

规划(planning): 看懂屏幕、想清楚“下一步该点哪个按钮”。强推理大模型(GPT、Gemini、Claude)擅长。
定位(grounding): 把“那个红色提交按钮”精确变成像素坐标 (412, 380)。专门的 grounding 小模型(GTA1、UI-TARS、Holo)擅长。

直接让 Gemini 输出精确坐标往往不准。于是 Cua 让它们分工。

思路/直觉:让规划模型说“人话”,定位模型补坐标

核心点子:规划模型根本不输出坐标,只输出元素描述(“点击 search text field”)。然后一个独立的 grounding 模型,拿着当前截图 + 这句描述,回一个 (x,y)。

模型名写成 grounding+thinking(如 huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro),被注册表的 .*\+.*(priority 1)路由到 ComposedGroundedConfig。

图示:一次 predict_step 的转换链

  历史里有坐标动作(上一步点了 (412,380))
        │
  ① xy2desc:把历史里的 (x,y) 反向换成描述,planning 模型只看见“点了 search field”
        │
  ② 转成 completion 格式 ──► litellm.acompletion(thinking_model)
        │                         (Gemini/GPT 在“描述空间”里规划)
        ▼
   模型回:computer_call { action: click, element_description: "红色提交按钮" }
        │
  ③ 收集所有 element_description ──► ["红色提交按钮", ...]
        │
  ④ 对每个描述,用 grounding 模型在当前截图上 predict_click(最多试 3 次)
        │      desc2xy["红色提交按钮"] = (256, 540)
        ▼
  ⑤ desc2xy:把描述换回坐标 ──► computer_call { action: click, x:256, y:540 }
        │
        ▼
   交还主循环执行(主循环根本不知道发生过这场“翻译”)

怎么读: 整条链是“坐标→描述→(规划)→描述→坐标”的来回翻译。planning 模型全程活在描述空间,从不碰像素;grounding 模型全程活在像素空间,从不推理。一个 desc2xy 字典在中间做桥。

原理演示(简化)

# 示意,非源码:组合 loop 的核心五步
async def predict_step(self, messages, model, computer_handler, ...):
    grounding_model, thinking_model = model.split("+", 1)
    img = get_last_screenshot(messages) or await computer_handler.screenshot()

    # ① 历史里的坐标换成描述,让规划模型看“人话”
    msgs = xy2desc(messages, self.desc2xy)
    # ② 规划模型出动作(只含 element_description,没有坐标)
    resp = await litellm.acompletion(model=thinking_model, messages=to_completion(msgs), ...)
    items = to_responses_items(resp)

    # ③④ 对每个描述,用 grounding 模型在截图上求坐标
    for desc in get_all_element_descriptions(items):
        for _ in range(3):                       # 试 3 次
            xy = await grounding_agent.predict_click(grounding_model, img, desc)
            if xy: self.desc2xy[desc] = xy; break

    # ⑤ 把描述换回坐标,交还主循环
    return {"output": desc2xy_convert(items, self.desc2xy), "usage": ...}

重点看: 规划和定位是两次独立的模型调用;desc2xy 字典是它们之间唯一的接头。grounding 模型在这里被当成一个“坐标函数”用——靠的就是 02 章说的 predict_click 能力(get_capabilities() 返回 ["click"] 的那类 loop)。

真实实现

七步流程写在 predict_step 的 docstring 里(composed_grounded.py:150-162),代码对应:xy→desc(:216)、转 completion(:218-221)、调 thinking 模型(:241)、转回 items(:255-263)、收集描述(:266)、grounding 求坐标(:268-281)、desc→xy(:284)。

用 grounding 模型求坐标的核心循环(composed_grounded.py:268-281,真实源码节选):

element_descriptions = get_all_element_descriptions(thinking_output_items)
if element_descriptions and last_image_b64:
    grounding_agent_conf = find_agent_config(grounding_model)   # 复用注册表!
    if grounding_agent_conf:
        grounding_agent = grounding_agent_conf.agent_class()
        for desc in element_descriptions:
            for _ in range(3):  # try 3 times
                coords = await grounding_agent.predict_click(
                    model=grounding_model, image_b64=last_image_b64, instruction=desc)
                if coords:
                    self.desc2xy[desc] = coords
                    break

注意 find_agent_config(grounding_model)——组合 loop 复用了 02 章那张注册表来找 grounding 那半边的 loop。这是递归式的优雅:组合 loop 本身也是注册表里的一个 loop,而它内部又去查注册表。

predict_click 的转发(composed_grounded.py:289-312):如果有人直接对组合模型调 predict_click,它只用 grounding 那半边。

关键细节/坑

第 0 步会自动补截图。 如果历史里没有任何 computer_call 截图(刚开始),先截一张,并造出一组 message + computer_call(screenshot) + computer_call_output 项塞进历史(composed_grounded.py:172-206),保证 grounding 有图可看。
工具 schema 被改写成“描述版”。 _prepare_tools_for_grounded 把 computer 工具换成 GROUNDED_COMPUTER_TOOL_SCHEMA——它的参数是 element_description/start_element_description 而不是 x/y(composed_grounded.py:28-105)。这样规划模型从 API 层面就“只会说描述”。
grounding 工具结果里不放图。 转 completion 时 allow_images_in_tool_results=False(composed_grounded.py:220),因为很多 planning 模型不接受 tool 结果里夹图。
desc2xy 跨步累积。 字典是实例属性,描述→坐标的映射会在多步之间复用,省掉重复定位。

横向对比

这套“规划/定位解耦”和 03 章的 grounding loop(如 GTA1,loops/gta1.py:99 的 predict_click)是配套的:grounding loop 只实现 predict_click(get_capabilities() 返回 ["click"]),predict_step 直接 raise NotImplementedError(gta1.py:97)——它生来就是给组合 loop 当“定位插件”用的,自己不能独立跑一个完整任务。

代码地图

主题	文件路径	符号名
组合 loop	`agent/cua_agent/loops/composed_grounded.py`	`ComposedGroundedConfig`
描述版工具 schema	`agent/cua_agent/loops/composed_grounded.py`	`GROUNDED_COMPUTER_TOOL_SCHEMA`, `_prepare_tools_for_grounded`
xy↔desc 转换	`agent/cua_agent/responses.py`	`convert_computer_calls_xy2desc`, `convert_computer_calls_desc2xy`, `get_all_element_descriptions`
grounding 示例	`agent/cua_agent/loops/gta1.py`	`GTA1Config.predict_click`, `smart_resize`

它要解决的小问题​

思路/直觉:让规划模型说“人话”,定位模型补坐标​

图示:一次 predict_step 的转换链​

原理演示(简化)​

真实实现​

关键细节/坑​

横向对比​

代码地图​