ACI 与视觉定位

本章讲 Agent S 最有工程含量的一支:ACI(Agent-Computer Interface,智能体-计算机接口)——它定义了 agent 能做哪些动作,以及每个动作如何"把模型说的那段描述,精确落到屏幕坐标上"。

2.1 ACI 是什么:一组"参数化"的动作

传统做法是让模型直接吐坐标("点 (1340, 88)")——但大脑模型报坐标不准。ACI 的设计是:让模型用自然语言描述目标,坐标推迟到执行时由定位模型生成。

每个动作都是 OSWorldACI 上的一个方法,接收的是描述而非坐标:

# 真实源码节选,grounding.py:346
@agent_action
def click(self, element_description: str, num_clicks: int = 1,
          button_type: str = "left", hold_keys: List = []):
    """Click on the element
    Args:
        element_description:str, a detailed descriptions of which element
        to click on. This description should be at least a full sentence.
    ..."""
    coords1 = self.generate_coords(element_description, self.obs)  # 描述 → (x,y)
    x, y = self.resize_coordinates(coords1)                       # 缩放到真机分辨率
    ...
    command += f"pyautogui.click({x}, {y}, clicks={num_clicks}, ...)"
    return command  # 返回一段可执行的 pyautogui 字符串

动作清单(都在 grounding.py,都挂了 @agent_action):

动作	落点方式	符号
`click` / `type` / `scroll` / `drag_and_drop`	视觉定位模型生成 (x,y)	`grounding.py:346 / 413 / 605 / 474`
`highlight_text_span`	OCR 词级坐标(非视觉模型)	`grounding.py:503`
`hotkey` / `hold_and_press`	无需坐标,直接键盘	`grounding.py:621 / 631`
`open` / `switch_applications`	按平台分叉的快捷键脚本	`grounding.py:391 / 374`
`set_cell_values`	直接走 LibreOffice UNO API(不点 UI)	`grounding.py:527`
`call_code_agent`	把任务甩给嵌套 Code Agent	`grounding.py:542`(见 `04` 章)
`save_to_knowledge`	写入本任务的文本便签 `notes`	`grounding.py:465`
`done` / `fail` / `wait`	返回哨兵字符串 `DONE`/`FAIL`/sleep	`grounding.py:649 / 664 / 657`

2.2 `@agent_action` 装饰器:动作即文档,文档即提示词

这是一处很妙的设计。装饰器本身只有两行:

# 真实源码,grounding.py:25
def agent_action(func):
    func.is_agent_action = True   # 只是打个标记
    return func

它的威力在于系统提示词是从这些被标记的方法自动生成的。construct_simple_worker_procedural_memory(procedural_memory.py:12)用 inspect 遍历 ACI 类,把每个带 is_agent_action 标记的方法的签名 + docstring 直接拼进提示词,告诉大脑模型"你能调这些函数":

# 真实源码节选,procedural_memory.py:74-85
for attr_name in dir(agent_class):
    if attr_name in skipped_actions:
        continue
    attr = getattr(agent_class, attr_name)
    if callable(attr) and hasattr(attr, "is_agent_action"):
        signature = inspect.signature(attr)
        procedural_memory += f"\n    def {attr_name}{signature}:\n    '''{attr.__doc__}'''\n"

带来的两个好处:

单一事实来源:动作的 docstring 既是给人读的文档,又是给模型读的 API 说明,不会两边漂移。
可按平台/能力裁剪:skipped_actions 能在运行时隐藏动作。例如非 Linux 平台隐藏 set_cell_values;没有可用代码环境时隐藏 call_code_agent(worker.py:63-73)——模型根本看不到不该用的动作。

2.3 视觉定位:`generate_coords`,从描述到一个点

这是"瞄准"的核心。流程极简:重置定位模型 → 把"描述 + 截图"喂进去 → 正则抠出前两个数字当 (x,y)。

# 真实源码节选,grounding.py:230
def generate_coords(self, ref_expr: str, obs: Dict) -> List[int]:
    self.grounding_model.reset()
    prompt = f"Query:{ref_expr}\nOutput only the coordinate of one point in your response.\n"
    self.grounding_model.add_message(
        text_content=prompt, image_content=obs["screenshot"], put_text_last=True)
    response = call_llm_safe(self.grounding_model)
    numericals = re.findall(r"\d+", response)   # 从回复里抠数字
    assert len(numericals) >= 2
    return [int(numericals[0]), int(numericals[1])]

关键细节:

定位模型不用系统提示词(注释:"UI-TARS demo does not use system prompt")——它是个被专门微调的指哪打哪模型,prompt 越简单越好。
解析极其朴素:re.findall(r"\d+") 直接抓数字,容忍模型回复里的各种格式噪声。

2.4 坐标缩放:模型坐标系 ≠ 屏幕坐标系

定位模型是在某个固定分辨率(如 1920×1080)下训练输出坐标的,但真实屏幕可能是别的尺寸。所以拿到点之后必须按比例换算:

# 真实源码,grounding.py:337
def resize_coordinates(self, coordinates: List[int]) -> List[int]:
    grounding_width = self.engine_params_for_grounding["grounding_width"]
    grounding_height = self.engine_params_for_grounding["grounding_height"]
    return [
        round(coordinates[0] * self.width / grounding_width),
        round(coordinates[1] * self.height / grounding_height),
    ]

这正是 CLI 为什么强制要求 --grounding_width / --grounding_height 参数(cli_app.py:286-297):你必须告诉系统定位模型输出的坐标系,它才能正确缩放。README 还专门列了不同 UI-TARS 版本对应的取值(如 72B 用 1000×1000)。

2.5 文本高亮走另一条路:OCR 词级定位

高亮一段文字(选中从某词到某词)用视觉定位很别扭——它需要的是"词的精确边界",而不是"某物大概在哪"。所以 highlight_text_span 改走 OCR:

get_ocr_elements(grounding.py:249)用 pytesseract 把屏幕上每个词抽成 <id, text, 包围盒>,拼成一张"词表"。
generate_text_coords(grounding.py:286)把"短语 + 词表 + 截图"喂给一个文本定位 agent,让它回答"该选哪个词 id";alignment="start"/"end" 控制取该词包围盒的左边缘还是右边缘。
起点取首词左缘、终点取末词右缘,拼成 pyautogui.dragTo 完成拖选(grounding.py:503)。

妙在哪: 同一个"落点"问题,按目标性质选了两套定位手段——元素/图标用视觉模型(语义匹配),文字选区用 OCR(精确边界)。不是一把锤子敲所有钉子。

2.6 动作即代码字符串:为什么不直接调函数

所有动作方法都返回一段 pyautogui 代码字符串而不是直接执行。例如 click 返回 "import pyautogui; pyautogui.click(1340, 88, ...)"。原因:

执行与生成解耦:上层可以先拿到字符串做格式校验(eval 一遍看会不会抛错,见 03 章 CODE_VALID_FORMATTER),通过了再真机 exec。
Unicode 友好:type 动作检测到非 ASCII 字符时,会改用剪贴板粘贴(pyperclip.copy + ctrl/cmd+v)而非逐字 write(grounding.py:450-459),绕开 pyautogui 打不出中文/emoji 的坑。

代码地图(导航索引)

主题	文件路径	符号名
动作基类 / 装饰器	`gui_agents/s3/agents/grounding.py`	`ACI` / `agent_action`
全部动作实现	`gui_agents/s3/agents/grounding.py`	`OSWorldACI`(`click`/`type`/`drag_and_drop`/`highlight_text_span`…)
视觉定位	`gui_agents/s3/agents/grounding.py`	`generate_coords` / `resize_coordinates`
OCR 文本定位	`gui_agents/s3/agents/grounding.py`	`get_ocr_elements` / `generate_text_coords`
提示词自动生成	`gui_agents/s3/memory/procedural_memory.py`	`construct_simple_worker_procedural_memory`
表格直写(UNO)	`gui_agents/s3/agents/grounding.py`	`set_cell_values` / `SET_CELL_VALUES_CMD`

2.1 ACI 是什么:一组"参数化"的动作​

2.2 @agent_action 装饰器:动作即文档,文档即提示词​

2.3 视觉定位:generate_coords,从描述到一个点​

2.4 坐标缩放:模型坐标系 ≠ 屏幕坐标系​

2.5 文本高亮走另一条路:OCR 词级定位​

2.6 动作即代码字符串:为什么不直接调函数​

代码地图(导航索引)​