Chat UI — OpenAI 适配器与推理流

上游模型后端五花八门(llama.cpp、Ollama、HF router、OpenRouter…),推理字段还各叫各的。本章讲 Chat UI 怎么用一层“适配器”把它们抹平成内部统一格式,以及处理推理(<think>)的技巧。

1. Endpoint 抽象:一个函数

Chat UI 里“如何调一个模型”被抽象成 Endpoint:一个 async 函数,吃一组参数(messages、preprompt、是否多模态、abortSignal…),吐一个统一 token 流 AsyncIterable<TextGenerationStreamOutput>。

模型在 models.ts 里被装配 getEndpoint()(models.ts:137),这个 build 里它永远返回 endpoints.openai(...)(models.ts:148)——即 endpointOai(endpoints/openai/endpointOai.ts:53)。路由型模型例外,返回 makeRouterEndpoint(第 4 章)。

好处:generate.ts 完全不关心上游是谁,它只 for await 一个统一流。要换协议,只需换 Endpoint 实现。

2. endpointOai:组装请求

endpointOai(endpointOai.ts:166 的 chat_completions 分支)做的事:

拼消息:prepareMessagesWithFiles 把内部消息转成 OpenAI 格式,多模态时把图片转成 image_url 块。
处理 preprompt/系统消息:已有 system 消息就把 preprompt 前置进去,否则插一条 system(endpointOai.ts:186)。
拼参数:temperature/top_p/stop/max_tokens 等;支持 max_tokens vs max_completion_tokens 二选一(useCompletionTokens 开关,endpointOai.ts:217)。
provider 后缀:HuggingChat 上可以把模型 id 写成 model:together 指定 provider(endpointOai.ts:210)。
调用并适配:openai.chat.completions.create(...),把返回流交给 openAIChatToTextGenerationStream 转换。

有个自定义 fetch(endpointOai.ts:80)包了一层:从响应头里抓 X-Router-Route / x-inference-provider,作为“路由元信息”的兜底来源(有些 provider 不在流里给,只在 header 里给)。

3. 推理:把 reasoning 缝成块

这是本章的精华。问题:不同 provider 表达“思考过程”的方式不一样:

provider 行为	例子
单独的 `delta.reasoning` 字段	一些 HF router 模型
单独的 `delta.reasoning_content` 字段	DeepSeek 系
直接在正文里吐 `<think>…</think>`	原生 DeepSeek R1

Chat UI 想要的统一表示:正文里就是 <think>思考</think>正式回答,这样客户端用一个正则就能识别并折叠思考块。

openAIChatToTextGenerationStream(openai/openAIChatToTextGenerationStream.ts:8)就是这个“缝合器”。核心是一个 thinkOpen 状态机:

收到 reasoning 增量:
   thinkOpen=false → 输出 "<think>" + reasoning,置 thinkOpen=true
   thinkOpen=true  → 直接输出 reasoning

收到 content 增量:
   thinkOpen=true  → 先输出 "</think>" 收尾,再输出 content,置 thinkOpen=false
   thinkOpen=false → 直接输出 content

教学示意(就是这个状态机的骨架):

// 示意,非源码:把分离的 reasoning/content 缝成 <think>…</think> 正文
let thinkOpen = false;
for await (const chunk of stream) {
  const delta = chunk.choices?.[0]?.delta ?? {};
  const reasoning = delta.reasoning ?? delta.reasoning_content ?? "";
  const content = delta.content ?? "";
  let out = "";
  if (reasoning) {                 // 思考片段
    out += thinkOpen ? reasoning : "<think>" + reasoning;
    thinkOpen = true;
  }
  if (content) {                   // 正文片段
    out += thinkOpen ? "</think>" + content : content;
    thinkOpen = false;
  }
  yield { token: { text: out /* … */ }, generated_text: last ? full : null };
}

真实实现见 openAIChatToTextGenerationStream.ts:100-123。非流式版本 openAIChatToTextGenerationSingle(:170)同理,把 reasoning 一次性包成 <think>…</think> 前缀。

4. 下游怎么用 reasoning

缝好的统一流回到 generate.ts,那里还有第二套推理处理,针对模型本身用 token 标记推理边界的情况(model.reasoning 配置,类型可为 tokens/regex/summarize):

tokens:用 beginToken/endToken(如 <think>/</think>)切出推理段,从最终答案里裁掉避免重复(generate.ts:144)。
regex:用正则从推理缓冲里抽最终答案(generate.ts:113)。
summarize:推理结束后,再调一次模型把冗长推理总结成一段(generate.ts:116)。

还有可选的“推理状态摘要”:开了 REASONING_SUMMARY 时,每 ~4s 调 generateSummaryOfReasoning(reasoning.ts:4)生成一句“它正在想什么”的状态文案(generate.ts:223)。

5. 巧妙之处

Endpoint = 单函数抽象:换上游协议只需换一个函数,生成逻辑零改动(endpointOai.ts:53)。
thinkOpen 状态机:把三种 provider 的推理表达统一成 <think> 正文,客户端一个正则搞定(openAIChatToTextGenerationStream.ts:100)。
header 兜底路由元信息:provider 不在流里给就从响应头抓(endpointOai.ts:80)。
未闭合 <think> 自动收尾:工具循环里每轮结束强行补 </think>,否则客户端正则会把后续内容全吞进思考块(runMcpFlow.ts:626)。

6. 边界与局限

这层假定上游说标准 OpenAI chat-completions;偏离协议的字段(各家 reasoning)只能逐个特判,新出的字段名要手动加。
多模态只支持图片,且会被压到 ≤1MB / 1024px、转 jpeg(endpointOai.ts:36)。

代码地图

主题	文件	符号
OpenAI endpoint	`src/lib/server/endpoints/openai/endpointOai.ts`	`endpointOai`, `endpointOAIParametersSchema`
流适配 + 推理缝合	`src/lib/server/endpoints/openai/openAIChatToTextGenerationStream.ts`	`openAIChatToTextGenerationStream`, `openAIChatToTextGenerationSingle`
下游推理处理	`src/lib/server/textGeneration/generate.ts`	`generate`
推理摘要	`src/lib/server/textGeneration/reasoning.ts`	`generateSummaryOfReasoning`
模型装配	`src/lib/server/models.ts`	`addEndpoint`, `getEndpoint`

1. Endpoint 抽象:一个函数​

2. endpointOai:组装请求​

3. 推理:把 reasoning 缝成 块​

4. 下游怎么用 reasoning​

5. 巧妙之处​

6. 边界与局限​

代码地图​