第 2 章:cognify —— 文本如何变成知识图谱

这是 Cognee 的核心价值:把「一段话」变成「实体节点 + 关系边」。本章讲三件事:LLM 抽什么、抽出来的东西长什么样、以及最巧的一招——同名实体怎么自动合并成同一个节点。

2.1 它要解决的小问题

纯向量 RAG 把文本切块存向量,问答时找相似块。但「Einstein 出生在 Ulm」「Ulm 在德国」这两句话如果在不同段落,向量 RAG 很难把它们连起来回答「Einstein 出生在哪个国家」。

cognify 的目标:把文本里的实体(Einstein、Ulm、德国)和它们的关系(born_in、located_in)显式抽成一张图,让「跨段落的连接」变成图上可走的边。

2.2 抽出来的东西长什么样:KnowledgeGraph 模型

LLM 被要求输出一个 KnowledgeGraph,就是「一堆节点 + 一堆边」(cognee/shared/data_models.py:73,非 Gemini 分支):

# 真实模型(cognee/shared/data_models.py)
class Node(BaseModel):
    id: str
    name: str = ""
    type: str          # 实体类别,如 "Person" / "City"
    description: str

class Edge(BaseModel):
    source_node_id: str
    target_node_id: str
    relationship_name: str          # 如 "born_in"
    description: str | None         # 这条边表达的一句话事实

class KnowledgeGraph(BaseModel):
    nodes: list[Node]
    edges: list[Edge]

注意 Edge.description 的字段说明要求 LLM 写「用两端名字表达的一句话具体事实」(data_models.py:67-71)——这句「事实文本」后面会成为检索时喂给 LLM 的上下文素材。

小坑:Gemini 不允许数据模型里出现空字典,所以仓库里对 Gemini 走了一套单独的 Node/KnowledgeGraph 定义(data_models.py:11-46 的 if get_llm_config().llm_provider.lower() == "gemini")。同一个概念两套模型,改的时候两边都要动。

2.3 默认流水线里的抽图那一步

回顾第 1 章,cognify 默认五步里的第 ③ 步是 extract_graph_and_summarize(cognee/tasks/graph/extract_graph_and_summarize.py)。它其实并行做两件事:

# 真实源码(extract_graph_and_summarize),用 asyncio.gather 并行跑两条
result_chunks = await asyncio.gather(
    extract_graph_from_data(data_chunks=..., graph_model=graph_model, ...),  # 抽实体/关系
    summarize_text(data_chunks=..., summarization_model=...),                # 给每块做摘要
)
return result_chunks[1]   # 只把摘要往下传;图的副作用已写进 DB

重点看: 抽图的结果不是 return 出去的,而是在 extract_graph_from_data → integrate_chunk_graphs 里直接写进图库(副作用);往下游传的是摘要(TextSummary)。

2.4 最巧的一招:确定性节点 id = 免费去重

问题

两篇文档都提到「Einstein」。如果每次抽取都给它分配一个随机 UUID,图里就会冒出两个互不相干的 Einstein 节点,关系也连不到一起。

思路

让节点 id 由它的身份字段算出来,而不是随机生成。同样的名字 → 同样的 id → 落库时自然合并成一个节点。

真实实现

所有图节点都继承 DataPoint(cognee/infrastructure/engine/models/DataPoint.py)。Entity 模型声明了 identity_fields:

# 真实源码 cognee/modules/engine/models/Entity.py
class Entity(DataPoint):
    name: str
    description: str
    ...
    # 用 name 算确定性 id;Entity.id_for(name) 也产同一个值
    metadata: dict = {"index_fields": ["name"], "identity_fields": ["name"]}

id 怎么算的看 DataPoint.id_for(DataPoint.py:154-170):

# 真实源码 DataPoint.id_for —— 命名空间就是类名本身
@classmethod
def id_for(cls, *values):
    joined = "|".join(cls._normalize_identity_value(v) for v in values)
    return uuid5(NAMESPACE_OID, f"{cls.__name__}:{joined}")

几个关键点:

类名当命名空间:Entity:einstein 和 EntityType:einstein 永不撞 id;调用方也不会忘记加前缀,类自己带(DataPoint.py:155-162 docstring)。
大小写/空格归一化:_normalize_identity_value 把值转小写、空格变下划线、去掉单引号(DataPoint.py:148-151)。所以「Einstein」「einstein」算出同一个 id。
构造即生效:DataPoint.__init__ 在没传显式 id 时,会走 _generate_identity_id(它再 delegate 给 id_for)自动算 id(DataPoint.py:77-86)。所以 Entity(name="Einstein", ...) 创建出来的实例,id 和 Entity.id_for("Einstein") 保证相等。

一句话直觉: 节点 id 是它身份的「哈希指纹」,所以「同一个东西」无论被提到几次,落到图里永远是同一个节点——去重不需要额外比对,id 一样就是同一个。

2.5 实体扩展与边整合

抽出的原始 KnowledgeGraph(纯字符串 id 的 Node/Edge)要变成真正的 DataPoint 节点和边,这一步在 expand_with_nodes_and_edges(cognee/modules/graph/utils/expand_with_nodes_and_edges.py),由 integrate_chunk_graphs 调用(cognee/tasks/graph/extract_graph_from_data.py:108-112):

先 retrieve_existing_edges 拉出图里已有的边,建一张 existing_edges_map,避免重复加边(extract_graph_from_data.py:104-107)。
用 _create_node_key(node_id, category) / _create_edge_key(...) 这种「key 去重」防止同一批里重复创建节点/边(expand_with_nodes_and_edges.py:19-26)。
若挂了 ontology(本体/受控词表),get_subgraph 会把抽出的实体对齐到本体里最接近的类,对齐成功就改用本体节点的 id(_create_type_node 里 closest_class,expand_with_nodes_and_edges.py:128-141)。这让「自由抽取」也能收敛到规范词表。

2.6 切块这一步:DocumentChunk

抽图的输入是 DocumentChunk(cognee/modules/chunking/models/DocumentChunk.py),一个继承 DataPoint 的文本块:

# 真实源码 DocumentChunk(节选字段)
class DocumentChunk(DataPoint):
    text: str
    chunk_index: int
    is_part_of: Document
    contains: List[Union[Entity, Event, tuple[Edge, Entity]]] = None  # 这块里抽出的实体
    metadata: dict = {"index_fields": ["text"]}   # text 字段会被嵌入向量

注意 index_fields: ["text"] ——这告诉系统「把 text 字段做向量嵌入」,于是 chunk 本身既是图节点又可被向量检索(第 3、4 章会用到)。

2.7 时序变体

传 temporal_cognify=True 会换一套任务:把 chunk 先抽成「事件 + 时间戳」再建图(get_temporal_tasks,cognee/api/v1/cognify/cognify.py:344,任务 extract_events_and_timestamps → extract_knowledge_graph_from_events)。这印证第 1 章的复用思想:换能力 = 换 Task 列表。

2.8 代码地图

主题	文件路径	符号名
LLM 抽图入口	`cognee/tasks/graph/extract_graph_and_summarize.py`	`extract_graph_and_summarize`
图整合/写库	`cognee/tasks/graph/extract_graph_from_data.py`	`integrate_chunk_graphs`
节点/边扩展去重	`cognee/modules/graph/utils/expand_with_nodes_and_edges.py`	`expand_with_nodes_and_edges`, `_create_node_key`
KnowledgeGraph 模型	`cognee/shared/data_models.py`	`KnowledgeGraph`, `Node`, `Edge`
确定性 id	`cognee/infrastructure/engine/models/DataPoint.py`	`id_for`, `_generate_identity_id`
Entity 模型	`cognee/modules/engine/models/Entity.py`	`Entity`
文本块模型	`cognee/modules/chunking/models/DocumentChunk.py`	`DocumentChunk`

2.1 它要解决的小问题​

2.2 抽出来的东西长什么样:KnowledgeGraph 模型​

2.3 默认流水线里的抽图那一步​

2.4 最巧的一招:确定性节点 id = 免费去重​

问题​

思路​

真实实现​

2.5 实体扩展与边整合​

2.6 切块这一步:DocumentChunk​

2.7 时序变体​

2.8 代码地图​