ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！原创

发布于 2025-8-14 07:58

浏览

0收藏

本文介绍ColPali与DocLayNet结合的多模态RAG系统，通过视觉语言建模理解文档中的表格、图表等布局信息，显著提升复杂文档问答的准确性和上下文感知能力。

简介

检索增强生成（RAG）已成为构建开放领域和特定领域问答系统的标准范例。传统意义上，RAG流程严重依赖于基于文本的检索器，这些检索器使用密集或稀疏嵌入来索引和检索段落。虽然这些方法对于纯文本内容有效，但在处理视觉复杂的文档（例如科学论文、财务报告或扫描的PDF）时，往往会遇到困难，因为这些文档中的关键信息嵌入在表格、图形或结构化布局中，而这些布局无法很好地转换为纯文本。

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！-AI.x社区

论文插图：与传统方法相比，ColPali简化了文档检索流程，同时提供了更高的性能和更低的延迟（来源：arxiv）

为了突破这些限制，Manuel Faysse等人近期的研究成果提出了ColPALI（ICLR 2025），这是一个视觉语言检索框架，它使用类似ColBERT的视觉嵌入后期交互，基于图像理解来检索文档内容。与此同时，Pfitzmann等人提出了Yolo DocLayNet（CVPR 2025），这是一个快速且布局感知的对象检测模型，专门用于以高精度和高效率提取文档组件，例如表格、图表和章节标题。

在本文中，我将指导你完成混合RAG管道的实际实现，该管道结合了ColPALI和DocLayout-YOLO，以实现对阅读和查看的文档的问答。

RAG系统中的视觉盲点

尽管在处理文本查询方面取得了成功，但大多数RAG系统都忽略了一个关键问题。它们几乎忽略了表格、图表和图形等视觉元素，而这些元素在许多实际文档中都承载着至关重要的意义。让我们来看看下图。

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！-AI.x社区

图片来自：SLB 2023年年度报告的摘录（资料来源：报告）

通过应用常见的OCR工具从表格中提取文本，我们可以得到以下结果。

Option Awards Stock Awards
 Name
 Option/ 
PSU/RSU 
...
...
... (truncated)
Shares, Units, or 
Other Rights That 
Have Not Vested 
($)(1)
 D. Ralston 1/20/2021 67,220(2) 3,498,129
 1/20/2021 33,610(3) 1,749,064
 2/3/2021 29,390(4) 1,529,456
 1/19/2022 64,587(5) 3,361,107
 1/19/2022 22,321(6) 1,161,585
 1/18/2023 41,668(7) 2,168,403
 1/18/2023 14,427(8) 750,781

显然，OCR结果无法捕捉多层级标题结构和列分组，导致文本呈现扁平的线性，数值与其对应指标之间的关联性缺失。这使得数据所属类别（例如授予日期与股票奖励）难以识别，从而降低了提取数据对下游分析的实用性。

用户查询：What is the market value of unearned shares, units, or other rights that have not vested for the 1/20/2021 grant date?（2021年1月20日授予日尚未归属的未赚取股份、单位或其他权利的市场价值是多少？）

RAG回应：The market value of unearned shares, units, or other rights that have not vested for the 1/20/2021 grant date is $1,749,064.（2021年1月20日授予日尚未归属的未赚取股份、单位或其他权利的市场价值为1,749,064美元。）

由于提取的信息缺乏结构和上下文，因此产生的RAG响应不准确，因为它无法可靠地将值与原始表中的预期含义关联起来。

让我们探讨另一个例子来进一步说明传统RAG系统的局限性。

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！-AI.x社区

图片来源：SLB 2023年年度报告中的数据图表片段（来源：报告）

这是从上图中提取的OCR结果。

Total Shareholder Return
 Indexed ($100)
 $50.0 \n$40.0 \n$30.0 \n$20.0 \n$10.0 \n$0.0-$10.0-$20.0
 2020 2021 2023 2022
 CEO CAP Avg. NEO CAP SLB TSR-$10.6
 $2.2 \n$32.4 \n$13.1 \n$39.4 \n$26.2 \n$17.3 \n$9.3 
 ...
 ...
 ... (truncated)
 $40.00 \n$20.00 \n$0.00
 OSX TSR
 CAP vs. Total Shareholder Return
 (SLB and OSX)
 (in millions of US dollars)
 Compensation Actually Paid

易知，该图表的OCR输出也缺乏结构一致性，未能捕捉数据点之间的关系，例如哪些条形或标签对应特定年份或股东回报率线。此外，它还未能将数值与其视觉元素进行对齐，导致难以区分每年的CEO CAP、NEO CAP和TSR值。

问题：What was the SLB Total Shareholder Return (TSR) in 2022?（2022年SLB总股东回报率(TSR)是多少？）

RAG回应：The SLB Total Shareholder Return (TSR) in 2022 was $134.09.（SLB2022年的总股东回报(TSR)为134.09美元。）

与前面的示例一样，由于OCR提取的数据中结构和上下文的丢失，此处的RAG响应也不准确。

多模态RAG架构

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！-AI.x社区

作者插图：我们实验中的多模态RAG架构

该架构由两个主要组件组成：索引管道和聊天推理管道。在索引阶段，文档语料库通过两条并行路径进行处理。第一条路径使用YOLO-DocLayNet检测器识别表格和图形等视觉元素，然后使用ColPALI图像编码器将其嵌入并存储在图像向量集合中。第二条路径使用PyTesseractOCR从文档中提取原始文本，然后使用Mxbai-Embed模型对其进行编码，并保存在同一向量数据库中的文本向量集合中。

在聊天推理过程中，用户查询会同时由用于Mxbai-Embed文本检索的编码器和用于视觉语言检索的ColPALI编码器进行编码。然后，系统会针对各自的向量集合执行双重检索（文本到文本和文本到图像）。检索到的文本和图像区域会被转发到多模态LLM（LLaMA-4），该模型会综合两种模态，生成上下文感知且准确的响应。这种设计将文本理解与细粒度的视觉推理相结合，从而实现强大的文档质量保证（QA）。

我的测试设置

A. 环境设置

为了高效运行完整的多模式RAG管道，我使用单个NVIDIA RTX A6000 GPU和48GB的VRAM，这为运行ColPALI、YOLO模型和句子嵌入模型提供了足够的内存。

对于软件环境，我建议使用Miniconda来隔离你的依赖关系并确保可重复性。

1.创建Conda环境

conda create -n multimodal_rag pythnotallow=3.11 
conda activate multimodal_rag

2. 准备requirements.txt

ultralytics 
git+https://github.com/illuin-tech/colpali 
groq 
pill 
pymilvus 
sentence-transformers 
uvicorn 
fastapi 
opencv-python 
pytesseract 
PyMuPDF 
pydantic 
chainlit 
pybase64 
huggingface_hub[hf_transfer]

3.安装Python依赖项

pip install -r requirements.txt

B.预训练模型设置

要复现此实验，你需要下载检索和布局提取流程中使用的三个预训练模型。所有模型都将存储在该pretrained_models/目录下，以保持一致性并更易于加载。

下面是使用Hugging Face CLI下载它们的命令hf transfer，该命令针对更快的下载速度进行了优化：

# 下载YOLO DocLayNet用于布局检测
HF_HUB_ENABLE_HF_TRANSFER=1 hf download hantian/yolo-doclaynet --local-dir pretrained_models/yolo-doclaynet

# 下载 ColQwen 2.5（用于 ColPALI）用于基于图像的检索
HF_HUB_ENABLE_HF_TRANSFER=1 hf download vidore/colqwen2.5-v0.2 --local-dir pretrained_models/colqwen2.5-v0.2

# 下载 Mixedbread Embed Large 模型用于基于文本的检索
HF_HUB_ENABLE_HF_TRANSFER=1 hf download mixedbread-ai/mxbai-embed-large-v1 --local-dir pretrained_models/mxbai-embed-large-v1

C. 代码设置和初始化

import torch 
from Typing import Cast 
from Ultralytics import YOLO 
from transforms.utils.import_utils import is_flash_attn_2_available 
from colpali_engine.models.paligemma.colpali.processing_colpali import ColPaliProcessor 
from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor 
from sentence_transformers import SentenceTransformer 

# 定义设备
device = "cuda" if torch.cuda.is_available() else "cpu" 

# 定义知识库源和目标图像目录
document_source_dir = "document_sources"
 img_dir = "image_database"
 os.makedirs(img_dir, exist_ok= True ) # 确保目录存在

# YOLO-12L-Doclaynet
 yolo_model = YOLO( "pretrained_models/yolo-doclaynet/yolov12l-doclaynet.pt" ) 
yolo_model = yolo_model.to(device) 

# ColQwen2.5-Colpali
 colpali_model = ColQwen2_5.from_pretrained( 
 "pretrained_models/colqwen2.5-v0.2" , 
 torch_dtype=torch.bfloat16, 
 device_map=device, # 或 "mps" 如果在 Apple Silicon 上
 attn_implementatinotallow= "flash_attention_2" 如果is_flash_attn_2_available() else None , 
 )。eval () 
colpali_processor = ColQwen2_5_Processor.from_pretrained( "pretrained_models/colqwen2.5-v0.2" ) 
processor = cast( 
 ColPaliProcessor, 
 colpali_processor) 

# Mxbai-embed-large-v1
 embed_model = SentenceTransformer( "pretrained_models/mxbai-embed-large-v1" ,device=device) 

# 定义实体颜色
ENTITIES_COLORS = { 
 "Picture" : ( 255 , 72 , 88 ), 
 "Table" : ( 128 , 0 , 128 ) 
} 

print ( "FINISH SETUP..." )

上述代码初始化了多模态检索系统的核心组件。它设置了设备（GPU或CPU），确保图像输出目录存在，并加载了三个预训练模型：YOLOv12L-DocLayNet模型（用于检测表格和图形等布局元素）、ColQwen2.5模型（其ColPALI处理器用于对裁剪图像区域进行视觉语言嵌入）以及mxbai-embed-large-v1模型（使用SentenceTransformers嵌入文本）。此外，它还定义了检测到的实体类型的颜色映射，以支持预处理过程中的可视化。

索引管道

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！-AI.x社区

作者插图：索引管道

索引管道负责将原始文档转换为结构化的、可搜索的文本和视觉表示形式。在本节中，我们将逐步讲解实际实现过程，展示如何使用代码处理、编码和存储文档内容。

A. 准备

在继续开发之前，让我们准备一些对分析有用的处理函数。

import matplotlib.pyplot as plt 
import cv2 

def display_img_array ( img_arr ): 
 image_rgb = cv2.cvtColor(img_arr, cv2.COLOR_BGR2RGB) 
 plt.figure(figsize=( 30 , 30 )) 
 plt.imshow(image_rgb) 
 plt.axis( 'off' ) 
 plt.show() 

def show_layout_detection ( detection_results, img_arr ): 
 for result indetection_results : 
 boxes = result.boxes # 获取检测框
 for box in boxes: 
 x, y, w, h = box.xywh[ 0 ] # 框坐标（中心 x, y, 宽度, 高度）
 x, y, w, h = int (x), int (y), int (w), int (h) # 转换为整数
 conf = box.conf.item() # 置信度得分
 cls = int (box.cls.item()) # 类 ID
 label = f" {yolo_model.model.names[cls]} {conf: .2 f} "
 color = ENTITIES_COLORS[yolo_model.model.names[cls]] # 获取此类的颜色
 top_left = (x - w // 2 , y - h // 2 ) 
 bottom_right = (x + w // 2 , y + h // 2 ) # 特定类的彩色框
 cv2.rectangle(img_arr, top_left, bottom_right, color, 2 ) 
 cv2.putText(img_arr, label, (top_left[ 0 ], top_left[ 1 ] - 10 ), cv2.FONT_HERSHEY_SIMPLEX, 0.9 , color, 2 ) # 匹配文本颜色
 display_img_array(img_arr)

上面的两个辅助函数将用于可视化布局检测。其中，display_img_array将BGR图像转换为RGB并使用matplotlib显示它；同时，show_layout_detection使用YOLO检测结果和特定于类的颜色在检测到的布局元素（例如表格、图形）上叠加边界框和标签。

接下来，让我们准备要加载的PDF中的特定页面，如下所示。

import fitz # PyMuPDF
import numpy as np
import cv2
import os

# 定义文件名和页面
page_id = 38
 filename = "SLB-2023-Annual-Report.pdf" 

# 阅读文档
doc = fitz. open (os.path.join(document_source_dir,filename)) 
page = doc.load_page(page_id) 
pix = page.get_pixmap(dpi= 300 ) 
img_rgb = np.frombuffer(pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, pix.n)) 
img_page = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2BGR)

此代码使用PyMuPDF从PDF文件加载特定页面，以高分辨率呈现，并将其转换为适合OpenCV处理的BGR图像数组。

B.布局检测

现在，让我们编写如下布局检测代码。

def layout_detection(img_doc):
 return yolo_model.predict(
 source=img_doc, 
 classes=[6,8],
 cnotallow=0.25, 
 iou=0.45)

layout_results = layout_detection(img_page)
show_layout_detection(layout_results, img_page)

在此步骤中，我们专门过滤检测到的布局元素，使其仅包含与嵌入的相关视觉区域相对应的类标签6（表格）和8（图形）。

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！-AI.x社区

图片来源：SLB 2023年年度报告第37页（来源：报告）

在这里，我们成功定位了相关的视觉对象，例如圆形图和表格，这些将用于下游的视觉嵌入。

C. 提取并保存

接下来，让我们从图像中检索本地化的表格和图片区域，并用白色遮罩它们以将它们从原始页面视图中删除。

def extract_and_masking_images(img_doc, layout_results):
 height, width, _ = img_doc.shape
 extracted_imgs = []
 for box in layout_results[0].boxes:
 x, y, w, h = map(int,box.xywh[0]) #矩形坐标（中心x、y、宽度、高度）
 # 计算左上角(x_min, y_min)
 x_min = x - w // 2
 y_min = y - h // 2
 x_max = x_min + w
 y_max = y_min + h
 # 将坐标夹紧到图像边界
 x_start = max(0, x_min)
 y_start = max(0, y_min)
 x_end = min(width, x_max)
 y_end = min(height, y_max)
 # 如果区域无效，则跳过
 if x_start >= x_end or y_start >= y_end:
 continue
 # 将图像提取到extracted_imgs数组中
 extracted_imgs.append(img_doc[y_start:y_end, x_start:x_end].copy())
 #将区域设置为白色
 img_doc[y_start:y_end, x_start:x_end] = [255, 255, 255]
 return extracted_imgs, img_doc

extracted_imgs, img_page = extract_and_masking_images(img_page, layout_results)
display_img_array(img_page)

此函数从文档图像中提取检测到的表格和图片区域，并用白色遮罩这些区域，返回裁剪后的视觉效果和更新后的图像。更新后的页面图像如下所示。

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！-AI.x社区

图片来源：SLB 2023年年度报告第37页（来源：报告）

接下来，让我们使用如下代码保存提取的图形。

import os
import cv2

def save_img_files(extracted_imgs, filename, page_id):
 #目标路径
 save_path = os.path.join(img_dir, filename, f"page_{page_id}/")
 # 确保目录存在
 os.makedirs(os.path.dirname(save_path), exist_ok=True)
 # 保存图像
 for i in range(len(extracted_imgs)):
 cv2.imwrite(save_path+f"fig_{i}.jpg", extracted_imgs[i])

save_img_files(extracted_imgs, filename, page_id)

此代码将提取的图形和表格图像保存到指定的本地目标目录中img_dir。

D.文本OCR

在此步骤中，让我们使用带有pytesseract的普通OCR从屏蔽文档图像中提取剩余的文本信息。

import pytesseract

text = pytesseract.image_to_string(img_page)
print(text)

结果如下：

Short-Term Cash Incentive Awards

We pay performance-based short-term (annual) cash incentives to
our executives to foster a results-driven, pay-for-performance culture,
and to align executives’ interests with those of our shareholders. STI
awards are earned according to the achievement of quantitative
Company financial and non-financial objectives, as well as strategic
objectives. Our Compensation Committee selects performance
measures that it believes support our strategy and strike a balance
between motivating our executives to increase near-term financial and
operating results and driving profitable long-term Company growth
and value for shareholders.

2022 STI Opportunity Mix

Compensation Discussion and Analysis

For 2023, 70% of our NEOs’ target STI opportunity was based on
achieving quantitative Company financial objectives, 10% was based
on achieving quantitative Company non-financial objectives, and 20%
was based on strategic personal objectives. The financial portion of
the target plan was evenly split between adjusted EBITDA and free
cash flow performance goals. The total maximum STI payout for 2023
was 200% of target—consistent with 2022—and the weighted payout
range for each metric as a percentage of target is reflected by the outer
bars in the 2023 STI Opportunity Mix chart below.

2023 STI Opportunity Mix

»>

»>

In January 2023, our Compensation Committee determined to leave the target STI opportunity for all NEOs unchanged from 2022, following
a review of market data indicating that our NEOs’ target STI opportunity (as a percentage of base salary) was competitively positioned. As a
result, the 2023 target STI opportunity for our CEO was 150% of his base salary and for our other NEOs it was 100% of base salary.

The following table reflects our NEOs’ full-year 2023 STI results, together with relevant weightings of the different components and payouts

under each component.

(1) Equals the sum of the financial, non-financial, and personal portions of the STI achieved, shown as a percentage of base salary.
(2) In January 2024, due to factors not contemplated in the 2023 forecast, our Compensation Committee applied a discretionary downward
adjustment to reduce all executive payouts by 5% under our 2023 STI plan.

2024 Proxy Statement

E. 使用Milvus DB Client建立索引

在此步骤中，我们将使用Milvus数据库进行向量存储和检索；我们选择使用文件milvus_file.db的简单本地实例来进行此实验，而不是使用可扩展的生产级设置。

1.Retriever类

让我们定义两个检索器：用于细粒度基于图像的检索的ColBERT样式检索器和用于基于文本的检索的基本密集检索器。

from pymilvus import MilvusClient, DataType
import numpy as np
import concurrent.futures
import os
import base64

class MilvusColbertRetriever:
 def __init__(self, milvus_client, collection_name, img_dir, dim=128):
 #使用Milvus客户端、集合名称和向量嵌入的维度初始化检索器。
 # If the collection exists, load it.
 self.collection_name = collection_name
 self.client = milvus_client
 if self.client.has_collection(collection_name=self.collection_name):
 self.client.load_collection(collection_name)
 self.dim = dim
 self.img_dir = img_dir

 def create_collection(self):
 # 在Milvus中创建一个新的集合来存储嵌入。
 #如果现有集合已存在，请删除该集合，并为该集合定义架构。
 if self.client.has_collection(collection_name=self.collection_name):
 self.client.drop_collection(collection_name=self.collection_name)
 schema = self.client.create_schema(auto_id=True, enable_dynamic_field=True)
 schema.add_field(field_name="pk", datatype=DataType.INT64, is_primary=True)
 schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=self.dim)
 schema.add_field(field_name="seq_id", datatype=DataType.INT16)
 schema.add_field(field_name="doc_id", datatype=DataType.INT64)
 schema.add_field(field_name="doc", datatype=DataType.VARCHAR, max_length=65535)
 self.client.create_collection(collection_name=self.collection_name, schema=schema)

 def create_index(self):
 # 在向量字段上创建索引，以实现快速相似性搜索。
 # 在使用指定参数创建新索引之前，释放并删除任何现有索引。
 self.client.release_collection(collection_name=self.collection_name)
 self.client.drop_index(collection_name=self.collection_name, index_name="vector")
 index_params = self.client.prepare_index_params()
 index_params.add_index(
 field_name="vector",
 index_name="vector_index",
 index_type="IVF_FLAT", 
 metric_type="IP", 
 )
 self.client.create_index(collection_name=self.collection_name, index_params=index_params, sync=True)

 def search(self, data, topk):
 # 对集合执行向量搜索，以找到前k个最相似的文档。
 search_params = {"metric_type": "IP", "params": {}}
 results = self.client.search(
 self.collection_name,
 data,
 limit=int(50),
 output_fields=["vector", "seq_id", "doc_id","$meta"],
 search_params=search_params,
 )
 doc_meta = {}
 for r_id in range(len(results)):
 for r in range(len(results[r_id])):
 entity = results[r_id][r]["entity"]
 doc_id = entity["doc_id"]
 if doc_id not in doc_meta:
 doc_meta[doc_id] = {
 "page_id": entity["page_id"],
 "fig_id": entity["fig_id"],
 "filename": entity["filename"],
 }
 scores = []

 def rerank_single_doc(doc_id, data, client, collection_name):
 #通过检索单个文档的嵌入并计算其与查询的相似度来对其重新排序。
 doc_colbert_vecs = client.query(
 collection_name=collection_name,
 filter=f"doc_id in [{doc_id}]",
 output_fields=["seq_id", "vector", "doc"],
 limit=1000,
 )
 doc_vecs = np.vstack(
 [doc_colbert_vecs[i]["vector"] for i in range(len(doc_colbert_vecs))]
 )
 score = np.dot(data, doc_vecs.T).max(1).sum()
 return (score, doc_id)

 with concurrent.futures.ThreadPoolExecutor(max_workers=300) as executor:
 futures = {
 executor.submit(
 rerank_single_doc, doc_id, data, self.client, self.collection_name
 ): doc_id
 for doc_id in doc_meta.keys()
 }
 for future in concurrent.futures.as_completed(futures):
 score, doc_id = future.result()
 meta = doc_meta[doc_id]
 img_path = os.path.join(self.img_dir, meta["filename"], f"page_{meta['page_id']}", f"fig_{meta['fig_id']}.jpg")
 with open(img_path, "rb") as f:
 img_base64 = base64.b64encode(f.read()).decode('utf-8')
 scores.append({
 "score":float(score), 
 "page_id": meta["page_id"],
 "fig_id": meta["fig_id"],
 "filename": meta["filename"],
 "content": img_base64})

 scores.sort(key=lambda x: x["score"], reverse=True)
 if len(scores) >= topk:
 return scores[:topk]
 else:
 return scores

 def insert(self, data):
 # 将文档的ColBERT嵌入和元数据插入集合中。
 #将数据作为多个向量（每个序列一个）与相应的元数据一起插入。
 colbert_vecs = [vec for vec in data["colbert_vecs"]]
 seq_length = len(colbert_vecs)
 self.client.insert(
 self.collection_name,
 [
 {
 "vector": colbert_vecs[i],
 "seq_id": i,
 "doc_id": data["doc_id"] ,
 "doc": "",
 "page_id": data["page_id"],
 "fig_id": data["fig_id"],
 "filename": data["filename"],
 }
 for i in range(seq_length)
 ],
 )

class MilvusBasicRetriever:
 def __init__(self, milvus_client, collection_name, dim=1024):
 # 使用Milvus客户端、集合名称和向量嵌入的维度初始化检索器。
 #如果集合存在，则加载之。
 self.collection_name = collection_name
 self.client = milvus_client
 if self.client.has_collection(collection_name=self.collection_name):
 self.client.load_collection(collection_name)
 self.dim = dim

 def normalize(self, vec):
 #把向量规范化
 norm = np.linalg.norm(vec)
 if norm == 0:
 return vec
 return vec / norm

 def create_collection(self):
 # 在Milvus中创建一个新的集合来存储嵌入。
 #如果现有集合已存在，请删除该集合，并为该集合定义架构。
 if self.client.has_collection(collection_name=self.collection_name):
 self.client.drop_collection(collection_name=self.collection_name)
 schema = self.client.create_schema(auto_id=True, enable_dynamic_field=True)
 schema.add_field(field_name="pk", datatype=DataType.INT64, is_primary=True)
 schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=self.dim)
 schema.add_field(field_name="content", datatype=DataType.VARCHAR, max_length=65535)
 self.client.create_collection(collection_name=self.collection_name, schema=schema)

 def create_index(self):
 #在向量字段上创建索引，以实现快速相似性搜索。
 # 在使用指定参数创建新索引之前，释放并删除任何现有索引。
 self.client.release_collection(collection_name=self.collection_name)
 self.client.drop_index(collection_name=self.collection_name, index_name="vector")
 index_params = self.client.prepare_index_params()
 index_params.add_index(
 field_name="vector",
 index_name="vector_index",
 index_type="IVF_FLAT", # or any other index type you want
 metric_type="IP", # or the appropriate metric type
 )
 self.client.create_index(collection_name=self.collection_name, index_params=index_params, sync=True)

 def search(self, data, topk):
 #对集合执行向量搜索，以找到前k个最相似的文档。
 normalized_data = self.normalize(data)
 search_params = {"metric_type": "IP", "params": {}}
 results = self.client.search(
 self.collection_name,
 [normalized_data],
 limit=topk,
 output_fields=["vector", "content","$meta"],
 search_params=search_params,
 )
 return_arr = []
 for hit in results[0]:
 return_arr.append({
 "score":hit.distance,
 "page_id":hit["entity"]["page_id"],
 "filename":hit["entity"]["filename"],
 "content":hit["entity"]["content"]
 })
 return return_arr

 def insert(self, data):
 data["vector"] = self.normalize(np.array(data["vector"])).tolist()
 self.client.insert(
 self.collection_name,
 [data]
 )

这段代码定义了两个与Milvus交互的检索器类，以支持混合检索：

MilvusColbertRetriever专为使用ColBERT样式的多向量嵌入进行基于图像的检索而设计。它支持插入来自ColPALI的块级视觉嵌入，使用后期交互（MaxSim）对结果进行重新排序，并返回最匹配的图像区域及其元数据和base64编码的内容。
MilvusBasicRetriever用于基于文本的检索。它存储并搜索来自句子嵌入的单个密集向量，对余弦相似度进行归一化（通过内积），并检索最相关的文本块及其源元数据。

两种检索器均可处理集合创建、索引和插入，从而实现对视觉和文本文档内容的灵活的多模式检索。

2. Retriever设置

使用上面定义的检索器类，让我们初始化ColBERT风格的图像检索器和基本文本检索器，并将它们安装在由milvus_file.db支持的本地Milvus实例上，以进行存储和检索。

client = MilvusClient("milvus_file.db")
colbert_retriever = MilvusColbertRetriever(collection_name="colbert", milvus_client=client,img_dir=img_dir)
basic_retriever = MilvusBasicRetriever(collection_name="basic", milvus_client=client)

对于初始化步骤，我们必须为两个检索器创建集合和索引，如下所示。

colbert_retriever.create_collection() 
colbert_retriever.create_index() 
basic_retriever.create_collection() 
basic_retriever.create_index()

3.图像数据加载器

让我们将使用ColPALI模型的图像嵌入过程包装到数据加载函数中。

from colpali_engine.utils.torch_utils import ListDataset
from torch.utils.data import DataLoader
from typing import List
from tqdm import tqdm
from PIL import Image

def create_image_embedding_loader(extracted_imgs):
 images = [Image.fromarray(img_arr) for img_arr in extracted_imgs]
 dataloader_images = DataLoader(
 dataset=ListDataset[str](images),
 batch_size=1,
 shuffle=False,
 collate_fn=lambda x: processor.process_images(x),
 )
 ds: List[torch.Tensor] = []
 for batch_doc in tqdm(dataloader_images):
 with torch.no_grad():
 batch_doc = {k: v.to(colpali_model.device) for k, v in batch_doc.items()}
 embeddings_doc = colpali_model(**batch_doc)
 ds.extend(list(torch.unbind(embeddings_doc.to("cpu"))))
 return ds

embedding_loader = create_image_embedding_loader(extracted_imgs)
embedding_loader

此函数将提取的图像区域列表包装到DataLoader中，并使用ColPALI模型对其进行处理，并返回每个图像的多向量嵌入列表。返回列表的结果如下。

[tensor([[-0.0055, 0.0991, -0.0903, ..., -0.0474, -0.0042, -0.1138],
 [-0.0067, 0.1064, -0.0488, ..., -0.0723, 0.0535, -0.0986],
 [-0.0200, 0.1113, -0.1084, ..., -0.0747, 0.0447, -0.0786],
 ...,
 [-0.0027, 0.0811, -0.1602, ..., 0.0354, -0.0112, -0.1670],
 [-0.0557, -0.1099, 0.0128, ..., 0.0203, -0.0728, -0.0688],
 [ 0.1025, 0.0145, -0.0420, ..., 0.0894, -0.0413, 0.1650]], dtype=torch.bfloat16),
 tensor([[-0.0055, 0.0991, -0.0903, ..., -0.0474, -0.0042, -0.1138],
 [-0.0067, 0.1064, -0.0488, ..., -0.0723, 0.0535, -0.0986],
 [-0.0200, 0.1113, -0.1084, ..., -0.0747, 0.0447, -0.0786],
 ...,
 [-0.0141, 0.0645, -0.1377, ..., 0.0430, -0.0061, -0.1338],
 [-0.0835, -0.1094, 0.0049, ..., 0.0211, -0.0608, -0.0645],
 [ 0.1396, 0.0549, -0.0669, ..., 0.0942, 0.0038, 0.1514]], dtype=torch.bfloat16),
 tensor([[-0.0053, 0.0996, -0.0894, ..., -0.0471, -0.0042, -0.1128],
 [-0.0068, 0.1060, -0.0491, ..., -0.0713, 0.0532, -0.0986],
 [-0.0204, 0.1118, -0.1089, ..., -0.0752, 0.0444, -0.0791],
 ...,
 [ 0.0330, 0.0398, -0.0505, ..., 0.0586, 0.0250, -0.1099],
 [-0.0508, -0.0981, -0.0126, ..., 0.0183, -0.0791, -0.0713],
 [ 0.1387, 0.0698, -0.0330, ..., 0.0238, 0.0923, 0.0337]], dtype=torch.bfloat16)]

4. 图像和文本索引

在此步骤中，让我们将页面图像中提取的数据组件（包括图像和文本）索引到数据库集合中。

import random

for i in range(len(extracted_imgs)):
 data = {
 "colbert_vecs": embedding_loader[i].float().numpy(),
 "doc_id": random.getrandbits(63),
 "page_id": page_id,
 "fig_id": i,
 "filename": filename,
 }
 colbert_retriever.insert(data)

此代码将每个图像嵌入到ColBERT样式检索器中，并附带相关元数据，分配唯一的doc_id并存储页面和图形索引引用。

data = {
 "vector": embed_model.encode(text),
 "content": text,
 "page_id": page_id,
 "filename": filename
}
basic_retriever.insert(data)

此代码将文本嵌入及其元数据（包括内容、页面ID和文件名）插入到基本文本检索器中进行索引。

5. Retriever测试

最后，让我们测试一下上面设置的检索器。

query = "O.Le Peuch Payout Results in percentage according to SLB Financial Objectives"
batch_query = colpali_processor.process_queries([query]).to(device)
embeddings_query = torch.unbind(colpali_model(**batch_query).to("cpu"))[0].float().numpy()
colbert_retriever_result = colbert_retriever.search(embeddings_query, topk=3)
colbert_retriever_result

此代码使用ColPALI模型嵌入用户查询，并从ColBERT风格的检索器中检索出最相关的前3个图像区域。结果如下。

[{ 'score' : 20.13466208751197, 
 'page_id' : 38, 
 'fig_id' : 1, 
 'filename' : 'SLB-2023-Annual-Report.pdf' , 
 'content' : '/9j/4AAQSkZJRgABAQAAAQABAAD/...' }, 
{ 'score' : 20.13466208751197, 
 'page_id' : 38, 
 'fig_id' : 1, 
 'filename' : 'SLB-2023-Annual-Report.pdf' , 
 'content' : '/9j/4AAQSkZJRgABAQAAAQABAAD/...' }, 
{ 'score' : 15.088707166083623，
 'page_id'：41，
 'fig_id'：1，
 'filename'：'SLB-2023-Annual-Report.pdf'，
 'content'：'/9j/4AAQSkZJRgABAQAAAQABAAD/...' }]

接下来，让我们测试一下基本的检索器。

query = "Potential Payout as a % of Target Opportunity"
basic_retriever_result = basic_retriever.search(embed_model.encode(query), topk=3)
basic_retriever_result

此代码使用文本嵌入模型对查询进行编码，并从基本检索器中检索出最相关的前3个文本条目。结果如下。

[{'score': 0.6565427184104919,
 'page_id': 38,
 'filename': 'SLB-2023-Annual-Report.pdf',
 'content': 'Short-Term Cash Incentive Awards\n\nWe pay performance-based short-term (annual) cash incentives to\nour ... (truncated)'},
 {'score': 0.6533020734786987,
 'page_id': 40,
 'filename': 'SLB-2023-Annual-Report.pdf',
 'content': "Free Cash Flow Targets and Results\n\nIn January 2023, our Compensation Committee considered SLB's\n2022 ... (truncated)"},
 {'score': 0.6505128145217896,
 'page_id': 59,
 'filename': 'SLB-2023-Annual-Report.pdf',
 'content': "Executive Compensation Tables\n\nChange in Control\n\nUnder our omnibus incentive plans, in the event of ... (truncated)"}]

在这里，得分的差异是由于检索方法造成的：ColBERT风格的检索器使用内积（IP），从而产生更大的分数值，而基本检索器反映余弦相似度，通常会产生在-1和之间较小范围内的分数+1。

聊天推理管道

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！-AI.x社区

作者插图：聊天推理管道

聊天推理管道通过从文本和图像嵌入中检索相关内容来处理用户查询，从而生成准确且情境感知的响应。在本节中，我们将实现查询编码、检索和多模态步骤，以完成端到端问答工作流程。

A. 准备

对于聊天补全模型，我们使用Meta开发的一款功能强大的多模态LLM Llama-4。我们将使用GROQ API Key来使用此Llama模型。代码如下。

import os
from groq import Groq

# Groq API-Llama4
os.environ["GROQ_API_KEY"] = "<your-api-key>"
client_groq = Groq()

接下来，让我们准备一些处理函数，如下所示。

def url_conversion(img_base64):
 return f"data:image/jpeg;base64,{img_base64}"

def llama4_inference(messages, token=1024):
 completion = client_groq.chat.completions.create(
 model="meta-llama/llama-4-maverick-17b-128e-instruct",
 messages=messages,
 temperature=0.1,
 max_completion_tokens=token,
 top_p=1,
 stream=True,
 stop=None,
 )
 inference_result = ""
 for chunk in completion:
 chunk_inference = chunk.choices[0].delta.content or ""
 inference_result += chunk_inference
 text = inference_result
 return text

此代码定义了一个函数，用于将base64编码的图像转换为可显示的URL，以及一个llama推理函数，用于通过Groq的API使用LLaMA 4 Maverick模型执行流推理。

B. 用户查询和相关上下文

现在，让我们定义用户查询并检索相关上下文，如下所示。

user_query = "I want to know the payout"
batch_query = colpali_processor.process_queries([user_query]).to(device)
embeddings_query = torch.unbind(colpali_model(**batch_query).to("cpu"))[0].float().numpy()
colbert_retriever_result = colbert_retriever.search(embeddings_query, topk=3)
basic_retriever_result = basic_retriever.search(embed_model.encode(user_query), topk=3)

此代码使用ColPALI模型执行文本到图像检索的查询嵌入，并使用句子嵌入模型执行文本到文本检索，遵循与上一个检索步骤相同的方法。

C.系统指令

接下来，让我们为我们的llama模型定义系统指令提示，如下所示。

system_instruction = """
You are a helpful assistant designed to answer user queries based on document-related content.

You will be provided with two types of context:
1. Text-based context — extracted textual content from documents.
2. Image-based context — visual content (e.g., figures, tables, or screenshots) extracted from documents.

Your tasks are:
- Analyze the user query and determine the appropriate response using the available context.
- Decide whether the answer requires information from the image-based context.

If the image context is necessary to answer the query:
- Set "need_image" to True.
- Set "image_index" to the appropriate index of the image used (e.g., 0 for the first image, 1 for the second, and so on).
- Include a clear explanation or reasoning in the response.

If the image context is **not** needed:
- Set "need_image" to False.
- Set "image_index" to -1.

All responses **must be returned in strict JSON format**:
{"response": <string>, "need_image": <true|false>, "image_index": <int>}

If you are unsure or cannot answer based on the given context, clearly state that you do not know.

Examples:
{"response": "The chart in image 1 shows the revenue trend.", "need_image": true, "image_index": 1}
{"response": "The policy details are outlined in the text section.", "need_image": false, "image_index": -1}
"""

该系统指令定义了助手应如何基于两种类型的文档上下文（基于文本和基于图像）回答用户查询。它指导模型判断是否需要图像来回答查询，并以严格的JSON格式构建响应，并包含一个标志（need_image）和一个image_indexif applicable标记。该指令确保文档理解任务的响应一致、可解释且支持多模式感知。

D. 消息有效载荷

接下来，让我们创建将传递给Llama模型API的消息有效负载，如下所示。

#定义有效载荷内容
payload_content = [{
 "type": "text",
 "text": f"User Query: {user_query}"
 }]

# 构造正在检索的图像URL
for i in range(len(colbert_retriever_result)):
 img_payload = {
 "type": "image_url",
 "image_url": {"url":url_conversion(colbert_retriever_result[i]["content"])}
 }
 payload_content.append(img_payload)

# 构建基于文本的上下文
for i in range(len(basic_retriever_result)):
 txt_payload = {
 "type": "text",
 "text": f"Text-based Context #{i+1}:\n{basic_retriever_result[i]['content']}"
 }
 payload_content.append(txt_payload)

#创建最终消息形式
messages = [
 {
 "role": "system",
 "content": system_instruction
 },
 {
 "role": "user",
 "content": payload_content
 }
]

此代码通过将用户查询、检索到的图像（以base64 URL的形式）和基于文本的上下文组合成结构化消息格式，构建LLM的输入负载。然后，它将这些内容与系统指令一起包装，形成最终的messages推理输入。

构造messages如下。

[{'role': 'system',
 'content': '\nYou are a helpful assistant designed to answer user queries based on document-related content.\n\nYou will be provided with two types of context:\n1. Text-based ... (truncated)'},
 {'role': 'user',
 'content': [{'type': 'text',
 'text': 'User Query: I want to know the payout'},
 {'type': 'image_url',
 'image_url': {'url': 'data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/...'}},
 {'type': 'image_url',
 'image_url': {'url': 'data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/...'}},
 {'type': 'image_url',
 'image_url': {'url': 'data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/...'}},
 {'type': 'text',
 'text': 'Text-based Context #1:\nExecutive Compensation Tables\n\nSummary Compensation Table\n\nThe following table sets forth information regarding the total compensation ... (truncated)'},
 {'type': 'text',
 'text': 'Text-based Context #2:\nExecutive Compensation Tables\n\nGrants of Plan-Based Awards in 2023\n\nThe following table provides additional information regarding cash ... (truncated)'},
 {'type': 'text',
 'text': 'Text-based Context #3:\nPay vs. Performance Comparison\n\nPay vs. Performance Comparison\n\nAs discussed in the CD&A above, our Compensation Committee has implemented ... (truncated)'}]}

E.模型推理

现在，让我们使用Llama模型来预测响应，如下所示。

import json
import re 

chat_result = llama4_inference(messages) 
chat_result = re.findall( r'\{[^{}]+\}' , chat_result) 
chat_result = json.loads(chat_result[- 1 ]) 
chat_result

此代码运行LLM推理，使用正则表达式从输出中提取最后的JSON格式的响应，并将其解析为Python字典以供进一步使用。

由此推论可得出如下结果。

{'response': 'The payout varies based on the performance metric. For Relative TSR Percentile Rank, Delta ROCE, and FCF Margin, the payouts are illustrated in the provided graphs. For example, at a Relative TSR Percentile Rank of 60%, the payout is 60%; at a Delta ROCE of 0 bps, the payout is 100%; and at an FCF Margin of 10%, the payout is 100%.',
 'need_image': True,
 'image_index': 0}

F. 输出响应

下一步是按如下方式构建输出响应。

if chat_result[ "need_image" ]: 
 img_content = colbert_retriever_result[chat_result[ 'image_index' ]][ 'content' ] 
else : 
 img_content = ""

 output_response = { 
 "response" :chat_result[ "response" ], 
 "need_image" :chat_result[ "need_image" ], 
 "img_base64" :img_content 
 } 
output_response

此代码检查LLM响应是否需要图像；如果需要，则从ColBERT检索器结果中检索相应的base64图像内容。然后，它会构建一个最终响应字典，其中包含答案文本、图像标志以及图像内容（如果适用）。

最终构建的响应如下。

{'response': 'The payout varies based on the performance metric. For Relative TSR Percentile Rank, Delta ROCE, and FCF Margin, the payouts are illustrated in the provided graphs. For example, at a Relative TSR Percentile Rank of 60%, the payout is 60%; at a Delta ROCE of 0 bps, the payout is 100%; and at an FCF Margin of 10%, the payout is 100%.',
 'need_image': True,
 'img_base64': '/9j/4AAQSkZJRgABAQAAAQABAAD/...'}

评估

在本节中，我们将讨论我们提出的多模态RAG管道和标准纯文本RAG管道之间的定性比较，重点介绍检索相关性和答案质量方面的关键差异，特别是对于基于视觉的查询。

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！-AI.x社区

作者插图：我们的管道与常见的RAG管道的比较

我们的定性比较表明，多模态RAG流程比标准的纯文本RAG系统能够提供更准确的答案，尤其是在涉及表格、图形和图表等结构化视觉内容的查询时。标准RAG流程依赖OCR将文档视觉内容转换为纯文本，这通常会导致空间结构的丢失和关键信息的误解。

相比之下，我们的系统结合了基于ColPALI的图像检索、用于布局检测的YOLO DocLayNet以及标准文本嵌入，从而同时保留视觉和语义上下文。这使得它能够准确地检索和推理基于OCR的流程通常会遗漏的内容，凸显了真正多模态方法的有效性。

进一步的演示

我开发了一个简单的基于Chainlit的应用程序来总结我们的实验。以下是该Web应用程序的前端概览。

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！-AI.x社区

作者插图：多模式RAG应用程序的前端

通过此应用程序，聊天机器人能够检索文本和图像信息来回答用户查询。当用户查询相关时，它还可以显示相关图像，以增强理解并提供更清晰的背景信息。

为了复制此Web应用程序及其对应的后端服务器，我创建了一个GitHub存储库，你可以在此处访问。此存储库完全复制了我们的实验，包括完整的知识库索引管道以及端到端部署所需的所有组件。

结论

在本文中，我们构建了一个多模态RAG系统，该系统结合了用于基于图像的检索的ColPALI和用于视觉区域检测的YOLO-DocLayNet，从而突破了传统纯文本检索的局限性。通过实际结果演示，我们展示了如何将文本和视觉上下文相结合，在文档问答任务中提供更准确、更具有上下文感知的答案。

参考文献

Faysse, M.，Sibille, H.，Wu, T.，Omrani, B.，Viaud, G.，Hudelot, C.和Colombo, P.（2024）。ColPali：基于视觉语言模型的高效文档检索。arXiv预印本arXiv:2407.01449。地址：https ://arxiv.org/abs/2407.01449。
Pfitzmann, B.，Auer, C.，Dolfi, M.，Nassar, AS和Staar, P.（2022）。DocLayNet：用于文档布局分割的大型人工注释数据集。载于第28届ACM SIGKDD知识发现与数据挖掘会议论文集（第3743-3751页）。
Reimers, N.和Gurevych, I.（2020）。利用知识蒸馏将单语句子嵌入多语言化。载于2020年自然语言处理实证方法会议（EMNLP）论文集。计算语言学协会。地址：https ://arxiv.org/abs/2004.09813。

译者介绍

朱先忠，51CTO社区编辑，51CTO专家博客、讲师，潍坊一所高校计算机教师，自由编程界老兵一枚。

原文标题：ColPALI Meets DocLayNet: A Vision-Aware Multimodal RAG for Document-QA，作者：Abu Hanif Muhammad Syarubany

标签

检索增强生成

ColPali

DocLayNet

已于2025-8-14 09:41:34修改

社区头条

51CTO

51CTO博客

51CTO学堂

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！原创

简介

RAG系统中的视觉盲点

多模态RAG架构

我的测试设置

A. 环境设置

1.创建Conda环境

2. 准备requirements.txt

3.安装Python依赖项

B.预训练模型设置

C. 代码设置和初始化

索引管道

A. 准备

B.布局检测

C. 提取并保存

D.文本OCR

E. 使用Milvus DB Client建立索引

1.Retriever类

2. Retriever设置

3.图像数据加载器

4. 图像和文本索引

5. Retriever测试

聊天推理管道

A. 准备

B. 用户查询和相关上下文

C.系统指令

D. 消息有效载荷

E.模型推理

F. 输出响应

评估

进一步的演示

结论

参考文献

译者介绍

目录

51CTO

51CTO博客

51CTO学堂

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！ 原创

简介

RAG系统中的视觉盲点

多模态RAG架构

我的测试设置

A. 环境设置

1.创建Conda环境

2. 准备requirements.txt

3.安装Python依赖项

B.预训练模型设置

C. 代码设置和初始化

索引管道

A. 准备

B.布局检测

C. 提取并保存

D.文本OCR

E. 使用Milvus DB Client建立索引

1.Retriever类

2. Retriever设置

3.图像数据加载器

4. 图像和文本索引

5. Retriever测试

聊天推理管道

A. 准备

B. 用户查询和相关上下文

C.系统指令

D. 消息有效载荷

E.模型推理

F. 输出响应

评估

进一步的演示

结论

参考文献

译者介绍

目录

ColPali联手DocLayNet：打造能“看懂”文档布局的视觉问答神器！原创