AI总是忘事?教你9招,让智能体“记性”变超强! 原创

发布于 2025-7-18 14:26
浏览
0收藏

从滑动窗口到类操作系统记忆的测试与解析

9种技巧

优化AI代理记忆的9种技巧:从入门到高级

优化AI代理的一种方法是设计多子代理架构以提升准确性。然而,在对话型AI中,优化远不止于此——memory变得尤为关键。

随着你与AI代理的对话越来越长、越来越深入,它使用的memory会越来越多。这是因为AI依赖于诸如历史上下文存储、工具调用、数据库搜索等组件。

在这篇博客中,我们将编写代码并评估9种从入门到高级的memory optimization技巧,帮助你了解如何应用每种技巧,以及它们的优缺点——从简单的sequential approach到高级的OS-like memory management实现。

AI总是忘事?教你9招,让智能体“记性”变超强!-AI.x社区

技巧总结

为了保持清晰和实用性,我们将全程使用一个简单的AI代理,观察每种技巧的内部机制,便于在更复杂系统中扩展和实现这些策略。

所有代码(理论+笔记本)都可在我的GitHub仓库获取:
​​​https://github.com/PulsarPioneers/Multi-Agent-AI-System​

目录

  • 环境设置
  • 创建辅助函数
  • 创建基础代理和Memory Class
  • Sequential Optimization Approach的问题
  • Sliding Window Approach
  • Summarization Based Optimization
  • Retrieval Based Memory
  • Memory Augmented Transformers
  • Hierarchical Optimizationfor Multi-tasks
  • Graph Based Optimization
  • Compression & Consolidation Memory
  • OS-Like Memory Management
  • 选择合适的策略

环境设置

为了优化和测试AI代理的memory techniques,我们需要先初始化一些组件。但在初始化之前,得先安装必要的Python库:

  • openai:用于与LLM API交互的客户端库。
  • numpy:用于数值运算,特别是处理embeddings
  • faiss-cpu:Facebook AI的库,用于高效相似性搜索,驱动我们的retrieval memory,堪称完美的内存向量数据库。
  • networkx:用于创建和管理Graph-Based Memory中的knowledge graph
  • tiktoken:用于精确计算tokens并管理上下文窗口限制。

安装这些模块:

pip install openai numpy faiss-cpu networkx tiktoken

接下来,初始化client module以调用LLM

import os
from openai import OpenAI

API_KEY = "YOUR_LLM_API_KEY"
BASE_URL = "https://api.studio.nebius.com/v1/"

client = OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY
)

print("OpenAI client configured successfully.")

我们将通过BnebiusTogether AI等API提供商使用开源模型。接下来,导入并选择用于创建AI代理的开源LLM

import tiktoken
import time

GENERATION_MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
EMBEDDING_MODEL = "BAAI/bge-multilingual-gemma2"

主要任务使用LLaMA 3.1 8B Instruct模型,部分优化依赖embedding model,我们将使用Gemma-2-BGE多模态嵌入模型。

接下来,定义多个辅助函数,贯穿整个博客使用。

创建辅助函数

为了避免重复代码并遵循良好编码习惯,我们将定义三个辅助函数:

  • generate_text:根据系统和用户prompts生成内容。
  • generate_embeddings:为retrieval-based方法生成embeddings
  • count_tokens:为每种retrieval-based方法计算总tokens数。

先编码generate_text函数,根据输入prompt生成文本:

def generate_text(system_prompt: str, user_prompt: str) -> str:
    """
    调用LLM API生成文本响应。
    
    参数:
        system_prompt: 定义AI角色和行为的指令。
        user_prompt: 用户输入,AI需对此响应。
        
    返回:
        AI生成的文本内容,或错误信息。
    """
    response = client.chat.completions.create(
        model=GENERATION_MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    return response.choices[0].message.content

generate_text函数接受system promptuser prompt,基于LLaMA 3.1 8B生成响应。

接下来,编码generate_embeddings函数,使用Gemma-2模型生成embeddings

def generate_embedding(text: str) -> list[float]:
    """
    使用嵌入模型为给定文本生成数值嵌入。
    
    参数:
        text: 要转换为嵌入的输入字符串。
        
    返回:
        表示嵌入向量的浮点数列表,或错误时返回空列表。
    """
    response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=text
    )
    return response.data[0].embedding

embedding函数使用Gemma-2模型返回输入文本的嵌入。

最后,创建一个函数,基于整个AI和用户聊天历史计算tokens,帮助了解优化效果:

我们将使用常见的现代tokenizer——OpenAI cl100k_base,这是一个**Byte Pair Encoding (BPE)**分词器。简单来说,BPE是一种高效地将文本拆分为子词单元的算法。

BPE示例
​​​"lower", "lowest" → ["low", "er"], ["low", "est"]​

初始化tokenizer

tokenizer = tiktoken.get_encoding("cl100k_base")

现在创建函数来分词并计算tokens总数:

def count_tokens(text: str) -> int:
    """
    使用预加载的tokenizer计算给定字符串的token数。
    
    参数:
        text: 要分词的字符串。
        
    返回:
        token数的整数。
    """
    return len(tokenizer.encode(text))

搞定!辅助函数创建完毕,我们可以开始探索和评估不同技巧。

创建基础代理和Memory Class

现在需要创建代理的核心设计结构,贯穿整个指南使用。关于memory,AI代理有三个关键组件:

  • 将历史消息添加到AI代理的memory,使其了解上下文。
  • 检索相关内容,帮助AI生成响应。

在每种策略实施后清除AI代理的memory

Object-Oriented Programming (OOP)是构建基于memory功能的最佳方式,我们来实现:

import abc

class BaseMemoryStrategy(abc.ABC):
    """所有memory策略的抽象基类。"""
    
    @abc.abstractmethod
    def add_message(self, user_input: str, ai_response: str):
        """添加新的用户-AI交互到memory存储。"""
        pass

    @abc.abstractmethod
    def get_context(self, query: str) -> str:
        """从memory检索并格式化相关上下文发送给LLM。"""
        pass

    @abc.abstractmethod
    def clear(self):
        """重置memory,适用于开始新对话。"""
        pass

我们使用**@abstractmethod**,这是子类复用不同实现时的常见编码风格。每种策略(子类)包含不同实现,因此设计中需要抽象方法。

基于刚定义的memory state和辅助函数,我们使用OOP原则构建AI代理结构:

class AIAgent:
    """主AI代理类,设计为可与任何memory策略配合使用。"""
    
    def __init__(self, memory_strategy: BaseMemoryStrategy, system_prompt: str = "You are a helpful AI assistant."):
        """
        初始化代理。
        
        参数:
            memory_strategy: 继承自BaseMemoryStrategy的实例,决定代理如何记忆对话。
            system_prompt: 给LLM的初始指令,定义其角色和任务。
        """
        self.memory = memory_strategy
        self.system_prompt = system_prompt
        print(f"Agent initialized with {type(memory_strategy).__name__}.")

    def chat(self, user_input: str):
        """
        处理对话中的一个回合。
        
        参数:
            user_input: 用户的最新消息。
        """
        print(f"\n{'='*25} NEW INTERACTION {'='*25}")
        print(f"User > {user_input}")
        
        start_time = time.time()
        context = self.memory.get_context(query=user_input)
        retrieval_time = time.time() - start_time
        
        full_user_prompt = f"### MEMORY CONTEXT\n{context}\n\n### CURRENT REQUEST\n{user_input}"
        
        prompt_tokens = count_tokens(self.system_prompt + full_user_prompt)
        print("\n--- Agent Debug Info ---")
        print(f"Memory Retrieval Time: {retrieval_time:.4f} seconds")
        print(f"Estimated Prompt Tokens: {prompt_tokens}")
        print(f"\n[Full Prompt Sent to LLM]:\n---\nSYSTEM: {self.system_prompt}\nUSER: {full_user_prompt}\n---")
        
        start_time = time.time()
        ai_response = generate_text(self.system_prompt, full_user_prompt)
        generation_time = time.time() - start_time
        
        self.memory.add_message(user_input, ai_response)
        
        print(f"\nAgent > {ai_response}")
        print(f"(LLM Generation Time: {generation_time:.4f} seconds)")
        print(f"{'='*70}")

代理基于6个简单步骤:

1. 根据使用的策略从memory检索上下文,记录时间等。

2. 将检索的memory context与当前用户输入合并,准备完整的prompt

3. 打印调试信息,如prompttokens数和上下文检索时间。

4. 将完整prompt(系统+用户+上下文)发送给LLM,等待响应。

5. 用新交互更新memory,供未来上下文检索使用。

6. 显示AI响应及生成时间,结束此回合。

好了!组件编码完成,我们开始理解和实现每种memory optimization技巧。

Sequential Optimization Approa 的问题

这是最基础、最简单的优化方法,许多开发者常用,是早期管理对话历史的常用方式,常用于早期chatbots

该方法将每条新消息添加到运行日志,并每次将整个对话反馈给模型,形成线性memory链,保留所有对话内容。让我们来可视化:

AI总是忘事?教你9招,让智能体“记性”变超强!-AI.x社区


Sequential Approach工作方式:

1. 用户与AI代理开始对话。

2. 代理响应。

3. 用户-AI交互(一个“回合”)保存为单一文本块。

4. 下一回合,代理获取整个历史(回合1+回合2+回合3…)并与新用户查询结合。

5. 这个巨大的文本块发送给LLM生成下一次响应。

使用我们的Memory Class实现sequential optimization

class SequentialMemory(BaseMemoryStrategy):
    def __init__(self):
        """初始化memory,包含一个空列表存储对话历史。"""
        self.history = []

    def add_message(self, user_input: str, ai_response: str):
        """将新的用户-AI交互添加到历史。"""
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "assistant", "content": ai_response})

    def get_context(self, query: str) -> str:
        """检索整个对话历史,格式化为单一字符串作为LLM的上下文。"""
        return "\n".join([f"{turn['role'].capitalize()}: {turn['content']}" for turn in self.history])

    def clear(self):
        """通过清空列表重置对话历史。"""
        self.history = []
        print("Sequential memory cleared.")

代码解析:

  • init(self):初始化空的self.history列表存储对话。
  • add_message(...):添加用户输入和AI响应到历史。
  • get_context(...):将历史格式化为“Role: Content”字符串作为上下文。
  • clear():为新对话重置历史。

初始化memory class并构建AI代理:

sequential_memory = SequentialMemory()
agent = AIAgent(memory_strategy=sequential_memory)

测试sequential approach,创建多回合对话:

agent.chat("Hi there! My name is Sam.")
agent.chat("I'm interested in learning about space exploration.")
agent.chat("What was the first thing I told you?")

输出

==== NEW INTERACTION ====
User: Hi there! My name is Sam.  
Bot: Hello Sam! Nice to meet you. What brings you here today?  
>>>> Tokens: 23 | Response Time: 2.25s

==== NEW INTERACTION ====
User: I am interested in learning about space exploration.  
Bot: Awesome! Are you curious about:  
- Mars missions  
- Space agencies  
- Private companies (e.g., SpaceX)  
- Space tourism  
- Search for alien life?  
...  
>>>> Tokens: 92 | Response Time: 4.46s

==== NEW INTERACTION ====
User: What was the first thing I told you?  
Bot: You said, "Hi there

! My name is Sam."  
...  
>>>> Tokens: 378 | Response Time: 0.52s

对话很顺畅,但注意token计算,每回合后tokens数越来越大。我们的代理不依赖显著增加token的外部工具,因此增长完全来自消息的sequential accumulation

缺点:对话越大,token成本越高,sequential approach成本高昂。

Sliding Window Approach

为避免大上下文问题,接下来聚焦sliding window approach,代理无需记住所有历史消息,只保留最近N条消息的上下文。

AI总是忘事?教你9招,让智能体“记性”变超强!-AI.x社区

代理仅保留最近N条消息作为上下文,新消息到达时,最旧的消息被丢弃,窗口向前滑动。

Sliding Window Approach流程:

1. 定义固定窗口大小,如N=2回合。

2. 前两回合填满窗口。

3. 第三回合时,第一个回合被推出窗口。

4. 发送给LLM的上下文仅为当前窗口内的内容。

实现Sliding Window Memory类:

from collections import deque

class SlidingWindowMemory(BaseMemoryStrategy):
    def __init__(self, window_size: int = 4):
        """
        初始化memory,使用固定大小的deque。
        
        参数:
            window_size: 保留的对话回合数(用户+AI=1回合)。
        """
        self.history = deque(maxlen=window_size)

    def add_message(self, user_input: str, ai_response: str):
        """添加新对话回合到历史,deque满时自动移除最旧回合。"""
        self.history.append([
            {"role": "user", "content": user_input},
            {"role": "assistant", "content": ai_response}
        ])

    def get_context(self, query: str) -> str:
        """检索当前窗口内的对话历史,格式化为单一字符串。"""
        context_list = []
        for turn in self.history:
            for message in turn:
                context_list.append(f"{message['role'].capitalize()}: {message['content']}")
        return "\n".join(context_list)

sequentialsliding memory类相似,区别在于添加了上下文窗口。代码解析:

  • init(self, window_size=2):设置固定大小的deque,实现上下文窗口的自动滑动。
  • add_message(...):添加新回合,deque满时丢弃旧条目。
  • get_context(...):仅从当前滑动窗口内的消息构建上下文。

初始化sliding window并构建AI代理:

sliding_memory = SlidingWindowMemory(window_size=2)
agent = AIAgent(memory_strategy=sliding_memory)

测试优化方法,创建多回合对话:

agent.chat("My name is Priya and I'm a software developer.")
agent.chat("I work primarily with Python and cloud technologies.")
agent.chat("My favorite hobby is hiking.")

输出

==== NEW INTERACTION ====
User: My name is Priya and I am a software developer.  
Bot: Nice to meet you, Priya! What can I assist you with today?  
>>>> Tokens: 27 | Response Time: 1.10s

==== NEW INTERACTION ====
User: I work primarily with Python and cloud technologies.  
Bot: That is great! Given your expertise...  
>>>> Tokens: 81 | Response Time: 1.40s

==== NEW INTERACTION ====
User: My favorite hobby is hiking.  
Bot: It seems we had a nice conversation about your background...  
>>>> Tokens: 167 | Response Time: 1.59s

对话与sequential approach类似。现在,测试用户询问窗口外的信息:

agent.chat("What is my name?")

输出

==== NEW INTERACTION ====
User: What is my name?  
Bot: I apologize, but I dont have access to your name from our recent conversation. Could you please remind me?  
>>>> Tokens: 197 | Response Time: 0.60s

AI代理无法回答,因为相关上下文已超出滑动窗口。token数减少,但重要上下文可能丢失。滑动窗口大小需根据AI代理类型定制。

Summarization Based Optimization

sequential approach有巨大上下文问题,sliding window可能丢失重要上下文。需要一种方法压缩上下文而不丢失关键信息,这就是summarization

AI总是忘事?教你9招,让智能体“记性”变超强!-AI.x社区

Summarization Approach流程:

1. 最近消息存储在临时“buffer”中。

2. buffer达到一定大小(“threshold”)时,代理暂停并触发动作。

3. 将buffer内容和之前summary发送给LLM,要求生成新的、合并的summary

4. LLM生成新summary,替换旧的,buffer清空。

实现summarization optimization

class SummarizationMemory(BaseMemoryStrategy):
    def __init__(self, summary_threshold: int = 4):
        """
        初始化summarization memory。
        
        参数:
            summary_threshold: 触发summarization的消息数(用户+AI)。
        """
        self.running_summary = ""
        self.buffer = []
        self.summary_threshold = summary_threshold

    def add_message(self, user_input: str, ai_response: str):
        """添加新交互到buffer,buffer满时触发memory consolidation。"""
        self.buffer.append({"role": "user", "content": user_input})
        self.buffer.append({"role": "assistant", "content": ai_response})

        if len(self.buffer) >= self.summary_threshold:
            self._consolidate_memory()

    def _consolidate_memory(self):
        """使用LLM总结buffer内容并与现有running summary合并。"""
        print("\n--- [Memory Consolidation Triggered] ---")
        buffer_text = "\n".join([f"{msg['role'].capitalize()}: {msg['content']}" for msg in self.buffer])
        
        summarization_prompt = (
            f"You are a summarization expert. Your task is to create a concise summary of a conversation. "
            f"Combine the 'Previous Summary' with the 'New Conversation' into a single, updated summary. "
            f"Capture all key facts, names, and decisions.\n\n"
            f"### Previous Summary:\n{self.running_summary}\n\n"
            f"### New Conversation:\n{buffer_text}\n\n"
            f"### Updated Summary:"
        )
        
        new_summary = generate_text("You are an expert summarization engine.", summarization_prompt)
        self.running_summary = new_summary
        self.buffer = []
        print(f"--- [New Summary: '{self.running_summary}'] ---")

    def get_context(self, query: str) -> str:
        """构建上下文,结合长期running summary和短期buffer。"""
        buffer_text = "\n".join([f"{msg['role'].capitalize()}: {msg['content']}" for msg in self.buffer])
        return f"### Summary of Past Conversation:\n{self.running_summary}\n\n### Recent Messages:\n{buffer_text}"

代码解析:

  • init(...):设置空的running_summarybuffer列表。
  • add_message(...):将消息添加到buffer,达到summary_threshold时调用**_consolidate_memory**。
  • _consolidate_memory():格式化buffer和现有summary,请求LLM生成新summary,更新running_summary并清空buffer
  • get_context(...):提供长期summary和短期buffer,给LLM完整对话视图。

初始化并测试:

summarization_memory = SummarizationMemory(summary_threshold=4)
agent = AIAgent(memory_strategy=summarization_memory)

agent.chat("I'm starting a new company called 'Innovatech'. Our focus is on sustainable energy.")
agent.chat("Our first product will be a smart solar panel, codenamed 'Project Helios'.")

输出

==== NEW INTERACTION ====
User: I am starting a new company called 'Innovatech'. Ou...
Bot: Congratulations on starting Innovatech! Focusing o ...  
>>>> Tokens: 45 | Response Time: 2.55s

==== NEW INTERACTION ====
User: Our first product will be a smart solar panel....  
--- [Memory Consolidation Triggered] ---  
--- [New Summary: The user started a compan ...  
Bot: That is exciting news about  ....  
>>>> Tokens: 204 | Response Time: 3.58s

两回合后生成summary。继续测试:

agent.chat("The marketing budget is set at $50,000.")
agent.chat("What is the name of my company and its first product?")

输出

==== NEW INTERACTION ====
User: What is the name of my company and its first product?  
Bot: Your company is called 'Innovatech' and its first product is codenamed 'Project Helios'.  
>>>> Tokens: 147 | Response Time: 1.05s

第四回合token数几乎减半,summarization大大降低token使用。但需精心设计summarization prompts以捕捉关键细节。

缺点:关键信息可能在summarization中丢失。例如,40回合对话包含数值或事实细节(如第四回合的销售数据),可能不再出现在summary中。

测试40回合后的场景:

agent.chat("what was the gross sales of our company in the fiscal year?")

输出

==== NEW INTERACTION ====
User: what was the gross sales of our company in the fiscal year?  
Bot: I am sorry but I do not have that information. Could you please provide the gross sales figure for the fiscal year?  
>>>> Tokens: 1532 | Response Time: 2.831s

summary信息虽减少tokens,但答案质量可能显著下降。建议创建子代理进行fact-checking,提升可靠性。

Retrieval Based Memory

这是许多AI代理用例中最强大的策略:RAG-based AI agents。之前的方法减少token使用但可能丢失上下文,RAG通过基于当前用户查询检索相关上下文解决此问题。

AI总是忘事?教你9招,让智能体“记性”变超强!-AI.x社区

上下文存储在数据库中,embedding models将文本转换为向量表示,提升检索效率。

RAG Based Memory流程:

1. 新交互保存为数据库中的“document”,生成其数值表示(embedding)并存储。

2. 用户发送新消息,代理将其转换为embedding

3. 使用查询embedding对所有document embeddings进行相似性搜索。

4. 检索语义上最相关的k个documents(如3个最相似的历史回合)。

5. 仅将这些相关documents注入LLM的上下文窗口。

使用FAISS进行向量存储:

import numpy as np
import faiss

class RetrievalMemory(BaseMemoryStrategy):
    def __init__(self, k: int = 2, embedding_dim: int = 3584):
        """
        初始化retrieval memory系统。
        
        参数:
            k: 检索的top相关documents数。
            embedding_dim: 嵌入模型生成的向量维度,BAAI/bge-multilingual-gemma2为3584。
        """
        self.k = k
        self.embedding_dim = embedding_dim
        self.documents = []
        self.index = faiss.IndexFlatL2(self.embedding_dim)

    def add_message(self, user_input: str, ai_response: str):
        """添加新对话回合到memory,分别嵌入和索引用户和AI消息。"""
        docs_to_add = [
            f"User said: {user_input}",
            f"AI responded: {ai_response}"
        ]
        for doc in docs_to_add:
            embedding = generate_embedding(doc)
            if embedding:
                self.documents.append(doc)
                vector = np.array([embedding], dtype='float32')
                self.index.add(vector)

    def get_context(self, query: str) -> str:
        """根据语义相似性检索k个最相关documents。"""
        if self.index.ntotal == 0:
            return "No information in memory yet."
        
        query_embedding = generate_embedding(query)
        if not query_embedding:
            return "Could not process query for retrieval."
        
        query_vector = np.array([query_embedding], dtype='float32')
        distances, indices = self.index.search(query_vector, self.k)
        
        retrieved_docs = [self.documents[i] for i in indices[0] if i != -1]
        if not retrieved_docs:
            return "Could not find any relevant information in memory."
        
        return "### Relevant Information Retrieved from Memory:\n" + "\n---\n".join(retrieved_docs)

代码解析:

  • init(...):初始化documents列表和faiss.IndexFlatL2存储搜索向量,指定embedding_dim
  • add_message(...):为用户和AI消息生成embedding,添加到documentsFAISS index
  • get_context(...):嵌入用户查询,使用self.index.search查找k个最相似向量,提取原始文本作为上下文。

初始化并测试:

retrieval_memory = RetrievalMemory(k=2)
agent = AIAgent(memory_strategy=retrieval_memory)

agent.chat("I am planning a vacation to Japan for next spring.")
agent.chat("For my software project, I'm using the React framework for the frontend.")
agent.chat("I want to visit Tokyo and Kyoto while I'm on my trip.")
agent.chat("The backend of my project will be built with Django.")
agent.chat("What cities am I planning to visit on my vacation?")

输出

==== NEW INTERACTION ====
User: What cities am I planning to visit on my vacation?  
--- Agent Debug Info ---  
[Full Prompt Sent to LLM]:  
---  
SYSTEM: You are a helpful AI assistant.  
USER: MEMORY CONTEXT  
Relevant Information Retrieved from Memory:  
User said: I want to visit Tokyo and Kyoto while I am on my trip.  
---  
User said: I am planning a vacation to Japan for next spring.  
...  

Bot: You are planning to visit Tokyo and Kyoto while on your vacation to Japan next spring.  
>>>> Tokens: 65 | Response Time: 0.53s

成功检索相关上下文,token数极低,仅检索相关信息。embedding modelvector storage database的选择至关重要,FAISS因其高效性广受欢迎。但数据库越大,AI代理复杂度越高,需并行查询等优化技术。

Memory Augmented Transformers

AI系统正采用更复杂的方法,突破可能性的边界。

想象普通AI像一个学生,只有一个小笔记本,写的内容有限。在长考试中,他们得擦掉旧笔记为新笔记腾空间。Memory-Augmented Transformers就像给学生一堆便签,笔记本处理当前工作,便签保存早期关键信息。

AI总是忘事?教你9招,让智能体“记性”变超强!-AI.x社区

例如:设计一个无暴力的太空视频游戏,早期提到“太空设定,无暴力”。普通AI可能忘记,但memory-augmentedAI将此写在便签上,稍后查询时仍能匹配原始愿景。

Memory Augmented Transformers流程:

  • 使用SlidingWindowMemory管理近期聊天。
  • 每回合后,使用LLM作为“fact extractor”,分析对话,决定是否包含核心事实、偏好或决定。
  • 若发现重要事实,存储为memory token(简洁字符串)。
  • 提供给代理的最终上下文结合近期聊天窗口和所有持久memory tokens

实现:

class MemoryAugmentedMemory(BaseMemoryStrategy):
    def __init__(self, window_size: int = 2):
        """
        初始化memory-augmented系统。
        
        参数:
            window_size: 短期memory保留的最近回合数。
        """
        self.recent_memory = SlidingWindowMemory(window_size=window_size)
        self.memory_tokens = []

    def add_message(self, user_input: str, ai_response: str):
        """添加回合到近期memory,并使用LLM决定是否创建持久memory token。"""
        self.recent_memory.add_message(user_input, ai_response)
        
        fact_extraction_prompt = (
            f"Analyze the following conversation turn. Does it contain a core fact, preference, or decision that should be remembered long-term? "
            f"Examples include user preferences ('I hate flying'), key decisions ('The budget is $1000'), or important facts ('My user ID is 12345').\n\n"
            f"Conversation Turn:\nUser: {user_input}\nAI: {ai_response}\n\n"
            f"If it contains such a fact, state the fact concisely in one sentence. Otherwise, respond with 'No important fact.'"
        )
        
        extracted_fact = generate_text("You are a fact-extraction expert.", fact_extraction_prompt)
        
        if "no important fact" not in extracted_fact.lower():
            print(f"--- [Memory Augmentation: New memory token created: '{extracted_fact}'] ---")
            self.memory_tokens.append(extracted_fact)

    def get_context(self, query: str) -> str:
        """结合短期近期对话和长期memory tokens构建上下文。"""
        recent_context = self.recent_memory.get_context(query)
        memory_token_context = "\n".join([f"- {token}" for token in self.memory_tokens])
        return f"### Key Memory Tokens (Long-Term Facts):\n{memory_token_context}\n\n### Recent Conversation:\n{recent_context}"

代码解析:

  • init(...):初始化SlidingWindowMemory和空的memory_tokens列表。
  • add_message(...):添加回合到滑动窗口,额外调用LLM检查是否提取关键事实,添加到memory_tokens
  • get_context(...):结合“便签”(memory_tokens)和近期聊天历史构建丰富prompt

初始化并测试:

mem_aug_memory = MemoryAugmentedMemory(window_size=2)
agent = AIAgent(memory_strategy=mem_aug_memory)

agent.chat("Please remember this for all future interactions: I am severely allergic to peanuts.")
agent.chat("Okay, let's talk about recipes. What's a good idea for dinner tonight?")
agent.chat("That sounds good. What about a dessert option?")
agent.chat("Could you suggest a Thai green curry recipe? Please ensure it's safe for me.")

输出

==== NEW INTERACTION ====
User: Please remember this for all future interactions: I am severely allergic to peanuts.  
--- [Memory Augmentation: New memory token created: 'The user has a severe allergy to peanuts.'] ---  
Bot: I have taken note of your long-term fact: You are severely allergic to peanuts. I will keep this in mind...  
>>>> Tokens: 45 | Response Time: 1.32s

...

==== NEW INTERACTION ====
User: Could you suggest a Thai green curry recipe? Please ensure it is safe for me.  
--- Agent Debug Info ---  
[Full Prompt Sent to LLM]:  
---  
SYSTEM: You are a helpful AI assistant.  
USER: MEMORY CONTEXT  
Key Memory Tokens (Long-Term Facts):  
- The user has a severe allergy to peanuts.  
...  
Recent Conversation:  
User: Okay, lets talk about recipes...  
...  

Bot: Of course. Given your peanut allergy, it is very important to be careful with Thai cuisine as many recipes use peanuts or peanut oil. Here is a peanut-free Thai green curry recipe...  
>>>> Tokens: 712 | Response Time: 6.45s

此方法因需额外LLM调用进行fact extraction,复杂且成本高,但能长期保留关键信息,非常适合构建可靠的个人助手。

Hierarchical Optimization for Multi-tasks

之前我们将memory视为单一系统。如果代理能像人类一样,拥有不同用途的memory类型呢?这就是Hierarchical Memory的理念,结合多种简单memory类型,创建更复杂、有组织的智能系统。

类比人类记忆:

  • Working Memory:最近听到的几句话,快速但短暂。
  • Short-Term Memory:今天早上会议的要点,几小时内易回忆。
  • -Term Memory:家庭地址或多年前学到的关键事实,持久且深入。AI总是忘事?教你9招,让智能体“记性”变超强!-AI.x社区

Hierarchical Approach流程:

1. 捕获用户消息到working memory

2. 检查信息是否重要,需提升至long-term memory

3. 提升内容存储到retrieval memory供未来使用。

4. 新查询时,搜索long-term memory获取相关上下文。

5. 将相关memories注入上下文,生成更好响应。

实现:

class HierarchicalMemory(BaseMemoryStrategy):
    def __init__(self, window_size: int = 2, k: int = 2, embedding_dim: int = 3584):
        """
        初始化hierarchical memory系统。
        
        参数:
            window_size: 短期working memory的回合数。
            k: 从long-term memory检索的documents数。
            embedding_dim: long-term memory的嵌入向量维度。
        """
        print("Initializing Hierarchical Memory...")
        self.working_memory = SlidingWindowMemory(window_size=window_size)
        self.long_term_memory = RetrievalMemory(k=k, embedding_dim=embedding_dim)
        self.promotion_keywords = ["remember", "rule", "preference", "always", "never", "allergic"]

    def add_message(self, user_input: str, ai_response: str):
        """添加消息到working memory,基于内容有条件提升到long-term memory。"""
        self.working_memory.add_message(user_input, ai_response)
        
        if any(keyword in user_input.lower() for keyword in self.promotion_keywords):
            print(f"--- [Hierarchical Memory: Promoting message to long-term storage.] ---")
            self.long_term_memory.add_message(user_input, ai_response)

    def get_context(self, query: str) -> str:
        """结合long-term和short-term memory层构建丰富上下文。"""
        working_context = self.working_memory.get_context(query)
        long_term_context = self.long_term_memory.get_context(query)
        return f"### Retrieved Long-Term Memories:\n{long_term_context}\n\n### Recent Conversation (Working Memory):\n{working_context}"

代码解析:

  • init(...):初始化SlidingWindowMemoryRetrievalMemory,定义promotion_keywords
  • add_message(...):添加消息到working_memory,检查是否包含keywords,若有则添加到long_term_memory
  • get_context(...):从两种memory系统获取上下文,合并为丰富prompt

初始化并测试:

hierarchical_memory = HierarchicalMemory()
agent = AIAgent(memory_strategy=hierarchical_memory)

agent.chat("Please remember my User ID is AX-7890.")
agent.chat("Let's chat about the weather. It's very sunny today.")
agent.chat("I'm planning to go for a walk later.")
agent.chat("I need to log into my account, can you remind me of my ID?")

输出

==== NEW INTERACTION ====
User: Please remember my User ID is AX-7890.  
--- [Hierarchical Memory: Promoting message to long-term storage.] ---  
Bot: You have provided your User ID as AX-7890, which has been stored in long-term memory for future reference.  
...

==== NEW INTERACTION ====
User: I need to log into my account, can you remind me of my ID?  
--- Agent Debug Info ---  
[Full Prompt Sent to LLM]:  
---  
SYSTEM: You are a helpful AI assistant.  
USER: ### MEMORY CONTEXT  
### Retrieved Long-Term Memories:  
### Relevant Information Retrieved from Memory:  
User said: Please remember my User ID is AX-7890.  
...  
### Recent Conversation (Working Memory):  
User: Let's chat about the weather...  
User: I'm planning to go for a walk later...  

Bot: Your User ID is AX-7890. You can use this to log into your account. Is there anything else I can assist you with?  
>>>> Tokens: 452 | Response Time: 2.06s

代理成功结合不同memory类型,使用快速working memory维持对话流,查询long-term memory检索关键User ID

Graph Based Optimization

之前memory以文本块存储,无论是完整对话、summary还是检索document。如果代理能理解信息间的关系呢?这就是Graph-Based Memory的飞跃。

此策略将信息表示为knowledge graph

Nodes (Entities):对话中的“事物”,如人(Clara)、公司(FutureScape)、概念(Project Odyssey)。

Edges (Relations):描述nodes关系的连接,如works_foris_based_inmanages

结果是结构化的网状memory。例如,不是简单事实“Clara works for FutureScape”,而是存储连接:(Clara) --[works_for]--> (FutureScape)。

AI总是忘事?教你9招,让智能体“记性”变超强!-AI.x社区

这对于回答需要推理关系的复杂查询非常强大。挑战在于从非结构化对话填充graph。我们使用LLM提取结构化(Subject, Relation, Object)三元组。

实现,使用networkx库:

import networkx as nx
import re

class GraphMemory(BaseMemoryStrategy):
    def __init__(self):
        """初始化memory,包含空的NetworkX有向图。"""
        self.graph = nx.DiGraph()

    def _extract_triples(self, text: str) -> list[tuple[str, str, str]]:
        """使用LLM从文本提取(Subject, Relation, Object)三元组。"""
        print("--- [Graph Memory: Attempting to extract triples from text.] ---")
        extraction_prompt = (
            f"You are a knowledge extraction engine. Your task is to extract Subject-Relation-Object triples from the given text. "
            f"Format your output strictly as a list of Python tuples. For example: [('Sam', 'works_for', 'Innovatech'), ('Innovatech', 'focuses_on', 'Energy')]. "
            f"If no triples are found, return an empty list [].\n\n"
            f"Text to analyze:\n\"""{text}\""""
        )
        
        response_text = generate_text("You are an expert knowledge graph extractor.", extraction_prompt)
        
        try:
            found_triples = re.findall(r"\(['\"](.*?)['\"],\s*['\"](.*?)['\"],\s*['\"](.*?)['\"]\)", response_text)
            print(f"--- [Graph Memory: Extracted triples: {found_triples}] ---")
            return found_triples
        except Exception as e:
            print(f"Could not parse triples from LLM response: {e}")
            return []

    def add_message(self, user_input: str, ai_response: str):
        """从最新对话回合提取三元组并添加到knowledge graph。"""
        full_text = f"User: {user_input}\nAI: {ai_response}"
        triples = self._extract_triples(full_text)
        for subject, relation, obj in triples:
            self.graph.add_edge(subject.strip(), obj.strip(), relatinotallow=relation.strip())

    def get_context(self, query: str) -> str:
        """通过查询中的实体查找graph,返回所有已知关系。"""
        if not self.graph.nodes:
            return "The knowledge graph is empty."
        
        query_entities = [word.capitalize() for word in query.replace('?','').split() if word.capitalize() in self.graph.nodes]
        
        if not query_entities:
            return "No relevant entities from your query were found in the knowledge graph."
        
        context_parts = []
        for entity in set(query_entities):
            for u, v, data in self.graph.out_edges(entity, data=True):
                context_parts.append(f"{u} --[{data['relation']}]--> {v}")
            for u, v, data in self.graph.in_edges(entity, data=True):
                context_parts.append(f"{u} --[{data['relation']}]--> {v}")
        
        return "### Facts Retrieved from Knowledge Graph:\n" + "\n".join(sorted(list(set(context_parts))))

代码解析:

_extract_triples(…):策略核心,将对话文本发送给LLM,请求结构化数据。

add_message(…):调用**_extract_triples**,将三元组添加到networkx graph

get_context(…):搜索查询中的实体,检索所有相关关系作为结构化上下文。

测试:

graph_memory = GraphMemory()
agent = AIAgent(memory_strategy=graph_memory)

agent.chat("A person named Clara works for a company called 'FutureScape'.")
agent.chat("FutureScape is based in Berlin.")
agent.chat("Clara's main project is named 'Odyssey'.")
agent.chat("Tell me about Clara's project.")

输出

==== NEW INTERACTION ====
User: A person named Clara works for a company called 'FutureScape'.  
--- [Graph Memory: Attempting to extract triples from text.] ---  
--- [Graph Memory: Extracted triples: [('Clara', 'works_for', 'FutureScape')]] ---  
Bot: Understood. I've added the fact that Clara works for FutureScape to my knowledge graph.  
...

==== NEW INTERACTION ====
User: Clara's main project is named 'Odyssey'.  
--- [Graph Memory: Attempting to extract triples from text.] ---  
--- [Graph Memory: Extracted triples: [('Clara', 'manages_project', 'Odyssey')]] ---  
Bot: Got it. I've noted that Clara's main project is Odyssey.  

==== NEW INTERACTION ====
User: Tell me about Clara's project.  
--- Agent Debug Info ---  
[Full Prompt Sent to LLM]:  
---  
SYSTEM: You are a helpful AI assistant.  
USER: ### MEMORY CONTEXT  
### Facts Retrieved from Knowledge Graph:  
Clara --[manages_project]--> Odyssey  
Clara --[works_for]--> FutureScape  
...  

Bot: Based on my knowledge graph, Clara's main project is named 'Odyssey', and Clara works for the company FutureScape.  
>>>> Tokens: 78 | Response Time: 1.5s

代理通过导航内部graph提供所有相关事实,适合构建高知识专家代理。

Compression & Consolidation Memory

summarization管理长对话效果不错,但能否更激进地降低token使用?这就是Compression & Consolidation Memory,像是summarization的更强版本。

目标是将每条信息提炼为最密集的事实表示,例如将冗长会议记录转化为简洁的单句要点。

AI总是忘事?教你9招,让智能体“记性”变超强!-AI.x社区

Compression Approach流程:

1. 每回合(用户输入+AI响应)发送给LLM

2. 使用特定prompt要求LLM作为“data compression engine”。

3. LLM将回合重写为单一、必要语句,剔除寒暄、礼貌用语等。

4. 压缩事实存储在简单列表中。

5. 代理的memory成为高效的核心事实列表,token效率极高。

实现:

class CompressionMemory(BaseMemoryStrategy):
    def __init__(self):
        """初始化memory,包含空的compressed facts列表。"""
        self.compressed_facts = []

    def add_message(self, user_input: str, ai_response: str):
        """使用LLM将最新回合压缩为简洁事实语句。"""
        text_to_compress = f"User: {user_input}\nAI: {ai_response}"
        
        compression_prompt = (
            f"You are a data compression engine. Your task is to distill the following text into its most essential, factual statement. "
            f"Be as concise as possible, removing all conversational fluff. For example, 'User asked about my name and I, the AI, responded that my name is an AI assistant' should become 'User asked for AI's name.'\n\n"
            f"Text to compress:\n\"{text_to_compress}\""
        )
        
        compressed_fact = generate_text("You are an expert data compressor.", compression_prompt)
        print(f"--- [Compression Memory: New fact stored: '{compressed_fact}'] ---")
        self.compressed_facts.append(compressed_fact)

    def get_context(self, query: str) -> str:
        """返回所有compressed facts列表,格式为项目符号列表。"""
        if not self.compressed_facts:
            return "No compressed facts in memory."
        return "### Compressed Factual Memory:\n- " + "\n- ".join(self.compressed_facts)

代码解析:

init(...):创建空的compressed_facts列表。

add_message(...):将回合发送给LLM,用compression prompt存储简洁结果。

get_context(...):将compressed facts格式化为简洁的项目符号列表。

测试:

compression_memory = CompressionMemory()
agent = AIAgent(memory_strategy=compression_memory)

agent.chat("Okay, I've decided on the venue for the conference. It's going to be the 'Metropolitan Convention Center'.")
agent.chat("The date is confirmed for October 26th, 2025.")
agent.chat("Could you please summarize the key details for the conference plan?")

输出

==== NEW INTERACTION ====
User: Okay, I've decided on the venue for the conference. It's going to be the 'Metropolitan Convention Center'.  
--- [Compression Memory: New fact stored: 'The conference venue has been decided as the 'Metropolitan Convention Center'.'] ---  
Bot: Great! The Metropolitan Convention Center is an excellent choice. What's next on our planning list?  
...

==== NEW INTERACTION ====
User: The date is confirmed for October 26th, 2025.  
--- [Compression Memory: New fact stored: 'The conference date is confirmed for October 26th, 2025.'] ---  
Bot: Perfect, I've noted the date.  
...

==== NEW INTERACTION ====
User: Could you please summarize the key details for the conference plan?  
--- Agent Debug Info ---  
[Full Prompt Sent to LLM]:  
---  
SYSTEM: You are a helpful AI assistant.  
USER: ### MEMORY CONTEXT  
### Compressed Factual Memory:  
- The conference venue has been decided as the 'Metropolitan Convention Center'.  
- The conference date is confirmed for October 26th, 2025.  
...  

Bot: Of course. Based on my notes, here are the key details for the conference plan:  
- **Venue:** Metropolitan Convention Center  
- **Date:** October 26th, 2025  
>>>> Tokens: 48 | Response Time: 1.2s

此策略极大降低token数,保留核心事实,适合需要长期事实召回且token预算紧张的应用。但对依赖细微语气和个性的对话,压缩可能过激。

OS-Like Memory Management

如果为代理构建一个像计算机memory一样的系统呢?

AI总是忘事?教你9招,让智能体“记性”变超强!-AI.x社区

此高级概念借鉴计算机Operating System管理RAMhard disk的方式:

  • RAM:计算机用于活动程序的超快memory,昂贵且容量有限。代理的LLM context windowRAM,访问快但大小受限。
  • Hard Disk:长期存储,容量大且便宜,但访问慢。代理可将其视为外部数据库或文件,存储旧对话历史。

OS-Like Memory Management流程:

  • Active Memory (RAM):最近对话回合保存在快速访问的buffer中。
  • Passive Memory (Disk)active memory满时,最旧信息移到长期存储,称为“paging out”。
  • Page Fault:用户询问不在active memory的信息时,发生“page fault”。
  • 系统从passive storage查找相关信息,加载到active contextLLM使用,称为“paging in”。

实现,模拟active_memorydeque)和passive_memorydictionary):

class OSMemory(BaseMemoryStrategy):
    def __init__(self, ram_size: int = 2):
        """
        初始化OS-like memory系统。

        参数:
            ram_size: active memory (RAM)保留的最大对话回合数。
        """
        self.ram_size = ram_size
        self.active_memory = deque()
        self.passive_memory = {}
        self.turn_count = 0

    def add_message(self, user_input: str, ai_response: str):
        """添加回合到active memory,RAM满时将最旧回合paging out到passive memory。"""
        turn_id = self.turn_count
        turn_data = f"User: {user_input}\nAI: {ai_response}"
        
        if len(self.active_memory) >= self.ram_size:
            lru_turn_id, lru_turn_data = self.active_memory.popleft()
            self.passive_memory[lru_turn_id] = lru_turn_data
            print(f"--- [OS Memory: Paging out Turn {lru_turn_id} to passive storage.] ---")
        
        self.active_memory.append((turn_id, turn_data))
        self.turn_count += 1

    def get_context(self, query: str) -> str:
        """提供RAM上下文,模拟page fault从passive memory拉取数据。"""
        active_context = "\n".join([data for _, data in self.active_memory])
        
        paged_in_context = ""
        for turn_id, data in self.passive_memory.items():
            if any(word in data.lower() for word in query.lower().split() if len(word) > 3):
                paged_in_context += f"\n(Paged in from Turn {turn_id}): {data}"
                print(f"--- [OS Memory: Page fault! Paging in Turn {turn_id} from passive storage.] ---")
        
        return f"### Active Memory (RAM):\n{active_context}\n\n### Paged-In from Passive Memory (Disk):\n{paged_in_context}"

    def clear(self):
        """清空active和passive memory。"""
        self.active_memory.clear()
        self.passive_memory = {}
        self.turn_count = 0
        print("OS-like memory cleared.")

代码解析:

  • init(...):设置固定大小的active_memory deque和空的passive_memory dictionary
  • add_message(...):添加新回合到active_memory,满时将最旧回合popleft()移到passive_memorypaging out)。
  • get_context(...):包含active_memory,搜索passive_memory,匹配查询时paging in数据到上下文。

测试,代理被告知秘密代码,强制其paging outpassive memory,然后询问代码:

os_memory = OSMemory(ram_size=2)
agent = AIAgent(memory_strategy=os_memory)

agent.chat("The secret launch code is 'Orion-Delta-7'.")
agent.chat("The weather for the launch looks clear.")
agent.chat("The launch window opens at 0400 Zulu.")
agent.chat("I need to confirm the launch code.")

输出

...

==== NEW INTERACTION ====
User: The launch window opens at 0400 Zulu.  
--- [OS Memory: Paging out Turn 0 to passive storage.] ---  
Bot: PROCESSING NEW LAUNCH WINDOW INFORMATION...  
...

==== NEW INTERACTION ====
User: I need to confirm the launch code.  
--- [OS Memory: Page fault! Paging in Turn 0 from passive storage.] ---  
--- Agent Debug Info ---  
[Full Prompt Sent to LLM]:  
---  
SYSTEM: You are a helpful AI assistant.  
USER: ### MEMORY CONTEXT  
### Active Memory (RAM):  
User: The weather for the launch looks clear.  
...  
User: The launch window opens at 0400 Zulu.  
...  
### Paged-In from Passive Memory (Disk):  
(Paged in from Turn 0): User: The secret launch code is 'Orion-Delta-7'.  
...  

Bot: CONFIRMING LAUNCH CODE: The stored secret launch code is 'Orion-Delta-7'.  
>>>> Tokens: 539 | Response Time: 2.56s

完美运行!代理成功将旧数据移到passive storage,仅在查询需要时智能检索。

此模型适合构建几乎无限memory的大规模系统,同时保持active context小而快。

选择合适的策略

我们探讨了九种不同的memory optimization策略,从简单到复杂。没有单一“最佳”策略,选择需平衡代理需求、预算和工程资源。

何时选择什么?

  • 简单、短生命周期botsSequentialSliding Window简单易实现,效果好。
  • 长、创意对话Summarization维持对话流,减少token开销。
  • 需要精确长期召回的代理Retrieval-Based memory是行业标准,强大且可扩展,RAG应用的基石。
  • 高可靠性个人助手Memory-AugmentedHierarchical方法分离关键事实和对话杂音。
  • 专家系统和知识库Graph-Based memory在推理数据点关系方面无与伦比。

生产中最强大的代理通常使用混合方法,结合这些技术。你可能使用hierarchical systemlong-term memory结合vector databaseknowledge graph

关键是明确代理需要记住什么、多久、精度如何。掌握这些memory strategies,你能超越简单chatbots,打造真正智能的代理,随时间学习、记忆、表现更好。


本文转载自​AI大模型观察站​,作者:AI大模型观察站

©著作权归作者所有,如需转载,请注明出处,否则将追究法律责任
已于2025-7-18 14:36:48修改
收藏
回复
举报
回复
相关推荐