GEPA 反思流程详解

反思是 GEPA 的核心创新，类似于神经网络的反向传播，但使用 LLM 的语言理解能力代替数学梯度。

反思流程图

┌─────────────────────────────────────────────────────────────────────────┐
│                         GEPA 反思流程                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  输入:                                                                  │
│    - candidate: 当前 prompt                                            │
│    - eval_batch: 评估结果 (分数 + 轨迹)                                │
│    - components_to_update: 需要更新的组件                               │
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           Step 1: 构建反思数据集                                │   │
│  │  反思数据集 = 所有评估案例 (成功 + 失败)                         │   │
│  │                                                                 │   │
│  │  for 每个评估案例:                                               │   │
│  │    record = {                                                    │   │
│  │      "Inputs": "问题: 1+1=?",                                    │   │
│  │      "Generated Outputs": "答案是3",                             │   │
│  │      "Feedback": "错误！正确答案是2。..." ← 评估器生成           │   │
│  │    }                                                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │              Step 2: 生成改进建议 ✅ 调用 LLM                    │   │
│  │                                                                  │   │
│  │  prompt = f"""                                                   │   │
│  │    当前 prompt: {candidate}                                      │   │
│  │                                                                  │   │
│  │    以下是在 {minibatch_size} 个样本上的评估结果:                 │   │
│  │    - 案例 1: {record_1}  ← 可能成功或失败                        │   │
│  │    - 案例 2: {record_2}  ← 可能成功或失败                        │   │
│  │    - 案例 3: {record_3}  ← 可能成功或失败                        │   │
│  │                                                                  │   │
│  │    请基于以上反馈，改进 prompt。                                  │   │
│  │  """                                                             │   │
│  │                                                                  │   │
│  │  new_prompt = reflection_lm(prompt)  # 调用更强的 LLM          │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │              Step 3: 返回新候选者                                │   │
│  │  new_candidate = {                                              │   │
│  │    "system_prompt": new_prompt                                  │   │
│  │  }                                                               │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

反思数据集详解

反思数据集 = 成功 + 失败案例

重要：反思数据集不只是失败的案例，而是包含所有评估的案例（包括成功和失败）。

为什么包含成功案例？

学习成功模式
- 了解什么是对的
- 保持好的行为
对比学习
- 对比成功 vs 失败
- 理解关键差异
避免过度修正
- 只看失败可能会矫枉过正
- 成功案例提供平衡

反思数据集的生成过程

┌─────────────────────────────────────────────────────────────────────────┐
│                    反思数据集生成过程                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  1. 评估阶段 (在训练集 minibatch 上)                                     │
│     ┌─────────────────────────────────────────────────────────────┐   │
│     │  minibatch = [样本1, 样本2, 样本3]  (通常 3 个样本)          │   │
│     │                                                             │   │
│     │  for 每个样本:                                               │   │
│     │    1. 调用 Task LLM 生成响应 ✅                              │   │
│     │    2. 评估器计算分数和 feedback ❌                            │   │
│     │    3. 记录到 trajectory                                      │   │
│     └─────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  2. 反思数据集构建                                                     │
│     ┌─────────────────────────────────────────────────────────────┐   │
│     │  for 每个 trajectory:                                        │   │
│     │    record = {                                                │   │
│     │      "Inputs": traj["data"]["input"],                         │   │
│     │      "Generated Outputs": traj["full_assistant_response"],    │   │
│     │      "Feedback": traj["feedback"]  ← 评估器生成的反馈         │   │
│     │    }                                                         │   │
│     └─────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  3. 发送给 Reflection LLM                                             │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Feedback 的来源

Feedback 由评估器生成（不调用 LLM）

Feedback 在评估阶段由评估器生成，使用文本模板，不调用 LLM。

评估器生成 Feedback 的代码

# src/gepa/adapters/default_adapter/default_adapter.py:69-84
class ContainsAnswerEvaluator:
    """默认评估器：检查答案是否在响应中"""

    def __call__(self, data, response):
        # 判断正确性 (字符串匹配)
        is_correct = data["answer"] in response
        score = 1.0 if is_correct else 0.0

        # 根据正确性生成不同的 feedback ❌ (不调用 LLM)
        if is_correct:
            feedback = f"The response is correct. Includes '{data['answer']}'"
        else:
            feedback = (
                f"The response is incorrect. "
                f"The correct answer is '{data['answer']}'. "
                "Ensure that the correct answer is included in the response exactly as it is."
            )
            # 添加额外上下文
            if data.get("additional_context"):
                feedback += f" Context: {data['additional_context']}"

        return EvaluationResult(score=score, feedback=feedback)

Feedback 模板示例

场景	Feedback 模板
正确	`"The response is correct. Includes '{answer}'"`
错误	`"The response is incorrect. The correct answer is '{answer}'. Ensure accuracy."`
格式错误	`"Wrong format. Expected: {format}, Got: {output}"`
超时	`"Timeout after {seconds}s. Need faster execution."`
置信度过高	`"WRONG — 99% certainty on '{wrong}' but correct is '{right}'. Prompt is misleading."`

代码实现

构建反思数据集

# src/gepa/adapters/default_adapter/default_adapter.py
def make_reflective_dataset(
    self,
    candidate: dict[str, str],
    eval_batch: EvaluationBatch,
    components_to_update: list[str]
) -> Mapping[str, Sequence[Mapping[str, Any]]]:
    """构建反思数据集"""

    items = []
    trajectories = eval_batch.trajectories  # 包含所有评估案例

    for traj in trajectories:
        record = {
            "Inputs": traj["data"]["input"],           # 输入
            "Generated Outputs": traj["full_assistant_response"],  # LLM 输出
            "Feedback": traj["feedback"]              # 评估器反馈
        }
        items.append(record)

    return {"system_prompt": items}

评估阶段生成 Feedback

# src/gepa/adapters/default_adapter/default_adapter.py:139-158
def evaluate(self, batch, candidate, capture_traces=False):
    # 1. 调用 LLM 生成响应 ✅
    responses = self._lm.batch_complete(litellm_requests)

    # 2. 对每个响应评估 ❌ (不调用 LLM)
    for data, response in zip(batch, responses):
        eval_result = self.evaluator(data, response)
        score = eval_result.score
        feedback = eval_result.feedback  # ← 这里生成 feedback

        # 3. 保存到 trajectory
        if trajectories is not None:
            trajectories.append({
                "data": data,                        # 原始输入
                "full_assistant_response": response, # LLM 响应
                "feedback": feedback                 # 评估器反馈
            })

反思数据集完整示例

输入数据

# Minibatch 有 3 个样本
minibatch = [
    {"input": "1+1=?", "answer": "2"},
    {"input": "2+2=?", "answer": "4"},
    {"input": "3+3=?", "answer": "6"}
]

# LLM 的响应
responses = [
    "The answer is 2",    # 正确 ✅
    "The answer is 5",    # 错误 ❌
    "The answer is 6"     # 正确 ✅
]

反思数据集

reflective_dataset = {
    "system_prompt": [
        {
            "Inputs": "1+1=?",
            "Generated Outputs": "The answer is 2",
            "Feedback": "The response is correct. Includes '2'"
        },
        {
            "Inputs": "2+2=?",
            "Generated Outputs": "The answer is 5",
            "Feedback": "The response is incorrect. Correct answer is '4'. Ensure accuracy."
        },
        {
            "Inputs": "3+3=?",
            "Generated Outputs": "The answer is 6",
            "Feedback": "The response is correct. Includes '6'"
        }
    ]
}

注意：反思数据集包含 2 个成功案例和 1 个失败案例。

发送给 Reflection LLM 的 Prompt

# 实际的 reflection prompt
prompt = """
You are given the following information about the performance of a candidate instruction on a small minibatch of examples.

## Current Instruction

{candidate['system_prompt']}


## Minibatch Evaluation Results
Here are the results of the current instruction on a small minibatch of training examples, with detailed feedback:

{dataset_with_feedback}


In the above, "Inputs" are the inputs to the system, "Generated Outputs" are the outputs produced by using the current instruction, and "Feedback" contains diagnostic feedback on the performance.

## Task
Analyze the evaluation results and propose an improved version of the instruction that addresses the failures and maintains the successes.

Return ONLY a valid JSON object with the new instruction.
"""

反思 Prompt 模板

REFINER_PROMPT_TEMPLATE = """
You are refining a candidate to improve its performance.

## Instructions
{refiner_prompt}

## Current Candidate (JSON)

{candidate_to_improve}


## Evaluation History
The following shows all evaluation attempts:

{evaluation_feedback}


## Task
Analyze the evaluation history and propose an improved version.
Return ONLY a valid JSON object with the improved parameters.
"""

类比神经网络

方面	神经网络	GEPA
目标	理解"为什么失败"	理解"为什么失败"
方法	反向传播	LLM 反思
梯度	`∇L = ∂L/∂W` (解析梯度)	反思数据 (文本反馈)
更新	`W ← W - α·∇L`	`prompt ← reflect(prompt, feedback)`
信息类型	数值梯度	语义理解
训练数据	所有训练样本 (成功+失败)	反思数据集 (成功+失败)

实际例子

场景：数学计算问题

# 当前 prompt
candidate = {
    "system_prompt": "You are a helpful assistant."
}

# 反思数据集 (包含成功和失败)
reflective_dataset = {
    "system_prompt": [
        {
            "Inputs": "1+1=?",
            "Generated Outputs": "The answer is 3",
            "Feedback": "Wrong. Correct answer is 2. The model made a calculation error."
        },
        {
            "Inputs": "What is the capital of France?",
            "Generated Outputs": "Paris",
            "Feedback": "Correct. The response includes the correct answer."
        },
        {
            "Inputs": "2+2=?",
            "Generated Outputs": "The answer is 5",
            "Feedback": "Wrong. Correct answer is 4. Another calculation error."
        }
    ]
}

LLM 分析

Reflection LLM 会分析：

数学题模式失败（2/3 失败）
常识问答模式成功（1/1 成功）
失败原因：计算错误
改进方向：强调计算步骤和验证

LLM 生成的改进

new_prompt = reflection_lm(prompt)
# => "You are a helpful assistant. For math problems, show your work step-by-step and double-check your calculations."

反思数据的类型

1. 基础反馈

{
    "Inputs": "...",
    "Generated Outputs": "...",
    "Feedback": "Wrong. Correct answer is ..."
}

2. 置信度反馈

{
    "Inputs": "...",
    "Generated Outputs": "...",
    "Feedback": "WRONG — model has 99% certainty on 'A' but correct is 'B'. The prompt is actively misleading it."
}

3. 执行轨迹

{
    "Inputs": "...",
    "Generated Outputs": "...",
    "Feedback": "Error: Division by zero at line 42",
    "Trace": "Stack trace..."
}

关键优势

特性	反向传播	LLM 反思
可解释性	数值梯度	人类可读的文本
适用范围	可微函数	任何文本组件
信息类型	局部梯度	全局语义理解
计算成本	低	中 (LLM 调用)
数据利用	所有样本	所有评估案例

两次 LLM 调用对比

┌─────────────────────────────────────────────────────────────────────────┐
│                    GEPA 的两次 LLM 调用                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  第一次: Task LLM (评估/前向传播)                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  model="gpt-4o-mini"          # 较小的模型                      │   │
│  │  messages = [system_prompt, user_input]                        │   │
│  │  → 生成预测值 (回答)                                           │   │
│  │  → 评估器计算分数和 feedback ❌                                 │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  第二次: Reflection LLM (反思/反向传播)                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  model="gpt-4o" / "claude-sonnet-4"  # 更强的模型               │   │
│  │  prompt = 当前prompt + 反思数据 (所有评估案例，成功+失败)        │   │
│  │  → 生成改进的 prompt                                           │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

核心概念：Actionable Side Information (ASI)

ASI 是 GEPA 的关键创新，指从执行环境中提取的诊断反馈：

传统方法: Loss = scalar (单一数值)
GEPA 方法: ASI = {
    "Inputs": "...",           # 输入
    "Generated Outputs": "...", # 输出
    "Feedback": "...",          # 诊断反馈
    "error_message": "...",     # 错误信息
    "profiling_data": "...",    # 性能数据
    "confidence_scores": [...], # 置信度
    ...
}

ASI 充当"文本梯度"，指导 LLM 进行有针对性的改进。

关键要点

反思 = 反向传播的等价物
- 目标相同：理解"为什么失败"
- 方法不同：LLM 理解 vs 数学计算
反思数据 = 所有评估案例
- 包含成功和失败
- 通常 3 个样本 (minibatch_size)
- 供 LLM 分析和学习
Feedback 由评估器生成
- 使用文本模板
- 不调用 LLM
- 根据正确性生成不同反馈
使用更强的 LLM
- reflection_lm 通常比 task_lm 更强
- 例如：GPT-4o vs GPT-4o-mini
迭代改进
- 每次迭代累积经验
- LLM 学习历史失败模式

反思后的完整流程

反思后得到新的 prompt，接下来进入接受/拒绝 → 验证集评估 → 更新前沿阶段。

┌─────────────────────────────────────────────────────────────────────────┐
│                    反思后的完整流程                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  反思阶段完成 ✅                                                         │
│  已获得: new_candidate = 改进后的 prompt                                  │
│                                                                         │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           阶段 A: 接受/拒绝测试                                   │   │
│  │                                                                  │   │
│  │  old_sum = sum(scores_before)  # 在训练集 minibatch 上的分数    │   │
│  │  new_sum = sum(scores_after)   # 新候选者的分数                 │   │
│  │                                                                  │   │
│  │  if new_sum > old_sum:  # 必须改进                                │   │
│  │      → 接受 ✅ 继续评估                                            │   │
│  │  else:                                                            │   │
│  │      → 拒绝 ❌ 丢弃新候选者                                       │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│                    (如果被接受)                                         │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           阶段 B: 验证集完整评估                                 │   │
│  │                                                                  │   │
│  │  在整个验证集上评估新候选者                                      │   │
│  │  for val_sample in valset:  (例如 50 个样本)                     │   │
│  │      1. 构建消息 ✅                                             │   │
│  │         messages = [new_prompt, val_sample.input]                │   │
│  │      2. 调用 Task LLM ✅                                          │   │
│  │         response = task_lm.generate(messages)                   │   │
│  │      3. 评估器计算分数 ❌                                       │   │
│  │         score = evaluator(response, val_sample.answer)             │   │
│  │                                                                  │   │
│  │  val_score = mean(all_scores)  # 验证集平均分数                 │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           阶段 C: 添加到候选池                                   │   │
│  │                                                                  │   │
│  │  state.program_candidates.append(new_candidate)                 │   │
│  │  state.program_full_scores_val_set.append(val_score)            │   │
│  │  state.parent_program_for_candidate[new_idx] = parent_ids        │   │
│  │                                                                  │   │
│  │  现在有 2 个候选者:                                               │   │
│  │  - 候选者 0: 分数 0.70 (旧)                                     │   │
│  │  - 候选者 1: 分数 0.82 (新) ← 当前添加                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           阶段 D: 更新 Pareto 前沿                              │   │
│  │                                                                  │   │
│  │  for val_sample, score in val_scores.items():                     │   │
│  │      old_best_score = state.pareto_front_valset.get(val_sample)    │   │
│  │                                                                  │   │
│  │      if score > old_best_score:  # 新候选者更好                   │   │
│  │          state.pareto_front_valset[val_sample] = score           │   │
│  │          state.program_at_pareto_front_valset[val_sample] = {new_idx}│   │
│  │      elif score == old_best_score:  # 分数相同                   │   │
│  │          state.program_at_pareto_front_valset[val_sample].add(new_idx)│   │
│  │      # 否则：旧候选者保持最优，前沿不变                           │   │
│  │                                                                  │   │
│  │  Pareto 前沿示例:                                               │   │
│  │  {                                                             │   │
│  │    "val_1": {1},        # 候选者 1 在 val_1 上最优               │   │
│  │    "val_2": {0, 1},     # 候选者 0 和 1 在 val_2 上并列最优         │   │
│  │    "val_3": {0},        # 候选者 0 在 val_3 上最优                 │   │
│  │  }                                                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           阶段 E: 检查停止条件                                   │   │
│  │                                                                  │   │
│  │  if stop_callback.should_stop(state):                           │   │
│  │      → 结束优化 ✅                                               │   │
│  │  else:                                                            │   │
│  │      → 继续下一轮迭代 🔄                                          │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

步骤 A: 接受/拒绝测试

代码实现

# src/gepa/core/engine.py:287-322
def _accept_reflective_proposal(self, proposal, iteration, state):
    """检查是否接受新候选者"""

    # 计算分数总和
    old_sum = sum(proposal.subsample_scores_before or [])
    new_sum = sum(proposal.subsample_scores_after or [])

    # 使用接受准则判断
    if not self.acceptance_criterion.should_accept(proposal, state):
        # 拒绝
        self.logger.log(f"New score {new_sum} not better than {old_sum}, skipping")
        notify_callbacks("on_candidate_rejected", ...)
        return False  # ❌ 拒绝

    # 接受
    self.logger.log(f"New score {new_sum} is better than {old_sum}. Continue to full eval.")
    return True  # ✅ 接受

接受准则类型

# src/gepa/strategies/acceptance.py

# 严格改进：新分数必须严格大于旧分数
class StrictImprovementAcceptance:
    def should_accept(self, proposal, state):
        old_sum = sum(proposal.subsample_scores_before or [])
        new_sum = sum(proposal.subsample_scores_after or [])
        return new_sum > old_sum

# 允许相等：新分数大于或等于旧分数
class ImprovementOrEqualAcceptance:
    def should_accept(self, proposal, state):
        old_sum = sum(proposal.subsample_scores_before or [])
        new_sum = sum(proposal.subsample_scores_after or [])
        return new_sum >= old_sum

步骤 B: 验证集完整评估

代码实现

# src/gepa/core/engine.py:154-173
def _evaluate_on_valset(self, program, state):
    """在验证集上评估新候选者"""

    # 1. 获取需要评估的验证样本 ID
    val_ids = self.val_evaluation_policy.get_eval_batch(valset, state)
    # 例如：[0, 1, 2, ..., 49] (全部验证样本)

    # 2. 批量评估（使用缓存）
    outputs_by_val_idx, scores_by_val_idx, _, _ = state.cached_evaluate_full(
        program,            # 新候选者
        list(val_ids),      # 验证样本 ID
        valset.fetch,      # 获取样本数据的函数
        self.evaluator      # 评估器函数
    )

    # 3. 返回评估结果
    return ValsetEvaluation(
        outputs_by_val_id=outputs_by_val_idx,   # {val_id: output}
        scores_by_val_id=scores_by_val_id,     # {val_id: score}
        objective_scores_by_val_id=...         # {val_id: {objective: score}}
    )

评估过程示例

# 在验证集上评估
valset = [
    {"input": "1+1=?", "answer": "2"},
    {"input": "Paris capital?", "answer": "Paris"},
    {"input": "2+2=?", "answer": "4"}
]

for val in valset:
    # 调用 Task LLM ✅
    response = task_lm.generate(new_candidate, val["input"])

    # 评估器计算分数 ❌
    score = 1.0 if val["answer"] in response else 0.0

    # 保存结果
    val_scores[val] = score

# 结果：
# val_1 ("1+1=?"): response="The answer is 2" → score=1.0
# val_2 ("Paris capital?"): response="Paris" → score=1.0
# val_3 ("2+2=?"): response="The answer is 4" → score=1.0

步骤 C: 添加到候选池

代码实现

# src/gepa/core/state.py:527-549
def update_state_with_new_program(
    self,
    new_program,          # 新候选者
    valset_evaluation,     # 验证集评估结果
    ...
):
    # 1. 添加新候选者到候选池
    new_program_idx = len(self.program_candidates)
    self.program_candidates.append(dict(new_program))

    # 2. 保存每个验证样本的分数
    valset_scores = dict(valset_evaluation.scores_by_val_id)
    # 例如：{0: 1.0, 1: 0.0, 2: 1.0, ..., 49: 1.0}

    self.prog_candidate_val_subscores.append(valset_scores)
    # prog_candidate_val_subscores[new_program_idx] = valset_scores

    # 3. 计算并保存验证集平均分数
    # 平均分数会在 get_program_average_val_subset 中计算
    # val_score = mean(valset_scores.values())

    return new_program_idx

候选池示例

# 添加前
state.program_candidates = [
    {"system_prompt": "You are helpful."}  # 候选者 0
]
state.program_full_scores_val_set = [0.70]

# 添加后
state.program_candidates = [
    {"system_prompt": "You are helpful."},     # 候选者 0
    {"system_prompt": "Show your work..."}     # 候选者 1 (新)
]
state.program_full_scores_val_set = [0.70, 0.82]

步骤 D: 更新 Pareto 前沿

代码实现

# src/gepa/core/state.py:486-510
def _update_pareto_front_for_val_id(
    self,
    val_id,               # 验证样本 ID
    score,                # 新候选者在该样本上的分数
    program_idx,          # 新候选者的索引
    ...
):
    # 1. 获取该样本的当前最佳分数
    prev_score = self.pareto_front_valset.get(val_id, float("-inf"))

    # 2. 比较并更新前沿
    if score > prev_score:  # 新候选者更好
        # 更新最佳分数
        self.pareto_front_valset[val_id] = score
        # 新候选者独占该样本的前沿
        self.program_at_pareto_front_valset[val_id] = {program_idx}

    elif score == prev_score:  # 分数相同
        # 获取现有前沿
        pareto_front = self.program_at_pareto_front_valset.setdefault(val_id, set())
        # 添加新候选者到前沿
        pareto_front.add(program_idx)

    # 否则：旧候选者保持最优，不做任何更新

Pareto 前沿更新示例

# 假设验证集有 3 个样本

# 更新前的前沿
pareto_front_valset = {
    0: 1.0,  # 样本 0: 候选者 0 得 1.0 分
    1: 0.0,  # 样本 1: 候选者 0 得 0.0 分
    2: 1.0,  # 样本 2: 候选者 0 得 1.0 分
}

program_at_pareto_front_valset = {
    0: {0},  # 样本 0: 候选者 0 最优
    1: {0},  # 样本 1: 候选者 0 最优
    2: {0},  # 样本 2: 候选者 0 最优
}

# 新候选者 1 的分数
candidate_1_scores = {0: 1.0, 1: 1.0, 2: 1.0}

# 更新样本 1
prev_score = 0.0
new_score = 1.0
if new_score > prev_score:
    pareto_front_valset[1] = 1.0
    program_at_pareto_front_valset[1] = {1}  # 候选者 1 独占

# 更新样本 0 和 2 (分数相同，添加到前沿)
# 候选者 0 和 1 在这些样本上并列最优

# 更新后的前沿
pareto_front_valset = {
    0: 1.0,
    1: 1.0,  # 新候选者 1 改进了这个样本
    2: 1.0
}

program_at_pareto_front_valset = {
    0: {0, 1},  # 样本 0: 两者并列最优
    1: {1},     # 样本 1: 新候选者 1 最优
    2: {0, 1}   # 样本 2: 两者并列最优
}

Pareto 前沿的数据结构

# 存储结构
state.pareto_front_valset = {
    val_id: best_score  # 该样本的最佳分数
}

state.program_at_pareto_front_valset = {
    val_id: {program_idx_1, program_idx_2, ...}  # 在该样本上最优的候选者
}

# 完整示例
pareto_front_valset = {
    0: 1.0,  # 样本 0 的最佳分数是 1.0
    1: 1.0,  # 样本 1 的最佳分数是 1.0
    2: 0.5,  # 样本 2 的最佳分数是 0.5
    3: 0.8,  # 样本 3 的最佳分数是 0.8
}

program_at_pareto_front_valset = {
    0: {0, 1},   # 在样本 0 上，候选者 0 和 1 都是最优
    1: {1},      # 在样本 1 上，只有候选者 1 是最优
    2: {0},      # 在样本 2 上，只有候选者 0 是最优
    3: {1},      # 在样本 3 上，只有候选者 1 是最优
}

完整例子

场景：数学计算问题优化

# ===== 反思阶段完成 =====
new_candidate = {
    "system_prompt": "You are a helpful assistant. For math problems, show your work step-by-step."
}

# Minibatch 分数 (训练集上的 3 个样本)
scores_before = [0.0, 1.0, 0.0]  # 旧候选者：答对1题
scores_after = [1.0, 1.0, 1.0]   # 新候选者：答对3题

# ===== 阶段 A: 接受/拒绝测试 =====
old_sum = 0.0 + 1.0 + 0.0 = 1.0
new_sum = 1.0 + 1.0 + 1.0 = 3.0
# 3.0 > 1.0 → 接受 ✅

# ===== 阶段 B: 验证集完整评估 =====
valset = [50 个验证样本]
val_scores = []

for sample in valset:
    response = task_lm(new_candidate, sample)  # ✅ 调用 LLM
    score = evaluator(response, sample)        # ❌ 规则判断
    val_scores.append(score)

val_score = mean(val_scores)  # 例如：0.82

# ===== 阶段 C: 添加到候选池 =====
state.program_candidates.append(new_candidate)
state.program_full_scores_val_set.append(0.82)
# 现在有 2 个候选者：
# - 候选者 0: 分数 0.70
# - 候选者 1: 分数 0.82 (新)

# ===== 阶段 D: 更新 Pareto 前沿 =====
for val_id, score in candidate_1_scores.items():
    old_best = pareto_front_valset.get(val_id)
    if score > old_best:
        # 新候选者独占该样本前沿
        pareto_front_valset[val_id] = score
        program_at_pareto_front_valset[val_id] = {1}
    elif score == old_best:
        # 新旧候选者共享前沿
        program_at_pareto_front_valset[val_id].add(1)

# ===== 阶段 E: 检查停止条件 =====
if state.total_num_evals >= max_metric_calls:  # 150
    return state  # 结束
else:
    continue  # 继续下一轮迭代

关键要点

阶段	说明	神经网络类比
A. 接受/拒绝	比较 minibatch 分数	Loss 检查
B. 验证集评估	完整评估新候选者	验证集测试
C. 添加到候选池	保存新候选者	保存模型检查点
D. 更新前沿	更新 Pareto 前沿	更新最佳模型
E. 检查停止	检查停止条件	早停检查

接受/拒绝的关键点

接受条件严格
- 必须在训练集 minibatch 上改进
- 默认：new_sum > old_sum
验证集评估完整
- 在所有验证样本上评估
- 计算平均分数
Pareto 前沿动态更新
- 每次接受后更新
- 维护多样本最优候选者
迭代直到停止
- 达到 max_metric_calls
- 或满足其他停止条件

下一步

Pareto 前沿 - 了解如何维护多个最优解
与神经网络对比 - 深入理解类比

GEPA 反思流程详解

反思是 GEPA 的核心创新，类似于神经网络的反向传播，但使用 LLM 的语言理解能力代替数学梯度。

反思流程图

┌─────────────────────────────────────────────────────────────────────────┐
│                         GEPA 反思流程                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  输入:                                                                  │
│    - candidate: 当前 prompt                                            │
│    - eval_batch: 评估结果 (分数 + 轨迹)                                │
│    - components_to_update: 需要更新的组件                               │
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           Step 1: 构建反思数据集                                │   │
│  │  反思数据集 = 所有评估案例 (成功 + 失败)                         │   │
│  │                                                                 │   │
│  │  for 每个评估案例:                                               │   │
│  │    record = {                                                    │   │
│  │      "Inputs": "问题: 1+1=?",                                    │   │
│  │      "Generated Outputs": "答案是3",                             │   │
│  │      "Feedback": "错误！正确答案是2。..." ← 评估器生成           │   │
│  │    }                                                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │              Step 2: 生成改进建议 ✅ 调用 LLM                    │   │
│  │                                                                  │   │
│  │  prompt = f"""                                                   │   │
│  │    当前 prompt: {candidate}                                      │   │
│  │                                                                  │   │
│  │    以下是在 {minibatch_size} 个样本上的评估结果:                 │   │
│  │    - 案例 1: {record_1}  ← 可能成功或失败                        │   │
│  │    - 案例 2: {record_2}  ← 可能成功或失败                        │   │
│  │    - 案例 3: {record_3}  ← 可能成功或失败                        │   │
│  │                                                                  │   │
│  │    请基于以上反馈，改进 prompt。                                  │   │
│  │  """                                                             │   │
│  │                                                                  │   │
│  │  new_prompt = reflection_lm(prompt)  # 调用更强的 LLM          │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │              Step 3: 返回新候选者                                │   │
│  │  new_candidate = {                                              │   │
│  │    "system_prompt": new_prompt                                  │   │
│  │  }                                                               │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

反思数据集详解

反思数据集 = 成功 + 失败案例

重要：反思数据集不只是失败的案例，而是包含所有评估的案例（包括成功和失败）。

为什么包含成功案例？

学习成功模式
- 了解什么是对的
- 保持好的行为
对比学习
- 对比成功 vs 失败
- 理解关键差异
避免过度修正
- 只看失败可能会矫枉过正
- 成功案例提供平衡

反思数据集的生成过程

┌─────────────────────────────────────────────────────────────────────────┐
│                    反思数据集生成过程                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  1. 评估阶段 (在训练集 minibatch 上)                                     │
│     ┌─────────────────────────────────────────────────────────────┐   │
│     │  minibatch = [样本1, 样本2, 样本3]  (通常 3 个样本)          │   │
│     │                                                             │   │
│     │  for 每个样本:                                               │   │
│     │    1. 调用 Task LLM 生成响应 ✅                              │   │
│     │    2. 评估器计算分数和 feedback ❌                            │   │
│     │    3. 记录到 trajectory                                      │   │
│     └─────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  2. 反思数据集构建                                                     │
│     ┌─────────────────────────────────────────────────────────────┐   │
│     │  for 每个 trajectory:                                        │   │
│     │    record = {                                                │   │
│     │      "Inputs": traj["data"]["input"],                         │   │
│     │      "Generated Outputs": traj["full_assistant_response"],    │   │
│     │      "Feedback": traj["feedback"]  ← 评估器生成的反馈         │   │
│     │    }                                                         │   │
│     └─────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  3. 发送给 Reflection LLM                                             │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Feedback 的来源

Feedback 由评估器生成（不调用 LLM）

Feedback 在评估阶段由评估器生成，使用文本模板，不调用 LLM。

评估器生成 Feedback 的代码

# src/gepa/adapters/default_adapter/default_adapter.py:69-84
class ContainsAnswerEvaluator:
    """默认评估器：检查答案是否在响应中"""

    def __call__(self, data, response):
        # 判断正确性 (字符串匹配)
        is_correct = data["answer"] in response
        score = 1.0 if is_correct else 0.0

        # 根据正确性生成不同的 feedback ❌ (不调用 LLM)
        if is_correct:
            feedback = f"The response is correct. Includes '{data['answer']}'"
        else:
            feedback = (
                f"The response is incorrect. "
                f"The correct answer is '{data['answer']}'. "
                "Ensure that the correct answer is included in the response exactly as it is."
            )
            # 添加额外上下文
            if data.get("additional_context"):
                feedback += f" Context: {data['additional_context']}"

        return EvaluationResult(score=score, feedback=feedback)

Feedback 模板示例

场景	Feedback 模板
正确	`"The response is correct. Includes '{answer}'"`
错误	`"The response is incorrect. The correct answer is '{answer}'. Ensure accuracy."`
格式错误	`"Wrong format. Expected: {format}, Got: {output}"`
超时	`"Timeout after {seconds}s. Need faster execution."`
置信度过高	`"WRONG — 99% certainty on '{wrong}' but correct is '{right}'. Prompt is misleading."`

代码实现

构建反思数据集

# src/gepa/adapters/default_adapter/default_adapter.py
def make_reflective_dataset(
    self,
    candidate: dict[str, str],
    eval_batch: EvaluationBatch,
    components_to_update: list[str]
) -> Mapping[str, Sequence[Mapping[str, Any]]]:
    """构建反思数据集"""

    items = []
    trajectories = eval_batch.trajectories  # 包含所有评估案例

    for traj in trajectories:
        record = {
            "Inputs": traj["data"]["input"],           # 输入
            "Generated Outputs": traj["full_assistant_response"],  # LLM 输出
            "Feedback": traj["feedback"]              # 评估器反馈
        }
        items.append(record)

    return {"system_prompt": items}

评估阶段生成 Feedback

# src/gepa/adapters/default_adapter/default_adapter.py:139-158
def evaluate(self, batch, candidate, capture_traces=False):
    # 1. 调用 LLM 生成响应 ✅
    responses = self._lm.batch_complete(litellm_requests)

    # 2. 对每个响应评估 ❌ (不调用 LLM)
    for data, response in zip(batch, responses):
        eval_result = self.evaluator(data, response)
        score = eval_result.score
        feedback = eval_result.feedback  # ← 这里生成 feedback

        # 3. 保存到 trajectory
        if trajectories is not None:
            trajectories.append({
                "data": data,                        # 原始输入
                "full_assistant_response": response, # LLM 响应
                "feedback": feedback                 # 评估器反馈
            })

反思数据集完整示例

输入数据

# Minibatch 有 3 个样本
minibatch = [
    {"input": "1+1=?", "answer": "2"},
    {"input": "2+2=?", "answer": "4"},
    {"input": "3+3=?", "answer": "6"}
]

# LLM 的响应
responses = [
    "The answer is 2",    # 正确 ✅
    "The answer is 5",    # 错误 ❌
    "The answer is 6"     # 正确 ✅
]

反思数据集

reflective_dataset = {
    "system_prompt": [
        {
            "Inputs": "1+1=?",
            "Generated Outputs": "The answer is 2",
            "Feedback": "The response is correct. Includes '2'"
        },
        {
            "Inputs": "2+2=?",
            "Generated Outputs": "The answer is 5",
            "Feedback": "The response is incorrect. Correct answer is '4'. Ensure accuracy."
        },
        {
            "Inputs": "3+3=?",
            "Generated Outputs": "The answer is 6",
            "Feedback": "The response is correct. Includes '6'"
        }
    ]
}

注意：反思数据集包含 2 个成功案例和 1 个失败案例。

发送给 Reflection LLM 的 Prompt

# 实际的 reflection prompt
prompt = """
You are given the following information about the performance of a candidate instruction on a small minibatch of examples.

## Current Instruction

{candidate['system_prompt']}


## Minibatch Evaluation Results
Here are the results of the current instruction on a small minibatch of training examples, with detailed feedback:

{dataset_with_feedback}


In the above, "Inputs" are the inputs to the system, "Generated Outputs" are the outputs produced by using the current instruction, and "Feedback" contains diagnostic feedback on the performance.

## Task
Analyze the evaluation results and propose an improved version of the instruction that addresses the failures and maintains the successes.

Return ONLY a valid JSON object with the new instruction.
"""

反思 Prompt 模板

REFINER_PROMPT_TEMPLATE = """
You are refining a candidate to improve its performance.

## Instructions
{refiner_prompt}

## Current Candidate (JSON)

{candidate_to_improve}


## Evaluation History
The following shows all evaluation attempts:

{evaluation_feedback}


## Task
Analyze the evaluation history and propose an improved version.
Return ONLY a valid JSON object with the improved parameters.
"""

类比神经网络

方面	神经网络	GEPA
目标	理解"为什么失败"	理解"为什么失败"
方法	反向传播	LLM 反思
梯度	`∇L = ∂L/∂W` (解析梯度)	反思数据 (文本反馈)
更新	`W ← W - α·∇L`	`prompt ← reflect(prompt, feedback)`
信息类型	数值梯度	语义理解
训练数据	所有训练样本 (成功+失败)	反思数据集 (成功+失败)

实际例子

场景：数学计算问题

# 当前 prompt
candidate = {
    "system_prompt": "You are a helpful assistant."
}

# 反思数据集 (包含成功和失败)
reflective_dataset = {
    "system_prompt": [
        {
            "Inputs": "1+1=?",
            "Generated Outputs": "The answer is 3",
            "Feedback": "Wrong. Correct answer is 2. The model made a calculation error."
        },
        {
            "Inputs": "What is the capital of France?",
            "Generated Outputs": "Paris",
            "Feedback": "Correct. The response includes the correct answer."
        },
        {
            "Inputs": "2+2=?",
            "Generated Outputs": "The answer is 5",
            "Feedback": "Wrong. Correct answer is 4. Another calculation error."
        }
    ]
}

LLM 分析

Reflection LLM 会分析：

数学题模式失败（2/3 失败）
常识问答模式成功（1/1 成功）
失败原因：计算错误
改进方向：强调计算步骤和验证

LLM 生成的改进

new_prompt = reflection_lm(prompt)
# => "You are a helpful assistant. For math problems, show your work step-by-step and double-check your calculations."

反思数据的类型

1. 基础反馈

{
    "Inputs": "...",
    "Generated Outputs": "...",
    "Feedback": "Wrong. Correct answer is ..."
}

2. 置信度反馈

{
    "Inputs": "...",
    "Generated Outputs": "...",
    "Feedback": "WRONG — model has 99% certainty on 'A' but correct is 'B'. The prompt is actively misleading it."
}

3. 执行轨迹

{
    "Inputs": "...",
    "Generated Outputs": "...",
    "Feedback": "Error: Division by zero at line 42",
    "Trace": "Stack trace..."
}

关键优势

特性	反向传播	LLM 反思
可解释性	数值梯度	人类可读的文本
适用范围	可微函数	任何文本组件
信息类型	局部梯度	全局语义理解
计算成本	低	中 (LLM 调用)
数据利用	所有样本	所有评估案例

两次 LLM 调用对比

┌─────────────────────────────────────────────────────────────────────────┐
│                    GEPA 的两次 LLM 调用                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  第一次: Task LLM (评估/前向传播)                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  model="gpt-4o-mini"          # 较小的模型                      │   │
│  │  messages = [system_prompt, user_input]                        │   │
│  │  → 生成预测值 (回答)                                           │   │
│  │  → 评估器计算分数和 feedback ❌                                 │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  第二次: Reflection LLM (反思/反向传播)                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  model="gpt-4o" / "claude-sonnet-4"  # 更强的模型               │   │
│  │  prompt = 当前prompt + 反思数据 (所有评估案例，成功+失败)        │   │
│  │  → 生成改进的 prompt                                           │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

核心概念：Actionable Side Information (ASI)

ASI 是 GEPA 的关键创新，指从执行环境中提取的诊断反馈：

传统方法: Loss = scalar (单一数值)
GEPA 方法: ASI = {
    "Inputs": "...",           # 输入
    "Generated Outputs": "...", # 输出
    "Feedback": "...",          # 诊断反馈
    "error_message": "...",     # 错误信息
    "profiling_data": "...",    # 性能数据
    "confidence_scores": [...], # 置信度
    ...
}

ASI 充当"文本梯度"，指导 LLM 进行有针对性的改进。

关键要点

反思 = 反向传播的等价物
- 目标相同：理解"为什么失败"
- 方法不同：LLM 理解 vs 数学计算
反思数据 = 所有评估案例
- 包含成功和失败
- 通常 3 个样本 (minibatch_size)
- 供 LLM 分析和学习
Feedback 由评估器生成
- 使用文本模板
- 不调用 LLM
- 根据正确性生成不同反馈
使用更强的 LLM
- reflection_lm 通常比 task_lm 更强
- 例如：GPT-4o vs GPT-4o-mini
迭代改进
- 每次迭代累积经验
- LLM 学习历史失败模式

反思后的完整流程

反思后得到新的 prompt，接下来进入接受/拒绝 → 验证集评估 → 更新前沿阶段。

┌─────────────────────────────────────────────────────────────────────────┐
│                    反思后的完整流程                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  反思阶段完成 ✅                                                         │
│  已获得: new_candidate = 改进后的 prompt                                  │
│                                                                         │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           阶段 A: 接受/拒绝测试                                   │   │
│  │                                                                  │   │
│  │  old_sum = sum(scores_before)  # 在训练集 minibatch 上的分数    │   │
│  │  new_sum = sum(scores_after)   # 新候选者的分数                 │   │
│  │                                                                  │   │
│  │  if new_sum > old_sum:  # 必须改进                                │   │
│  │      → 接受 ✅ 继续评估                                            │   │
│  │  else:                                                            │   │
│  │      → 拒绝 ❌ 丢弃新候选者                                       │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│                    (如果被接受)                                         │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           阶段 B: 验证集完整评估                                 │   │
│  │                                                                  │   │
│  │  在整个验证集上评估新候选者                                      │   │
│  │  for val_sample in valset:  (例如 50 个样本)                     │   │
│  │      1. 构建消息 ✅                                             │   │
│  │         messages = [new_prompt, val_sample.input]                │   │
│  │      2. 调用 Task LLM ✅                                          │   │
│  │         response = task_lm.generate(messages)                   │   │
│  │      3. 评估器计算分数 ❌                                       │   │
│  │         score = evaluator(response, val_sample.answer)             │   │
│  │                                                                  │   │
│  │  val_score = mean(all_scores)  # 验证集平均分数                 │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           阶段 C: 添加到候选池                                   │   │
│  │                                                                  │   │
│  │  state.program_candidates.append(new_candidate)                 │   │
│  │  state.program_full_scores_val_set.append(val_score)            │   │
│  │  state.parent_program_for_candidate[new_idx] = parent_ids        │   │
│  │                                                                  │   │
│  │  现在有 2 个候选者:                                               │   │
│  │  - 候选者 0: 分数 0.70 (旧)                                     │   │
│  │  - 候选者 1: 分数 0.82 (新) ← 当前添加                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           阶段 D: 更新 Pareto 前沿                              │   │
│  │                                                                  │   │
│  │  for val_sample, score in val_scores.items():                     │   │
│  │      old_best_score = state.pareto_front_valset.get(val_sample)    │   │
│  │                                                                  │   │
│  │      if score > old_best_score:  # 新候选者更好                   │   │
│  │          state.pareto_front_valset[val_sample] = score           │   │
│  │          state.program_at_pareto_front_valset[val_sample] = {new_idx}│   │
│  │      elif score == old_best_score:  # 分数相同                   │   │
│  │          state.program_at_pareto_front_valset[val_sample].add(new_idx)│   │
│  │      # 否则：旧候选者保持最优，前沿不变                           │   │
│  │                                                                  │   │
│  │  Pareto 前沿示例:                                               │   │
│  │  {                                                             │   │
│  │    "val_1": {1},        # 候选者 1 在 val_1 上最优               │   │
│  │    "val_2": {0, 1},     # 候选者 0 和 1 在 val_2 上并列最优         │   │
│  │    "val_3": {0},        # 候选者 0 在 val_3 上最优                 │   │
│  │  }                                                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              ↓                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │           阶段 E: 检查停止条件                                   │   │
│  │                                                                  │   │
│  │  if stop_callback.should_stop(state):                           │   │
│  │      → 结束优化 ✅                                               │   │
│  │  else:                                                            │   │
│  │      → 继续下一轮迭代 🔄                                          │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

步骤 A: 接受/拒绝测试

代码实现

# src/gepa/core/engine.py:287-322
def _accept_reflective_proposal(self, proposal, iteration, state):
    """检查是否接受新候选者"""

    # 计算分数总和
    old_sum = sum(proposal.subsample_scores_before or [])
    new_sum = sum(proposal.subsample_scores_after or [])

    # 使用接受准则判断
    if not self.acceptance_criterion.should_accept(proposal, state):
        # 拒绝
        self.logger.log(f"New score {new_sum} not better than {old_sum}, skipping")
        notify_callbacks("on_candidate_rejected", ...)
        return False  # ❌ 拒绝

    # 接受
    self.logger.log(f"New score {new_sum} is better than {old_sum}. Continue to full eval.")
    return True  # ✅ 接受

接受准则类型

# src/gepa/strategies/acceptance.py

# 严格改进：新分数必须严格大于旧分数
class StrictImprovementAcceptance:
    def should_accept(self, proposal, state):
        old_sum = sum(proposal.subsample_scores_before or [])
        new_sum = sum(proposal.subsample_scores_after or [])
        return new_sum > old_sum

# 允许相等：新分数大于或等于旧分数
class ImprovementOrEqualAcceptance:
    def should_accept(self, proposal, state):
        old_sum = sum(proposal.subsample_scores_before or [])
        new_sum = sum(proposal.subsample_scores_after or [])
        return new_sum >= old_sum

步骤 B: 验证集完整评估

代码实现

# src/gepa/core/engine.py:154-173
def _evaluate_on_valset(self, program, state):
    """在验证集上评估新候选者"""

    # 1. 获取需要评估的验证样本 ID
    val_ids = self.val_evaluation_policy.get_eval_batch(valset, state)
    # 例如：[0, 1, 2, ..., 49] (全部验证样本)

    # 2. 批量评估（使用缓存）
    outputs_by_val_idx, scores_by_val_idx, _, _ = state.cached_evaluate_full(
        program,            # 新候选者
        list(val_ids),      # 验证样本 ID
        valset.fetch,      # 获取样本数据的函数
        self.evaluator      # 评估器函数
    )

    # 3. 返回评估结果
    return ValsetEvaluation(
        outputs_by_val_id=outputs_by_val_idx,   # {val_id: output}
        scores_by_val_id=scores_by_val_id,     # {val_id: score}
        objective_scores_by_val_id=...         # {val_id: {objective: score}}
    )

评估过程示例

# 在验证集上评估
valset = [
    {"input": "1+1=?", "answer": "2"},
    {"input": "Paris capital?", "answer": "Paris"},
    {"input": "2+2=?", "answer": "4"}
]

for val in valset:
    # 调用 Task LLM ✅
    response = task_lm.generate(new_candidate, val["input"])

    # 评估器计算分数 ❌
    score = 1.0 if val["answer"] in response else 0.0

    # 保存结果
    val_scores[val] = score

# 结果：
# val_1 ("1+1=?"): response="The answer is 2" → score=1.0
# val_2 ("Paris capital?"): response="Paris" → score=1.0
# val_3 ("2+2=?"): response="The answer is 4" → score=1.0

步骤 C: 添加到候选池

代码实现

# src/gepa/core/state.py:527-549
def update_state_with_new_program(
    self,
    new_program,          # 新候选者
    valset_evaluation,     # 验证集评估结果
    ...
):
    # 1. 添加新候选者到候选池
    new_program_idx = len(self.program_candidates)
    self.program_candidates.append(dict(new_program))

    # 2. 保存每个验证样本的分数
    valset_scores = dict(valset_evaluation.scores_by_val_id)
    # 例如：{0: 1.0, 1: 0.0, 2: 1.0, ..., 49: 1.0}

    self.prog_candidate_val_subscores.append(valset_scores)
    # prog_candidate_val_subscores[new_program_idx] = valset_scores

    # 3. 计算并保存验证集平均分数
    # 平均分数会在 get_program_average_val_subset 中计算
    # val_score = mean(valset_scores.values())

    return new_program_idx

候选池示例

# 添加前
state.program_candidates = [
    {"system_prompt": "You are helpful."}  # 候选者 0
]
state.program_full_scores_val_set = [0.70]

# 添加后
state.program_candidates = [
    {"system_prompt": "You are helpful."},     # 候选者 0
    {"system_prompt": "Show your work..."}     # 候选者 1 (新)
]
state.program_full_scores_val_set = [0.70, 0.82]

步骤 D: 更新 Pareto 前沿

代码实现

# src/gepa/core/state.py:486-510
def _update_pareto_front_for_val_id(
    self,
    val_id,               # 验证样本 ID
    score,                # 新候选者在该样本上的分数
    program_idx,          # 新候选者的索引
    ...
):
    # 1. 获取该样本的当前最佳分数
    prev_score = self.pareto_front_valset.get(val_id, float("-inf"))

    # 2. 比较并更新前沿
    if score > prev_score:  # 新候选者更好
        # 更新最佳分数
        self.pareto_front_valset[val_id] = score
        # 新候选者独占该样本的前沿
        self.program_at_pareto_front_valset[val_id] = {program_idx}

    elif score == prev_score:  # 分数相同
        # 获取现有前沿
        pareto_front = self.program_at_pareto_front_valset.setdefault(val_id, set())
        # 添加新候选者到前沿
        pareto_front.add(program_idx)

    # 否则：旧候选者保持最优，不做任何更新

Pareto 前沿更新示例

# 假设验证集有 3 个样本

# 更新前的前沿
pareto_front_valset = {
    0: 1.0,  # 样本 0: 候选者 0 得 1.0 分
    1: 0.0,  # 样本 1: 候选者 0 得 0.0 分
    2: 1.0,  # 样本 2: 候选者 0 得 1.0 分
}

program_at_pareto_front_valset = {
    0: {0},  # 样本 0: 候选者 0 最优
    1: {0},  # 样本 1: 候选者 0 最优
    2: {0},  # 样本 2: 候选者 0 最优
}

# 新候选者 1 的分数
candidate_1_scores = {0: 1.0, 1: 1.0, 2: 1.0}

# 更新样本 1
prev_score = 0.0
new_score = 1.0
if new_score > prev_score:
    pareto_front_valset[1] = 1.0
    program_at_pareto_front_valset[1] = {1}  # 候选者 1 独占

# 更新样本 0 和 2 (分数相同，添加到前沿)
# 候选者 0 和 1 在这些样本上并列最优

# 更新后的前沿
pareto_front_valset = {
    0: 1.0,
    1: 1.0,  # 新候选者 1 改进了这个样本
    2: 1.0
}

program_at_pareto_front_valset = {
    0: {0, 1},  # 样本 0: 两者并列最优
    1: {1},     # 样本 1: 新候选者 1 最优
    2: {0, 1}   # 样本 2: 两者并列最优
}

Pareto 前沿的数据结构

# 存储结构
state.pareto_front_valset = {
    val_id: best_score  # 该样本的最佳分数
}

state.program_at_pareto_front_valset = {
    val_id: {program_idx_1, program_idx_2, ...}  # 在该样本上最优的候选者
}

# 完整示例
pareto_front_valset = {
    0: 1.0,  # 样本 0 的最佳分数是 1.0
    1: 1.0,  # 样本 1 的最佳分数是 1.0
    2: 0.5,  # 样本 2 的最佳分数是 0.5
    3: 0.8,  # 样本 3 的最佳分数是 0.8
}

program_at_pareto_front_valset = {
    0: {0, 1},   # 在样本 0 上，候选者 0 和 1 都是最优
    1: {1},      # 在样本 1 上，只有候选者 1 是最优
    2: {0},      # 在样本 2 上，只有候选者 0 是最优
    3: {1},      # 在样本 3 上，只有候选者 1 是最优
}

完整例子

场景：数学计算问题优化

# ===== 反思阶段完成 =====
new_candidate = {
    "system_prompt": "You are a helpful assistant. For math problems, show your work step-by-step."
}

# Minibatch 分数 (训练集上的 3 个样本)
scores_before = [0.0, 1.0, 0.0]  # 旧候选者：答对1题
scores_after = [1.0, 1.0, 1.0]   # 新候选者：答对3题

# ===== 阶段 A: 接受/拒绝测试 =====
old_sum = 0.0 + 1.0 + 0.0 = 1.0
new_sum = 1.0 + 1.0 + 1.0 = 3.0
# 3.0 > 1.0 → 接受 ✅

# ===== 阶段 B: 验证集完整评估 =====
valset = [50 个验证样本]
val_scores = []

for sample in valset:
    response = task_lm(new_candidate, sample)  # ✅ 调用 LLM
    score = evaluator(response, sample)        # ❌ 规则判断
    val_scores.append(score)

val_score = mean(val_scores)  # 例如：0.82

# ===== 阶段 C: 添加到候选池 =====
state.program_candidates.append(new_candidate)
state.program_full_scores_val_set.append(0.82)
# 现在有 2 个候选者：
# - 候选者 0: 分数 0.70
# - 候选者 1: 分数 0.82 (新)

# ===== 阶段 D: 更新 Pareto 前沿 =====
for val_id, score in candidate_1_scores.items():
    old_best = pareto_front_valset.get(val_id)
    if score > old_best:
        # 新候选者独占该样本前沿
        pareto_front_valset[val_id] = score
        program_at_pareto_front_valset[val_id] = {1}
    elif score == old_best:
        # 新旧候选者共享前沿
        program_at_pareto_front_valset[val_id].add(1)

# ===== 阶段 E: 检查停止条件 =====
if state.total_num_evals >= max_metric_calls:  # 150
    return state  # 结束
else:
    continue  # 继续下一轮迭代

关键要点

阶段	说明	神经网络类比
A. 接受/拒绝	比较 minibatch 分数	Loss 检查
B. 验证集评估	完整评估新候选者	验证集测试
C. 添加到候选池	保存新候选者	保存模型检查点
D. 更新前沿	更新 Pareto 前沿	更新最佳模型
E. 检查停止	检查停止条件	早停检查

接受/拒绝的关键点

接受条件严格
- 必须在训练集 minibatch 上改进
- 默认：new_sum > old_sum
验证集评估完整
- 在所有验证样本上评估
- 计算平均分数
Pareto 前沿动态更新
- 每次接受后更新
- 维护多样本最优候选者
迭代直到停止
- 达到 max_metric_calls
- 或满足其他停止条件

下一步

Pareto 前沿 - 了解如何维护多个最优解
与神经网络对比 - 深入理解类比

GEPA 反思流程详解

反思流程图

反思数据集详解

反思数据集 = 成功 + 失败案例

为什么包含成功案例？

反思数据集的生成过程

Feedback 的来源

Feedback 由评估器生成（不调用 LLM）

评估器生成 Feedback 的代码

Feedback 模板示例

代码实现

构建反思数据集

评估阶段生成 Feedback

反思数据集完整示例

输入数据

反思数据集

发送给 Reflection LLM 的 Prompt

反思 Prompt 模板

类比神经网络

实际例子

场景：数学计算问题

LLM 分析

LLM 生成的改进

反思数据的类型

1. 基础反馈

2. 置信度反馈

3. 执行轨迹

关键优势

两次 LLM 调用对比

核心概念：Actionable Side Information (ASI)

关键要点

反思后的完整流程

步骤 A: 接受/拒绝测试

代码实现

接受准则类型

步骤 B: 验证集完整评估

代码实现

评估过程示例

步骤 C: 添加到候选池

代码实现

候选池示例

步骤 D: 更新 Pareto 前沿

代码实现

Pareto 前沿更新示例

Pareto 前沿的数据结构

完整例子

场景：数学计算问题优化

关键要点

接受/拒绝的关键点

下一步

本页目录

评论

GEPA 反思流程详解

反思流程图

反思数据集详解

反思数据集 = 成功 + 失败案例

为什么包含成功案例？

反思数据集的生成过程

Feedback 的来源

Feedback 由评估器生成（不调用 LLM）

评估器生成 Feedback 的代码

Feedback 模板示例

代码实现

构建反思数据集

评估阶段生成 Feedback

反思数据集完整示例

输入数据

反思数据集

发送给 Reflection LLM 的 Prompt

反思 Prompt 模板

类比神经网络

实际例子

场景：数学计算问题

LLM 分析

LLM 生成的改进

反思数据的类型

1. 基础反馈

2. 置信度反馈

3. 执行轨迹

关键优势