Expand generation harness observability

2026-06-24 10:48:23 +08:00
parent 459ca9edef
commit 1f34d80083
35 changed files with 8003 additions and 112 deletions
--- a/backend/tests/harness-evaluation-test-cases.md
+++ b/backend/tests/harness-evaluation-test-cases.md
@@ -0,0 +1,610 @@
+# Test Cases: Harness Evaluation Driven Generation
+
+## Overview
+
+- **Feature**: Harness evaluation driven generation
+- **Requirements Source**: `docs/technical/harness-engineering-modernization.md`
+- **Test Coverage**: evaluation scoring, blocking quality failures, workflow plan events, trace aggregation, state transitions, internal golden replay, admin-only analytics, admin-only executor coverage summary, admin-only harness readiness
+- **Last Updated**: 2026-06-23
+
+## Test Case Categories
+
+### 1. Functional Tests
+
+#### TC-F-001: 普通故事无图片生成写入评测事件
+
+- **Requirement**: H7-3, H7-4
+- **Priority**: High
+- **Preconditions**:
+  - 用户已登录。
+  - 文本 provider 返回完整、儿童安全的故事。
+- **Test Steps**:
+  1. 调用 `POST /api/generations`，设置 `output_mode=story`、`generate_images=false`。
+  2. 执行 worker 任务。
+  3. 查询 job detail。
+- **Expected Results**:
+  - job 状态为 `completed`。
+  - event 顺序包含 `workflow_planned`。
+  - event 顺序包含 `evaluation_completed`。
+  - `evaluation_completed.event_metadata.passed=true`。
+  - `evaluation_completed.event_metadata.overall_score >= 0.7`。
+- **Postconditions**: 故事已持久化，`story_id` 写入 job。
+
+#### TC-F-003: 用户 Trace summary 不返回评测摘要
+
+- **Requirement**: H7-4, H7B-1
+- **Priority**: High
+- **Preconditions**:
+  - 故事已有 `evaluation_completed` job event。
+- **Test Steps**:
+  1. 调用 `GET /api/generations/{story_id}/trace-summary`。
+  2. 检查响应字段。
+- **Expected Results**:
+  - 响应不包含 `evaluation` 字段。
+  - `by_step` 不包含 `evaluation`。
+  - `by_artifact` 不因 `evaluation_completed` 增加 `story_text` 计数。
+  - `failed_events` 不统计 `evaluation_completed`。
+  - `total_events` 不统计 `evaluation_completed`，避免通过事件数量泄露内部评测步骤。
+- **Postconditions**: 无数据修改。
+
+#### TC-F-004: 用户 Job detail 不返回评测事件
+
+- **Requirement**: H7-4, H7B-2
+- **Priority**: High
+- **Preconditions**:
+  - job 已记录 `evaluation_completed` 事件。
+- **Test Steps**:
+  1. 调用 `GET /api/generations/jobs/{job_id}`。
+  2. 检查 `events` 列表。
+- **Expected Results**:
+  - `events` 不包含 `evaluation_completed`。
+  - 响应不包含评测分数、维度分数、通过率或阻断阈值。
+- **Postconditions**: 内部数据库事件不被删除。
+
+#### TC-F-002: 完整故事输出获得通过评分
+
+- **Requirement**: H7-1
+- **Priority**: High
+- **Preconditions**:
+  - 构造完整 `StoryOutput`。
+- **Test Steps**:
+  1. 调用 `evaluate_story_output`。
+  2. 读取 `EvaluationResult`。
+- **Expected Results**:
+  - `passed=true`。
+  - `blocking=false`。
+  - scores 包含 `structure`、`safety`、`age_fit`、`educational_value`、`readability`。
+- **Postconditions**: 无持久化副作用。
+
+#### TC-F-005: 完整绘本输出获得通过评分
+
+- **Requirement**: H7-1, H7C-1
+- **Priority**: High
+- **Preconditions**:
+  - 构造完整 `Storybook`。
+- **Test Steps**:
+  1. 调用 `evaluate_storybook_output`。
+  2. 读取 `EvaluationResult`。
+- **Expected Results**:
+  - `passed=true`。
+  - `blocking=false`。
+  - scores 包含 `structure`、`safety`、`age_fit`、`educational_value`、`readability`。
+- **Postconditions**: 无持久化副作用。
+
+#### TC-F-006: 内部 golden cases 可回放且全部符合预期
+
+- **Requirement**: H7-7, H7-8
+- **Priority**: High
+- **Preconditions**:
+  - `backend/app/services/harness/fixtures/evaluation_golden_cases.json` 存在。
+  - fixture 只由后端测试、内部工具或 admin-only readiness 读取。
+- **Test Steps**:
+  1. 调用 `replay_evaluation_golden_cases`。
+  2. 读取 `EvaluationReplaySuiteResult`。
+- **Expected Results**:
+  - `passed=true`。
+  - `failed_case_ids` 为空。
+  - 普通故事和绘本样本都被覆盖。
+  - 样本覆盖完整普通故事、较长普通故事、空正文、缺失封面提示词、安全风险词、短文本阈值阻断、绘本重复页码、绘本缺页、绘本安全风险和绘本短分页。
+  - 结果不通过任何用户端 API 返回。
+- **Postconditions**: 无持久化副作用。
+
+#### TC-F-007: 内部 golden replay 覆盖摘要稳定
+
+- **Requirement**: H7-8
+- **Priority**: High
+- **Preconditions**:
+  - golden replay suite 已执行。
+- **Test Steps**:
+  1. 调用 `coverage_summary`。
+  2. 检查 artifact、age_band、risk_area、tags 和 outcome 分布。
+- **Expected Results**:
+  - artifact 覆盖 `story=6`、`storybook=5`。
+  - age_band 覆盖 `3-4`、`5-6`、`7-8` 和 `unknown`。
+  - risk_area 覆盖 `happy_path`、`schema_error`、`safety_error`、`readability_warning`、`length_boundary`。
+  - outcome 覆盖 `passed=3`、`blocked=8`。
+  - 覆盖摘要不通过任何用户端 API 返回。
+- **Postconditions**: 无持久化副作用。
+
+### 2. Edge Case Tests
+
+#### TC-E-001: 很短故事通过结构但产生低龄阅读体验警告
+
+- **Requirement**: H7-1
+- **Priority**: Medium
+- **Preconditions**:
+  - 构造标题、正文、封面提示词完整但正文很短的 `StoryOutput`。
+- **Test Steps**:
+  1. 调用 `evaluate_story_output`。
+  2. 读取 warnings 和维度分数。
+- **Expected Results**:
+  - 不触发质量门异常。
+  - `age_fit` 或 `readability` 分数低于完整故事。
+  - warnings 包含阅读体验提示。
+- **Postconditions**: 无持久化副作用。
+
+#### TC-E-002: 内部 golden replay 能报告预期不匹配
+
+- **Requirement**: H7-7
+- **Priority**: Medium
+- **Preconditions**:
+  - 构造一个实际得分低于期望阈值的 `EvaluationReplayCase`。
+- **Test Steps**:
+  1. 调用 `run_evaluation_replay_cases`。
+  2. 读取 `failure_report`。
+- **Expected Results**:
+  - `passed=false`。
+  - `failed_case_ids` 包含该 case id。
+  - `failure_report` 包含 `overall_score` 差异。
+- **Postconditions**: 无持久化副作用。
+
+### 3. Error Handling Tests
+
+#### TC-ERR-001: 空正文阻断持久化
+
+- **Requirement**: H7-4
+- **Priority**: High
+- **Preconditions**:
+  - 文本 provider 返回空 `story_text`。
+- **Test Steps**:
+  1. 执行 worker 任务。
+  2. 查询 job 和 story 表。
+  3. 查询 job events。
+- **Expected Results**:
+  - job 状态为 `failed`。
+  - 没有 story 被持久化。
+  - events 包含 `quality_gate_failed`。
+  - events 包含 `evaluation_completed`。
+  - `evaluation_completed.event_metadata.blocking=true`。
+- **Postconditions**: 用户可重试该 job。
+
+#### TC-ERR-002: 不适龄风险词阻断生成
+
+- **Requirement**: H7-1
+- **Priority**: High
+- **Preconditions**:
+  - 构造包含明显不适龄风险词的 `StoryOutput`。
+- **Test Steps**:
+  1. 调用 `evaluate_story_output`。
+  2. 读取 `quality_gate` metadata。
+- **Expected Results**:
+  - `passed=false`。
+  - `blocking=true`。
+  - `quality_gate.issues[0].failure_category=safety_error`。
+- **Postconditions**: 无持久化副作用。
+
+#### TC-ERR-003: 绘本结构错误阻断生成
+
+- **Requirement**: H7-1, H7C-1
+- **Priority**: High
+- **Preconditions**:
+  - 构造页码重复或页面缺失的 `Storybook`。
+- **Test Steps**:
+  1. 调用 `evaluate_storybook_output`。
+  2. 读取 `quality_gate` metadata。
+- **Expected Results**:
+  - `passed=false`。
+  - `blocking=true`。
+  - `quality_gate.issues[0].code=invalid_storybook_page_number` 或对应结构错误。
+- **Postconditions**: 无持久化副作用。
+
+### 4. State Transition Tests
+
+#### TC-ST-001: 普通故事无图片路径事件顺序稳定
+
+- **Requirement**: H7-3
+- **Priority**: High
+- **Preconditions**:
+  - job 初始状态为 `running/request_accepted`。
+- **Test Steps**:
+  1. 执行 worker 任务。
+  2. 按 id 查询 events。
+- **Expected Results**:
+  - event 顺序为 `request_accepted`、`worker_started`、`workflow_planned`、`context_prepared`、`evaluation_completed`、`narrative_generated`、`story_saved`、`generation_completed`。
+- **Postconditions**: job `current_step=generation_completed`。
+
+#### TC-ST-002: 普通故事带图片路径记录可恢复资产计划
+
+- **Requirement**: H9-1, H9-3
+- **Priority**: High
+- **Preconditions**:
+  - job 初始状态为 `running/request_accepted`。
+  - 请求设置 `output_mode=story`、`generate_images=true`。
+  - 文本 provider 返回合格故事，图片 provider 返回封面 URL。
+- **Test Steps**:
+  1. 执行 worker 任务。
+  2. 按 id 查询内部 events。
+  3. 读取 `workflow_planned.event_metadata.plan`。
+- **Expected Results**:
+  - event 顺序为 `request_accepted`、`worker_started`、`workflow_planned`、`context_prepared`、`evaluation_completed`、`narrative_generated`、`story_saved`、`cover_image_started`、`cover_image_succeeded`、`generation_completed`。
+  - `plan.mode=story_with_assets`。
+  - plan tasks 包含 `evaluate_narrative`。
+  - plan tasks 包含 `generate_cover_image`。
+  - `generate_cover_image.required=false`。
+  - `generate_cover_image.recoverable=true`。
+- **Postconditions**: job `current_step=generation_completed`，故事 `image_status=ready`。
+
+#### TC-ST-003: 绘本路径记录绘本计划快照
+
+- **Requirement**: H9-2, H9-3
+- **Priority**: High
+- **Preconditions**:
+  - job 初始状态为 `running/request_accepted`。
+  - 请求设置 `output_mode=storybook`。
+- **Test Steps**:
+  1. 执行 worker 任务。
+  2. 按 id 查询内部 events。
+  3. 读取 `workflow_planned.event_metadata.plan`。
+- **Expected Results**:
+  - event 顺序包含 `workflow_planned`，且位于 `worker_started` 和 `context_prepared` 之间。
+  - `plan.mode=storybook`。
+  - plan tasks 包含 `generate_storybook_pages`。
+  - plan tasks 包含 `evaluate_storybook_pages`。
+  - 当 `generate_images=true` 时，plan tasks 包含 `generate_storybook_images`。
+  - `generate_storybook_images.required=false`。
+  - `generate_storybook_images.recoverable=true`。
+- **Postconditions**: job `current_step=generation_completed`。
+
+#### TC-ST-004: 绘本生成内部记录评测但用户事件脱敏
+
+- **Requirement**: H7C-1, H7B-2, H9-4
+- **Priority**: High
+- **Preconditions**:
+  - 绘本生成 job 已执行完成。
+- **Test Steps**:
+  1. 直接查询内部 `generation_job_events`。
+  2. 调用 `GET /api/generations/jobs/{job_id}`。
+- **Expected Results**:
+  - 内部事件包含 `evaluation_completed`。
+  - 内部 `evaluation_completed.event_metadata.artifact=storybook_pages`。
+  - 用户 API events 不包含 `evaluation_completed`。
+  - 用户 API 响应不包含 `overall_score`、维度分数、阈值或 golden replay 字段。
+- **Postconditions**: job 完成，绘本已持久化。
+
+#### TC-ST-005: 资产生成和重试路径记录资产计划快照
+
+- **Requirement**: H10-1, H10-2, H10-3
+- **Priority**: High
+- **Preconditions**:
+  - 故事已有可生成或可重试的图片/音频资源。
+- **Test Steps**:
+  1. 执行 `asset_generation` worker 任务。
+  2. 调用 `/api/generations/{story_id}/retry-assets`。
+  3. 按 id 查询内部 events。
+- **Expected Results**:
+  - `asset_generation` 事件顺序包含 `workflow_planned`。
+  - `asset_generation` 的 `plan.mode=asset_generation`。
+  - `asset_retry` 事件顺序包含 `workflow_planned`。
+  - `asset_retry` 的 `plan.mode=asset_retry`。
+  - 图片和音频任务在 plan 中为 `required=false`、`recoverable=true`。
+- **Postconditions**: 资源状态按原有语义更新。
+
+#### TC-ST-006: 用户事件 metadata 使用白名单脱敏
+
+- **Requirement**: H10-4, H10-5
+- **Priority**: High
+- **Preconditions**:
+  - 内部 job events 包含原始 `plan.tasks`、`result_snapshot`、内部阈值或内部错误详情。
+- **Test Steps**:
+  1. 调用 `GET /api/generations/jobs/{job_id}`。
+  2. 检查 `events[*].event_metadata`。
+- **Expected Results**:
+  - 用户响应保留 `step`、`artifact`、`asset`、`assets`、`failure_category` 等可解释字段。
+  - `workflow_planned` 只返回 `plan_mode`、`planned_task_count`、`recoverable_task_count`。
+  - 用户响应不包含原始 `plan`、`tasks`、`result_snapshot`、内部阈值、内部错误原文。
+  - 用户响应仍不包含 `evaluation_completed`、`overall_score`、维度分数或 golden replay 字段。
+- **Postconditions**: 内部数据库事件不被修改。
+
+#### TC-ST-007: 用户 request payload 使用白名单脱敏
+
+- **Requirement**: H11-1, H11-4
+- **Priority**: High
+- **Preconditions**:
+  - 生成 job 的 `request_payload` 同时包含用户输入、公开控制字段、内部调度 token、Provider override 和评测策略。
+- **Test Steps**:
+  1. 调用 `GET /api/generations/jobs/{job_id}`。
+  2. 检查响应中的 `request_payload`。
+- **Expected Results**:
+  - 用户响应只保留 `output_mode`、`input_type`、`type`、`story_id`、`assets`、`page_count`、`generate_images` 等安全控制字段。
+  - 用户响应不包含原始 `data`、`education_theme`、内部调度 token、Provider override 或 evaluation policy。
+  - 内部数据库中的完整 request payload 不被修改。
+- **Postconditions**: 用户端仍可根据公开字段展示任务进度和可用操作。
+
+#### TC-ST-008: 资产 plan runner 按 WorkflowPlan 顺序执行任务
+
+- **Requirement**: H12-1, H12-5
+- **Priority**: High
+- **Preconditions**:
+  - 构造 `asset_generation` 或 `asset_retry` plan，包含图片和音频 task。
+- **Test Steps**:
+  1. 调用 `run_asset_plan(...)`。
+  2. 记录 image/audio handler 的调用顺序。
+  3. 检查 runner 返回的 executed/ignored task keys。
+- **Expected Results**:
+  - 图片和音频 handler 按 plan 中 `WorkflowTask` 顺序执行。
+  - `start_asset_*` 和 `complete_asset_*` 这类非资产生产 task 被记录为 ignored，不触发 provider handler。
+  - 未知非资产 task 默认 ignored，不影响已知资产 task。
+- **Postconditions**: 无数据库修改。
+
+#### TC-ST-009: 后台资产生成由 plan runner 执行组合资产
+
+- **Requirement**: H12-2, H12-5
+- **Priority**: High
+- **Preconditions**:
+  - 已持久化故事同时具备可生成图片和音频的输入。
+  - 创建 `asset_generation` job，`assets=["audio", "image"]`。
+- **Test Steps**:
+  1. 调用 worker 执行该 job。
+  2. 查询 job events 和 story 状态。
+- **Expected Results**:
+  - event stream 为 `workflow_planned` 后依次出现音频和图片生成事件。
+  - plan tasks 顺序包含 `complete_audio_asset`、`complete_image_asset`。
+  - story 的 `audio_status` 与 `image_status` 均为 `ready`。
+  - 用户 API 仍只暴露 coarse plan metadata，不返回原始 `plan.tasks`。
+- **Postconditions**: job 完成，资源状态与原有语义一致。
+
+#### TC-ST-010: 用户侧过滤 executor coverage 内部事件
+
+- **Requirement**: H13-4, H13-5
+- **Priority**: High
+- **Preconditions**:
+  - 生成 job 包含内部 `executor_completed` 事件。
+  - `executor_completed.event_metadata` 包含 task keys 和 result assets。
+- **Test Steps**:
+  1. 调用 `GET /api/generations/jobs/{job_id}`。
+  2. 调用 `GET /api/generations/{story_id}/jobs`。
+  3. 调用 `GET /api/generations/{story_id}/trace-summary`。
+- **Expected Results**:
+  - 用户 job detail 不包含 `executor_completed`。
+  - 用户 job detail 不包含 `executed_task_keys`、`ignored_task_keys` 或具体 task key。
+  - 当 job 当前步骤短暂停留在 `executor_completed` 时，用户 summary 显示为安全公开的 `workflow_planned` 进度。
+  - 用户 trace summary 不包含 `executor_completed` 或具体 task key。
+  - 用户 trace summary 的 `total_events` 不统计内部 `executor_completed`。
+- **Postconditions**: 内部数据库事件不被修改。
+
+### 5. Admin-Only Analytics Tests
+
+#### TC-ADM-001: 管理端评测 analytics 聚合内部评测事件
+
+- **Requirement**: H8-1, H8-2
+- **Priority**: High
+- **Preconditions**:
+  - 数据库存在多个用户的 `evaluation_completed` 事件。
+  - 请求通过 admin guard。
+- **Test Steps**:
+  1. 调用 `GET /admin/evaluations/analytics`。
+  2. 检查聚合结果。
+- **Expected Results**:
+  - 返回通过数、阻断数、通过率和平均分。
+  - 返回 artifact、output mode、score band、dimension score、quality gate issue、failure category 和 warning 聚合。
+  - 不返回故事正文、prompt、单条 evaluation event 或评分 reason。
+- **Postconditions**: 无数据修改。
+
+#### TC-ADM-002: 管理端评测 analytics 支持过滤
+
+- **Requirement**: H8-3
+- **Priority**: Medium
+- **Preconditions**:
+  - 数据库存在新旧评测事件以及不同 artifact。
+- **Test Steps**:
+  1. 调用 `GET /admin/evaluations/analytics?days=7`。
+  2. 调用 `GET /admin/evaluations/analytics?artifact=story_text`。
+  3. 调用非法 artifact。
+- **Expected Results**:
+  - `days` 过滤只统计窗口内事件。
+  - `artifact` 过滤只统计对应 artifact。
+  - 非法 artifact 返回 `422`。
+- **Postconditions**: 无数据修改。
+
+#### TC-ADM-003: 管理端评测 analytics 需要 admin 鉴权
+
+- **Requirement**: H8-2
+- **Priority**: High
+- **Preconditions**:
+  - 未提供 admin Basic Auth。
+- **Test Steps**:
+  1. 调用 `GET /admin/evaluations/analytics`。
+- **Expected Results**:
+  - 返回 `401`。
+  - 不返回任何评测统计。
+- **Postconditions**: 无数据修改。
+
+#### TC-ADM-004: 管理端完整生成 trace 返回内部事件流
+
+- **Requirement**: H11-2, H11-3, H11-4
+- **Priority**: High
+- **Preconditions**:
+  - 数据库存在包含 `workflow_planned` 与 `evaluation_completed` 的生成 job。
+  - 请求通过 admin guard。
+- **Test Steps**:
+  1. 调用 `GET /admin/generations/jobs/{job_id}/trace`。
+  2. 检查 request payload 与 event stream。
+- **Expected Results**:
+  - 返回完整 request payload，包括原始用户输入和内部调度字段。
+  - 返回完整 `workflow_planned.event_metadata.plan.tasks`。
+  - 返回 `evaluation_completed` 事件及其内部评分 metadata。
+  - 响应包含 `user_id`，便于管理控制面审计。
+- **Postconditions**: 无数据修改。
+
+#### TC-ADM-005: 管理端完整生成 trace 需要 admin 鉴权
+
+- **Requirement**: H11-3
+- **Priority**: High
+- **Preconditions**:
+  - 未提供 admin Basic Auth。
+- **Test Steps**:
+  1. 调用 `GET /admin/generations/jobs/{job_id}/trace`。
+- **Expected Results**:
+  - 返回 `401`。
+  - 不返回 request payload 或内部 event metadata。
+- **Postconditions**: 无数据修改。
+
+#### TC-ADM-006: 管理端 executor coverage 聚合内部执行事件
+
+- **Requirement**: H13-1, H13-2, H13-3, H13-5
+- **Priority**: High
+- **Preconditions**:
+  - 数据库存在多个 `executor_completed` 事件。
+  - 请求通过 admin guard。
+- **Test Steps**:
+  1. 调用 `GET /admin/executors/coverage`。
+  2. 调用 `GET /admin/executors/coverage?plan_mode=asset_retry`。
+  3. 调用非法 plan mode。
+- **Expected Results**:
+  - 返回 total runs、planned/executed/ignored task counts 和 coverage ratio。
+  - 返回 plan mode、output mode、executed task keys、ignored task keys 和 result assets 聚合。
+  - `plan_mode` 过滤只统计对应 executor run。
+  - 非法 plan mode 返回 `422`。
+- **Postconditions**: 无数据修改。
+
+#### TC-ADM-007: 管理端 executor coverage 需要 admin 鉴权
+
+- **Requirement**: H13-3
+- **Priority**: High
+- **Preconditions**:
+  - 未提供 admin Basic Auth。
+- **Test Steps**:
+  1. 调用 `GET /admin/executors/coverage`。
+- **Expected Results**:
+  - 返回 `401`。
+  - 不返回 executor task keys 或 coverage metadata。
+- **Postconditions**: 无数据修改。
+
+#### TC-ADM-008: 管理端完整生成 trace 返回单 job executor coverage 摘要
+
+- **Requirement**: H14-1, H14-2, H14-4
+- **Priority**: High
+- **Preconditions**:
+  - 数据库存在包含 `executor_completed` 事件的生成 job。
+  - 请求通过 admin guard。
+- **Test Steps**:
+  1. 调用 `GET /admin/generations/jobs/{job_id}/trace`。
+  2. 检查 `executor_coverage`。
+- **Expected Results**:
+  - 响应包含 `executor_coverage.scope=admin_internal_job_executor_coverage`。
+  - `executor_coverage` 只统计当前 job 的 runs、planned/executed/ignored task counts 和 coverage ratio。
+  - `executor_coverage.executed_task_keys`、`ignored_task_keys` 和 `result_assets` 与当前 job 的内部 executor event 一致。
+  - 完整 event stream 仍保留 `executor_completed`，便于 admin 调试。
+- **Postconditions**: 无数据修改。
+
+#### TC-ADM-009: 管理端 harness readiness 聚合内部质量门
+
+- **Requirement**: H15-1, H15-2, H15-3, H15-4
+- **Priority**: High
+- **Preconditions**:
+  - app 内部 harness fixture 存在 golden replay cases。
+  - 数据库存在至少一条通过的 `evaluation_completed` 事件。
+  - 数据库存在至少一条 `executor_completed` 事件。
+  - 请求通过 admin guard。
+- **Test Steps**:
+  1. 调用 `GET /admin/harness/readiness`。
+  2. 检查 readiness status、checks 和聚合摘要。
+- **Expected Results**:
+  - `status=ready`。
+  - checks 包含 `golden_replay`、`runtime_evaluation_samples`、`runtime_evaluation_quality`、`executor_coverage_samples` 和 `executor_coverage_ratio`。
+  - golden replay 显示全部通过。
+  - evaluation analytics 与 executor coverage 只以聚合形式返回。
+  - 响应不包含故事标题、正文、prompt、score reason 或 quality gate message。
+- **Postconditions**: 无数据修改。
+
+#### TC-ADM-010: 管理端 harness readiness 阻断低质量运行样本并需要 admin 鉴权
+
+- **Requirement**: H15-2, H15-3, H15-4, H15-5
+- **Priority**: High
+- **Preconditions**:
+  - 数据库存在低质量或 blocking 的 `evaluation_completed` 事件。
+  - executor coverage 运行样本缺失或不足。
+- **Test Steps**:
+  1. 通过 admin guard 调用 `GET /admin/harness/readiness`。
+  2. 未提供 admin Basic Auth 调用同一路径。
+- **Expected Results**:
+  - 有 admin 权限时返回 `status=blocked`。
+  - `runtime_evaluation_quality.status=blocked`。
+  - executor 样本缺失时对应 check 为 `needs_attention`。
+  - 无 admin 权限时返回 `401`。
+  - 响应不包含 quality gate message 或单条事件明细。
+- **Postconditions**: 无数据修改。
+
+## Test Coverage Matrix
+
+| Requirement ID | Test Cases | Coverage Status |
+| --- | --- | --- |
+| H7-1 | TC-F-002, TC-F-005, TC-E-001, TC-ERR-002, TC-ERR-003 | Complete |
+| H7-2 | TC-F-001, TC-ST-001 | Complete |
+| H7-3 | TC-F-001, TC-ST-001 | Complete |
+| H7-4 | TC-F-003, TC-ERR-001 | Complete |
+| H7-5 | This document | Complete |
+| H7-7 | TC-F-006, TC-E-002 | Complete |
+| H7-8 | TC-F-006, TC-F-007 | Complete |
+| H7B-1 | TC-F-003 | Complete |
+| H7B-2 | TC-F-004 | Complete |
+| H7C-1 | TC-F-005, TC-ERR-003, TC-ST-002 | Complete |
+| H8-1 | TC-ADM-001 | Complete |
+| H8-2 | TC-ADM-001, TC-ADM-003 | Complete |
+| H8-3 | TC-ADM-002 | Complete |
+| H8-4 | TC-F-003, TC-F-004, TC-ADM-001 | Complete |
+| H9-1 | TC-ST-002 | Complete |
+| H9-2 | TC-ST-003 | Complete |
+| H9-3 | TC-ST-001, TC-ST-002, TC-ST-003 | Complete |
+| H9-4 | TC-F-003, TC-F-004, TC-ST-004 | Complete |
+| H10-1 | TC-ST-005 | Complete |
+| H10-2 | TC-ST-005 | Complete |
+| H10-3 | TC-ST-005 | Complete |
+| H10-4 | TC-ST-006 | Complete |
+| H10-5 | TC-ST-005, TC-ST-006 | Complete |
+| H11-1 | TC-ST-007 | Complete |
+| H11-2 | TC-ADM-004 | Complete |
+| H11-3 | TC-ADM-004, TC-ADM-005 | Complete |
+| H11-4 | TC-ST-007, TC-ADM-004, TC-ADM-005 | Complete |
+| H11-5 | This document, `docs/planning/harness-stage-11-report.md` | Complete |
+| H12-1 | TC-ST-008 | Complete |
+| H12-2 | TC-ST-009 | Complete |
+| H12-3 | TC-ST-005, TC-ST-008 | Complete |
+| H12-4 | TC-ST-005, backend story endpoint regression tests | Complete |
+| H12-5 | TC-ST-008, TC-ST-009 | Complete |
+| H13-1 | TC-ADM-006 | Complete |
+| H13-2 | TC-ST-009, TC-ADM-006 | Complete |
+| H13-3 | TC-ADM-006, TC-ADM-007 | Complete |
+| H13-4 | TC-ST-010 | Complete |
+| H13-5 | TC-ST-010, TC-ADM-006, TC-ADM-007 | Complete |
+| H14-1 | TC-ADM-006, TC-ADM-008 | Complete |
+| H14-2 | TC-ADM-008 | Complete |
+| H14-3 | TC-ST-010 | Complete |
+| H14-4 | TC-ST-010, TC-ADM-008 | Complete |
+| H14-5 | This document, `docs/planning/harness-stage-14-report.md` | Complete |
+| H15-1 | TC-F-006, TC-ADM-009 | Complete |
+| H15-2 | TC-ADM-009, TC-ADM-010 | Complete |
+| H15-3 | TC-ADM-009, TC-ADM-010 | Complete |
+| H15-4 | TC-ADM-009, TC-ADM-010 | Complete |
+| H15-5 | This document, `docs/planning/harness-stage-15-report.md` | Complete |
+
+## Notes
+
+- 当前自动化已覆盖 TC-F-001、TC-F-002、TC-F-003、TC-F-004、TC-F-005、TC-F-006、TC-F-007、TC-E-002、TC-ERR-001、TC-ERR-002、TC-ERR-003、TC-ST-001、TC-ST-002、TC-ST-003、TC-ST-004、TC-ST-005、TC-ST-006、TC-ST-007、TC-ST-008、TC-ST-009、TC-ST-010、TC-ADM-001、TC-ADM-002、TC-ADM-003、TC-ADM-004、TC-ADM-005、TC-ADM-006、TC-ADM-007、TC-ADM-008、TC-ADM-009、TC-ADM-010。
+- TC-E-001 可在下一轮补成显式单测。
+- 所有 `evaluation_completed`、golden replay 和评分维度数据均按内部质量资产处理，不应进入用户端接口或用户前端。
+- `GET /admin/evaluations/analytics` 只允许 admin-only 聚合摘要，不应返回原始内容、prompt、单条事件或评分 reason。
+- `GET /admin/generations/jobs/{job_id}/trace` 是 admin-only 调试和审查接口，可返回完整内部链路，不应被用户前端调用。
+- `GET /admin/executors/coverage` 是 admin-only executor 覆盖率接口，可返回 task keys 和 result assets，不应被用户前端调用。
+- `GET /admin/generations/jobs/{job_id}/trace` 可返回当前 job 的 `executor_coverage` 摘要；该摘要与 task keys 一样属于内部执行资产。
+- `GET /admin/harness/readiness` 是 admin-only harness 上线前审查摘要，可返回聚合 readiness、thresholds、golden coverage、evaluation analytics 和 executor coverage，不应返回正文、prompt、score reason、quality gate message 或单条事件明细。