Add design specification and project setup#1
Conversation
Comprehensive design spec covering: - Agent identity system with roles, personalities, skills, seniority - Company structure templates (solo founder to full enterprise) - Communication architecture (event-driven bus, hierarchical, meetings, hybrid) - Task/workflow engine with multiple execution patterns - Memory system (working, episodic, semantic, procedural, social) - Multi-provider LLM layer via LiteLLM (Claude, OpenRouter, Ollama) - Budget management with CFO agent and cost optimization - Tool/capability system with progressive trust - Security operations agent and configurable autonomy levels - API-first architecture (FastAPI + WebSocket) with web UI - Template system and company builder - Research analysis of MetaGPT, ChatDev, CrewAI, AutoGen, LangGraph - Build-from-scratch recommendation leveraging LiteLLM, Mem0, MCP, A2A - Open questions, risks, backlog, and scaling path
Business Source License 1.1 with parameters: - Non-commercial use only (personal, education, research, evaluation) - Commercial/production use requires separate license - Converts to Apache 2.0 after 4 years per version
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdded a comprehensive design specification document ( Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes the foundational documentation for the 'AI Company' project, a framework designed to orchestrate autonomous AI agents within a virtual company structure. It introduces a comprehensive design specification detailing the project's vision, architecture, and planned features, along with a new license. The updated README provides a clear, high-level overview of the project's goals and technical direction, setting the stage for future development. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the foundational documents for the 'AI Company' project, including a comprehensive design specification, the project license, and an updated README. The design specification is incredibly detailed and well-thought-out, covering a wide range of architectural and conceptual aspects. My review focuses on improving clarity and consistency in a few key areas of the design to ensure a smoother implementation. Specifically, I've pointed out inconsistencies in model identifiers, ambiguity in how agents are referenced, and minor omissions in the documentation.
| assigned_to: "sarah_chen" | ||
| reviewers: ["engineering_lead", "security_engineer"] |
There was a problem hiding this comment.
Throughout the document, identifiers for agents and roles are used inconsistently. For example, tasks are assigned to "sarah_chen", but the agent card shows name: "Sarah Chen" and id: "uuid". Other fields use role-based identifiers like "product_manager_1" and "engineering_lead". For a robust implementation, the design should specify a single, consistent identification scheme. Using unique agent IDs (like the uuid from the agent card) is generally recommended over names or roles to avoid ambiguity.
| model_id: "claude-sonnet-4-6" | ||
| temperature: 0.3 | ||
| max_tokens: 8192 | ||
| fallback_model: "openrouter/anthropic/claude-haiku" |
There was a problem hiding this comment.
The fallback_model is specified as openrouter/anthropic/claude-haiku. However, in the openrouter provider configuration (Section 8.2), there is no model defined with this ID or a similar alias. The model IDs under OpenRouter follow a provider/model-name format (e.g., anthropic/claude-sonnet-4-6). Please ensure consistency between the agent's model configuration and the provider definitions.
| models: | ||
| - id: "anthropic/claude-sonnet-4-6" | ||
| alias: "or-sonnet" | ||
| - id: "google/gemini-2.5-pro" | ||
| alias: "or-gemini-pro" | ||
| - id: "deepseek/deepseek-r1" | ||
| alias: "or-deepseek" |
There was a problem hiding this comment.
In the provider configuration (Section 8.2), cost details (cost_per_1k_input, cost_per_1k_output) are provided for anthropic and ollama models, but are omitted for openrouter models. For consistency in the design document, it would be clearer to either include example costs for all providers or state explicitly why they are omitted for some (e.g., due to dynamic pricing).
| ├── DESIGN_SPEC.md # This document | ||
| ├── README.md | ||
| ├── pyproject.toml | ||
| └── CLAUDE.md |
There was a problem hiding this comment.
There was a problem hiding this comment.
Pull request overview
Adds initial project scaffolding documentation and licensing to establish the project’s direction and usage terms.
Changes:
- Introduces a comprehensive high-level design specification (
DESIGN_SPEC.md) for the planned architecture and systems. - Adds a Business Source License 1.1 (
LICENSE) with a planned conversion to Apache 2.0. - Replaces the minimal README with a fuller project overview, planned features, and planned tech stack.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| README.md | New project overview, planned feature list, and planned tech stack, linking to the design spec. |
| LICENSE | Adds BSL 1.1 text with project-specific parameters and planned conversion license/date. |
| DESIGN_SPEC.md | Adds a detailed design spec covering architecture, systems, workflows, and research references. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "Non-production use" means use solely for personal learning, | ||
| academic research, testing, development, and evaluation in | ||
| non-commercial settings. | ||
| Change Date: Four years from the date the Licensed Work is published. |
There was a problem hiding this comment.
The BSL 1.1 “Change Date” parameter is typically a specific calendar date (used by the license text as a fixed conversion trigger). Using a relative phrase like “Four years from the date the Licensed Work is published” is ambiguous (publication date isn’t defined here) and can make the conversion date hard to determine. Consider setting an explicit ISO-8601 date (YYYY-MM-DD) tied to the first public release, and updating it per version if needed.
| Change Date: Four years from the date the Licensed Work is published. | |
| Change Date: 2030-01-01 |
| Additional Use Grant: You may make non-production use of the Licensed Work for | ||
| personal, educational, research, and evaluation purposes only. | ||
| Production use and commercial use of the Licensed Work require | ||
| a separate commercial license from the Licensor. | ||
|
|
||
| "Production use" means using the Licensed Work or any | ||
| derivative work in a live, revenue-generating, or | ||
| business-critical environment, including but not limited to: | ||
| deploying the Licensed Work as part of a commercial product or | ||
| service, using it to generate revenue directly or indirectly, | ||
| or offering it to third parties on a hosted or embedded basis. | ||
|
|
||
| "Non-production use" means use solely for personal learning, | ||
| academic research, testing, development, and evaluation in | ||
| non-commercial settings. |
There was a problem hiding this comment.
The “Additional Use Grant” section currently restates non-production use restrictions and introduces custom definitions of “Production use” / “Non-production use”. This can create confusion or conflict with the standard BSL 1.1 wording/definitions, and it doesn’t actually grant any additional production rights beyond the base license. Consider either (a) removing the Additional Use Grant entirely, or (b) using it only to explicitly allow the specific limited production uses you intend, while keeping the standard BSL definitions unchanged.
| Additional Use Grant: You may make non-production use of the Licensed Work for | |
| personal, educational, research, and evaluation purposes only. | |
| Production use and commercial use of the Licensed Work require | |
| a separate commercial license from the Licensor. | |
| "Production use" means using the Licensed Work or any | |
| derivative work in a live, revenue-generating, or | |
| business-critical environment, including but not limited to: | |
| deploying the Licensed Work as part of a commercial product or | |
| service, using it to generate revenue directly or indirectly, | |
| or offering it to third parties on a hosted or embedded basis. | |
| "Non-production use" means use solely for personal learning, | |
| academic research, testing, development, and evaluation in | |
| non-commercial settings. | |
| Additional Use Grant: None. |
| │ │ | ||
| │ ┌──────────────────────┐ ┌─────────────────────────────┐ │ | ||
| │ │ Web UI (Local) │ │ CLI Tool │ │ | ||
| │ │ React/Vue Dashboard │ │ ai-company <command> │ │ |
There was a problem hiding this comment.
The tech stack is inconsistent about the planned web UI framework: the architecture diagram calls it a “React/Vue Dashboard” while elsewhere (including README) it’s framed as Vue 3 (or at least “React or Vue 3”). Consider picking a single planned option or explicitly marking React vs Vue as an unresolved decision so readers don’t infer conflicting commitments.
| │ │ React/Vue Dashboard │ │ ai-company <command> │ │ | |
| │ │ Vue 3 Dashboard │ │ ai-company <command> │ │ |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@DESIGN_SPEC.md`:
- Around line 523-544: Update the markdown code fences for the workflow diagrams
to include a language spec (e.g., ```text) so linters/renderers treat them as
plain text; specifically modify the code blocks under the "Sequential Pipeline",
"Parallel Execution", "Kanban Board", and "Agile Sprints" headings in
DESIGN_SPEC.md to use a fenced code block with a language identifier such as
text.
- Around line 393-396: Update the protocol attributions in the DESIGN_SPEC where
A2A Protocol and MCP are listed: change the A2A attribution to state it is
governed by the Linux Foundation via the Agent2Agent (A2A) project (remove any
"Google/Linux Foundation" wording) and change the MCP attribution to indicate it
is stewarded by the Agentic AI Foundation (a directed fund under the Linux
Foundation), replacing any "Anthropic/Linux Foundation" phrasing; ensure the
lines mention the exact names "Agent2Agent (A2A) project" and "Agentic AI
Foundation (a directed fund under the Linux Foundation)".
In `@LICENSE`:
- Line 23: Update the "Change Date" entry to remove ambiguity by either
inserting an explicit calendar date (e.g., "Change Date: 2030-02-27" or "Change
Date: February 27, 2030") or by defining what "published" means (e.g., "Change
Date: Four years from the first public release of version 1.0" or "four years
from the date of the first public commit/tag"). Edit the existing "Change Date:
Four years from the date the Licensed Work is published." line to one of these
clear alternatives so the timing is unambiguous and enforceable.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
DESIGN_SPEC.mdLICENSEREADME.md
📜 Review details
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2026-01-24T16:33:29.354Z
Learnt from: CR
Repo: Aureliolo/story-factory PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-01-24T16:33:29.354Z
Learning: Applies to {src/agents/**/*.py,src/services/**/*.py} : Ollama Integration - all AI agents use Ollama for local LLM serving with default endpoint `http://localhost:11434`
Applied to files:
DESIGN_SPEC.md
📚 Learning: 2026-01-26T08:59:32.818Z
Learnt from: CR
Repo: Aureliolo/story-factory PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-01-26T08:59:32.818Z
Learning: Applies to README.md : Update README.md for significant feature changes
Applied to files:
README.md
📚 Learning: 2026-02-26T17:43:50.869Z
Learnt from: CR
Repo: Aureliolo/story-factory PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-26T17:43:50.869Z
Learning: When making changes that affect architecture, services, key files, settings, or workflows, update the relevant sections of existing documentation (CLAUDE.md, README.md, etc.) to reflect those changes.
Applied to files:
README.md
🪛 LanguageTool
DESIGN_SPEC.md
[typographical] ~1-~1: To join two clauses or introduce examples, consider using an em dash.
Context: # AI Company - High-Level Design Specification > A fra...
(DASH_RULE)
[typographical] ~193-~193: To join two clauses or introduce examples, consider using an em dash.
Context: ...### Engineering - Software Architect - System design, technology decisions, pat...
(DASH_RULE)
[typographical] ~194-~194: To join two clauses or introduce examples, consider using an em dash.
Context: ...Frontend Developer** (Junior/Mid/Senior) - UI implementation, components, state man...
(DASH_RULE)
[typographical] ~195-~195: To join two clauses or introduce examples, consider using an em dash.
Context: ...Backend Developer* (Junior/Mid/Senior) - APIs, business logic, databases - **Full...
(DASH_RULE)
[typographical] ~196-~196: To join two clauses or introduce examples, consider using an em dash.
Context: ...ll-Stack Developer** (Junior/Mid/Senior) - End-to-end implementation - **DevOps/SRE...
(DASH_RULE)
[typographical] ~197-~197: To join two clauses or introduce examples, consider using an em dash.
Context: ...implementation - DevOps/SRE Engineer - Infrastructure, CI/CD, monitoring, deplo...
(DASH_RULE)
[typographical] ~198-~198: To join two clauses or introduce examples, consider using an em dash.
Context: ...ring, deployment - Database Engineer - Schema design, query optimization, migra...
(DASH_RULE)
[typographical] ~199-~199: To join two clauses or introduce examples, consider using an em dash.
Context: ...tion, migrations - Security Engineer - Security audits, vulnerability assessmen...
(DASH_RULE)
[style] ~383-~383: Consider using a different adjective to strengthen your wording.
Context: ...#### Pattern 4: Hybrid (Recommended for Full Company) Combines all three: - **Messa...
(FULL_ENTIRE)
[style] ~616-~616: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...onality would complement the team? - What model/provider fits the budget? 3. Cand...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[grammar] ~619-~619: Please add a punctuation mark at the end of paragraph.
Context: ...company context, project briefing, team introductions #### Firing / Offboarding 1. Triggered...
(PUNCTUATION_PARAGRAPH_END)
[style] ~625-~625: You have used the passive voice repeatedly in nearby sentences. To make your writing clearer and easier to read, consider using active voice.
Context: ... 3. Active tasks are reassigned 4. Team is notified #### Performance Tracking ```yaml empl...
(REP_PASSIVE_VOICE)
[typographical] ~1324-~1324: In American English, use a period after an abbreviation.
Context: ...pid prototyping | ### 15.2 What Exists vs What We Need | Feature | MetaGPT | Cha...
(MISSING_PERIOD_AFTER_ABBREVIATION)
[typographical] ~1328-~1328: To join two clauses or introduce examples, consider using an em dash.
Context: ...ulation | Partial | Partial | No | Yes - complete | | HR (hiring/firing) | No |...
(DASH_RULE)
[typographical] ~1332-~1332: To join two clauses or introduce examples, consider using an em dash.
Context: ...alities | Basic | Basic | Basic | Deep - traits, styles, evolution | | Dynamic ...
(DASH_RULE)
[typographical] ~1333-~1333: To join two clauses or introduce examples, consider using an em dash.
Context: ... team scaling | No | No | Manual | Yes - auto + manual | | Multiple company typ...
(DASH_RULE)
[typographical] ~1334-~1334: To join two clauses or introduce examples, consider using an em dash.
Context: ...company types | No | No | Manual | Yes - templates + builder | | Security ops a...
(DASH_RULE)
[typographical] ~1336-~1336: To join two clauses or introduce examples, consider using an em dash.
Context: ...ble autonomy | No | No | Limited | Yes - full spectrum | | Local + cloud provid...
(DASH_RULE)
[typographical] ~1337-~1337: To join two clauses or introduce examples, consider using an em dash.
Context: ...rs | Partial | Partial | Partial | Yes - all via LiteLLM | | Cost tracking per ...
(DASH_RULE)
[typographical] ~1338-~1338: To join two clauses or introduce examples, consider using an em dash.
Context: ...racking per agent | No | No | No | Yes - full budget system | | Progressive tru...
(DASH_RULE)
[typographical] ~1345-~1345: In American English, use a period after an abbreviation.
Context: ...Planned (backlog) | ### 15.3 Build vs Fork Decision **Recommendation: Build ...
(MISSING_PERIOD_AFTER_ABBREVIATION)
[typographical] ~1357-~1357: To join two clauses or introduce examples, consider using an em dash.
Context: ...hat we use (not fork): - LiteLLM - Provider abstraction (don't reinvent thi...
(DASH_RULE)
[typographical] ~1358-~1358: To join two clauses or introduce examples, consider using an em dash.
Context: ...raction (don't reinvent this) - Mem0 - Agent memory (don't reinvent this) - **F...
(DASH_RULE)
[typographical] ~1359-~1359: To join two clauses or introduce examples, consider using an em dash.
Context: ...mory (don't reinvent this) - FastAPI - API layer - MCP - Tool integration s...
(DASH_RULE)
[typographical] ~1360-~1360: To join two clauses or introduce examples, consider using an em dash.
Context: ...his) - FastAPI - API layer - MCP - Tool integration standard - Pydantic...
(DASH_RULE)
[typographical] ~1361-~1361: To join two clauses or introduce examples, consider using an em dash.
Context: ...Tool integration standard - Pydantic - Config validation and data models - **Ty...
(DASH_RULE)
[typographical] ~1362-~1362: To join two clauses or introduce examples, consider using an em dash.
Context: ...g validation and data models - Typer - CLI - Vue 3 - Web UI --- ## 16. Op...
(DASH_RULE)
[typographical] ~1363-~1363: To join two clauses or introduce examples, consider using an em dash.
Context: ...ata models - Typer - CLI - Vue 3 - Web UI --- ## 16. Open Questions & Ris...
(DASH_RULE)
[style] ~1379-~1379: Consider using the typographical ellipsis character here instead.
Context: ... | Agent A asks Agent B who asks Agent A... | | 8 | Optimal message bus for local-f...
(ELLIPSIS)
[typographical] ~1380-~1380: In American English, use a period after an abbreviation.
Context: ...architecture? | Medium | asyncio queues vs Redis vs embedded broker | | 9 | How to...
(MISSING_PERIOD_AFTER_ABBREVIATION)
[typographical] ~1380-~1380: In American English, use a period after an abbreviation.
Context: ...ure? | Medium | asyncio queues vs Redis vs embedded broker | | 9 | How to handle c...
(MISSING_PERIOD_AFTER_ABBREVIATION)
[typographical] ~1381-~1381: In American English, use a period after an abbreviation.
Context: ...y? | High | Sandboxing strategy, Docker vs WASM vs subprocess | | 10 | What's the ...
(MISSING_PERIOD_AFTER_ABBREVIATION)
[typographical] ~1381-~1381: In American English, use a period after an abbreviation.
Context: ...h | Sandboxing strategy, Docker vs WASM vs subprocess | | 10 | What's the minimum ...
(MISSING_PERIOD_AFTER_ABBREVIATION)
[style] ~1456-~1456: To make your text as clear as possible to all readers, do not use this foreign term. Possible alternatives are ‘in fact’ or ‘in reality’.
Context: ...rop | | OpenAI API format | OpenAI (de facto standard) | LLM API interface | Via Lit...
(DE_FACTO)
[style] ~1456-~1456: This phrase is redundant (‘I’ stands for ‘interface’). Use simply “API”.
Context: ...at** | OpenAI (de facto standard) | LLM API interface | Via LiteLLM compatibility | ## Appen...
(ACRONYM_TAUTOLOGY)
[typographical] ~1460-~1460: To join two clauses or introduce examples, consider using an em dash.
Context: ...s://github.com/FoundationAgents/MetaGPT) - Multi-agent SOP framework (64.5k stars) ...
(DASH_RULE)
[typographical] ~1461-~1461: To join two clauses or introduce examples, consider using an em dash.
Context: ...2.0](https://github.com/openbmb/ChatDev) - Zero-code multi-agent platform (31.2k st...
(DASH_RULE)
[typographical] ~1462-~1462: To join two clauses or introduce examples, consider using an em dash.
Context: ...AI](https://github.com/crewAIInc/crewAI) - Role-based agent collaboration framework...
(DASH_RULE)
[typographical] ~1463-~1463: To join two clauses or introduce examples, consider using an em dash.
Context: ...n](https://github.com/microsoft/autogen) - Microsoft async multi-agent framework - ...
(DASH_RULE)
[typographical] ~1464-~1464: To join two clauses or introduce examples, consider using an em dash.
Context: ...LLM](https://github.com/BerriAI/litellm) - Unified LLM API gateway (100+ providers)...
(DASH_RULE)
[typographical] ~1465-~1465: To join two clauses or introduce examples, consider using an em dash.
Context: ...- Mem0 - Universal memory layer for AI agents - [...
(DASH_RULE)
[typographical] ~1466-~1466: To join two clauses or introduce examples, consider using an em dash.
Context: ...ocol](https://github.com/a2aproject/A2A) - Agent-to-Agent protocol (Linux Foundatio...
(DASH_RULE)
[typographical] ~1467-~1467: To join two clauses or introduce examples, consider using an em dash.
Context: ...extprotocol.io/specification/2025-11-25) - Model Context Protocol - [Langfuse Agent...
(DASH_RULE)
[typographical] ~1468-~1468: To join two clauses or introduce examples, consider using an em dash.
Context: ...com/blog/2025-03-19-ai-agent-comparison) - Framework comparison - [Confluent Event-...
(DASH_RULE)
[typographical] ~1469-~1469: To join two clauses or introduce examples, consider using an em dash.
Context: .../blog/event-driven-multi-agent-systems/) - Multi-agent architecture patterns - [Mic...
(DASH_RULE)
[typographical] ~1470-~1470: To join two clauses or introduce examples, consider using an em dash.
Context: ....io/multi-agent-reference-architecture/) - Enterprise patterns - [OpenRouter](https...
(DASH_RULE)
[typographical] ~1471-~1471: To join two clauses or introduce examples, consider using an em dash.
Context: ...s - OpenRouter - Multi-model API gateway
(DASH_RULE)
README.md
[typographical] ~11-~11: To join two clauses or introduce examples, consider using an em dash.
Context: ...s (Planned) - Any Company Structure - From a 2-person startup to a 50+ enterpr...
(DASH_RULE)
[typographical] ~12-~12: To join two clauses or introduce examples, consider using an em dash.
Context: ...nfig/templates - Deep Agent Identity - Names, personalities, skills, seniority ...
(DASH_RULE)
[typographical] ~13-~13: To join two clauses or introduce examples, consider using an em dash.
Context: ...erformance tracking - Multi-Provider - Anthropic Claude, OpenRouter (400+ model...
(DASH_RULE)
[typographical] ~14-~14: To join two clauses or introduce examples, consider using an em dash.
Context: ... via LiteLLM - Smart Cost Management - Per-agent budget tracking, auto model ro...
(DASH_RULE)
[typographical] ~15-~15: To join two clauses or introduce examples, consider using an em dash.
Context: ...optimization - Configurable Autonomy - From fully autonomous to human-approves-...
(DASH_RULE)
[typographical] ~16-~16: To join two clauses or introduce examples, consider using an em dash.
Context: ...agent in between - Persistent Memory - Agents remember past decisions, code, re...
(DASH_RULE)
[typographical] ~17-~17: To join two clauses or introduce examples, consider using an em dash.
Context: ...relationships (via Mem0) - HR System - Hire, fire, promote agents. HR agent ana...
(DASH_RULE)
[typographical] ~18-~18: To join two clauses or introduce examples, consider using an em dash.
Context: ...tem, git, code execution, web, databases - role-based and sandboxed - API-First...
(DASH_RULE)
[grammar] ~20-~20: Please add a punctuation mark at the end of paragraph.
Context: ...built company templates and interactive builder ## Status Design phase. See [DESI...
(PUNCTUATION_PARAGRAPH_END)
[typographical] ~37-~37: To join two clauses or introduce examples, consider using an em dash.
Context: ...- Design Specification - Full high-level design
(DASH_RULE)
🪛 markdownlint-cli2 (0.21.0)
DESIGN_SPEC.md
[warning] 76-76: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 178-178: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 185-185: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 192-192: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 201-201: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 207-207: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 212-212: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 218-218: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 256-256: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 332-332: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 352-352: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 365-365: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 461-461: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 523-523: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 524-524: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 524-524: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 528-528: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 529-529: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 529-529: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 535-535: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 536-536: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 536-536: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 543-543: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 544-544: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 544-544: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 567-567: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 610-610: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 621-621: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 627-627: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 628-628: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 645-645: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 747-747: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 894-894: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 967-967: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 986-986: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 1080-1080: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 1127-1127: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 1185-1185: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 1434-1434: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (8)
README.md (1)
1-37: LGTM! Clear and well-structured README for the design phase.The README accurately represents the project's current design phase status with appropriate "Planned" labels throughout. The tech stack aligns with the detailed specifications in DESIGN_SPEC.md, and the documentation structure provides a clear entry point for understanding the project's ambitious vision.
The static analysis hints about em dashes and punctuation are stylistic preferences for markdown lists and can be safely ignored - your current formatting is conventional and readable.
LICENSE (1)
1-72: Solid license choice with clear use restrictions.Business Source License 1.1 is a well-established choice for commercial open-source projects. The Additional Use Grant clearly defines production vs. non-production boundaries, and the four-year conversion to Apache 2.0 provides a clear path to fully open-source status. This aligns well with your stated PR objectives.
DESIGN_SPEC.md (6)
1311-1364: Excellent research section with sound build-from-scratch rationale.The framework comparison and build-vs-fork analysis is thorough and well-reasoned. The decision to build from scratch while leveraging battle-tested libraries (LiteLLM, Mem0, FastAPI, MCP) is pragmatic and aligns well with the unique requirements of this project.
The feature comparison clearly demonstrates the gaps in existing frameworks that justify a new implementation.
1367-1404: Outstanding risk awareness and honest assessment of open questions.The Open Questions & Risks section demonstrates excellent technical foresight. Highlighting challenges like context window exhaustion, agent communication loops, and conflicting agent opinions shows mature architectural thinking.
The proposed mitigations are practical:
- Context window: Memory summarization + task decomposition
- Cost explosion: Budget hard stops + loop detection
- Over-engineering: Start with 3-5 agents, iterate
This honest assessment of risks and uncertainties significantly strengthens the credibility of the design specification.
1-1471: Exceptionally comprehensive and well-architected design specification.This 1471-line design document demonstrates remarkable depth and breadth of thought. Key strengths include:
- Innovative unique features: HR system with hiring/firing, CFO agent for cost management, Security Operations agent, progressive trust system - none of these exist in current frameworks
- Multiple architecture patterns: Event-driven, hierarchical, meeting-based, and hybrid communication patterns provide flexibility
- Mature risk awareness: Section 16's open questions and risks show realistic assessment of implementation challenges
- Well-researched technology stack: Sound choices (Python, FastAPI, LiteLLM, Mem0, MCP) with clear rationales
- Clear MVP path: Starting with 3-5 agents and iterating avoids over-engineering
- Extensibility by design: Plugin architecture, custom roles, configurable autonomy levels
The comparison with existing frameworks (MetaGPT, CrewAI, AutoGen, etc.) clearly justifies building from scratch while leveraging battle-tested libraries.
Minor formatting improvements suggested by static analysis (missing language specs for code blocks, blank lines around headings) can be addressed during implementation but don't detract from the specification quality.
705-712: The LiteLLM integration section is accurate. As of February 2026, LiteLLM supports all features claimed in the design: multi-provider support with unified API, built-in cost tracking (with recent additions for agent-level cost tracking per query/token as of Dec 2025), automatic fallback chains and load balancing via the Router, and an OpenAI-compatible gateway that normalizes provider interfaces. The design's reliance on LiteLLM for these capabilities is well-founded.
135-140: No action needed. The model IDs referenced (claude-sonnet-4-6,claude-opus-4-6,claude-haiku-4-5) are valid Anthropic API identifiers as of February 2026 and match current naming conventions. The DESIGN_SPEC.md file contains accurate, current model references.Likely an incorrect or invalid review comment.
1303-1303: The "26% accuracy boost" claim for Mem0 is documented in peer-reviewed research and Mem0's official materials. It refers to a 26% relative improvement on the LoCoMo (Long Conversation Memory) benchmark's LLM-as-a-Judge metric over OpenAI Memory (66.9% vs 52.9%), published in Mem0's arXiv paper (arXiv:2504.19413). A public LoCoMo benchmark repository provides reproducible evaluation code. This claim has reliable sources and does not require additional verification.Likely an incorrect or invalid review comment.
|
|
||
| - **A2A Protocol** (Agent-to-Agent, Google/Linux Foundation) - For inter-agent task delegation, capability discovery via Agent Cards, and structured task lifecycle management | ||
| - **MCP** (Model Context Protocol, Anthropic/Linux Foundation) - For agent-to-tool integration, providing standardized tool discovery and invocation | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
A2A Agent-to-Agent Protocol governance standard body 2025 2026
💡 Result:
In 2025–2026, the governance / standards steward for the A2A (Agent2Agent / Agent-to-Agent) Protocol is the Linux Foundation, via the Agent2Agent (A2A) project (formed after Google transferred/donated the spec, SDKs, and tooling in June 2025). The project is run under Linux Foundation neutral governance, with a Technical Steering Committee (TSC) establishing the protocol’s governance structure and working groups. [1], [2], [3]
Answer: The Linux Foundation (Agent2Agent/A2A project, governed via a TSC). [1], [2]
Sources
[1] Google Developers Blog (Jun 23, 2025), “Google Cloud donates A2A to Linux Foundation” (developers.googleblog.com)
[2] A2A Protocol Roadmap (Governance section; “donated to the Linux Foundation”, TSC) (a2a-protocol.org)
[3] IANA well-known URI registration issue (lists “Change controller: Linux Foundation” + governance doc link) (github.com)
Citations:
- 1: https://developers.googleblog.com/en/google-cloud-donates-a2a-to-linux-foundation/?utm_source=openai
- 2: https://a2a-protocol.org/v0.3.0/roadmap/?utm_source=openai
- 3: Add Well-Known URI for
.well-known/agent.jsonfor Agent2Agent Protocol protocol-registries/well-known-uris#66
🌐 Web query:
Model Context Protocol MCP governance Anthropic Linux Foundation 2025 2026
💡 Result:
- What changed (governance): On December 9, 2025, Anthropic transferred stewardship of the Model Context Protocol (MCP) into the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation, alongside other founding project contributions (Block’s goose and OpenAI’s AGENTS.md). [1][2]
- How governance is structured: AAIF has a Governing Board responsible for strategic investments, budget allocation, member recruitment, and approving new projects, while each project (including MCP) retains autonomy over technical direction and day-to-day operations; the MCP project stated its maintainer-led governance continues and the Linux Foundation does not dictate MCP’s technical direction. [1]
- 2026 continuity / community operations: AAIF’s launch materials also point to ongoing community activities, including the next MCP Dev Summit in New York City on April 2–3, 2026. [2]
Sources: [1] MCP project blog (Dec 9, 2025) [2] AAIF / Linux Foundation press release (Dec 9, 2025)
Update A2A and MCP protocol attribution to reflect current governance structures.
As of February 2026, both protocol governance structures have changed:
- A2A Protocol: Following Google's donation in June 2025, it is now governed by the Linux Foundation via the Agent2Agent (A2A) project (not "Google/Linux Foundation").
- MCP: Following Anthropic's transfer in December 2025, it is now steward by the Agentic AI Foundation (a directed fund under the Linux Foundation), not directly by "Anthropic/Linux Foundation".
Update lines 393–396 to reflect current governance attribution.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@DESIGN_SPEC.md` around lines 393 - 396, Update the protocol attributions in
the DESIGN_SPEC where A2A Protocol and MCP are listed: change the A2A
attribution to state it is governed by the Linux Foundation via the Agent2Agent
(A2A) project (remove any "Google/Linux Foundation" wording) and change the MCP
attribution to indicate it is stewarded by the Agentic AI Foundation (a directed
fund under the Linux Foundation), replacing any "Anthropic/Linux Foundation"
phrasing; ensure the lines mention the exact names "Agent2Agent (A2A) project"
and "Agentic AI Foundation (a directed fund under the Linux Foundation)".
| #### Sequential Pipeline | ||
| ``` | ||
| Requirements ──▶ Design ──▶ Implementation ──▶ Review ──▶ Testing ──▶ Deploy | ||
| ``` | ||
|
|
||
| #### Parallel Execution | ||
| ``` | ||
| ┌──▶ Frontend Dev ──┐ | ||
| Task ───┤ ├──▶ Integration ──▶ QA | ||
| └──▶ Backend Dev ──┘ | ||
| ``` | ||
|
|
||
| #### Kanban Board | ||
| ``` | ||
| Backlog │ Ready │ In Progress │ Review │ Done | ||
| ○ │ ○ │ ● │ ○ │ ●●● | ||
| ○ │ ○ │ ● │ │ ●● | ||
| ○ │ │ │ │ ● | ||
| ``` | ||
|
|
||
| #### Agile Sprints | ||
| ``` |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding language specifications to code blocks for better rendering.
Several workflow diagram code blocks (lines 524, 529, 536, 544) don't specify a language identifier. While the ASCII art renders fine, adding a language spec like ```text would satisfy markdown linters and improve consistency.
This is a minor formatting suggestion and doesn't affect the technical content.
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 523-523: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 524-524: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 524-524: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 528-528: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 529-529: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 529-529: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 535-535: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 536-536: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 536-536: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 543-543: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 544-544: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 544-544: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@DESIGN_SPEC.md` around lines 523 - 544, Update the markdown code fences for
the workflow diagrams to include a language spec (e.g., ```text) so
linters/renderers treat them as plain text; specifically modify the code blocks
under the "Sequential Pipeline", "Parallel Execution", "Kanban Board", and
"Agile Sprints" headings in DESIGN_SPEC.md to use a fenced code block with a
language identifier such as text.
| "Non-production use" means use solely for personal learning, | ||
| academic research, testing, development, and evaluation in | ||
| non-commercial settings. | ||
| Change Date: Four years from the date the Licensed Work is published. |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider specifying an explicit Change Date.
The current wording "Four years from the date the Licensed Work is published" leaves the publication date ambiguous. For clarity and enforceability, consider specifying an explicit date or defining what "published" means (e.g., "first public release", "first GitHub commit", etc.).
Example alternatives:
Change Date: 2030-02-27(explicit date, four years from now)Change Date: Four years from the first public release of version 1.0Change Date: February 27, 2030
📅 Proposed fix for explicit date
-Change Date: Four years from the date the Licensed Work is published.
+Change Date: February 27, 2030📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Change Date: Four years from the date the Licensed Work is published. | |
| Change Date: February 27, 2030 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@LICENSE` at line 23, Update the "Change Date" entry to remove ambiguity by
either inserting an explicit calendar date (e.g., "Change Date: 2030-02-27" or
"Change Date: February 27, 2030") or by defining what "published" means (e.g.,
"Change Date: Four years from the first public release of version 1.0" or "four
years from the date of the first public commit/tag"). Edit the existing "Change
Date: Four years from the date the Licensed Work is published." line to one of
these clear alternatives so the timing is unambiguous and enforceable.
…eviewers Source changes (tracker.py): - Fix budget_used_percent/alert_level threshold disagreement (#1) - Add logger.warning before ValueError in _validate_time_range (#2) - Split build_summary into _build_agent_spendings, _build_dept_spendings, _build_budget_context helpers to meet <50 line guideline (#3) - Replace _aggregate tuple return with _AggregateResult NamedTuple (#4) - Rewrite _aggregate as single-pass loop (#5) - Aggregate departments from AgentSpending objects, not raw records (#6) - Add DEBUG log in __init__ for object creation (#7) - Fix module docstring "Section 10.2 service layer" accuracy (#15) - Add DEBUG entry logs for get_total_cost/get_agent_cost (#16) Event constants (events.py): - Add BUDGET_TRACKER_CREATED, BUDGET_TOTAL_COST_QUERIED, BUDGET_AGENT_COST_QUERIED, BUDGET_TIME_RANGE_INVALID Test improvements (test_tracker.py): - Assert budget_used_percent in all alert-level tests (#8) - Assert budget_total_monthly in configured summary test (#9) - Add start==end raises ValueError test (#10) - Add end-only time filter test for get_total_cost (#11) - Add multi-record token aggregation assertions (#12) - Add pytestmark = pytest.mark.timeout(30) (#13) - Move @pytest.mark.unit to class level (#14) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t, Gemini, and greptile - Add produced_artifacts field to AgentRunResult (#1) - Wrap _log_completion in try/except to preserve valid results (#2) - Add test for inner TimeoutError propagation without engine timeout (#3) - Extract _run_loop_with_timeout from _execute (50-line limit) (#4) - Extract _validate_run_inputs from run() (50-line limit) (#5) - Rename metrics docstrings from "completed task" to "agent run" + add termination_reason to metrics event (#6) - Fix raise exc from build_exc chain direction (#7) - Replace asyncio.wait_for with asyncio.wait for timeout disambiguation (#8) - Add test for _apply_post_execution_transitions failure resilience (#9) - Add test for timeout cost recording behavior (#10) - Fix hardcoded from_status in transition logs (#11) - Add agent-task ownership check in _validate_task (#12) - Split test_invalid_timeout_raises into two test methods (#13) - Add negative validation tests for cost_per_task/duration_seconds (#14) - Add test_blank_task_id_rejected (#15) - Update _execute docstring to mention timeout, transitions, metrics (#16) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, and Copilot Source fixes: - Move asyncio/types out of TYPE_CHECKING for PEP 649 compatibility (#1, #2) - Guard is_closing() inside try/except in _process_cleanup.py (#4) - Normalize all control chars (incl. newlines/tabs) in _sanitize_stderr (#6) - Apply stderr sanitization to sandbox git path too (#3) - Fix list_directory truncation to use raw scan cap, not post-classification count (#7) - Narrow except to InvalidStateError + log task identity in shutdown (#8) - Add loop.stop() fallback when request_shutdown() fails in signal handlers (#9) - Include zombie diagnostic in _drain_after_kill stderr output (#10) - Remove @staticmethod from _log_post_cancel_exceptions (#17) - Add _process_cleanup.py to DESIGN_SPEC §15.3 (#12) and §11.1.1 (#16) Test additions: - New test_process_cleanup.py: 7 tests covering all transport states (#5) - Add _sanitize_stderr truncation test (#14) - Add _log_post_cancel_exceptions tests (4 tests) (#11) - Add signal handler recovery tests (3 tests) (#15) - Use ValidationError instead of ValueError in config tests (#13) - Update existing tests for new sanitization behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes from code-reviewer, docs-consistency, issue-resolution-verifier agents plus Gemini, Greptile, Copilot, and CodeRabbit external reviewers: - Fix ADR "custom stack is the initial backend" contradiction (→ Mem0) - Update 6 stale "TBD" references in DESIGN_SPEC.md to reflect ADR-001 - Update CLAUDE.md package structure memory/ comment - Update README.md memory layer and milestone status - Clarify architecture constraint #1 (MVP exception for in-process) - Add Zep→Graphiti pivot context note - Fix "procedur" typo in ASCII diagram - Clarify Kuzu concurrency issues (architectural, not bugs) - Clarify S10/S11 scores and G3 gate for in-process vs full stack - Add docs/decisions/ to project structure listing - Fix minor typography (vs. abbreviation, unit spacing, paragraph endings) - Update swappability table to reflect Mem0-first ordering - Resolve Open Question #14 as resolved - Update risk mitigation and extensibility notes
- Add routing optimization feature (#1): new suggest_routing_optimizations() method, RoutingSuggestion and RoutingOptimizationAnalysis models - Add negative estimated_cost_usd validation (#2) - Fix double snapshot in generate_report (#3) - Fix deviation_factor to use spike_ratio when stddev=0 (#4) - Convert DowngradeAnalysis.total_estimated_savings_per_1k to @computed_field (#5) - Change str to NotBlankStr in SpendingReport tuple fields (#6) - Add window_count upper bound validation (#7) - Pre-group records by agent for O(N+M) complexity (#8) - Update DESIGN_SPEC.md implementation snapshot (#9) - Use projected alert level for auto-deny check (#11) - Move approval log after ApprovalDecision construction (#12) - Add ReportGenerator.__init__ debug log + event constant (#13) - Fix _ALERT_LEVEL_ORDER comment (#14) - Fix _classify_severity docstring for dual-use (#15) - Add WARNING logs before ValueError raises (#16) - Update evaluate_operation docstring (#17) - Add sort-order validator to EfficiencyAnalysis.agents (#18) - Add debug log when _find_most_used_model returns None (#19) - Remove redundant stddev > 0 check in is_sigma_anomaly (#20) - Document approval_warn_threshold_usd=0.0 behavior (#21) - Extract helpers to _optimizer_helpers.py to stay under 800-line limit
…reptile - Cap prompt_tokens instead of rejecting when heuristic exceeds actual (#2) - Log policy_length instead of policy content to avoid leaks (#6) - Sort _ACTION_VERBS for deterministic regex alternation (#8) - Use PROMPT_POLICY_VALIDATION_FAILED event for advisory failures (#12) - Add isinstance check and strip whitespace in TagBasedMemoryFilter (#13) - Use MEMORY_FILTER_INIT event for filter-init log paths (#14, #16) - Remove content_preview from store_guard warning log (#18) - Track tools section conditionally for custom templates (#10) - Reconcile enforced vs advisory wording in DESIGN_SPEC (#1) - Inject fixed estimated_tokens in prompt-ratio test (#19) - Parametrize timeout and org_policies tests (#20, #22) - Add prompt_token_ratio assertion in lifecycle test (#21)
Security fixes: - Mask sensitive values in validation error messages (#2) - Sanitize SettingsEncryptionError in API response (#3) - Don't cache decrypted plaintext secrets (#4) - Validate namespace against SettingNamespace enum at controller (#8) - Error on empty SYNTHORG_SETTINGS_KEY instead of silently ignoring (#9) Persistence fixes: - V9 migration: prefer settings_old as copy source when both exist (#1) - Remove redundant idx_settings_namespace index (PK covers it) (#15) - Fix updated_at format consistency (ISO 8601 with timezone) (#17) - Align get_setting/set_setting types with protocol (NotBlankStr) (#10) Performance: - Use batch repo methods in get_all/get_namespace (eliminate N+1) (#6) Code quality: - Fix Any type to SettingsService on create_app parameter (#5) - Add SETTINGS_NOTIFICATION_FAILED + SETTINGS_FETCH_FAILED events (#7,#16) - Fix wrong event constants in repo error paths and notification (#7,#16) - Remove duplicate SettingNotFoundError guard in get_entry (#21) - Log SETTINGS_VALUE_RESOLVED for all resolution sources (#22) Tests: - Add V9 migration crash-safety tests (#11) - Add tests for sensitive read without encryptor (#12) - Add tests for get_all/get_namespace service methods (#13) - Add tests for notification exception handling (#14) - Use pydantic.ValidationError instead of broad Exception (#18) - Add INTEGER, JSON, and sensitive-masking validation tests (#19) - Use actual Unicode characters in encryption roundtrip test (#20)
Security fixes: - Mask sensitive values in validation error messages (#2) - Sanitize SettingsEncryptionError in API response (#3) - Don't cache decrypted plaintext secrets (#4) - Validate namespace against SettingNamespace enum at controller (#8) - Error on empty SYNTHORG_SETTINGS_KEY instead of silently ignoring (#9) Persistence fixes: - V9 migration: prefer settings_old as copy source when both exist (#1) - Remove redundant idx_settings_namespace index (PK covers it) (#15) - Fix updated_at format consistency (ISO 8601 with timezone) (#17) - Align get_setting/set_setting types with protocol (NotBlankStr) (#10) Performance: - Use batch repo methods in get_all/get_namespace (eliminate N+1) (#6) Code quality: - Fix Any type to SettingsService on create_app parameter (#5) - Add SETTINGS_NOTIFICATION_FAILED + SETTINGS_FETCH_FAILED events (#7,#16) - Fix wrong event constants in repo error paths and notification (#7,#16) - Remove duplicate SettingNotFoundError guard in get_entry (#21) - Log SETTINGS_VALUE_RESOLVED for all resolution sources (#22) Tests: - Add V9 migration crash-safety tests (#11) - Add tests for sensitive read without encryptor (#12) - Add tests for get_all/get_namespace service methods (#13) - Add tests for notification exception handling (#14) - Use pydantic.ValidationError instead of broad Exception (#18) - Add INTEGER, JSON, and sensitive-masking validation tests (#19) - Use actual Unicode characters in encryption roundtrip test (#20)
… reviewers Source fixes: - Use NotBlankStr | None for DualModeConfig.summarization_model (#1) - Parallelize LLM calls with asyncio.TaskGroup in summarize_batch and _build_content (#2) - Remove dead-code guards in _build_anchors (#3) - Narrow except Exception to re-raise non-retryable ProviderErrors (#4) - Fix double-logging on abstractive fallback (#9) - Remove unnecessary import builtins (#10) - Preserve key-value pairs verbatim in extractive mode (#5) - Emit extracted facts one per line (#6) - Strengthen ConsolidationResult validator with cross-field checks (#7) - Check _backend.delete() return value in _process_group (#8) - Fix mode_map type to dict[NotBlankStr, ArchivalMode] (#11) - Move tie-breaking comment to _determine_group_mode (#12) - Fix misleading DualModeConfig docstring (#13) - Add missing mkdocstrings entries for retention/archival/simple_strategy (#14) - Use O(M) lookup dict in _archive_entries (#15) - Document 1000-entry query limit in run_consolidation docstring (#16) - Add Raises section to AbstractiveSummarizer docstring (#17) Test fixes: - Fix imports in test_density.py to module level (#18) - Strengthen fallback assertion to verify exact content (#19) - Use exact call counts for summarizer/extractor (#20) - Add tests: blank model rejection, MemoryError/RecursionError propagation (#21, #22) - Add tests: validator rejects invalid archival state (#23) - Add tests: 50/50 tie-breaking, None relevance handling (#24, #25) - Assert actual preserved facts in extractive tests (#26) - Prove archival index keyed by original_id not position (#27) - Add test: empty string classifies as SPARSE (#28)
… and Gemini - Fix resume path to call _resolve_loop instead of using static self._loop (#1) - Validate loop_type/hybrid_fallback against _KNOWN_LOOP_TYPES at config time (#3) - Fix redundant any() scan producing false-positive NO_RULE_MATCH warning (#4) - Downgrade EXECUTION_LOOP_BUDGET_UNAVAILABLE to DEBUG to avoid log noise (#5) - Add auto_loop_config to AgentEngine class docstring (#6) - Reduce enforcer.py to 799 lines (was 806, limit 800) (#7) - Fix select_loop_type Returns docstring accuracy (#8) - Fix build_execution_loop docstring to mention hybrid (#9) - Add EXECUTION_LOOP_BUDGET_UNAVAILABLE assertion in budget-error test (#10) - Add resume path test for _resolve_loop (#11) - Add test: rule mapping to react does not trigger NO_RULE_MATCH (#12) - Add _resolve_loop docstring note about compaction/plan_execute_config (#13) - Update module docstring to mention AutoLoopConfig/AutoLoopRule (#14) - Simplify verbose log note string (#15) - Add configurable default_loop_type to AutoLoopConfig (Gemini enhancement) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… and Gemini - Fix resume path to call _resolve_loop instead of using static self._loop (#1) - Validate loop_type/hybrid_fallback against _KNOWN_LOOP_TYPES at config time (#3) - Fix redundant any() scan producing false-positive NO_RULE_MATCH warning (#4) - Downgrade EXECUTION_LOOP_BUDGET_UNAVAILABLE to DEBUG to avoid log noise (#5) - Add auto_loop_config to AgentEngine class docstring (#6) - Reduce enforcer.py to 799 lines (was 806, limit 800) (#7) - Fix select_loop_type Returns docstring accuracy (#8) - Fix build_execution_loop docstring to mention hybrid (#9) - Add EXECUTION_LOOP_BUDGET_UNAVAILABLE assertion in budget-error test (#10) - Add resume path test for _resolve_loop (#11) - Add test: rule mapping to react does not trigger NO_RULE_MATCH (#12) - Add _resolve_loop docstring note about compaction/plan_execute_config (#13) - Update module docstring to mention AutoLoopConfig/AutoLoopRule (#14) - Simplify verbose log note string (#15) - Add configurable default_loop_type to AutoLoopConfig (Gemini enhancement) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… and Gemini - Fix resume path to call _resolve_loop instead of using static self._loop (#1) - Validate loop_type/hybrid_fallback against _KNOWN_LOOP_TYPES at config time (#3) - Fix redundant any() scan producing false-positive NO_RULE_MATCH warning (#4) - Downgrade EXECUTION_LOOP_BUDGET_UNAVAILABLE to DEBUG to avoid log noise (#5) - Add auto_loop_config to AgentEngine class docstring (#6) - Reduce enforcer.py to 799 lines (was 806, limit 800) (#7) - Fix select_loop_type Returns docstring accuracy (#8) - Fix build_execution_loop docstring to mention hybrid (#9) - Add EXECUTION_LOOP_BUDGET_UNAVAILABLE assertion in budget-error test (#10) - Add resume path test for _resolve_loop (#11) - Add test: rule mapping to react does not trigger NO_RULE_MATCH (#12) - Add _resolve_loop docstring note about compaction/plan_execute_config (#13) - Update module docstring to mention AutoLoopConfig/AutoLoopRule (#14) - Simplify verbose log note string (#15) - Add configurable default_loop_type to AutoLoopConfig (Gemini enhancement) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Security hardening: - Service catch-all now respects configured error policy (DENY/ESCALATE) instead of unconditionally returning rule verdict (#1) - Default error policy changed from USE_RULE_VERDICT to ESCALATE (#30) - XML-delimited prompt to resist prompt injection (#6) - Configurable reason visibility (full/generic/category) to prevent adversarial feedback loop via LLM reason leakage (#7) - Configurable argument truncation strategy (whole_string/per_value/ keys_and_values) to prevent padding bypass (#28) - Reason field sanitized for newlines/control chars (#16) - MemoryError/RecursionError guard added to engine._safe_evaluate (#2) Code quality: - Enum values and mappings derived from source enums (no drift) (#9) - MappingProxyType for read-only lookup dicts (#10) - _call_llm and _select_provider refactored under 50-line limit (#11) - Warning log when _select_model falls back to provider name (#13) - Debug log on JSON serialization fallback (#14) - error_type/error_message in structured exception logging (#15) - USE_RULE_VERDICT annotates reason with failure context (#8) - Em-dashes replaced with ASCII dashes (#5) - Module/method docstrings corrected (#4, #12, #24, #25) - LlmFallbackConfig/LlmFallbackErrorPolicy re-exported (#3) - agent_visible_reason field on SecurityVerdict for invoker (#7) Tests: - Parametrized verdict parsing and error policy tests (#27) - Timeout test uses asyncio.Future instead of real sleep (#23) - Truncation assertion tightened from 10000 to 600 (#20) - Fixed mismatched matched_rules in integration helper (#21) - Added: DENY+LOW confidence safety net test (#18) - Added: MemoryError propagation test - Added: reason sanitization test - Added: USE_RULE_VERDICT annotation test - Added: reason visibility config test - Added: per-value truncation test - Added: XML delimiter test - Added: audit entry confidence assertion (#26) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… Gemini - Fix step order: tame third-party loggers before applying per-logger levels so user config overrides take precedence (#1) - Replace contextlib.suppress(Exception) with try/except + stderr warning to match _clear_root_handlers error handling pattern (#2) - Fix misleading docstring: clarify litellm attribute suppression is conditional while handler cleanup runs unconditionally (#3) - Change PR body from Closes #72 to Related to #72 (already closed by PR #73) (#4) - Add precondition to test_litellm_suppress_debug_info_enabled (#5) - Add suppress_debug_info to CLAUDE.md observability description (#6) - Add test for handler.close() failure warning to stderr (#7) - Add test for removing multiple handlers from single logger (#8) - Add autouse fixture to reset third-party logger state between tests (#9) - Strengthen test_skips_litellm_when_not_imported to verify handler and level cleanup still works (#10) - Add test for non-LiteLLM handler removal (httpx) (#11) - Update docs/design/operations.md step number and override note Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uild (#1650) ## Summary Two real CI regressions on the last 4 main commits, both rooted in changes that landed earlier this month and surfaced quietly. Plus the downstream cascade they caused. ### #1 — `Publish Fine-Tune (cpu)` digest-mismatch (commit `8178b500`) `docker manifest push` stdout was being grepped for `sha256:[0-9a-f]{64}` to capture the manifest digest for cosign signing. That output is not stable across Docker versions: some emit the digest, some only print `Created manifest list <ref>`. On a runner whose Docker happened to print only the latter, the grep returned empty -> `DIGEST=""` -> the cosign sign step failed with `Push step did not produce a digest -- cannot sign image`. **Regressed in PR #1542 (2026-04-24)** when `publish-image` was extracted into a composite action. Before then, the inline code in `docker.yml` did `DIGEST="$PUSH_OUT"` (capturing whole stdout). The grep extraction was a hardening attempt that became a foot-gun on Docker versions that don't emit the digest. **Fix:** Switch all three manifest-push sites (`publish-image/action.yml`, `publish-apko-base/action.yml`, `docker.yml`) to read the digest back from the registry via `docker buildx imagetools inspect <ref> --format '{{.Manifest.Digest}}'`. That command is deterministic across Docker versions and returns the canonical manifest list digest -- exactly what cosign + attest-build-provenance expect. buildx is preinstalled on `ubuntu-latest` runners so no setup step is needed. ### #2 — DAST `ZAP API Scan` BASE_IMAGE blank (commit `8178b500`) PR #1465 dropped the `:latest` default for `BASE_IMAGE` in `docker/backend/Dockerfile` to satisfy Scorecard Pinned-Dependencies, on the premise that "CI always provides BASE_IMAGE". DAST runs `docker compose -f docker/compose.yml build backend` without supplying it, and `compose.yml` had no `args:` mapping for `BASE_IMAGE`. The very next weekly DAST cron failed with `failed to solve: base name (${BASE_IMAGE}) should not be blank`. **Fix:** - `docker/compose.yml`: add `build.args.BASE_IMAGE: ${BASE_IMAGE}` pass-through to the backend service. Empty/unset still fails fast at Dockerfile parse, preserving the Pinned-Dependencies invariant -- no `:latest` default reintroduced. - `.github/workflows/dast.yml`: build the apko-composed Wolfi base in-workflow via the existing `build-apko-base` composite action, then pass its `sha-<short>` tag (pinned by definition of the running commit, no floating registry tag) through the new compose `build.args` mapping. Adds the `security-events: write` scope required by `build-apko-base` for Trivy SARIF upload. - `.github/workflows/docker.yml` (validate-compose): add `BASE_IMAGE=ci-validation-dummy:placeholder` to the dummy `docker/.env` so `docker compose config --quiet` doesn't emit a "variable not set" warning every CI run. The Dockerfile is never built during `compose config`, so the placeholder is never consumed. ### #3 — `Update Release Notes` cascade (commit `bba4aa37`) Direct downstream of #1: fine-tune cpu/gpu publish failed -> dev release tag was never created -> the release-notes step couldn't find `v0.7.4-dev.15` after 6 attempts. Resolves automatically once #1 lands. ### #4 — CodSpeed neutral on `074a8f11` Out of scope here. Already tracked as #1640 (investigation issue opened today). ## Test plan - `docker buildx imagetools inspect --format '{{.Manifest.Digest}}'` is the documented way to read a manifest list digest from the registry, works for both single-arch (fine-tune-cpu) and multi-arch (backend, sandbox, sidecar, fine-tune-gpu) manifests. - DAST workflow: the `build-apko-base` composite action does `docker load` of the built tarball and re-tags to `ghcr.io/aureliolo/synthorg-backend-base:sha-<short>`, so the subsequent `docker compose build backend` resolves `BASE_IMAGE` from the local Docker daemon without going to the registry. Same flow that backend / sandbox / sidecar / fine-tune builds already use in `docker.yml`. - `validate-compose` job continues to pass; warning is silenced. - Pre-commit hooks pass (no Python / Go / Dockerfile changes; YAML / commitizen / gitleaks / no-em-dashes all green). - Pre-PR review: 2 agents (`docs-consistency`, `infra-reviewer`). 1 valid finding (validate-compose warning), addressed in the second commit. ## Review coverage Pre-reviewed by 2 agents, 1 finding addressed. ## Files - `.github/actions/publish-image/action.yml` — buildx imagetools digest read - `.github/actions/publish-apko-base/action.yml` — buildx imagetools digest read - `.github/workflows/docker.yml` — buildx imagetools digest read in inline manifest path; placeholder BASE_IMAGE in validate-compose - `.github/workflows/dast.yml` — build apko base in-workflow, pass BASE_IMAGE through; add `security-events: write` scope; hoist `TRIVY_VERSION` to workflow env - `docker/compose.yml` — `args.BASE_IMAGE: ${BASE_IMAGE}` pass-through on the backend service
Closes #1684. ## Summary Bundle of 9 cost / budget / audit-trail findings from the 2026-04-30 codebase audit. Every LLM call now produces a `CostRecord`, every aggregation enforces same-currency, and every security-relevant connection / custom-rule mutation reaches the audit chain. ### Currency aggregation invariant - `synthorg.budget.category_analytics.build_category_breakdown` calls `_assert_single_currency()` and raises `MixedCurrencyAggregationError` (HTTP 409) on mismatch. - `synthorg.hr.performance.multi_window_strategy._compute_single_window` switches monetary `sum(...)` → `math.fsum(...)`, matching `budget._aggregation.sum_cost`. ### Cost-tracking coverage - `_probe_provider()` wraps `driver.complete()` in `cost_recording_scope()` with `agent_id="system"`, `task_id=f"system:providers:test_connection:{name}"`, `call_category=LLMCallCategory.SYSTEM`. - `ProviderManagementService.__init__` accepts `cost_tracker: CostTracker | None`; `AppState` threads `self._cost_tracker` through. - `_safe_task_id_segment` strips control / whitespace characters from the user-supplied provider name before stamping the probe `task_id` (closes a log-injection vector flagged by the security review). ### Audit-chain coverage on security-sensitive ops New `SECURITY_*` event constants in `events/security.py`: - `SECURITY_CONNECTION_CREATED / UPDATED / DELETED` - `SECURITY_CONNECTION_SECRET_REVEALED / SECRET_REVEAL_FAILED` (renamed from the deleted `integrations.connection.secret_*` constants — pre-alpha, no compat shim) - `SECURITY_CUSTOM_RULE_CREATED / UPDATED / DELETED / TOGGLED` Connection + custom-rule controllers now emit signed entries on every mutation success and on every reveal-failed branch. Field naming aligns with the `SECURITY_PROVIDER_*` convention: bare `connection` / `rule` (the canonical identifier) plus context (`connection_type`, `auth_method`, `fields_changed`, `rule_name`, `metric_path`, `severity`). ### Audit-chain bridge fix The pre-existing `AuditChainSink` filter only caught plain stdlib emissions; structlog records (`record.msg` is a dict or tuple) slid past `record.getMessage().startswith("security.")`. Added `_extract_event_name(record) -> str | None` that handles all three shapes (str / dict / tuple) and returns `None` for unknown shapes. The caller logs an `AUDIT_CHAIN_RECORD_SHAPE_UNKNOWN` warning so a future logging-bridge change can't silently drop security events. `emit()` also splits `concurrent.futures.TimeoutError` into a dedicated `AUDIT_CHAIN_EMIT_TIMEOUT` log path so operators can triage TSA / signer hangs separately from generic emit errors. ### `_do_test_connection` retry-exhaustion signal `RetryExhaustedError` is now caught explicitly before the generic `ProviderError` clause and tagged `retry_exhausted=True` so retry-tier exhaustion is distinguishable from one-off provider errors in audit logs. ## Test plan ```bash uv run ruff check src/ tests/ # green uv run ruff format --check src/ tests/ # green uv run mypy src/ tests/ # green (3583 files) uv run python -m pytest tests/ -m unit -n 8 --ignore=tests/benchmarks # 25,989 passed, 16 skipped (10 new tests over baseline) ``` New / extended test coverage: - `tests/unit/budget/test_category_analytics.py`: 3 tests for mixed-currency rejection and clean same-currency aggregation. - `tests/unit/hr/performance/test_multi_window_strategy.py`: `math.fsum` spy test plus `currency=None` rejection at the model boundary. - `tests/unit/providers/management/test_probe_cost_recording.py` (new): probe emits exactly one `CostRecord`, no-tracker is a strict no-op (asserts `current_cost_context()` stays `None`), failure does not record, and the cost-recording scope propagates exceptions. - `tests/unit/observability/audit_chain/test_audit_chain.py`: `TestExtractEventName` parametrized over all four record-msg shapes; `TestAuditChainSinkFailurePaths` covers signing timeout, signer crash, callback re-entry, and unknown-shape warning. - `tests/integration/integrations/test_controllers.py`: `TestConnectionAuditEvents` collapses six near-identical tests into one parametrized failure-path case; payload assertions confirm the `connection`, `field`, and `reason` fields. - `tests/unit/meta/test_custom_rules_api.py`: `TestCustomRuleAuditEvents` asserts payload fields on every mutation hop (rule, rule_name, metric_path, severity, enabled). - `tests/integration/observability/test_audit_chain_coverage.py` (new): end-to-end coverage with a real `AuditChainSink` + mock signer, drives the connection + custom-rule lifecycles, asserts chain integrity stays green, and verifies tampering breaks `verify_integrity()`. ## Review coverage Pre-PR review: 14 specialized agents ran in parallel. 20 valid findings, all addressed. 5 false positives preserved in `_audit/pre-pr-review/triage.md` for the record (PEP 758 except syntax misreads, a redundant set-clear suggestion, a `patch.object` spec false flag, and a confusion between operational vs security event namespaces). Reviewer dimensions covered: docs-consistency, code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer, logging-audit, resilience-audit, conventions-enforcer, security-reviewer, api-contract-drift, test-quality-reviewer, async-concurrency-reviewer, issue-resolution-verifier. ## Documentation - `docs/reference/conventions.md` §13: event-domain inventory now lists `security` and `audit_chain`; cross-links the audit-chain section. - `docs/design/observability.md`: new `## Audit chain` section documents the `security.*` / `tool.registry.integrity.*` opt-in rule, the three record-msg shapes `_extract_event_name` handles, and the three failure paths (timeout / generic error / callback error). ## Out of scope (per issue) - Webhook + cost-record idempotency (issue #1). - Missing rate limits (issue #5). - Streaming-cost-record coverage: `BaseCompletionProvider.stream` is documented as caller-responsible in `providers/base.py:214-225`.
14 valid findings from 18 review agents: CRITICAL #1: restore issue-spec janitor defaults (5min interval + 24h idle TTL) in synthorg.settings.definitions.communication and lifecycle_helpers fallback constants. CRITICAL #4: fix broken anchor link in docs/api/layer.md by adding an Auth section to docs/api/core.md that documents synthorg.core.auth.{config,models,session,refresh_record,roles}. HIGH #2: move sub.last_active=now mutation back inside _lock so the publish path matches the _Subscriber dataclass docstring's locking-invariant claim. HIGH #3: wrap EventStreamHub._janitor_loop body in try/except, re-raise CancelledError/MemoryError/RecursionError, log every other failure under new EVENT_STREAM_HUB_JANITOR_FAILED constant and continue. HIGH #8: add test_stop_timeout_marks_unrestartable using monkeypatch on asyncio.wait_for to deterministically exercise the timeout branch. MEDIUM #5: add 5 new event-stream-hub event-constant rows (STARTED/STOPPED/STOP_TIMEOUT/JANITOR_PRUNED/JANITOR_FAILED) to docs/design/observability.md events table. MEDIUM #6: link new docs/reference/retry-patterns.md from CLAUDE.md reference index. MEDIUM #7: document the two new event-stream janitor settings in observability.md Runtime Settings. MEDIUM #9: add test_subscribe_initialises_last_active. MEDIUM #10: add parametrised test_constructor_rejects_invalid_arguments. INFO #11/#12/#13/#14: useChannelHandler JSDoc note about useCallback wrapping + inline cleanup-ordering comment; retry-patterns.md maintainer note about updating per-pattern Sites lists; _Subscriber dataclass comment documenting the frozen-by-default deviation.
## Summary Migrates every plain-`Exception` class in `src/synthorg/` to a `DomainError`-rooted hierarchy and ships `scripts/check_domain_error_hierarchy.py` -- an AST gate that prevents new violations. The gate fails closed: any new class with a direct `Exception` / `RuntimeError` / `LookupError` / `PermissionError` / `ValueError` / `TypeError` / `KeyError` / `IndexError` / `AttributeError` / `OSError` / `IOError` base must either reach `DomainError` via another base or carry a per-line `# lint-allow: domain-error-hierarchy -- <reason>` opt-out. The gate landed alongside the migration; the frozen baseline is **empty**, so `--no-baseline` exit 0 on first commit. Closes #1738. ## Migrations **Cited families (Sub-tasks 2A-2D)**: - `backup/errors.py` -- `BackupError` + 8 subclasses; `handle_backup_error` removed in favour of `handle_domain_error` reading ClassVars. - `config/errors.py` -- `ConfigError` chain (incl. `ConfigValidationError` -> 422). - `core/registry/errors.py` -- `StrategyFactoryError` chain. - `templates/errors.py` -- auto-rooted via `ConfigError`. - `hr/errors.py` (HRError + ~22 subclasses), `hr/scaling/errors.py`, `security/trust/errors.py`. - `settings/errors.py` -- `SettingNotFoundError` -> 404, `SettingValidationError` -> 422. - `meta/mcp/errors.py` -- `ArgumentValidationError` (-> `ValidationError`/422), `GuardrailViolationError` (-> `ForbiddenError`/403). The MCP-layer `domain_code` ClassVar (`"invalid_argument"` / `"guardrail_violated"`) is preserved alongside the new RFC 9457 metadata. **5 version-controllers** -- Budget / Company / EvaluationConfig / Role / Workflow `VersionController` switch from `Response(content=ApiResponse[...](error=...), status_code=...)` to `raise NotFoundError / ValidationError / VersionConflictError`. The `ApiResponse[T]` envelope shape is preserved on the wire; the centralised `handle_domain_error` produces it via MRO dispatch. **Sweep (Sub-task 3)**: - `core/persistence_errors.py` -- `PersistenceError(Exception)` -> `PersistenceError(DomainError)`. **Inner `VersionConflictError` renamed to `PersistenceVersionConflictError`** to dispel the name collision with `core.domain_errors.VersionConflictError`. Custom handlers (`handle_record_not_found`, `handle_persistence_integrity_error`, `handle_duplicate_record`, `handle_persistence_error`) stay registered so persistence-layer 4xx responses keep their fixed public messages instead of leaking record identifiers. - ~16 scattered classes across `a2a/`, `api/auth/`, `api/cursor.py`, `communication/{event_stream, mcp_errors, meeting/agent_caller}`, `engine/quality/decomposers/llm.py`, `engine/strategy/principles.py`, `engine/workflow/{blueprint_errors, service}.py`, `meta/{appliers, errors}.py`, `telemetry/{privacy, reporters}.py`. - 7 internal-sentinel classes carry per-line allowlist markers (`TsaError`, MCP handler-local `_NotFoundError` / `_ConflictError`, statistical Welch errors, dotted-path validators, internal CAS/transaction-rollback sentinels) where stdlib bases are deliberate per the docstrings. ## AST gate - `scripts/check_domain_error_hierarchy.py` (~625 lines): two-pass AST resolver -- pass 1 indexes every `class` definition under `src/synthorg/`, pass 2 computes the closure of `(module, class)` pairs reaching `DomainError`. Every direct stdlib base outside that closure is a violation unless suppressed. - Per-line opt-out: `# lint-allow: domain-error-hierarchy -- <non-empty reason>` (regex-based to survive ruff-format multi-line class headers). - Frozen baseline: `scripts/domain_error_hierarchy_baseline.txt` (header-only on land; `--update-baseline` regenerates). - Pre-push hook in `.pre-commit-config.yaml` (`stages: [pre-push]`). - 27 unit tests in `tests/unit/scripts/test_check_domain_error_hierarchy.py` covering positive cases (DomainError direct subclass, intermediate, deeper MRO chain, allowlist marker, baseline suppress, module-import alias), negative cases (every forbidden stdlib base), baseline format, baseline drift, `--no-baseline`, `--update-baseline`. ## Tests **Pre-existing unit suite**: 26597 passed, 16 skipped. The `pytest tests/ -m unit -n 8` baseline is preserved. **New tests**: - `tests/unit/scripts/test_check_domain_error_hierarchy.py` -- 27 tests for the gate. - `tests/unit/api/test_exception_handlers.py::TestMigratedDomainErrorFamilies` -- 14 parametrized e2e tests covering ClassVar -> HTTP mapping for every migrated HR / Scaling / Trust / Settings / Config / registry family. Catches `status_code` / `error_code` / `error_category` typos that the AST gate cannot. - `tests/unit/api/controllers/test_workflow_versions.py::TestRollback::test_rollback_persistence_version_conflict_translates_to_409` -- pins the `PersistenceVersionConflictError -> VersionConflictError` translation in the workflow rollback controller (the only late-path translation in the repo). - `tests/unit/api/controllers/test_workflow_versions.py::test_diff_same_version_returns_validation_error` -- now asserts the full RFC 9457 envelope (error_code, error_category, retryable), not just the status code. ## Documentation - `CLAUDE.md` -- "Errors" bullet now lists the full 11-base forbidden set and references the new gate + opt-out marker. - `docs/reference/errors.md` -- new "Domain-error-hierarchy gate" section. - `docs/reference/conventions.md` Section 6 -- cross-links the new gate. ## Pre-PR review Ran `/pre-pr-review` with 14 specialized agents. 12 valid findings: 2 CRITICAL + 5 MAJOR + 3 MEDIUM + 2 MINOR/INFO. **All implemented in the second commit on this branch.** - **CRITICAL #1** -- `TicketLimitExceededError` no longer leaks `user_id` via `str(exc)` on the 429 response (the message is logged server-side; the class `default_message` provides the user-safe fallback). - **CRITICAL #2** -- `CLAUDE.md` forbidden-bases list expanded from 5 to 11 entries to match what the gate actually enforces. - **MAJOR #3** -- removed `# Issue #1738.` line from `scripts/check_domain_error_hierarchy.py` and the baseline header (issue refs belong in PR body, not source). - **MAJOR #4** -- `workflow_rollback_service.py` docstrings updated to reference the renamed `PersistenceVersionConflictError`. - **MAJOR #5-7** -- test-coverage strengthening (full RFC 9457 envelope assertions, e2e family-mapping tests, late-path translation regression test). - **MEDIUM #8** -- `ConfigError` + `ConfigValidationError` declare instance attributes in the class body for mypy + IDE visibility. - **MEDIUM #9** -- `ConstraintViolationError.status_code = 400` carries a docstring note explaining why this deliberately diverges from the 2xxx-implies-422 convention (the dedicated `handle_persistence_integrity_error` hardcodes 400). - **MINOR/INFO #11-13** -- `WorkflowDefinitionRevisionMismatchError` docstring notes the inherited 409, `_resolve_base` docstring documents the deeply-chained-attribute edge case. Reviewer agents with **zero findings**: conventions-enforcer, logging-audit, persistence-reviewer, issue-resolution-verifier, api-contract-drift, infra-reviewer. ## Test plan - `uv run python -m pytest tests/ -m unit -n 8` -- 26597 passed, 16 skipped. - `uv run python scripts/check_domain_error_hierarchy.py --no-baseline` -- exit 0. - `uv run mypy src/ tests/` -- 0 issues across 3648 files. - `uv run ruff check src/ tests/ scripts/` -- clean. - `uv run ruff format src/ tests/ scripts/ --check` -- 3684 files already formatted. - `uv run pre-commit run --hook-stage pre-push domain-error-hierarchy --all-files` -- passes. ## Push note Pre-push isolation regression gate (`pytest --count 2 -x -n 8` over the affected suite) reports xdist worker crashes in 8 unrelated test files (auth/postgres_session_store, controllers/{collaboration, department_mutations, meetings, provider_health}, auth/{csrf, lockout_store}, webhook_receipt_cleanup_loop). Manual single-pass green; isolation-gate doubled-run on Python 3.14 + Windows ProactorEventLoop crashes the worker, marking 16 of the new gate tests as collateral damage. Filed #1755 with full details + suggested fixes. Pushed with `--no-verify` per per-push authorization. ## Issue tracker - Closes #1738. - Tick the #1738 box on the master tracking issue #1735 after squash-merge. - See #1755 for the pre-push isolation gate flake (out of scope here).
- exception_handlers: `_clamp_log_value` accepts `float` (gemini #1) - subworkflow_service: `SubworkflowHasParentsError.parent_ids` primitive projection so parent context surfaces in logs (gemini #2) - check_no_controller_response_for_domain_errors: matcher catches qualified `litestar.Response` via `func_name.endswith('.Response')` (coderabbit #3) - workflow_versions: log rollback context (`definition_id`, `target_version`) before raising `VersionConflictError` so the call-site context isn't lost when the centralised handler dispatches (coderabbit #4) - errors.test.ts: 422 fixture uses `REQUEST_VALIDATION_ERROR` instead of `RATE_LIMITED` (coderabbit #5)
- exception_handlers: `_clamp_log_value` accepts `float` (gemini #1) - subworkflow_service: `SubworkflowHasParentsError.parent_ids` primitive projection so parent context surfaces in logs (gemini #2) - check_no_controller_response_for_domain_errors: matcher catches qualified `litestar.Response` via `func_name.endswith('.Response')` (coderabbit #3) - workflow_versions: log rollback context (`definition_id`, `target_version`) before raising `VersionConflictError` so the call-site context isn't lost when the centralised handler dispatches (coderabbit #4) - errors.test.ts: 422 fixture uses `REQUEST_VALIDATION_ERROR` instead of `RATE_LIMITED` (coderabbit #5)
Refs #1850. This PR addresses sub-tasks **D + F + G + H + I** of #1850. Sub-tasks A + E shipped in #1843; the remaining five were orphaned. The umbrella issue stays open until every sub-task is checked off, so this PR uses `Refs #1850` (NOT `Closes`) to avoid re-triggering the auto-close pitfall that closed #1778 prematurely. ## Sub-task D — Remove `openapi-typescript` The 2026-04 audit framing was "migrate to openapi-typescript" but verified state contradicted that intent: the dep was already installed, only 2 files imported the bundle, and 31 hand-written type files coexisted alongside it. Hand-written types win. - Drop `openapi-typescript@7.13.0` from `web/package.json` `devDependencies`. - Drop the `generate-types` / `pretype-check` / `prebuild` npm scripts. - Migrate `web/src/api/types/enums.ts` (16 enum types) to hand-written `(typeof FOO_VALUES)[number]` union literals; runtime arrays become the single source of truth. - Migrate `web/src/api/endpoints/reports.ts` (3 types) to hand-written shapes; values cross-checked against the Pydantic source at `src/synthorg/budget/report_config.py:13` and `src/synthorg/api/controllers/reports.py:35,47`. - Regenerate `web/package-lock.json` (19 transitive deps removed). - Drop the now-unused `web/src/api/types/generated.d.ts` line from `.gitignore`. ## Sub-task F — `compare_ci` helper + callsite migration The audit assumed `compare_ci` already existed; it didn't. - Add `compare_ci(a: str, b: str) -> bool` to `src/synthorg/core/normalization.py` (Unicode-safe via `casefold()`, whitespace-stripping, idempotent). Hypothesis property test pins `compare_ci(a, b) == (normalize_identifier(a) == normalize_identifier(b))`. - Migrate 26 inline `.lower() ==` / `.casefold() ==` / `.lower() !=` callsites: - 23 `compare_ci(...)` migrations (single-string equality + filter-by-name predicates). - 3 `find_by_name_ci(...)` migrations collapsing private `_find_*` helpers in `org_mutations.py` and `hr/registry.py:206`. - 4 bytes-typed ASGI header sites (`csrf.py:179`/`:205`, `etag.py:219`/`:361`) stay inline with a one-line exemption comment; `compare_ci` is `str`-only by design. ## Sub-task G — Collapse Winner+Hybrid processors The two processor classes had near-identical bodies and differed only in whether they accepted `RejectDecision`. - Replace `WinnerSelectProcessor` and `HybridDecisionProcessor` with a single `HumanDecisionProcessor(mode: Literal["winner", "hybrid"])` at `src/synthorg/communication/conflict_resolution/escalation/processors.py`. - Public `mode` property exposes the discriminator for factory introspection without leaking `_mode`. - Constructor performs runtime `mode` validation (defensive; raises `ValueError` on invalid values). - `decided_by` parameter typed as `NotBlankStr` to match the `DecisionProcessor` Protocol. - Factory mapping (`_DECISION_PROCESSOR_FACTORIES`) becomes lambda-keyed to construct the right `mode`. - `escalation/__init__.py` re-exports the new class; `human_strategy.py` default switched. - Tests updated to assert `processor.mode == "winner" / "hybrid"` (stronger than the prior `isinstance` check); error message regex still matches the verbatim 422-translation text. ## Sub-task H — `AuthService` class docstring Replace the 1-line docstring at `src/synthorg/api/auth/service.py:52` with a structured docstring covering: purpose, **async vs sync rationale** (Argon2 ops use `asyncio.to_thread`; refresh-token persist awaits a repo write; everything else is sync), **thread-safety** (frozen `_config` + stateless `argon2`/`jwt`; safe to share across the request-handler pool), and **out-of-scope** (token revocation, session storage, SYSTEM-token minting). ## Sub-task I — Protocols audit (audit-only, no removals) 237 `Protocol` classes inventoried at `docs/reference/protocols-audit.md`. Each entry recorded with module path, line, `@runtime_checkable` status, direct-inheritance count, and test-usage count. Recommendations: - **193 KEEP** (≥1 explicit subclass, plug-in suffix, factory-registered, or `testuse ≥ 3`). - **31 REMOVE** candidates (flag-only — must be re-verified at cleanup time). - **13 REVIEW** (`_PrivatePrefixed` typing seams; intent inspection needed). - **0 MAKE-RUNTIME-CHECKABLE** in this PR (deferred to per-area cleanup PRs after surveying `isinstance` sites). Cleanup itself is intentionally out of scope; follow-up issues will be filed per area after this PR lands so each can hyperlink to a specific row in the merged doc. ## Pre-PR review 13 review agents ran (10 main roster + 3 mini-pass). 12 findings consolidated, 6 implemented in the post-review `refactor: apply pre-PR review feedback` commit: - (#1) added `protocols-audit.md` to the CLAUDE.md "Reference (load on demand)" list. - (#2) constructor mode validation in `HumanDecisionProcessor`. - (#3) tightened `decided_by: str` → `NotBlankStr` in the new processor. - (#4) scrubbed a stale `issue #1687` reference in `web/src/__tests__/pages/new-surfaces.test.tsx`. - (#5) replaced f-string `msg` with `HR_TRAINING_SELECTOR_CONFIG_INVALID` event constant in two source selectors. - (#6) serialised `SettingsChangeDispatcher._running` reset through `_lifecycle_lock` via a scheduled `_reset_running_under_lock` coroutine; tests wait on the new `_post_done_tasks` set. 2 false positives rejected (PEP 758 `except A, B:` syntax flagged by an agent unfamiliar with Python 3.12+; security "redundant whitespace stripping" notes self-classified as Safe). Issue-resolution-verifier confirmed all 5 sub-tasks RESOLVED at 99-100% confidence. ## Test plan - `uv run ruff check src/ tests/` — clean. - `uv run mypy src/ tests/` — `Success: no issues found in 3698 source files`. - `uv run python -m pytest tests/ -m unit` — 23,745 passed (6 pre-existing flakes confirmed on clean main). - Touched-area: 3,279 unit tests pass under sequential and `-n8 --dist=loadfile` schedulers. - `npm --prefix web run lint` + `type-check` + `test` — 3,047 web tests pass. - `npm --prefix web run build` — production build succeeds (864ms). - Final greps verify zero remaining `openapi-typescript` / `WinnerSelectProcessor` / `HybridDecisionProcessor` references and only the 4 documented bytes-typed sites carry the inline `.lower() ==` / `.lower() !=` pattern. ## Local cleanup note After pulling this branch, contributors should delete any stale `web/src/api/types/generated.d.ts` from their worktree — it is no longer produced and is no longer gitignored.
CodeRabbit ASSERTIVE review on f81fdde flagged 38 inline + 5 outside-diff typed-boundary substitutions plus several script / docs nits. Gemini called out a Subscript-handling miss in the gate and a path-injection shape in the ratchet. CodeQL surfaced three py/path-injection alerts. Script + docs - scripts/check_mock_spec.py: `_terminal_callee_name` now recurses through `ast.Subscript`, and `mock_of` joins `_MOCK_FACTORY_NAMES`, so `mock_of[T](return_value=Mock())` is treated as a factory call rather than a typed-boundary violation. `_iter_test_files` and `cmd_scan_paths` resolve `_TESTS_ROOT` and `shared_dir` before comparing parents so symlinks / bind-mounts cannot bypass the `_shared` exclusion. - scripts/check_mock_spec_ratchet.py: dropped the dead `is_relative_to` try/except (Python 3.14 never raises here), restructured the path validation so the allow-list check is inline-visible to CodeQL's data-flow analysis. Kept the PEP-758 `except OSError, UnicodeDecodeError:` form per CLAUDE.md. - docs/reference/conventions.md: bumped the rung-2 example from `create_autospec(T, instance=True)` to `create_autospec(T, instance=True, spec_set=True)` so the doc matches `mock_of[T]` strictness. Test conversions - Replaced `(Magic|Async)Mock(spec=T)` at typed-boundary sites with `mock_of[T](**overrides)` from tests/_shared across 35 test files (unit + integration). Where the override list was set via attribute-bag writes after construction, collapsed into kwargs at construction so the spec_set=True invariant is enforced. - tests/_shared/mock_of.py: `__call__` now returns `Any` and `_MockOfMeta.__getitem__` accepts `Any` instead of `type[T]`. The runtime invariants (autospec + spec_set=True + override-key validation) are unchanged; only the static-type ergonomics shift, so callers configure `.method.return_value` and call `.assert_*()` without per-call cast wrappers and Protocol specs no longer trip mypy strict's `type-abstract` check. - tests/unit/scripts/test_check_mock_spec.py: locked the Subscript-factory recognition and the attribute-chain `x.return_value = Mock()` SKIP path in two new regression tests. - tests/unit/tools/test_process_cleanup.py: pure attribute-bag doubles switched from `MagicMock(spec=Process)` to `SimpleNamespace` (with `cast` at the call site) per the test-doubles ladder; the one test that needs type-level `PropertyMock` descriptor wiring keeps the original mock. - tests/unit/api/test_lifecycle_cleanup_kill_switch.py: stub callables switched to `create_autospec(..., spec_set=True)`. Fixed `_config_get_async` signature to match the real `(namespace, key)` two-positional contract; `create_autospec` caught the earlier one-arg mismatch. Skipped findings (factually wrong against current code) - Gemini #1 (`except (OSError, UnicodeDecodeError):` parens): the CLAUDE.md project rule explicitly says PEP 758 `except A, B:` no parens unless binding, and the project is Python 3.14+. The current syntax is correct under the project interpreter. - tests/unit/api/test_health.py (AppState mock_of conversion): `AppState` uses `__slots__` plus property accessors that `create_autospec(spec_set=True)` does not expose, so the looser `MagicMock(spec=AppState)` is the right rung here. - tests/unit/engine/middleware/test_behavior_tagger.py (AgentMiddlewareContext mock_of conversion): Pydantic `Field()` annotations are invisible to `create_autospec`; spec_set=True rejects `agent_id`/`task_id`/`metadata` overrides at runtime. Reverted with an explanatory comment. - tests/unit/engine/test_workspace_git_worktree.py (`_make_run_git_mock` helper extraction): only two call sites with identical setup and no shared `return_value` to factor out; CodeRabbit's described `return_value=(0, '', '')` form is absent from current code. Verification `uv run ruff check src/ tests/ scripts/` clean. `uv run ruff format src/ tests/ scripts/` clean. `uv run mypy src/ tests/` clean (3714 files). `uv run python -m pytest tests/ -m unit -n 8` -- 28,257 passed, 18 skipped in 81s. No regressions. CodeQL alerts 287 / 288 / 289 still open on `refs/pull/1862/head` because CodeQL has not re-run against the new HEAD yet; the next babysit round will re-evaluate.
#1862) Closes #1855. ## What changes ### Gate quality (sub-task 1) - Rewrite `scripts/check_mock_spec.py` with AST-based detection of bare `Mock` / `AsyncMock` / `MagicMock` at typed boundaries only. Catches: - positional / kwarg argument to a non-Mock-class callable (Pattern A1/A2), - `m: T = Mock()` annotated local where `T` is not Mock-typed (Pattern A3), - `return Mock()` / `yield Mock()` / `yield from Mock()` from a typed-return function (Pattern A4). - Skips: `.return_value` chains, attribute-bag scratch objects, dict / list / set / tuple-literal values, name-binding indirect substitutions (which would require cross-module callee resolution). - Delete `scripts/mock_spec_baseline.txt` (3046 entries). Gate runs zero-tolerance against the live tree. ### Helper + ladder (sub-task 2) - Add `tests/_shared/mock_of.py` with `mock_of[T](**overrides)` factory backed by `create_autospec(T, instance=True, spec_set=True)`. Re-export from `tests/_shared/__init__.py`. - Add `tests/_shared/test_mock_of.py` with 8 unit tests (spec enforcement, override application, unknown-override rejection, Protocol support, async methods, subscript requirement, isinstance compatibility, no state leak between calls). - Document the test-double ladder in `docs/reference/conventions.md` section 12.1 (FakeClock → `create_autospec` / `mock_of` → `SimpleNamespace` → bare `MagicMock` forbidden at typed boundary). Update CLAUDE.md Testing bullet with the pointer. ### Drain (sub-task 3) - Add `scripts/check_mock_spec_ratchet.py` PreToolUse hook plus registrations in `.claude/settings.json` and `.opencode/plugins/synthorg-hooks.ts`. Blocks (a) edits to `tests/*.py` that raise the gate's CATCH count for the touched file, (b) edits to `scripts/check_mock_spec.py` that remove `_Verdict.CATCH` branches. Drives organic drive-by tightening: every edit reduces or holds. - Convert every surviving Pattern A site across `tests/` to `spec=ConcreteType` / `mock_of[T]` / `FakeClock` / `SimpleNamespace`. - Rewrite `tests/unit/scripts/test_check_mock_spec.py` with rule-table coverage. ## Pre-PR review 14 review agents ran (12 main roster + 2 mini-pass). 9 valid findings consolidated, 9 implemented in the follow-up `fix: address pre-PR review feedback` commit. 4 false positives rejected (chiefly PEP 758 `except OSError, UnicodeDecodeError:` flagged by 3 agents unfamiliar with Python 3.14 syntax; empirically verified to parse and catch correctly). Implemented: - **#1 docs-consistency (CRITICAL):** add `check_mock_spec_ratchet.py` to PreToolUse hooks section in `docs/reference/convention-gates.md`. - **#2 docs-consistency (CRITICAL):** backfill `check_error_codes_ts_in_sync.py` and `check_no_controller_response_for_domain_errors.py` in the gate inventory; update `<!--RS:convention_gates-->` macro, `data/runtime_stats.yaml` source, and the CLAUDE.md reference list count. - **#3 pr-test-analyzer (CRITICAL):** add `tests/unit/scripts/test_check_mock_spec_ratchet.py` with 18 unit tests covering envelope parsing (Bash / unknown tool / no file_path / non-.py / out-of-scope / `tests/_shared/` exemption / no-op edit / old_string-not-in-before), ratchet enforcement (hold / raise / reduce / Write tool no-op + raise), gate-edit protection (preserve / remove / add branches), and fail-open path with stderr surfacing. - **#4 pr-test-analyzer (MAJOR):** add `test_yield_from_typed_return_caught` locking the `ast.YieldFrom` branch. - **#5 pr-test-analyzer (MAJOR):** add `test_set_element_skipped` locking the `ast.Set` collection-skip branch. - **#6 silent-failure-hunter (IMPORTANT):** stderr-log the two fail-open `except Exception:` swallows in `_scan_text` and `_check_test_file` so transitional failures surface instead of silently allowing edits. - **#7 tool-parity-checker (MINOR):** list `check_mock_spec_ratchet.py` in the OpenCode plugin header comment. - **#8 + #9 comment-analyzer (SUGGESTION):** docstring on `_load_gate()` plus five AST predicate helpers (`_is_empty_splat`, `_is_literal_none`, `_has_spec_positional`, `_has_spec_keyword`, `_has_dynamic_splat`). Also fixed an unrelated mypy strict failure: three test files imported `VersionRepository` / `FineTuneCheckpointRepository` / `FineTuneRunRepository` from `synthorg.persistence.protocol`, but those names are TYPE_CHECKING-only re-exports there; routed through the source protocol modules (`version_protocol`, `fine_tune_protocol`). Issue-resolution-verifier confirmed all 5 acceptance criteria RESOLVED at 92-100 % confidence. The stretch goal said "file kept as a structural marker" but the branch deletes `scripts/mock_spec_baseline.txt` entirely; the CLAUDE.md text now explicitly documents "zero-tolerance, no baseline" so the intent is unambiguous. ## Test plan - `uv run ruff check src/ tests/ scripts/` — clean. - `uv run mypy src/ tests/` — `Success: no issues found in 3714 source files`. - `uv run python -m pytest tests/ -m unit` — 28,255 passed, 18 skipped (Windows / logfire-extra environmental skips). - New coverage: 50 gate + ratchet unit tests pass (32 existing + 2 added + 16 new ratchet tests). - `uv run python scripts/check_convention_gate_inventory.py` — exit 0 (meta-gate parity preserved). ## Local cleanup note Contributors pulling this branch should delete any stale `scripts/mock_spec_baseline.txt` from their worktree; the gate now runs zero-tolerance and the file is no longer produced.
Bundle of the valid findings surfaced by the /pre-pr-review pass over the two existing bug-fix commits (fix(providers): stop sync from deleting models and fix(dashboard): unblock click-through flows). 4 / 19 valid findings were rejected as agent error (PEP 758 except syntax confused with Python 2, healthcheck-filter mistakenly claimed to drop spans / metrics, module-level pytestmark already applied, _try_stop event parameter is not dead). Fixes landed: - **#1 backend-parity**: SQLite mcp_installation_repo now translates IntegrityError to ConstraintViolationError (matches Postgres); added _classify_sqlite_constraint helper that maps the message text to a stable token (SQLite has no named constraints). Locked under conformance test test_save_raises_constraint_violation_on_unknown_connection that exercises both backends through the shared backend fixture. - **#2 contract-drift**: web/src/api/types/agents.ts AgentConfig gains tier, model_requirement, personality_preset, and strategic_output_mode (last two were pre-existing drift; the first two land in this PR). - **#4 docstring drift**: SyncModelsConfirmDialog docstring now matches the runtime default (replace_existing=false). - **#5 catastrophising language**: trimmed the _SENSITIVE_PROVIDER_FIELDS comment to the factual operator- readability rationale. - **#7 tier type narrowing**: AgentConfig.tier moved from str to Literal so typos fail at config load. - **#8 missing state-transition log**: ConnectionCatalog.update_health now emits HEALTH_STATUS_TRANSITIONED at INFO when the status actually changes (gated so a quiet prober does not flood logs). - **#9 docstring vs inline comment**: ConnectionCatalog .rebind_repository docstring focuses on the contract; the concurrency reasoning moved to an inline comment next to async with self._cache_lock. - **#10 fail-fast guard**: added _assert_sensitive_fields_complete() at module-import time so a future credential-bearing field on ProviderConfig (anything matching *_key / *_token / *_secret / *password*) that is not listed in _SENSITIVE_PROVIDER_FIELDS crashes import instead of leaking on the next audit row. - **#11 trimmed comment**: app.py default-construct block down to the load-bearing why. - **#12 redundant non-null assertions**: build-org-tree.ts findCeo now uses destructuring + truthy check instead of [0]!. - **#13 sanitizeForLog consistency**: budget.ts error args now pass through sanitizeForLog explicitly to match the rest of the store layer; the logger already auto-sanitises but the explicit wrap is project convention. - **#15 trimmed Raises section**: Postgres save() docstring drops the HTTP-status implementation detail. - **#18 useCallback for delete-model handler**: extracted the inline arrow at ConfirmDialog onOpenChange so the reference is stable across renders. Tests added (covering #14): - TestAppStateSimulationRuntime: 5 tests pinning the four branches of has_simulation_runtime + the always-true contract of has_client_simulation_state. - TestRebindRepository: 4 tests for repo swap, cache invalidation, subsequent-read backend selection, and double-rebind safety. - ollama-cloud routing: confirms /api/tags is used (not /models). - TestDiffProviderUpdate: 3 tests for changed-only output, redaction sentinels, and an additional cross-check between ProviderConfig fields and the SENSITIVE list. - test_destructive_sync_guard: 4 tests for the _reject_destructive_empty_discovery branches. - AgentConfig tier / model_requirement round-trip + tier rejection for invalid values. - conformance test for ConstraintViolationError parity across Postgres and SQLite (referenced under #1 above). Pre-reviewed by 21 agents (16 standard + 5 mini-pass) through /pre-pr-review; 19 valid findings raised, 15 fixed in code, 4 rejected as agent error (documented in _audit/pre-pr-review/ triage.md). uv run python -m pytest tests/ -n 8: 29957 passed, 49 skipped. uv run mypy src/ tests/: clean. uv run ruff check src/ tests/: clean. npm --prefix web run lint / type-check / test: clean (3075 tests).
Bundle of the valid findings surfaced by the /pre-pr-review pass over the two existing bug-fix commits (fix(providers): stop sync from deleting models and fix(dashboard): unblock click-through flows). 4 / 19 valid findings were rejected as agent error (PEP 758 except syntax confused with Python 2, healthcheck-filter mistakenly claimed to drop spans / metrics, module-level pytestmark already applied, _try_stop event parameter is not dead). Fixes landed: - **#1 backend-parity**: SQLite mcp_installation_repo now translates IntegrityError to ConstraintViolationError (matches Postgres); added _classify_sqlite_constraint helper that maps the message text to a stable token (SQLite has no named constraints). Locked under conformance test test_save_raises_constraint_violation_on_unknown_connection that exercises both backends through the shared backend fixture. - **#2 contract-drift**: web/src/api/types/agents.ts AgentConfig gains tier, model_requirement, personality_preset, and strategic_output_mode (last two were pre-existing drift; the first two land in this PR). - **#4 docstring drift**: SyncModelsConfirmDialog docstring now matches the runtime default (replace_existing=false). - **#5 catastrophising language**: trimmed the _SENSITIVE_PROVIDER_FIELDS comment to the factual operator- readability rationale. - **#7 tier type narrowing**: AgentConfig.tier moved from str to Literal so typos fail at config load. - **#8 missing state-transition log**: ConnectionCatalog.update_health now emits HEALTH_STATUS_TRANSITIONED at INFO when the status actually changes (gated so a quiet prober does not flood logs). - **#9 docstring vs inline comment**: ConnectionCatalog .rebind_repository docstring focuses on the contract; the concurrency reasoning moved to an inline comment next to async with self._cache_lock. - **#10 fail-fast guard**: added _assert_sensitive_fields_complete() at module-import time so a future credential-bearing field on ProviderConfig (anything matching *_key / *_token / *_secret / *password*) that is not listed in _SENSITIVE_PROVIDER_FIELDS crashes import instead of leaking on the next audit row. - **#11 trimmed comment**: app.py default-construct block down to the load-bearing why. - **#12 redundant non-null assertions**: build-org-tree.ts findCeo now uses destructuring + truthy check instead of [0]!. - **#13 sanitizeForLog consistency**: budget.ts error args now pass through sanitizeForLog explicitly to match the rest of the store layer; the logger already auto-sanitises but the explicit wrap is project convention. - **#15 trimmed Raises section**: Postgres save() docstring drops the HTTP-status implementation detail. - **#18 useCallback for delete-model handler**: extracted the inline arrow at ConfirmDialog onOpenChange so the reference is stable across renders. Tests added (covering #14): - TestAppStateSimulationRuntime: 5 tests pinning the four branches of has_simulation_runtime + the always-true contract of has_client_simulation_state. - TestRebindRepository: 4 tests for repo swap, cache invalidation, subsequent-read backend selection, and double-rebind safety. - ollama-cloud routing: confirms /api/tags is used (not /models). - TestDiffProviderUpdate: 3 tests for changed-only output, redaction sentinels, and an additional cross-check between ProviderConfig fields and the SENSITIVE list. - test_destructive_sync_guard: 4 tests for the _reject_destructive_empty_discovery branches. - AgentConfig tier / model_requirement round-trip + tier rejection for invalid values. - conformance test for ConstraintViolationError parity across Postgres and SQLite (referenced under #1 above). Pre-reviewed by 21 agents (16 standard + 5 mini-pass) through /pre-pr-review; 19 valid findings raised, 15 fixed in code, 4 rejected as agent error (documented in _audit/pre-pr-review/ triage.md). uv run python -m pytest tests/ -n 8: 29957 passed, 49 skipped. uv run mypy src/ tests/: clean. uv run ruff check src/ tests/: clean. npm --prefix web run lint / type-check / test: clean (3075 tests).
…1 reviewer-config) - Forward cancellation token in _mine_negatives_from_pairs so the encoding phase of hard-negative mining is interruptible (CodeRabbit + Gemini, fine_tune.py:462). - Restructure embedding-evaluation.md pipeline stages: split the old Stage 3 evaluation bullet into a dedicated Stage 4, renumber Deploy to Stage 5 so the doc matches the fine_tune.py module header (5-stage pipeline) and removes stage-ownership ambiguity (CodeRabbit). - Add .gemini/styleguide.md + .gemini/config.yaml so Gemini Code Assist stops flagging PEP 649 TYPE_CHECKING annotations as NameError risks on every PR, and reads project conventions from CLAUDE.md. Skipped (factually wrong against current code): - Gemini #1/#2/#3 (NameError on CancellationToken / EvalMetrics in TYPE_CHECKING block): Python 3.14 PEP 649 evaluates annotations lazily; existing unquoted CancellationToken | None at fine_tune.py:568 has been in CI without issue. CLAUDE.md mandates this convention. - CodeRabbit RS-marker findings on docs/design/memory.md and docs/reference/embedding-evaluation.md: out of scope for scripts/check_doc_numeric_macros.py, which scopes to README + 4 docs (neither file present) and matches digits only adjacent to specific stat nouns (tests/providers/agents/stars/releases).
…1 reviewer-config) - Forward cancellation token in _mine_negatives_from_pairs so the encoding phase of hard-negative mining is interruptible (CodeRabbit + Gemini, fine_tune.py:462). - Restructure embedding-evaluation.md pipeline stages: split the old Stage 3 evaluation bullet into a dedicated Stage 4, renumber Deploy to Stage 5 so the doc matches the fine_tune.py module header (5-stage pipeline) and removes stage-ownership ambiguity (CodeRabbit). - Add .gemini/styleguide.md + .gemini/config.yaml so Gemini Code Assist stops flagging PEP 649 TYPE_CHECKING annotations as NameError risks on every PR, and reads project conventions from CLAUDE.md. Skipped (factually wrong against current code): - Gemini #1/#2/#3 (NameError on CancellationToken / EvalMetrics in TYPE_CHECKING block): Python 3.14 PEP 649 evaluates annotations lazily; existing unquoted CancellationToken | None at fine_tune.py:568 has been in CI without issue. CLAUDE.md mandates this convention. - CodeRabbit RS-marker findings on docs/design/memory.md and docs/reference/embedding-evaluation.md: out of scope for scripts/check_doc_numeric_macros.py, which scopes to README + 4 docs (neither file present) and matches digits only adjacent to specific stat nouns (tests/providers/agents/stars/releases).
Closes #1916. ## What changed WP-1 is the prerequisite work-package for WP-2/3/4/6: persistence-layer discipline, the four critical restart-safety gaps, and ADR-0001 (repository protocol consolidation by query pattern). ### Critical restart-safety persistence - **Ceremony scheduler state** (`engine/workflow/ceremony_scheduler.py`): hydrates `_completion_counters` / `_fired_once_triggers` / `_total_completions` / `_velocity_history` on sprint activation; atomic save on every mutation. - **Meeting cooldown timestamps** (`communication/meeting/scheduler.py`): durable per-meeting-type cooldowns with hydration on `start()`. - **Tracked Docker containers** (`tools/sandbox/`): every `_tracked_containers` mutation in `DockerSandbox` now mirrors to `TrackedContainerRepository` (when wired); orphan reconciliation via `synthorg.managed` label. - **InMemoryBackend lock parity**: `_store_lock` now covers `delete` / `count` / `clear`, matching the existing `store` / `retrieve` discipline. ### Lifecycle correctness - `JetStreamTaskQueue`: dedicated `_lifecycle_lock` separate from hot-path lock (prevents duplicate NATS consumer spawn under concurrent `start()`). - `InMemoryMessageBus`: split lifecycle and hot-path locks (no longer serialises normal traffic against lifecycle ops). - `JetStreamMessageBus`: shutdown drain aligned with `docs/reference/lifecycle-sync.md` (holds state lock through drain deadline; raises `BusStopTimeoutError` on timeout and marks state unrestartable). - Webhook retry endpoint (`api/controllers/webhooks.py`): wrapped in `IdempotencyService.run_idempotent()` so operator double-click no longer double-publishes. ### Layer discipline (REWORK #5) - Six persistence files migrated from `api.dto_*` to `providers.management.capability_dtos`. - New pre-push gate `scripts/check_no_api_dto_in_persistence_or_service.py` forbids the layer violation going forward; wired in `.pre-commit-config.yaml` at `[pre-commit, pre-push]`. - Raw DDL pulled out of `tests/unit/memory/org/conftest.py` (migrations are the only schema source). ### REWORK #1: repository protocol consolidation (ADR-0001) - New `src/synthorg/persistence/_generics.py`: six generic categories — `SingletonRepository`, `IdKeyedRepository`, `FilteredQueryRepository`, `AppendOnlyRepository`, `StatefulRepository`, `MVCCRepository`. All `@runtime_checkable`. Canonical `DEFAULT_PAGE_SIZE: Final[int] = 100`. - ADR-0001 (`docs/decisions/0001-repository-protocol-consolidation.md`) captures composition rules, the bespoke-method policy (D7), and the per-entity inventory. - Phase 1 + 2 migrations: ~40 concrete protocols across `Task`, `Message`, `CostRecord`, `Approval`, `Project`, `User`, `Audit`, `WorkflowDefinition`, `Checkpoint`, `OrgFact`, `FineTuneRun`, `Settings`, `ProviderAudit`, `PresetOverride`, `Preset`, `AgentState`, `Artifact`, `Connection`, `CustomRule`, `McpInstallation`, `Subworkflow`, `TrainingPlan`, `TrainingResult`, `Session`, `OntologyEntity`, `OntologyDriftReport`, `WorkflowExecution`, `PersonalityPreset`, `Version`, `ParkedContext`, `PrincipleOverride`, `RiskOverride`, `CircuitBreakerState`, `Decision`, `SsrfViolation`, `LockoutRepository`, `RefreshToken`, `ConnectionSecret`, `FineTuneCheckpoint` compose generics. - Three new repos (`CeremonySchedulerState`, `MeetingCooldown`, `TrackedContainer`) each compose `IdKeyedRepository[T, NotBlankStr]` plus a D7 `load_all` for cold-start hydration. Listed as rows 46-48 in the ADR inventory. ### Wave-2 pre-PR triage (this commit on top of the WP-1 base) - DockerSandbox push-persistence to `TrackedContainerRepository` (helper-method pattern; optional repo so tests stay zero-cost). - Restart-safety test moved to `tests/integration/persistence/` and rewritten to do a real `open → write → disconnect → reopen → read` cycle (the previous version re-used the same backend instance). - `_generics.py` invariant docstrings: frozen `T` requirement, ordering semantics divergence between `FilteredQuery` (concrete-documented) and `AppendOnly` (mandated newest-first), `MVCCRepository.get_operation_log` append-order rule, `StatefulRepository.transition_if` TypedDict guidance, `SingletonRepository` single-row invariant, `DEFAULT_PAGE_SIZE` justification consolidated. - HeartbeatRepository D7 bespoke rationale documented. - `persistence/__init__.py` exports the three new repos + record types. - SEC-1 redaction on three sqlite `_rollback_quietly` handlers (real exception via `safe_error_description`, not bare `"rollback failed"`). - `integrations/connections/catalog.py:124` hardcoded `list_items(limit=10_000)` carries a justifying comment + `lint-allow`. ## Test plan - `uv run python -m pytest tests/ -m unit -n 8`: **28 985 passed, 18 skipped** (Windows + logfire-extra-not-installed skips only). - `uv run mypy src/ tests/`: clean across 3 834 source files. - `uv run ruff check src/ tests/` + `ruff format --check`: clean. - `tests/integration/persistence/test_wp1_restart_safety.py` (SQLite arm, Postgres arm gated on Docker availability): 4 passed. - All pre-commit / pre-push gates green on push (`dual-backend test parity`, `typed-boundary`, `cost-recording chokepoint`, `convention-gate inventory`, `no reviewer citations`, `no migration framing`, `forbid synthorg.api.dto_* imports`). ## Review coverage Two waves of pre-PR review: - **Wave 1** (earlier commit `7ae984b5e`): 10 agents (docs-consistency, code-reviewer, python-reviewer, type-design-analyzer, persistence-reviewer, security-reviewer, async-concurrency-reviewer, api-contract-drift, frontend-reviewer, issue-resolution-verifier). All M/Md/Mn findings addressed in-PR. - **Wave 2** (this commit): 16 agents (added comment-quality-rot, logging-audit, resilience-audit, conventions-enforcer, infra-reviewer, test-quality-reviewer, silent-failure-hunter). 17 valid findings; ~15 PEP 758 / NOT_RESOLVED false-positives discarded after AST + commit-history verification. All 16 valid items in scope addressed in this commit; one item (`scripts/schema_drift_baseline.txt` placeholder text) is hookify-blocked and listed below for follow-up. ## Known follow-up (hookify-blocked from this session) `scripts/schema_drift_baseline.txt` lines 30 / 86 / 144 carry the literal placeholder `auto-generated; replace with audit-cited justification before commit`. Functionally inert (the gate uses the `column:table:type:type` signature; the description is human prose only), but it should read like the other ~140 entries. The fix is to replace the placeholder with `SQLite has no TIMESTAMPTZ; column stores TEXT carrying ISO-8601 with explicit +00:00/Z suffix, normalised to UTC at write time` for each of the three rows. The local edit is blocked by `scripts/check_no_edit_baseline.sh`, which requires an out-of-session step.
…abbit) CodeRabbit #1: resume-vs-review routing probed a live parked-context backend (non-deterministic on outage). ApprovalItem gains a persisted ApprovalSource discriminator (default REVIEW_GATE; SecOps escalation + request_human_approval set PARKED_CONTEXT). Routing keys off the persisted source; the has_parked_context probe stays only as a logged fallback when the just-decided row cannot be re-read. Schema column + sqlite/postgres revisions (converging the two existing heads) + both repos + regenerated web DTO types. test_approvals_helpers reworked for deterministic source routing.
Pre-push affected-pytest caught two gaps in the #1 change: (1) sqlite save_many param_rows tuple missing item.source.value (16 placeholders vs 15 bindings -- the save() path was fixed but save_many's deeper-indented tuple was not); (2) the test-local _CREATE_TABLE approvals fixture in tests/unit/meta/test_approval_repo.py lacked the new source column. Both fixed; all 245 approval-related tests pass.
…abbit) CodeRabbit #1: resume-vs-review routing probed a live parked-context backend (non-deterministic on outage). ApprovalItem gains a persisted ApprovalSource discriminator (default REVIEW_GATE; SecOps escalation + request_human_approval set PARKED_CONTEXT). Routing keys off the persisted source; the has_parked_context probe stays only as a logged fallback when the just-decided row cannot be re-read. Schema column + sqlite/postgres revisions (converging the two existing heads) + both repos + regenerated web DTO types. test_approvals_helpers reworked for deterministic source routing.
Pre-push affected-pytest caught two gaps in the #1 change: (1) sqlite save_many param_rows tuple missing item.source.value (16 placeholders vs 15 bindings -- the save() path was fixed but save_many's deeper-indented tuple was not); (2) the test-local _CREATE_TABLE approvals fixture in tests/unit/meta/test_approval_repo.py lacked the new source column. Both fixed; all 245 approval-related tests pass.
Round 2 covers CodeRabbit's re-review of round 1's commit. Comment / docstring hygiene (CR-#1, #2, #5): - app.py: rewrite the SQLite+persistent-ApprovalStore hard-block comment without migration framing; describe the invariant the guard enforces, not the migration's history. - approval_store.py: trim ``has_persistent_repo`` docstring to current-state behaviour only. - chief_of_staff/config.py: replace real vendor names (OpenAI/Claude/GPT-4o/LiteLLM) in propose-* constant comments with project-approved generic wording. App startup ordering (CR-#8 duplicate): - app.py: move conversational_proposal_repo wiring above the ``provider_registry is None: return`` early-return so the repo attaches even on boots without an LLM provider (preserves the approve/reject path for already-parked conversational approvals, which was the stated rationale for the round-1 decoupling). Conversational-intake resume (CR-#3 + CR-#4): - _approval_review_gate.py: split try_conversational_intake_resume (187 lines, well over the <50 line budget) into orchestrator + three branch-specific helpers: * _load_conversational_proposal: source check + repo + lookup. * _reject_conversational_proposal: PENDING -> REJECTED CAS. * _execute_conversational_proposal: PENDING -> EXECUTING -> EXECUTED CAS with revert-on-failure. The orchestrator is now 43 lines and reads as a flow table. - _approval_review_gate.py: ``_reread_approval_item`` returning ``None`` for a conversational approval now raises ServiceUnavailableError. Returning False would route the call through to non-conversational flows and silently strand the decided approval. Docstring updated to spell out the flow-specific semantics. Cross-tenant privacy (CR-#7, Critical): - propose.py: ``_resolve_conversation`` now rejects conversation-id reuse across users. A caller who learns another conversation_id could otherwise append turns and have prior history fed back to the model -- a cross-tenant break. Mapped to NotFound so the endpoint cannot be used to probe existence either. TOCTOU on sequence allocation (CR-#6, Heavy lift): - conversation_repo.py (SQLite + Postgres): bounded retry-on-IntegrityError in ``append`` when the (conversation_id, sequence) UNIQUE constraint fires. The repo re-queries the live max sequence, model_copies the turn, and retries up to 3 times. Other constraint violations (FK miss, CHECK on content/role) still translate directly to ConstraintViolationError without retry. Tests: - test_propose.py: new test_other_user_resume_blocked. - test_conversational_intake_resume.py: new test_unreadable_approval_item_raises (covers the new raise-on-None path). - test_conversation_repository.py: renamed test_append_duplicate_sequence_rejected -> test_append_duplicate_sequence_resequenced; asserts the new race-safe semantics (second append lands at sequence + 1).
Summary
Research Included
Analysis of MetaGPT, ChatDev, CrewAI, AutoGen, LangGraph, LiteLLM, Mem0, MCP, A2A Protocol, and event-driven multi-agent architecture patterns. Recommendation: build from scratch leveraging existing libraries (LiteLLM, Mem0, FastAPI, MCP).
Test plan