Conversation
Describes the architecture for evolving Colloquip from a single-thread deliberation engine into a Reddit-like agent social platform with: - Subreddits (communities defining agent types) - Persistent agent identities across sessions - Agent memory/learning system (post-deliberation extraction + recall) - Cross-subreddit membership - New API surface for communities, agents, and memories https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
Three major additions to the social platform plan: - Agent pool/registry: agents selected from existing pool, new ones created only when no matching expertise exists - Mandatory red team: every subreddit always has at least one topic-specific red team agent (cannot be removed) - Literature search tools: PubMed, company docs, and web search via Anthropic's native tool-use API, configured per subreddit https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
Comprehensive plan to incrementally transform the working deliberation engine into the full social platform described in the Phase 1-2 and Phase 3-5 specs, without a ground-up rewrite. Key decisions: - Keep SQLite + SQLAlchemy (PostgreSQL deferred to Phase 3 for pgvector) - Keep src/colloquip/ structure (not backend/app/) - 10 curated personas: 8 from spec + protein engineering + synthetic biology - 7 sprints: models → registry → tools → synthesis → prompts → API → integration - All 181 existing tests must pass at every step - Phase 3+ hooks designed now, built later https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
Evolve the Colloquip deliberation system toward a Reddit-like social platform for AI agents, implementing Phases 1-2 of the implementation spec. Core additions: - 10 curated YAML agent personas (molecular biology, medicinal chemistry, ADMET, clinical, regulatory, computational biology, protein engineering, synthetic biology, 2 red team agents) with weighted evaluation criteria and phase-specific mandates - Agent registry with expertise-based recruitment scoring, find-or-create pattern, and mandatory red team enforcement per subreddit - Tool system: PubMed (NCBI E-utilities), company docs (local search), web search (Semantic Scholar), citation verifier — all with mock implementations for testing - 4 structured output templates (Assessment, Review, Analysis, Ideation) with named sections and metadata fields - Template-driven synthesis generator with audit chains linking claims to posts and citations - Per-thread cost tracking with budget enforcement - Prompt builder v3: layered assembly (persona -> subreddit context -> role -> phase mandate -> citation/tool instructions) - Platform manager orchestrating subreddit creation and agent recruitment - REST API: subreddit CRUD, agent listing, thread creation, cost endpoints - Extended DB schema: subreddits, agent identities, memberships, synthesis, cost records All 188 existing tests pass unchanged. 111 new tests added (299 total). https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
Round 1 (Critical/High): - synthesis: filter stop words in audit chain matching, raise overlap threshold from 0.2 to 0.3 to prevent false claim-post links - synthesis: fix _parse_metadata to only match lines starting with field names, preventing false matches inside section prose - tools/registry: handle None tool_configs gracefully - persona_loader: validate non-empty expertise_tags and domain_keywords - pubmed: pass email to efetch requests (NCBI API compliance) - citation_verifier: use actual error from _verify_pmid in flagged detail - cost_tracker: remove unused uuid4 import - synthesis: remove unused StructuredCitation import - platform_manager: simplify thread storage with setdefault Round 2 (High): - prompts: fix tool_descriptions to accept Union[str, List[str]], join list items for proper formatting in prompt - synthesis: rewrite _parse_synthesis_sections to use exact heading matches (longest-first sort) preventing partial name collisions - platform_routes: validate _initialized flag in _get_platform helper - platform_routes: add UUID format validation on thread cost endpoint Round 3 (High): - registry: add max_agents parameter to recruit_for_subreddit, reserve slot for red team when recruiting optional expertise - platform_manager: wire CostTracker into get_thread_costs instead of returning placeholder zeros Round 4 (Medium/Low): - synthesis: replace chr(10) with readable string formatting - synthesis: move uuid import to module level, add type hint - synthesis: add fallback validation for empty raw_text - output_templates: add descriptive ValueError for missing template - prompts: use proper Union type annotation instead of string literal - web_search: simplify redundant fallback on externalIds - pubmed: add null-safety on XML element .text access - company_docs: log when file content is truncated for search - platform_routes: add TYPE_CHECKING imports and type hints on helpers - registry: downgrade duplicate agent_type log from warning to debug - persona_loader: wrap persona_to_agent_identity in try/except for descriptive KeyError messages All 299 tests pass. https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
Bug fix: - Fix enum string comparison in synthesis audit chains (post.stance.value == "critical" → post.stance == AgentStance.CRITICAL) Principle #2 (testable without LLM): - Extract parse_synthesis() standalone function from SynthesisGenerator - Parsing logic now testable directly without LLM calls Principle #3 (interfaces first): - Replace AgentTool Protocol with ABC BaseSearchTool - tool_schema and execute() are now @AbstractMethod - Keep AgentTool as backward-compatible alias Principle #4 (configuration-driven): - Add ScoringWeights dataclass for configurable expertise matching - Make audit chain params configurable (max_chains, overlap_threshold, min_claim_words) Principle #5 (minimal dependencies): - Mock tool classes now inherit from real classes (PubMedTool, WebSearchTool, CompanyDocsTool), eliminating duplicated tool_schema properties - Convert VerificationReport to Pydantic BaseModel for consistency Simplification: - Extract _subreddit_common() helper to DRY response builders - 14 new tests (313 total, all passing) https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
Comprehensive plan covering 13 feature sprints and 5 infrastructure sprints: - Phase 3: Institutional memory (embedding interface, synthesis RAG, pgvector, human corrections) - Phase 4: Event-driven triggers (watchers, triage agent, notifications, auto-deliberation) - Phase 5: Cross-subreddit references, outcome tracking, agent calibration, export/external API - Deployment: Docker multi-stage builds, docker-compose (Postgres+Redis), CI/CD pipelines, Alembic migrations, production config, Prometheus monitoring https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
Add embedding infrastructure (mock + OpenAI providers), in-memory
vector store with cosine similarity search, synthesis-to-memory
extraction pipeline, RAG prompt integration, human memory corrections
via annotations, and API routes for memory management.
Key components:
- EmbeddingProvider ABC with MockEmbeddingProvider and OpenAIEmbeddingProvider
- MemoryStore ABC with InMemoryStore (brute-force cosine search)
- SynthesisMemoryExtractor: pure text parsing, no LLM calls
- MemoryRetriever: arena-scoped + cross-subreddit retrieval
- Memory annotations: outdated, correction, confirmed, context types
- API routes: /api/memories, /api/memories/{id}/annotate
- DB tables: synthesis_memories, memory_annotations
- Phase 3b typed memory models (reserved for future use)
Review fixes applied:
- UUID validation in API routes with proper 400 responses
- list_all() used instead of private _memories access
- ValueError (not KeyError) for missing memories in annotate()
- IndexError safety check in annotation response
- Missing template_type update in repository save
- OpenAI API error handling with logging
104 new tests (313 → 417), all passing.
https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
Add watcher infrastructure for monitoring external sources, triage agent for evaluating event relevance, notification system, and auto-deliberation policy with earned automation. Key components: - BaseWatcher ABC with WatcherRegistry for managing watcher instances - LiteratureWatcher: PubMed monitoring with PMID deduplication - ScheduledWatcher: time-based triggers with interval/day/hour constraints - WebhookWatcher: external event ingestion via HTTP POST - WatcherManager: async polling loop with error isolation per watcher - MockTriageAgent: keyword heuristic triage (novelty/relevance/urgency) - InMemoryNotificationStore: notification CRUD with status tracking - AutoDeliberationPolicy: earned automation (20+ events, >70% useful, human approval, rate limits, budget sharing) - API routes: watchers CRUD, notifications, webhook endpoint - DB tables: watchers, watcher_events, notifications - Models: WatcherType, WatcherEvent, TriageDecision, Notification 81 new tests (417 → 498), all passing. https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
- Fix list_subreddit_watchers: remove 'or True' filter bug, use subreddit_name stored in watcher config for proper filtering - Fix create_watcher: inject pubmed_tool from app state for LiteratureWatcher creation so watchers are functional - Store subreddit_name in watcher config dict for lookup - Add source_metadata column to DBWatcherEvent to prevent data loss - Add defensive None check for PubMed tool result https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
…ts 18-20) - Cross-subreddit reference detection with triple criteria (similarity + shared entities + actionability) - Entity extraction (PMIDs, gene names, compound IDs) for cross-reference matching - Deliberation differ for comparing syntheses over time - Outcome tracking system for real-world result recording - Agent calibration with accuracy, domain-specific metrics, and bias detection - External API with API key authentication for programmatic access - Export system (Markdown and JSON formats) - DB tables for cross-references and outcome reports - 41 new tests (539 total), all passing https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
- Register all Phase 3-5 route modules (memory, watcher, export, external, feedback) in FastAPI app — previously all these endpoints returned 404 - Fix self-referential comparison in calibration bias detection: compare domain accuracy against overall accuracy instead of itself https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
D1 - Containerization: - Multi-stage Dockerfile (Python deps, Node frontend, production image) - docker-compose.yml with app, PostgreSQL 16 + pgvector, Redis - docker-compose.dev.yml with hot-reload and dev tools - .dockerignore for clean builds D2 - CI/CD Pipeline: - GitHub Actions: ci.yml (lint + test + build), deploy.yml (GHCR push) - db-migration.yml for PR migration validation - Python 3.11/3.12 matrix, codecov integration D3 - Production Configuration: - Settings module with env var validation (database, LLM, embedding, memory, watchers, deployment) - Structured JSON logging for production, text for development - Request ID tracking, sensitive field redaction - Production and staging YAML config overrides - .env.example with documented variables D4 - Database Migrations: - Alembic setup with env.py and migration template - 4 migration files: baseline, Phase 3 memory, Phase 4 watchers, Phase 5 cross-refs - All migrations reversible with downgrade() D5 - Monitoring: - Prometheus metrics: deliberations, cost, memory retrieval, watchers, LLM usage - /api/metrics endpoint - docker-compose.monitoring.yml with Prometheus + Grafana - Pre-built Grafana dashboard 26 new tests (565 total), all passing. https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
- Fix 138 ruff lint errors: remove unused imports (F401), fix line length (E501), rename ambiguous variables (E741), prefix unused vars (F841) - Auto-format all 83 files with ruff format - Configure ruff rules in pyproject.toml: select E/F/I, ignore E402, per-file ignores for tests (F841) and forward refs (F821) - Add pre-commit hook: ruff check + ruff format --check + pytest (fast) - Add install-hooks.sh for easy setup All 565 tests pass. ruff check and ruff format --check both clean. https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
- Commit updated uv.lock (was stale, missing new dependencies from Phases 3-5 which caused --frozen sync failures in CI) - Use actions/setup-python@v5 for Python version matrix instead of passing python-version to setup-uv (unsupported parameter) - Add --frozen to all uv sync calls for reproducible CI builds - Remove Docker build job from CI (requires Docker daemon; deploy workflow handles real builds on tag push) - Add coverage.xml to .gitignore https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
Tests used absolute /home/user/Colloquip/ paths that only work locally. Replaced with paths computed relative to the project root using os.path.dirname(os.path.abspath(__file__)), which works on any CI runner. https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
- tests/TEST_STRATEGY.md: Test conventions, fixtures, patterns, and checklist for writing effective tests (reference before writing new tests) - tests/test_api_routes.py: 51 tests covering all route handlers (export, external, feedback, memory, watcher routes) with happy paths, validation, 404/400/503 error paths - tests/test_tools.py: 26 tests for citation_verifier, company_docs, web_search tools with mocked external APIs - tests/test_infrastructure.py: 18 tests for db/engine, display, CLI Key coverage improvements: export_routes: 20% → 95%+ external_routes: 51% → 95%+ feedback_routes: 62% → 95%+ memory_routes: 51% → 95%+ watcher_routes: 45% → 95% citation_verifier: 47% → 95% company_docs: 31% → 90% web_search: 43% → 95% db/engine: 38% → 100% https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
Replace the static string confidence_level with Beta distribution parameters (alpha, beta) on SynthesisMemory. Retrieval scoring becomes similarity * confidence * decay instead of similarity-only. Key changes: - SynthesisMemory carries confidence_alpha/confidence_beta fields initialized from synthesis metadata (high→3:1, moderate→2:1.5, low→1:2) - compute_confidence() returns clamped posterior mean [0.10, 0.95] - temporal_decay() applies exponential decay with 120-day half-life - composite_score() multiplies similarity × confidence × decay - Annotations auto-update confidence: confirmed +2α, correction +3β, outdated +2β, context no change - Outcome reports (confirmed/contradicted) update linked memory confidence - Retrieval logging records every memory retrieval with similarity, confidence, decay factor, and composite score for future calibration - DB schema, repository, and API responses updated for new fields - 47 new tests covering all Bayesian math, decay, scoring, annotation wiring, retrieval logging, and prompt formatting https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo
Full frontend rebuild with TanStack Router/Query, Zustand, Tailwind v4, and shadcn-style primitives. Includes route structure for communities, threads, agents, memories, notifications, and settings. Migrates existing deliberation components with Tailwind restyling. Fixes Memory type to match Bayesian confidence model (confidence/alpha/beta fields, correct citations_used and confidence_level types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename App.css -> app.css to match the import in main.tsx on case-sensitive filesystems. Remove unused legacy index.css. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Wrap UI primitives (Button, Card, Badge, Dialog, Tooltip, Skeleton) with HeroUI v3 beta components for consistent styling and accessibility - Fix Dialog component: conditionally render Modal to prevent backdrop from blocking pointer events when closed - Dockerfile: add alembic/ and alembic.ini to production image for migrations - Dockerfile.dev: replace broad COPY with targeted copies, use uv pip install - docker-compose.yml: add start_period to app healthcheck - docker-compose.dev.yml: mount alembic directory for dev migrations - Rewrite web/README.md to document actual tech stack (HeroUI, TanStack, TailwindCSS v4, Zustand) replacing default Vite boilerplate Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The alembic package is in the db-pg optional dependency group, not db. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SQLite does not support ALTER-based create_unique_constraint. Move the three constraints (consensus_maps, subreddit_memberships, syntheses) into their respective create_table calls as sa.UniqueConstraint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements Phase 1-2 of the Emergent Deliberation Platform specification, transforming Colloquip from a single-thread deliberation engine into a multi-subreddit social platform for AI expert panels. Scientists can now submit hypotheses to topic-specific communities and receive structured, multi-perspective assessments with cited evidence in 10-15 minutes.
Key Changes
Platform Architecture
Synthesis & Output
Tools & Evidence
Cost & Governance
Memory & Watchers (Phase 3-4 Foundation)
API & Infrastructure
/api/subreddits,/api/agents,/api/threadsfor community management/api/memoryfor retrieval and annotation/api/watchersfor event management/api/feedbackfor outcome tracking and agent calibrationDatabase & Deployment
Testing
Implementation Details
Files Added/Modified
registry.py,synthesis.py,platform_manager.py,output_templates.pypubmed.py,company_docs.py,web_search.py,https://claude.ai/code/session_017HcdLV3pMPN1s3iXESokNo