feat(cycle-6): knowledge base — file upload, async indexing, pgvector search, memory tools#11
Conversation
… search, save_memory - indexer.py: extract_text (PDF/DOCX/TXT/MD), chunk_text sliding window, index_source pipeline - tools.py: tool_knowledge_search (pgvector cosine sim), tool_save_memory, ToolContext dataclass SSRF-safe url_reader: async DNS + PinnedTransport; _REQUIRED_ARGS validation; result truncation - loop.py: _trim_messages (32k char sliding window), dynamic KB/memory tool list, memory injection - knowledge.py: upload/list/delete sources, embedding key, memories routes - run.py: ToolContext wiring, embedding key resolved from LLM config (OpenAI) or stored fallback Embedding key UX: OpenAI agents use their LLM key; non-OpenAI agents shown OpenAI fallback field - AgentPage.tsx: KB block (drag-drop upload, source list, provider-aware embedding key UI), memory block (retention days, stored memories list), stopped/onerror WS handlers - 15 new tests for indexer (extract_text, chunk_text, index_source happy + error paths)
🤖 AI Code ReviewAI review unavailable. Reviewed by GitHub Models (gpt-4o) · Not a substitute for human review |
Critical:
- indexer: validate embedding count per batch matches chunk count (raises ValueError on mismatch)
- tools: validate memory_type server-side against allowed set {fact, preference, goal}
Major:
- indexer: sanitise error_message stored to DB — only surface ValueError text; others get generic type+phrase
- indexer: exponential backoff retry (3 attempts) on embedding API calls
- indexer: fix chunk overlap — cap tail to len(current) to avoid silent over-slice, remove spurious space
- knowledge: _parse_uuid() helper converts ValueError → HTTP 400 on all path param UUID conversions
- knowledge: apply _parse_uuid to source_id, memory_id, agent_id, user_id in delete routes
Moderate:
- tools: retention_days=0 now treated as 0-day (immediate expiry), not no-expiry
- tools: truncated tool results append "[Results truncated]" so LLM knows output was cut
- knowledge: validate embedding key starts with 'sk-' before storing
- knowledge: sanitise filename with os.path.basename + strip non-ASCII + cap at 255 chars
Minor:
- tools: remove unused dc_field import
- tools: wrap conn.close() in try/except in both KB and memory tools
- indexer: add source_id correlation to all log lines (start, download, indexed)
🤖 AI Code ReviewAI review unavailable. Reviewed by GitHub Models (gpt-4o) · Not a substitute for human review |
🤖 AI Code ReviewAI review unavailable. Reviewed by GitHub Models (gpt-4o) · Not a substitute for human review |
sameerchereddy
left a comment
There was a problem hiding this comment.
Review
This is a substantial, well-structured feature. The architecture decisions (per-tool connection pooling, dynamic tool injection, SSRF-safe URL reader, sanitised error messages on the indexer) are solid. A few issues worth addressing before merge.
Major
1. Supabase upload before DB insert — orphaned file on failure (knowledge.py:773–797)
The storage upload runs first, then the INSERT INTO knowledge_sources. If the DB insert fails (constraint violation, network blip), the file is uploaded but no row exists, leaving storage permanently polluted with no cleanup path. Reverse the order: insert the DB row first (or in a transaction), then upload, then kick off BackgroundTasks.
2. tool_save_memory lacks error handling on DB operations (tools.py:499–526)
conn.fetchval and conn.execute can raise. When they do, the exception propagates out of _run_one and into asyncio.gather in the loop, which aborts the entire tool-call round-trip. Every other tool returns a string on failure; this one should too. Wrap the DB block in a try/except Exception as exc: return f"Error saving memory: ...".
3. tool_knowledge_search leaks DB exception details to the LLM (tools.py:465–466)
except Exception as exc:
return f"Error searching knowledge base: {exc}"asyncpg exceptions routinely include the full connection string and query. That string ends up verbatim in the LLM's context window. Mirror the indexer's pattern: generic message for unexpected errors, only surface ValueError text.
Moderate
4. Retry catches all exceptions, not just retriable ones (indexer.py:177)
The embed retry loop catches bare Exception, so a 401 Unauthorized from OpenAI retries 3 times with backoff before failing. Only transient errors (rate-limit, connection) should be retried. Check for openai.RateLimitError / openai.APIConnectionError or inspect exc.status_code == 429.
5. Frontend deleteSource / deleteMemory are optimistic without error handling (AgentPage.tsx)
Both functions remove the item from state before confirming the server DELETE succeeded. If the DELETE returns 4xx/5xx, the UI and DB diverge silently. Check res.ok before updating state.
6. File is fully buffered before size check (knowledge.py:705–710)
data = await file.read()
if len(data) > max_bytes:
...A 200 MB upload against a 20 MB limit is held entirely in memory before being rejected. Consider checking Content-Length first and returning 413 immediately, falling back to the buffer check only when the header is absent.
7. No error feedback in saveEmbeddingKey (AgentPage.tsx)
If the PUT returns 4xx, setSavingEmbKey(false) runs but no error is shown. The user sees nothing.
Minor
8. import re inside chunk_text (indexer.py:98)
This imports the module object on every call. Move it to the top-level imports.
9. if not settings is dead code (knowledge.py:617)
settings is a module-level Settings() singleton; it's never None. The guard is always false.
10. list_knowledge_sources uses SELECT * (knowledge.py:836)
Explicit column selection avoids surprises if the table schema grows.
Nit
The _get_conn() helper opens a raw asyncpg connection but the connection is never pooled. For a background-task-heavy endpoint with many concurrent uploads this will hit Postgres connection limits quickly. Something to revisit when concurrency becomes an issue.
Overall the KB plumbing is clean and the test coverage for indexer.py is good. Mainly needs the error-propagation fix in tool_save_memory and the storage-before-DB ordering fix before ship.
Major: - knowledge: DB INSERT before Supabase upload; rollback DB row if upload fails (no orphaned files) - tools: wrap tool_save_memory in try/except; DB errors return error string, don't abort tool round - tools: tool_knowledge_search returns generic error string, not raw asyncpg exception (no conn string leak) Moderate: - indexer: retry only retriable OpenAI errors (429, 5xx, network); auth/400 errors raise immediately - frontend: deleteSource/deleteMemory check response before updating state; show error on failure - knowledge: check Content-Length header before buffering to reject oversized uploads early - frontend: saveEmbeddingKey shows server error detail on PUT failure Minor: - tools: move re import to module level (was inside hot function) - knowledge: remove dead 'if not settings' guard; use explicit column list in SELECT - knowledge: add logger instance
🤖 AI Code ReviewAI review unavailable. Reviewed by GitHub Models (gpt-4o) · Not a substitute for human review |
sameerchereddy
left a comment
There was a problem hiding this comment.
All 10 issues from the previous review are addressed. The fixes are clean — particularly the storage/DB ordering with explicit rollback, _is_retriable_openai_error as a proper named predicate, and the _save_memory extraction making the error boundary obvious.
One nit in the storage rollback path (knowledge.py):
try:
conn2 = await _get_conn()
await conn2.execute("DELETE FROM knowledge_sources WHERE id=$1", uuid.UUID(source_id))
await conn2.close() # skipped if execute raises
except Exception:
passconn2.close() won't run if execute raises, leaking a connection. Fix:
try:
conn2 = await _get_conn()
try:
await conn2.execute("DELETE FROM knowledge_sources WHERE id=$1", uuid.UUID(source_id))
finally:
await conn2.close()
except Exception:
passRare path, not blocking — but worth a quick fix before the next cycle. LGTM otherwise.
🤖 AI Code ReviewAI review unavailable. Reviewed by GitHub Models (gpt-4o) · Not a substitute for human review |
Summary
indexer.py):extract_text(PDF/DOCX/TXT/MD via pypdf + python-docx),chunk_textsliding-window sentence-aware chunker,index_sourcefull async pipeline (download → extract → chunk → embed → pgvector insert) with status tracking (pending → indexing → ready/error)tools.py):tool_knowledge_search(pgvector cosine similarity, configurable top-k + threshold),tool_save_memory(enforces max_memories cap, expiry via retention_days),ToolContextdataclass; SSRF-safetool_url_reader(async DNS + PinnedTransport);_REQUIRED_ARGSvalidation; 4000-char result truncationloop.py):_trim_messagessliding-window (32k char limit, drops oldest complete exchanges atomically),build_system_promptinjects long-term memories, dynamic tool list (KB tool only if embedding key + ready sources, memory tool only if long_term_enabled)knowledge.py): upload/list/delete sources, embedding key endpoint, list/delete memoriesrun.py+AgentPage.tsx): OpenAI agents reuse their configured LLM key (zero extra config); non-OpenAI agents (Anthropic/Gemini/Bedrock/Ollama) see a small "OpenAI API key for embeddings" field only when neededTest plan
python -m pytest tests/ -q)python -m mypy app/)npx tsc --noEmit)pending → indexing → readywith chunk countknowledge_searchtool call🤖 Generated with Claude Code