Skip to content

feat(cycle-6): knowledge base — file upload, async indexing, pgvector search, memory tools#11

Merged
sameerchereddy merged 5 commits intomainfrom
feat/cycle-6-knowledge-base
Apr 7, 2026
Merged

feat(cycle-6): knowledge base — file upload, async indexing, pgvector search, memory tools#11
sameerchereddy merged 5 commits intomainfrom
feat/cycle-6-knowledge-base

Conversation

@sameerchereddy
Copy link
Copy Markdown
Owner

Summary

  • Indexing pipeline (indexer.py): extract_text (PDF/DOCX/TXT/MD via pypdf + python-docx), chunk_text sliding-window sentence-aware chunker, index_source full async pipeline (download → extract → chunk → embed → pgvector insert) with status tracking (pending → indexing → ready/error)
  • KB + memory tools (tools.py): tool_knowledge_search (pgvector cosine similarity, configurable top-k + threshold), tool_save_memory (enforces max_memories cap, expiry via retention_days), ToolContext dataclass; SSRF-safe tool_url_reader (async DNS + PinnedTransport); _REQUIRED_ARGS validation; 4000-char result truncation
  • Agent loop (loop.py): _trim_messages sliding-window (32k char limit, drops oldest complete exchanges atomically), build_system_prompt injects long-term memories, dynamic tool list (KB tool only if embedding key + ready sources, memory tool only if long_term_enabled)
  • Knowledge API (knowledge.py): upload/list/delete sources, embedding key endpoint, list/delete memories
  • Embedding key UX (run.py + AgentPage.tsx): OpenAI agents reuse their configured LLM key (zero extra config); non-OpenAI agents (Anthropic/Gemini/Bedrock/Ollama) see a small "OpenAI API key for embeddings" field only when needed
  • Frontend KB block: drag-and-drop upload, source list with status badges, chunk size/overlap/top-k/similarity controls, provider-aware embedding key section, stored memories list with delete

Test plan

  • 132 backend tests pass (python -m pytest tests/ -q)
  • mypy clean (python -m mypy app/)
  • TypeScript clean (npx tsc --noEmit)
  • OpenAI agent: KB block shows "using your configured OpenAI key" — no extra field
  • Non-OpenAI agent: KB block shows OpenAI fallback key field
  • No LLM configured: KB block shows "Configure an LLM above" message
  • File upload → source transitions pending → indexing → ready with chunk count
  • Chat with indexed sources triggers knowledge_search tool call
  • Ollama chat works (KB requires OpenAI key — deferred for later test)

🤖 Generated with Claude Code

… search, save_memory

- indexer.py: extract_text (PDF/DOCX/TXT/MD), chunk_text sliding window, index_source pipeline
- tools.py: tool_knowledge_search (pgvector cosine sim), tool_save_memory, ToolContext dataclass
  SSRF-safe url_reader: async DNS + PinnedTransport; _REQUIRED_ARGS validation; result truncation
- loop.py: _trim_messages (32k char sliding window), dynamic KB/memory tool list, memory injection
- knowledge.py: upload/list/delete sources, embedding key, memories routes
- run.py: ToolContext wiring, embedding key resolved from LLM config (OpenAI) or stored fallback
  Embedding key UX: OpenAI agents use their LLM key; non-OpenAI agents shown OpenAI fallback field
- AgentPage.tsx: KB block (drag-drop upload, source list, provider-aware embedding key UI),
  memory block (retention days, stored memories list), stopped/onerror WS handlers
- 15 new tests for indexer (extract_text, chunk_text, index_source happy + error paths)
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

🤖 AI Code Review

AI review unavailable.


Reviewed by GitHub Models (gpt-4o) · Not a substitute for human review

Critical:
- indexer: validate embedding count per batch matches chunk count (raises ValueError on mismatch)
- tools: validate memory_type server-side against allowed set {fact, preference, goal}

Major:
- indexer: sanitise error_message stored to DB — only surface ValueError text; others get generic type+phrase
- indexer: exponential backoff retry (3 attempts) on embedding API calls
- indexer: fix chunk overlap — cap tail to len(current) to avoid silent over-slice, remove spurious space
- knowledge: _parse_uuid() helper converts ValueError → HTTP 400 on all path param UUID conversions
- knowledge: apply _parse_uuid to source_id, memory_id, agent_id, user_id in delete routes

Moderate:
- tools: retention_days=0 now treated as 0-day (immediate expiry), not no-expiry
- tools: truncated tool results append "[Results truncated]" so LLM knows output was cut
- knowledge: validate embedding key starts with 'sk-' before storing
- knowledge: sanitise filename with os.path.basename + strip non-ASCII + cap at 255 chars

Minor:
- tools: remove unused dc_field import
- tools: wrap conn.close() in try/except in both KB and memory tools
- indexer: add source_id correlation to all log lines (start, download, indexed)
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

🤖 AI Code Review

AI review unavailable.


Reviewed by GitHub Models (gpt-4o) · Not a substitute for human review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

🤖 AI Code Review

AI review unavailable.


Reviewed by GitHub Models (gpt-4o) · Not a substitute for human review

Copy link
Copy Markdown
Owner Author

@sameerchereddy sameerchereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

This is a substantial, well-structured feature. The architecture decisions (per-tool connection pooling, dynamic tool injection, SSRF-safe URL reader, sanitised error messages on the indexer) are solid. A few issues worth addressing before merge.


Major

1. Supabase upload before DB insert — orphaned file on failure (knowledge.py:773–797)

The storage upload runs first, then the INSERT INTO knowledge_sources. If the DB insert fails (constraint violation, network blip), the file is uploaded but no row exists, leaving storage permanently polluted with no cleanup path. Reverse the order: insert the DB row first (or in a transaction), then upload, then kick off BackgroundTasks.

2. tool_save_memory lacks error handling on DB operations (tools.py:499–526)

conn.fetchval and conn.execute can raise. When they do, the exception propagates out of _run_one and into asyncio.gather in the loop, which aborts the entire tool-call round-trip. Every other tool returns a string on failure; this one should too. Wrap the DB block in a try/except Exception as exc: return f"Error saving memory: ...".

3. tool_knowledge_search leaks DB exception details to the LLM (tools.py:465–466)

except Exception as exc:
    return f"Error searching knowledge base: {exc}"

asyncpg exceptions routinely include the full connection string and query. That string ends up verbatim in the LLM's context window. Mirror the indexer's pattern: generic message for unexpected errors, only surface ValueError text.


Moderate

4. Retry catches all exceptions, not just retriable ones (indexer.py:177)

The embed retry loop catches bare Exception, so a 401 Unauthorized from OpenAI retries 3 times with backoff before failing. Only transient errors (rate-limit, connection) should be retried. Check for openai.RateLimitError / openai.APIConnectionError or inspect exc.status_code == 429.

5. Frontend deleteSource / deleteMemory are optimistic without error handling (AgentPage.tsx)

Both functions remove the item from state before confirming the server DELETE succeeded. If the DELETE returns 4xx/5xx, the UI and DB diverge silently. Check res.ok before updating state.

6. File is fully buffered before size check (knowledge.py:705–710)

data = await file.read()
if len(data) > max_bytes:
    ...

A 200 MB upload against a 20 MB limit is held entirely in memory before being rejected. Consider checking Content-Length first and returning 413 immediately, falling back to the buffer check only when the header is absent.

7. No error feedback in saveEmbeddingKey (AgentPage.tsx)

If the PUT returns 4xx, setSavingEmbKey(false) runs but no error is shown. The user sees nothing.


Minor

8. import re inside chunk_text (indexer.py:98)

This imports the module object on every call. Move it to the top-level imports.

9. if not settings is dead code (knowledge.py:617)

settings is a module-level Settings() singleton; it's never None. The guard is always false.

10. list_knowledge_sources uses SELECT * (knowledge.py:836)

Explicit column selection avoids surprises if the table schema grows.


Nit

The _get_conn() helper opens a raw asyncpg connection but the connection is never pooled. For a background-task-heavy endpoint with many concurrent uploads this will hit Postgres connection limits quickly. Something to revisit when concurrency becomes an issue.

Overall the KB plumbing is clean and the test coverage for indexer.py is good. Mainly needs the error-propagation fix in tool_save_memory and the storage-before-DB ordering fix before ship.

Major:
- knowledge: DB INSERT before Supabase upload; rollback DB row if upload fails (no orphaned files)
- tools: wrap tool_save_memory in try/except; DB errors return error string, don't abort tool round
- tools: tool_knowledge_search returns generic error string, not raw asyncpg exception (no conn string leak)

Moderate:
- indexer: retry only retriable OpenAI errors (429, 5xx, network); auth/400 errors raise immediately
- frontend: deleteSource/deleteMemory check response before updating state; show error on failure
- knowledge: check Content-Length header before buffering to reject oversized uploads early
- frontend: saveEmbeddingKey shows server error detail on PUT failure

Minor:
- tools: move re import to module level (was inside hot function)
- knowledge: remove dead 'if not settings' guard; use explicit column list in SELECT
- knowledge: add logger instance
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

🤖 AI Code Review

AI review unavailable.


Reviewed by GitHub Models (gpt-4o) · Not a substitute for human review

Copy link
Copy Markdown
Owner Author

@sameerchereddy sameerchereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 10 issues from the previous review are addressed. The fixes are clean — particularly the storage/DB ordering with explicit rollback, _is_retriable_openai_error as a proper named predicate, and the _save_memory extraction making the error boundary obvious.

One nit in the storage rollback path (knowledge.py):

try:
    conn2 = await _get_conn()
    await conn2.execute("DELETE FROM knowledge_sources WHERE id=$1", uuid.UUID(source_id))
    await conn2.close()  # skipped if execute raises
except Exception:
    pass

conn2.close() won't run if execute raises, leaking a connection. Fix:

try:
    conn2 = await _get_conn()
    try:
        await conn2.execute("DELETE FROM knowledge_sources WHERE id=$1", uuid.UUID(source_id))
    finally:
        await conn2.close()
except Exception:
    pass

Rare path, not blocking — but worth a quick fix before the next cycle. LGTM otherwise.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

🤖 AI Code Review

AI review unavailable.


Reviewed by GitHub Models (gpt-4o) · Not a substitute for human review

@sameerchereddy sameerchereddy merged commit 4b7aeb0 into main Apr 7, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant