Skip to content

fix(retrieval): replace window function with per-fact_type HNSW queries#540

Open
fabioscarsi wants to merge 2 commits intovectorize-io:mainfrom
fabioscarsi:fix/hnsw-semantic-retrieval
Open

fix(retrieval): replace window function with per-fact_type HNSW queries#540
fabioscarsi wants to merge 2 commits intovectorize-io:mainfrom
fabioscarsi:fix/hnsw-semantic-retrieval

Conversation

@fabioscarsi
Copy link
Contributor

Fixes #539

fix(retrieval): replace window function with per-fact_type HNSW queries

The problem

retrieve_semantic_bm25_combined() uses a window function:

ROW_NUMBER() OVER (PARTITION BY fact_type ORDER BY embedding <=> $1)

This pattern prevents pgvector from using the HNSW index and forces a full sequential scan of all vectors. On databases with 100K+ memory_units, every recall scans the entire table: hundreds of MB of buffers per query (451 MB observed on our deployment).

The impact is not limited to specific configurations:

  • Servers with ample RAM: under concurrent load, hundreds of MB × N parallel queries could put pressure on I/O and buffer pool. On multi-user deployments or with active consolidation, the degradation is cumulative.
  • VPS and containers: on memory-constrained systems, retrieval latency under consolidation load could become a limiting factor for production use.
  • As an observed example — macOS with compressed memory: compressed vectors are decompressed on every scan, generating 5+ GB of decompression per query.

Technical cause

pgvector can only use the HNSW index when the query has the form:

ORDER BY embedding <=> vector LIMIT n

The presence of PARTITION BY in the window function forces the planner to execute a sequential scan to then sort and partition the results.

A global HNSW index with post-filtering by fact_type does not work: minority classes (e.g., experience with ~3K nodes) receive near-zero results because the index returns the nearest nodes regardless of fact_type, and the WHERE filter discards them.

Solution

The core change replaces the full-table vector scan with targeted HNSW index lookups, then applies the existing RRF fusion and graph retrieval pipeline unchanged.

Separate queries per fact_type with ORDER BY embedding <=> $1 LIMIT n, which enables the HNSW index scan for each query.

Key changes:

  1. Per-fact_type queries: one semantic query per fact_type instead of a single query with window function
  2. Partial indexes: requires partial HNSW indexes per fact_type (see Prerequisites section)
  3. ef_search = 200: increased from default 40 to ensure sufficient recall on sparse HNSW graphs
  4. 5x overfetch: HNSW is approximate — fetch 5x more results and trim in Python
  5. Parallelization: semantic queries for different fact_types execute in parallel via asyncio.gather() using separate pool connections, reducing total semantic retrieval time to the slowest single fact_type query

Note on SET hnsw.ef_search: we use SET + RESET instead of SET LOCAL because asyncpg in autocommit mode ignores transaction-local settings.

Alignment with project design principles

Hindsight's recall architecture uses parallel multi-axis retrieval (semantic, BM25, graph, temporal) fused via RRF. This patch extends the same principle to the semantic axis itself: instead of one monolithic embedding scan across all fact_types, we run parallel per-fact_type HNSW traversals.

This is the same pattern already used in _find_semantic_seeds(), which leverages ORDER BY embedding <=> $1 LIMIT n for HNSW-accelerated retrieval. The patch applies existing patterns consistently to the main retrieval path — it does not introduce new architectural concepts.

Prerequisites (migration note)

This PR includes an Alembic migration (a3b4c5d6e7f8_add_partial_hnsw_indexes.py) that auto-creates the required partial HNSW indexes on upgrade — no manual intervention required. The migration runs automatically at startup and is idempotent: if the indexes already exist, the operation is a no-op.

For reference, the indexes created are:

-- Created automatically by migration
CREATE INDEX IF NOT EXISTS idx_mu_emb_world
    ON memory_units USING hnsw (embedding vector_cosine_ops)
    WHERE fact_type = 'world';

CREATE INDEX IF NOT EXISTS idx_mu_emb_observation
    ON memory_units USING hnsw (embedding vector_cosine_ops)
    WHERE fact_type = 'observation';

CREATE INDEX IF NOT EXISTS idx_mu_emb_experience
    ON memory_units USING hnsw (embedding vector_cosine_ops)
    WHERE fact_type = 'experience';

Note: The migration uses CREATE INDEX IF NOT EXISTS (not CONCURRENTLY) because Alembic migrations run inside a transaction. For large existing deployments, operators may prefer to create the indexes manually with CONCURRENTLY before upgrading, to avoid blocking writes during index build.

Validation data

Quality check (overlap with pre-patch results)

The patch has been running in production for 48h on a deployment with ~170K memory_units across two banks with no quality regressions observed. Pre-deploy formal validation:

Metric Value
Test cases 30 (10 embeddings × 3 fact_types)
Min overlap 95.0%
Mean overlap 99.3%
Max overlap 100%

The 95% minimum overlap is on experience (3.3K nodes, sparsest HNSW graph).

EXPLAIN ANALYZE post-deployment

fact_type Index used Buffers Execution
world idx_mu_emb_world 45 MB 14 ms
observation idx_mu_emb_observation 52 MB 429 ms
experience idx_mu_emb_experience 53 MB 138 ms

Pre-patch: global index, 451 MB buffers, 1029 ms, sequential scan forced by window function.

Note on observation (429 ms): the value reflects a cold cache at measurement time (1,609 pages read from disk). The relevant data for comparison is the buffer count: 52 MB vs 451 MB pre-patch. With warm cache the execution time drops proportionally.

Real-world benchmark (bank with ~170K memory_units)

Date Median retrieval_s Notes
Pre-patch 12.32s System under stress (consolidation active)
Post-patch 0.252s System idle
Speedup 49×

Under consolidation load the improvement is even more marked because the I/O cascade is eliminated.

Risks and rollback

  • Rollback: revert the commit and remove the partial indexes (the indexes do not harm the pre-patch code)
  • Main risk: on very large deployments, the Alembic migration may take several minutes to build the indexes during upgrade. Operators can pre-create them manually with CONCURRENTLY before upgrading to avoid any delay.
  • Compatibility: no changes to public signatures; pool is an optional parameter with sequential fallback

Changed files

  • hindsight-api/hindsight_api/engine/search/retrieval.py — per-fact_type HNSW queries in retrieve_semantic_bm25_combined()
  • hindsight-api/hindsight_api/alembic/versions/a3b4c5d6e7f8_add_partial_hnsw_indexes.py — migration to create partial indexes

Notes

  • Version compatibility: patch developed on 0.4.16 and verified on 0.4.17. retrieval.py is identical between the two versions (empty diff), zero conflicts.
  • Tested on: PostgreSQL 18 with pgvector 0.8.0. The logic should work on PostgreSQL 14+ with pgvector >= 0.5.0.

fabioscarsi and others added 2 commits March 10, 2026 21:29
Replace ROW_NUMBER() OVER (PARTITION BY fact_type ...) with separate
per-fact_type queries using ORDER BY embedding <=> $1 LIMIT n, enabling
HNSW index scans instead of sequential scans.

Key changes:
- Per-fact_type semantic queries with HNSW-friendly ORDER BY ... LIMIT
- Parallel execution via asyncio.gather() when pool is available
- ef_search=200 and 5x overfetch for approximate recall compensation
- New Alembic migration creates partial HNSW indexes per fact_type

Reduces buffer reads from ~450MB to ~50MB per recall on 170K+ deployments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: window function in retrieve_semantic_bm25_combined() prevents HNSW index use

1 participant