fix(retrieval): replace window function with per-fact_type HNSW queries by fabioscarsi · Pull Request #540 · vectorize-io/hindsight

fabioscarsi · 2026-03-10T20:29:41Z

Fixes #539

fix(retrieval): replace window function with per-fact_type HNSW queries

The problem

retrieve_semantic_bm25_combined() uses a window function:

ROW_NUMBER() OVER (PARTITION BY fact_type ORDER BY embedding <=> $1)

This pattern prevents pgvector from using the HNSW index and forces a full sequential scan of all vectors. On databases with 100K+ memory_units, every recall scans the entire table: hundreds of MB of buffers per query (451 MB observed on our deployment).

The impact is not limited to specific configurations:

Servers with ample RAM: under concurrent load, hundreds of MB × N parallel queries could put pressure on I/O and buffer pool. On multi-user deployments or with active consolidation, the degradation is cumulative.
VPS and containers: on memory-constrained systems, retrieval latency under consolidation load could become a limiting factor for production use.
As an observed example — macOS with compressed memory: compressed vectors are decompressed on every scan, generating 5+ GB of decompression per query.

Technical cause

pgvector can only use the HNSW index when the query has the form:

ORDER BY embedding <=> vector LIMIT n

The presence of PARTITION BY in the window function forces the planner to execute a sequential scan to then sort and partition the results.

A global HNSW index with post-filtering by fact_type does not work: minority classes (e.g., experience with ~3K nodes) receive near-zero results because the index returns the nearest nodes regardless of fact_type, and the WHERE filter discards them.

Solution

The core change replaces the full-table vector scan with targeted HNSW index lookups, then applies the existing RRF fusion and graph retrieval pipeline unchanged.

Separate queries per fact_type with ORDER BY embedding <=> $1 LIMIT n, which enables the HNSW index scan for each query.

Key changes:

Per-fact_type queries: one semantic query per fact_type instead of a single query with window function
Partial indexes: requires partial HNSW indexes per fact_type (see Prerequisites section)
ef_search = 200: increased from default 40 to ensure sufficient recall on sparse HNSW graphs
5x overfetch: HNSW is approximate — fetch 5x more results and trim in Python
Parallelization: semantic queries for different fact_types execute in parallel via asyncio.gather() using separate pool connections, reducing total semantic retrieval time to the slowest single fact_type query

Note on SET hnsw.ef_search: we use SET + RESET instead of SET LOCAL because asyncpg in autocommit mode ignores transaction-local settings.

Alignment with project design principles

Hindsight's recall architecture uses parallel multi-axis retrieval (semantic, BM25, graph, temporal) fused via RRF. This patch extends the same principle to the semantic axis itself: instead of one monolithic embedding scan across all fact_types, we run parallel per-fact_type HNSW traversals.

This is the same pattern already used in _find_semantic_seeds(), which leverages ORDER BY embedding <=> $1 LIMIT n for HNSW-accelerated retrieval. The patch applies existing patterns consistently to the main retrieval path — it does not introduce new architectural concepts.

Prerequisites (migration note)

This PR includes an Alembic migration (a3b4c5d6e7f8_add_partial_hnsw_indexes.py) that auto-creates the required partial HNSW indexes on upgrade — no manual intervention required. The migration runs automatically at startup and is idempotent: if the indexes already exist, the operation is a no-op.

For reference, the indexes created are:

-- Created automatically by migration
CREATE INDEX IF NOT EXISTS idx_mu_emb_world
    ON memory_units USING hnsw (embedding vector_cosine_ops)
    WHERE fact_type = 'world';

CREATE INDEX IF NOT EXISTS idx_mu_emb_observation
    ON memory_units USING hnsw (embedding vector_cosine_ops)
    WHERE fact_type = 'observation';

CREATE INDEX IF NOT EXISTS idx_mu_emb_experience
    ON memory_units USING hnsw (embedding vector_cosine_ops)
    WHERE fact_type = 'experience';

Note: The migration uses CREATE INDEX IF NOT EXISTS (not CONCURRENTLY) because Alembic migrations run inside a transaction. For large existing deployments, operators may prefer to create the indexes manually with CONCURRENTLY before upgrading, to avoid blocking writes during index build.

Validation data

Quality check (overlap with pre-patch results)

The patch has been running in production for 48h on a deployment with ~170K memory_units across two banks with no quality regressions observed. Pre-deploy formal validation:

Metric	Value
Test cases	30 (10 embeddings × 3 fact_types)
Min overlap	95.0%
Mean overlap	99.3%
Max overlap	100%

The 95% minimum overlap is on experience (3.3K nodes, sparsest HNSW graph).

EXPLAIN ANALYZE post-deployment

fact_type	Index used	Buffers	Execution
world	idx_mu_emb_world	45 MB	14 ms
observation	idx_mu_emb_observation	52 MB	429 ms
experience	idx_mu_emb_experience	53 MB	138 ms

Pre-patch: global index, 451 MB buffers, 1029 ms, sequential scan forced by window function.

Note on observation (429 ms): the value reflects a cold cache at measurement time (1,609 pages read from disk). The relevant data for comparison is the buffer count: 52 MB vs 451 MB pre-patch. With warm cache the execution time drops proportionally.

Real-world benchmark (bank with ~170K memory_units)

Date	Median retrieval_s	Notes
Pre-patch	12.32s	System under stress (consolidation active)
Post-patch	0.252s	System idle
Speedup	49×

Under consolidation load the improvement is even more marked because the I/O cascade is eliminated.

Risks and rollback

Rollback: revert the commit and remove the partial indexes (the indexes do not harm the pre-patch code)
Main risk: on very large deployments, the Alembic migration may take several minutes to build the indexes during upgrade. Operators can pre-create them manually with CONCURRENTLY before upgrading to avoid any delay.
Compatibility: no changes to public signatures; pool is an optional parameter with sequential fallback

Changed files

hindsight-api/hindsight_api/engine/search/retrieval.py — per-fact_type HNSW queries in retrieve_semantic_bm25_combined()
hindsight-api/hindsight_api/alembic/versions/a3b4c5d6e7f8_add_partial_hnsw_indexes.py — migration to create partial indexes

Notes

Version compatibility: patch developed on 0.4.16 and verified on 0.4.17. retrieval.py is identical between the two versions (empty diff), zero conflicts.
Tested on: PostgreSQL 18 with pgvector 0.8.0. The logic should work on PostgreSQL 14+ with pgvector >= 0.5.0.

Replace ROW_NUMBER() OVER (PARTITION BY fact_type ...) with separate per-fact_type queries using ORDER BY embedding <=> $1 LIMIT n, enabling HNSW index scans instead of sequential scans. Key changes: - Per-fact_type semantic queries with HNSW-friendly ORDER BY ... LIMIT - Parallel execution via asyncio.gather() when pool is available - ef_search=200 and 5x overfetch for approximate recall compensation - New Alembic migration creates partial HNSW indexes per fact_type Reduces buffer reads from ~450MB to ~50MB per recall on 170K+ deployments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fabioscarsi and others added 2 commits March 10, 2026 21:29

style: format retrieval.py with ruff

cee8b00

nicoloboschi mentioned this pull request Mar 11, 2026

perf: replace window-function retrieval with UNION ALL + per-bank HNSW indexes #541

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(retrieval): replace window function with per-fact_type HNSW queries#540

fix(retrieval): replace window function with per-fact_type HNSW queries#540
fabioscarsi wants to merge 2 commits intovectorize-io:mainfrom
fabioscarsi:fix/hnsw-semantic-retrieval

fabioscarsi commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fabioscarsi commented Mar 10, 2026

fix(retrieval): replace window function with per-fact_type HNSW queries

The problem

Technical cause

Solution

Alignment with project design principles

Prerequisites (migration note)

Validation data

Quality check (overlap with pre-patch results)

EXPLAIN ANALYZE post-deployment

Real-world benchmark (bank with ~170K memory_units)

Risks and rollback

Changed files

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant