Skip to content

search_graph on large datasets with name_pattern= is slow #254

@awconstable

Description

@awconstable

Problem

On a large codebase (~200K nodes), every search_graph call using name_pattern= takes 1.5–2.5s regardless of how specific the pattern is. An exact name search is as slow as a broad regex:

search_graph(name_pattern="specificFunctionName") → 1.5s, 70% CPU
search_graph(label="Method", name_pattern=".Controller.") → 2.0s, 88% CPU

Scales linearly with node count — confirmed across four projects ranging from 1K to 200K+ nodes. The query= (BM25) path is unaffected and fast.

### Root cause — three compounding bugs

  1. sqlite_iregexp recompiles the regex on every row (store.c:431)
  static void sqlite_iregexp(sqlite3_context *ctx, int argc, sqlite3_value **argv) {
      cbm_regex_t re;
      int rc = cbm_regcomp(&re, pattern, CBM_REG_EXTENDED | CBM_REG_NOSUB | CBM_REG_ICASE);
      // ...
      rc = cbm_regexec(&re, text, 0, NULL, 0);
      cbm_regfree(&re);   // ← compiled and freed for every row
  }

For a Method-scoped search across a large project, the pattern is compiled and freed once per node. The standard SQLite fix is sqlite3_get_auxdata/sqlite3_set_auxdata to cache the compiled regex for the lifetime of the statement.

  1. The count query runs a full second scan (store.c:2321)
snprintf(count_sql, sizeof(count_sql), "SELECT COUNT(*) FROM (%s)", sql);

The count query executes before the data query with identical WHERE conditions, so iregexp() fires twice for every row — doubling the already-expensive work.

  1. cbm_extract_like_hints is implemented but never called (store.c:2018, store.h:559)

There is already a function that extracts literal segments from a regex pattern to build LIKE pre-filters:

  int cbm_extract_like_hints(const char *pattern, char **out, int max_out);

For .*Controller.* it yields ["Controller"], enabling:
WHERE n.name LIKE '%Controller%' AND iregexp(?1, n.name)

The idx_nodes_name (project, name) index can satisfy the LIKE clause and cut the regex scan to only matching rows. But cbm_extract_like_hints is never called from search_where_basic or anywhere in the search path — it is dead code.

Suggested fixes

Fix 1 — cache the compiled regex in sqlite_regexp/sqlite_iregexp using sqlite3_set_auxdata:

cbm_regex_t *re = (cbm_regex_t *)sqlite3_get_auxdata(ctx, 0);
if (!re) {
    re = malloc(sizeof(cbm_regex_t));
    cbm_regcomp(re, pattern, CBM_REG_EXTENDED | CBM_REG_NOSUB | CBM_REG_ICASE);
    sqlite3_set_auxdata(ctx, 0, re, (void(*)(void*))regex_free_cb);
}

Fix 2 — wire cbm_extract_like_hints into search_where_basic to prepend LIKE conditions before the regex clause. The function is already correct; it just needs to be called.

Fix 3 — merge the count and data query into a single pass (e.g. COUNT(*) OVER () window function) to eliminate the double scan.

Note on query= mode

query= (BM25/FTS5) already takes a completely separate fast path (handle_search_graph:1318) and is unaffected. This bug is specific to name_pattern= and the cbm_store_search code path.

### Environment

  • Version: 0.6.0
  • OS: Linux (Ubuntu, x86_64)
  • 216,648 nodes, 661,477 edges, 509MB database (PHP monolith)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions