Skip to content

fix(chunker): match apollo MD5 content_hash schema; surface persistence failures#11

Merged
Esity merged 3 commits intoLegionIO:mainfrom
armstrongsamr:fix/content-hash-md5-match-apollo-schema
Apr 28, 2026
Merged

fix(chunker): match apollo MD5 content_hash schema; surface persistence failures#11
Esity merged 3 commits intoLegionIO:mainfrom
armstrongsamr:fix/content-hash-md5-match-apollo-schema

Conversation

@armstrongsamr
Copy link
Copy Markdown

fix(chunker): match apollo MD5 content_hash schema; surface persistence failures

Summary

Chunker.build_chunk emitted 64-char SHA-256 content_hash values, but Apollo's
apollo_entries.content_hash column is CHARACTER(32) (MD5 length). Postgres
rejected every INSERT with PG::StringDataRightTruncation, the error was
swallowed inside Apollo::Runners::Knowledge.handle_ingest's rescue Sequel::Error, and lex-knowledge's upsert_chunk_with_embedding ignored the
{success: false, ...} return value — reporting false-positive
chunks_created: counts to every caller. Nothing actually persisted to
apollo_entries. This PR aligns the hash format with
Legion::Extensions::Apollo::Helpers::Writeback.content_hash and makes the
upsert surface propagate persistence failures as :skipped with a warn log.

Symptom users see

  • POST /api/knowledge/ingest with a corpus path returns HTTP 200 with
    chunks_created: N (non-zero).
  • POST /api/apollo/query against the same content returns zero sources every
    time.
  • SELECT count(*) FROM apollo_entries WHERE content_type='document_chunk'
    stays at zero — the rows never landed.
  • Bedrock metering events fire successfully (embedding generation works), so
    there's no sign of trouble in the LLM side; only the DB INSERT silently
    fails.
  • No error logs emitted from the ingest path, because handle_ingest catches
    Sequel::Error without logging and returns {success: false, error: ...}
    into a caller that discards the return value.

End result: operators believe the corpus has been ingested, but
/api/apollo/query will never retrieve anything because
apollo_entries is empty.

Root cause

Two layered bugs double-silenced the persistence failure:

  1. Hash width mismatch. lib/legion/extensions/knowledge/helpers/chunker.rb
    computed content_hash as Digest::SHA256.hexdigest(content) — 64 hex
    chars. Apollo's schema defines the column as CHARACTER(32) (designed for
    MD5), so PG returned PG::StringDataRightTruncation: ERROR: value too long for type character(32) on every INSERT.

    For comparison, the existing
    Legion::Extensions::Apollo::Helpers::Writeback.content_hash in
    lex-apollo (lib/legion/extensions/apollo/helpers/writeback.rb) is:

    def content_hash(content)
      normalized = content.to_s.strip.downcase.gsub(/\s+/, ' ')
      Digest::MD5.hexdigest(normalized)
    end

    That returns 32 hex chars and fits the column. lex-knowledge should
    match this contract: same input → same digest, regardless of which gem
    is at the write site, so dedup behaves consistently across callers.

  2. Persistence outcome ignored. upsert_chunk_with_embedding called
    ingest_to_apollo (which wraps Apollo::Runners::Knowledge.handle_ingest)
    but discarded the return value, returning :created / :updated purely
    based on the force flag. So even when handle_ingest returned
    {success: false, error: 'PG::StringDataRightTruncation...'}, the caller
    counted the chunk as persisted.

Fix

File 1: lib/legion/extensions/knowledge/helpers/chunker.rb

Before:

def build_chunk(section, content, index)
  {
    content:      content,
    heading:      section[:heading],
    section_path: section[:section_path],
    source_file:  section[:source_file],
    token_count:  (content.length.to_f / CHARS_PER_TOKEN).ceil,
    chunk_index:  index,
    content_hash: ::Digest::SHA256.hexdigest(content)
  }
end

After:

# Hash must match Legion::Extensions::Apollo::Helpers::Writeback.content_hash so
# chunks land in apollo_entries.content_hash (CHARACTER(32) — MD5 length) without
# PG::StringDataRightTruncation. A bare SHA-256 (64 chars) silently fails the INSERT
# and the chain of rescues returns the chunk as "persisted" without actually writing.
def build_chunk(section, content, index)
  {
    content:      content,
    heading:      section[:heading],
    section_path: section[:section_path],
    source_file:  section[:source_file],
    token_count:  (content.length.to_f / CHARS_PER_TOKEN).ceil,
    chunk_index:  index,
    content_hash: apollo_compatible_content_hash(content)
  }
end

def apollo_compatible_content_hash(content)
  # `defined?(Const)` is unreliable under Zeitwerk autoloading: it returns nil
  # before the constant has been autoloaded and truthy after, so the same
  # chunk content could yield different hashes within a single request.
  # Memoize a stable Gem.loaded_specs check instead.
  return inline_md5_normalized(content) unless apollo_loaded?

  Legion::Extensions::Apollo::Helpers::Writeback.content_hash(content)
rescue StandardError => e
  # Defense-in-depth: if the delegate raises (e.g. helper internal error or
  # autoload race), fall back to the byte-for-byte equivalent inline
  # implementation so the chunker itself is never the source of a silent
  # ingest failure.
  log.warn(
    '[knowledge][chunker] apollo Writeback.content_hash delegate failed, ' \
    "using fallback: #{e.class}: #{e.message}"
  )
  inline_md5_normalized(content)
end
private_class_method :apollo_compatible_content_hash

def apollo_loaded?
  @apollo_loaded = Gem.loaded_specs.key?('lex-apollo') if @apollo_loaded.nil?
  @apollo_loaded
end
private_class_method :apollo_loaded?

# Byte-for-byte equivalent of lex-apollo's Writeback.content_hash, quoted in
# Root cause above. Kept inline here so the chunker can still run when
# lex-apollo is not loaded (e.g. unit tests, knowledge-only deployments).
def inline_md5_normalized(content)
  normalized = content.to_s.strip.downcase.gsub(/\s+/, ' ')
  ::Digest::MD5.hexdigest(normalized)
end
private_class_method :inline_md5_normalized

# Local `log` helper added to satisfy the org-wide `helper-logging-direct`
# lint rule (no direct `Legion::Logging.warn(...)` in `lib/`). Mirrors the
# same pattern used in `Helpers::Manifest` and `Runners::Ingest` in this
# gem. The rescue branch above invokes `log.warn(...)` rather than calling
# `Legion::Logging.warn(...)` directly.
def log
  Legion::Logging
end
private_class_method :log

File 2: lib/legion/extensions/knowledge/runners/ingest.rb

Before:

def upsert_chunk_with_embedding(chunk, embedding, dry_run: false, force: false, exists: false)
  return :created if dry_run
  return :created unless defined?(Legion::Extensions::Apollo)
  return :skipped if !force && exists

  ingest_to_apollo(chunk, embedding)
  force ? :updated : :created
rescue StandardError => e
  log.warn(e.message)
  :skipped
end

After:

def upsert_chunk_with_embedding(chunk, embedding, dry_run: false, force: false, exists: false)
  return :created if dry_run
  return :created unless defined?(Legion::Extensions::Apollo)
  return :skipped if !force && exists

  result = ingest_to_apollo(chunk, embedding)
  # handle_ingest returns a Hash on both success and failure paths. Treat
  # anything that is NOT an explicit success — nil, non-Hash, Hash without
  # :success, {success: false}, or any other truthy-but-not-:success value —
  # conservatively as :skipped. Previously the check was `result[:success] == false`,
  # which silently fell through to :created/:updated for nil, missing keys,
  # or non-Hash returns, producing false-positive responses to callers.
  unless result.is_a?(Hash) && result[:success] == true
    hash_prefix = chunk[:content_hash]&.slice(0, 12)
    content_len = chunk[:content]&.length
    err = result.is_a?(Hash) ? result[:error].inspect : "non-hash result class=#{result.class}"
    log.warn(
      '[knowledge][upsert_chunk] apollo persistence not confirmed ' \
      "error=#{err} chunk_hash=#{hash_prefix} chunk_len=#{content_len}"
    )
    return :skipped
  end
  force ? :updated : :created
rescue StandardError => e
  log.warn("[knowledge][upsert_chunk] unexpected error class=#{e.class} message=#{e.message} chunk_hash=#{chunk[:content_hash]&.slice(0, 12)}")
  :skipped
end

Counter aggregation — where :skipped lands in the API response

The outer caller is Runners::Ingest.process_file (lib/legion/extensions/knowledge/runners/ingest.rb). It runs each chunk through upsert_chunk_with_embedding and aggregates the returned symbols into named counters that are then surfaced in the /api/knowledge/ingest response:

# lib/legion/extensions/knowledge/runners/ingest.rb (process_file)
created = 0
skipped = 0
updated = 0

paired.each do |p|
  outcome = upsert_chunk_with_embedding(p[:chunk], p[:embedding],
                                        dry_run: dry_run, force: force,
                                        exists: p[:exists] || false)
  case outcome
  when :created then created += 1
  when :skipped then skipped += 1
  when :updated then updated += 1
  end
end

{ created: created, skipped: skipped, updated: updated }

So a chunk that this PR now reports as :skipped (apollo persistence not
confirmed, or ingest_to_apollo raised) is correctly counted in
chunks_skipped rather than silently dropped from all counters. Callers
that previously saw a false-positive chunks_created: N will now see
chunks_skipped: N plus a warn log per chunk explaining why — same total
chunk count, truthful labels.

Tests

Full suite: 200 examples, 0 failures. Rubocop: 0 offenses.

New / updated specs

spec/legion/extensions/knowledge/helpers/chunker_spec.rb

  • Updated computes a sha256 content_hash for each chunk to
    computes a 32-char (MD5-length) content_hash for each chunk — asserts the
    new /\A[0-9a-f]{32}\z/ pattern. The pre-existing 64-char assertion was
    codifying the bug.
  • Added content_hash matches MD5 of whitespace-normalized content — asserts
    Digest::MD5.hexdigest(content.to_s.strip.downcase.gsub(/\s+/, ' '))
    equivalence, keeping the apollo semantics pinned.
  • Added delegates to Apollo::Helpers::Writeback.content_hash when defined
    stubs a fake Writeback with a sentinel return and asserts the chunker
    delegates to it when Apollo is loaded.
  • Added fallback path produces digests identical to Apollo Writeback for the same input — drift guard: drives several real-world content samples (empty
    string, multi-paragraph markdown, content with mixed casing and tabs) through
    BOTH the inline fallback (inline_md5_normalized) and lex-apollo's
    Writeback.content_hash (when loaded), asserts the resulting hex digests
    are byte-identical. Catches future drift if either side's normalization
    changes.
  • Added falls back to inline implementation when delegate raises — stubs
    Writeback.content_hash to raise RuntimeError, asserts the chunker still
    returns a 32-char digest equal to the inline computation and emits a warn
    log naming the failed delegate. Validates the rescue in
    apollo_compatible_content_hash.

spec/legion/extensions/knowledge/runners/ingest_spec.rb

New describe '.upsert_chunk_with_embedding — persistence outcome propagation'
block with 9 examples:

  • when dry_run: true returns :created without contacting apollo
  • when Legion::Extensions::Apollo is not defined returns :created
  • when apollo is defined:
    • returns :skipped when exists: true and force: false
    • returns :created when handle_ingest returns {success: true, ...}
    • returns :updated when force: true and handle_ingest returns
      {success: true, ...}
    • returns :skipped and emits a warn log when handle_ingest returns
      {success: false, ...}
      — core regression guard for this bug
    • returns :skipped when handle_ingest returns a non-Hash result
      (e.g. nil or any unexpected value) — covers the strict
      unless result.is_a?(Hash) && result[:success] == true check; previously
      the :created fallback silently swallowed these
    • returns :skipped when handle_ingest returns a Hash without a
      :success key
      — same defensive check, different shape
    • returns :skipped and logs when ingest_to_apollo raises

Version

0.6.70.6.8 (patch bump; bug fix, no API break).

If the companion PR (fix/manifest-scan-tolerate-eperm on the same fork)
lands first and bumps to 0.6.8, this PR will need to be retargeted to
0.6.9 during merge. Both branches are independent against main at
0.6.7.

CHANGELOG entry added under [0.6.8]Fixed::

  • Chunker.build_chunk now produces a 32-char MD5 content_hash matching
    Legion::Extensions::Apollo::Helpers::Writeback.content_hash (delegates
    when lex-apollo is loaded; equivalent inline fallback otherwise) instead
    of a 64-char SHA-256. Previously, every INSERT into apollo_entries
    silently failed with PG::StringDataRightTruncation against the
    CHARACTER(32) column, the error was rescued without logging, and
    upsert_chunk_with_embedding reported false-positive chunks_created
    counts to /api/knowledge/ingest callers — corpus content appeared to
    be ingested but persisted no rows.
  • upsert_chunk_with_embedding now treats anything other than an explicit
    {success: true} from handle_ingest as :skipped (with a warn log),
    instead of silently returning :created/:updated based on the force
    flag.

Live validation

The fix was developed by patching a locally-installed (Homebrew Cellar) build
of lex-knowledge and verifying end-to-end behavior against a running
legionio daemon on http://127.0.0.1:4567. All commands and outputs below
were captured from that environment; redacted placeholders are marked ....

Before the fix — silent failure

Diagnostic warn logs added to Apollo::Runners::Knowledge.handle_ingest (being
filed separately as a defense-in-depth PR on LegionIO/lex-apollo) captured
the underlying Postgres error on every INSERT attempt from the ingest path:

WARN [apollo][handle_ingest] Sequel::Error
  class=Sequel::DatabaseError
  message=PG::StringDataRightTruncation: ERROR:  value too long for type character(32)
  content_type=document_chunk
  backtrace_head=... handle_ingest ... ApolloEntry.create ...

The error was caught by rescue Sequel::Error, converted to
{success: false, error: ...}, and returned into the caller
(upsert_chunk_with_embedding) which discarded the Hash and reported
:created / :updated to the outer API response based only on the force
flag. Net effect: POST /api/knowledge/ingest returned HTTP 200 with a
non-zero chunks_created while zero rows landed in apollo_entries.

After the fix — round-trip works

Fresh ingest of a 195-byte markdown probe (brand-new content hash, no dedup
path):

$ curl -sS -X POST http://127.0.0.1:4567/api/knowledge/ingest \
    -H 'Content-Type: application/json' \
    -d '{"path":"/tmp/legion-roundtrip-corpus/final-omega.md"}'
{"data":{"success":true,"file":"/tmp/legion-roundtrip-corpus/final-omega.md",
         "chunks_created":1,"chunks_skipped":0,"chunks_updated":0},
 "meta":{...}}

Immediately retrievable via Apollo semantic query:

$ curl -sS -X POST http://127.0.0.1:4567/api/apollo/query \
    -H 'Content-Type: application/json' \
    -d '{"query":"FINAL_VALIDATION_PROBE_OMEGA ROUND_TRIP_SIGMA_ETA",
         "status":["candidate","confirmed"],
         "limit":3,"min_confidence":0.0}'
{"data":{"success":true,
         "entries":[{
           "id":"...",
           "content":"FINAL_VALIDATION_PROBE_OMEGA_... confirms round-trip works end-to-end.\nResponsible party: ...\nIdentifier: ROUND_TRIP_SIGMA_ETA.",
           "content_type":"document_chunk",
           "confidence":0.5,
           "distance":0.268,
           "tags":["...","final-omega","documentchunk"],
           "knowledge_domain":"..."
         }],
         "count":1},
 "meta":{...}}

distance=0.268 (cosine) is a strong semantic match for the query terms that
appear literally in the content. content_type=document_chunk confirms the
row was written via the lex-knowledge ingest path (not a manual
/api/apollo/ingest call, which would default to observation).

The same round-trip exercised via the skill-level wrapper (shell script
calling legionio knowledge ingest) produced chunks_created: 1 and a
subsequent query distance of 0.238, confirming the fix works from the
CLI/skill layer and not only from direct HTTP ingest.

Known deviations from org guidelines

Two soft-conflicts with LegionIO/.github/CONTRIBUTING.md, flagged for
reviewer transparency rather than hidden:

  1. Commit message length. The first line of the commit
    (fix(chunker): match apollo MD5 content_hash schema; surface persistence failures,
    84 chars) exceeds the 72-char convention from CONTRIBUTING.md. Happy to
    amend with git commit --amend to either shorten the subject (e.g.
    fix(chunker): match apollo MD5 content_hash schema (32 chars), 60
    chars) and move the rest into the body, or split into two commits. Did
    not amend pre-emptively to avoid invalidating the existing review thread
    on the fork.

  2. Two coupled fixes in one PR. Strictly per "one concern per PR" this
    could be split into two: (a) chunker hash format change, (b)
    upsert_chunk_with_embedding return-value propagation. Defended here
    because either fix alone leaves the silent-failure intact: with (a) but
    not (b), a future hash-vs-schema regression would still be silent at
    the upsert layer; with (b) but not (a), the body of the persistence
    failure (the truncation error) becomes visible but ingests still
    produce zero rows. The two are diagnostically and operationally
    inseparable. Open to splitting if the reviewer prefers.

Checklist

  • Tests pass (bundle exec rspec) — 200/200 full suite, including the
    9-example upsert_chunk_with_embedding — persistence outcome propagation
    block and updated chunker specs
  • RuboCop passes (bundle exec rubocop) — 37 files inspected, no offenses
  • CHANGELOG.md updated ([0.6.8]Fixed: entry shown in Version
    section, including the version-collision note with the companion manifest
    walker PR)
  • No new security concerns introduced — MD5 is used only for content
    deduplication (non-cryptographic context); Gem.loaded_specs.key? check
    reads gem metadata, no user input; the rescue widening in
    apollo_compatible_content_hash catches StandardError (does not
    swallow SystemExit, SignalException, Interrupt, or
    NoMemoryError).

…ce failures

Chunker.build_chunk used Digest::SHA256.hexdigest(content) — 64 hex chars.
Apollo's apollo_entries.content_hash column is CHARACTER(32) (MD5 length).
PG silently rejected every INSERT with PG::StringDataRightTruncation,
which handle_ingest caught via rescue Sequel::Error and returned as
{success: false, error: ...}. lex-knowledge's upsert_chunk_with_embedding
ignored that return value and reported :created/:updated based purely
on the `force` flag — producing false-positive responses to every
/api/knowledge/ingest call. No chunks actually persisted to apollo_entries.

Fix: chunker now calls Legion::Extensions::Apollo::Helpers::Writeback
.content_hash when available (normalize + MD5 → 32 chars), with an
inline fallback that preserves the same semantics when apollo isn't
loaded. upsert_chunk_with_embedding now returns :skipped when
handle_ingest reports failure, so callers see truthful status.

Live validation: fresh ingest of a 1490-byte markdown now produces
chunks_created:1 and an apollo query by the content's distinctive
tokens returns the row with distance 0.27 (strong semantic match).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Esity Esity self-requested a review April 27, 2026 15:30
Copy link
Copy Markdown
Contributor

@Esity Esity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the merge conflicts and we can merge this

@Esity
Copy link
Copy Markdown
Contributor

Esity commented Apr 28, 2026

Updated the PR branch with a normal merge from current main. The merge conflicts are resolved, the release metadata is now 0.6.10 on top of v0.6.9, and I tightened the Apollo persistence outcome handling so only an explicit {success: true} is counted as created/updated.

Validation:

  • Local: bundle exec rspec --format json --out tmp/rspec_results.json --format progress --out tmp/rspec_progress.txt passed
  • Local: bundle exec rubocop -A passed with 0 offenses
  • GitHub checks: RSpec 3.4, RSpec 4.0, RuboCop, Bundler Audit, Dependency Review, excluded-files, version, and changelog all passed

GitHub now reports the PR as mergeable. The prior changes-requested review appears to be stale from the old conflicting commit and needs reviewer follow-up before merge under our normal PR rules.

@Esity Esity self-requested a review April 28, 2026 16:28
@Esity Esity merged commit 9082176 into LegionIO:main Apr 28, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants