fix(chunker): match apollo MD5 content_hash schema; surface persistence failures#11
Merged
Esity merged 3 commits intoLegionIO:mainfrom Apr 28, 2026
Conversation
…ce failures
Chunker.build_chunk used Digest::SHA256.hexdigest(content) — 64 hex chars.
Apollo's apollo_entries.content_hash column is CHARACTER(32) (MD5 length).
PG silently rejected every INSERT with PG::StringDataRightTruncation,
which handle_ingest caught via rescue Sequel::Error and returned as
{success: false, error: ...}. lex-knowledge's upsert_chunk_with_embedding
ignored that return value and reported :created/:updated based purely
on the `force` flag — producing false-positive responses to every
/api/knowledge/ingest call. No chunks actually persisted to apollo_entries.
Fix: chunker now calls Legion::Extensions::Apollo::Helpers::Writeback
.content_hash when available (normalize + MD5 → 32 chars), with an
inline fallback that preserves the same semantics when apollo isn't
loaded. upsert_chunk_with_embedding now returns :skipped when
handle_ingest reports failure, so callers see truthful status.
Live validation: fresh ingest of a 1490-byte markdown now produces
chunks_created:1 and an apollo query by the content's distinctive
tokens returns the row with distance 0.27 (strong semantic match).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Esity
requested changes
Apr 27, 2026
Contributor
Esity
left a comment
There was a problem hiding this comment.
fix the merge conflicts and we can merge this
Contributor
|
Updated the PR branch with a normal merge from current main. The merge conflicts are resolved, the release metadata is now 0.6.10 on top of v0.6.9, and I tightened the Apollo persistence outcome handling so only an explicit Validation:
GitHub now reports the PR as mergeable. The prior changes-requested review appears to be stale from the old conflicting commit and needs reviewer follow-up before merge under our normal PR rules. |
Esity
approved these changes
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(chunker): match apollo MD5 content_hash schema; surface persistence failures
Summary
Chunker.build_chunkemitted 64-char SHA-256content_hashvalues, but Apollo'sapollo_entries.content_hashcolumn isCHARACTER(32)(MD5 length). Postgresrejected every INSERT with
PG::StringDataRightTruncation, the error wasswallowed inside
Apollo::Runners::Knowledge.handle_ingest'srescue Sequel::Error, andlex-knowledge'supsert_chunk_with_embeddingignored the{success: false, ...}return value — reporting false-positivechunks_created:counts to every caller. Nothing actually persisted toapollo_entries. This PR aligns the hash format withLegion::Extensions::Apollo::Helpers::Writeback.content_hashand makes theupsert surface propagate persistence failures as
:skippedwith a warn log.Symptom users see
POST /api/knowledge/ingestwith a corpus path returns HTTP 200 withchunks_created: N(non-zero).POST /api/apollo/queryagainst the same content returns zero sources everytime.
SELECT count(*) FROM apollo_entries WHERE content_type='document_chunk'stays at zero — the rows never landed.
there's no sign of trouble in the LLM side; only the DB INSERT silently
fails.
handle_ingestcatchesSequel::Errorwithout logging and returns{success: false, error: ...}into a caller that discards the return value.
End result: operators believe the corpus has been ingested, but
/api/apollo/querywill never retrieve anything becauseapollo_entriesis empty.Root cause
Two layered bugs double-silenced the persistence failure:
Hash width mismatch.
lib/legion/extensions/knowledge/helpers/chunker.rbcomputed
content_hashasDigest::SHA256.hexdigest(content)— 64 hexchars. Apollo's schema defines the column as
CHARACTER(32)(designed forMD5), so PG returned
PG::StringDataRightTruncation: ERROR: value too long for type character(32)on every INSERT.For comparison, the existing
Legion::Extensions::Apollo::Helpers::Writeback.content_hashinlex-apollo(lib/legion/extensions/apollo/helpers/writeback.rb) is:That returns 32 hex chars and fits the column.
lex-knowledgeshouldmatch this contract: same input → same digest, regardless of which gem
is at the write site, so dedup behaves consistently across callers.
Persistence outcome ignored.
upsert_chunk_with_embeddingcalledingest_to_apollo(which wrapsApollo::Runners::Knowledge.handle_ingest)but discarded the return value, returning
:created/:updatedpurelybased on the
forceflag. So even whenhandle_ingestreturned{success: false, error: 'PG::StringDataRightTruncation...'}, the callercounted the chunk as persisted.
Fix
File 1:
lib/legion/extensions/knowledge/helpers/chunker.rbBefore:
After:
File 2:
lib/legion/extensions/knowledge/runners/ingest.rbBefore:
After:
Counter aggregation — where
:skippedlands in the API responseThe outer caller is
Runners::Ingest.process_file(lib/legion/extensions/knowledge/runners/ingest.rb). It runs each chunk throughupsert_chunk_with_embeddingand aggregates the returned symbols into named counters that are then surfaced in the/api/knowledge/ingestresponse:So a chunk that this PR now reports as
:skipped(apollo persistence notconfirmed, or
ingest_to_apolloraised) is correctly counted inchunks_skippedrather than silently dropped from all counters. Callersthat previously saw a false-positive
chunks_created: Nwill now seechunks_skipped: Nplus a warn log per chunk explaining why — same totalchunk count, truthful labels.
Tests
Full suite:
200 examples, 0 failures. Rubocop:0 offenses.New / updated specs
spec/legion/extensions/knowledge/helpers/chunker_spec.rbcomputes a sha256 content_hash for each chunktocomputes a 32-char (MD5-length) content_hash for each chunk— asserts thenew
/\A[0-9a-f]{32}\z/pattern. The pre-existing 64-char assertion wascodifying the bug.
content_hash matches MD5 of whitespace-normalized content— assertsDigest::MD5.hexdigest(content.to_s.strip.downcase.gsub(/\s+/, ' '))equivalence, keeping the apollo semantics pinned.
delegates to Apollo::Helpers::Writeback.content_hash when defined—stubs a fake
Writebackwith a sentinel return and asserts the chunkerdelegates to it when Apollo is loaded.
fallback path produces digests identical to Apollo Writeback for the same input— drift guard: drives several real-world content samples (emptystring, multi-paragraph markdown, content with mixed casing and tabs) through
BOTH the inline fallback (
inline_md5_normalized) andlex-apollo'sWriteback.content_hash(when loaded), asserts the resulting hex digestsare byte-identical. Catches future drift if either side's normalization
changes.
falls back to inline implementation when delegate raises— stubsWriteback.content_hashto raiseRuntimeError, asserts the chunker stillreturns a 32-char digest equal to the inline computation and emits a warn
log naming the failed delegate. Validates the rescue in
apollo_compatible_content_hash.spec/legion/extensions/knowledge/runners/ingest_spec.rbNew
describe '.upsert_chunk_with_embedding — persistence outcome propagation'block with 9 examples:
when dry_run: truereturns:createdwithout contacting apollowhen Legion::Extensions::Apollo is not definedreturns:createdwhen apollo is defined::skippedwhenexists: trueandforce: false:createdwhenhandle_ingestreturns{success: true, ...}:updatedwhenforce: trueandhandle_ingestreturns{success: true, ...}:skippedand emits a warn log whenhandle_ingestreturns{success: false, ...}— core regression guard for this bug:skippedwhenhandle_ingestreturns a non-Hash result(e.g.
nilor any unexpected value) — covers the strictunless result.is_a?(Hash) && result[:success] == truecheck; previouslythe
:createdfallback silently swallowed these:skippedwhenhandle_ingestreturns a Hash without a:successkey — same defensive check, different shape:skippedand logs wheningest_to_apolloraisesVersion
0.6.7→0.6.8(patch bump; bug fix, no API break).If the companion PR (
fix/manifest-scan-tolerate-epermon the same fork)lands first and bumps to
0.6.8, this PR will need to be retargeted to0.6.9during merge. Both branches are independent againstmainat0.6.7.CHANGELOG entry added under
[0.6.8]→Fixed::Live validation
The fix was developed by patching a locally-installed (Homebrew Cellar) build
of
lex-knowledgeand verifying end-to-end behavior against a runninglegioniodaemon onhttp://127.0.0.1:4567. All commands and outputs belowwere captured from that environment; redacted placeholders are marked
....Before the fix — silent failure
Diagnostic warn logs added to
Apollo::Runners::Knowledge.handle_ingest(beingfiled separately as a defense-in-depth PR on
LegionIO/lex-apollo) capturedthe underlying Postgres error on every INSERT attempt from the ingest path:
The error was caught by
rescue Sequel::Error, converted to{success: false, error: ...}, and returned into the caller(
upsert_chunk_with_embedding) which discarded the Hash and reported:created/:updatedto the outer API response based only on theforceflag. Net effect:
POST /api/knowledge/ingestreturned HTTP 200 with anon-zero
chunks_createdwhile zero rows landed inapollo_entries.After the fix — round-trip works
Fresh ingest of a 195-byte markdown probe (brand-new content hash, no dedup
path):
Immediately retrievable via Apollo semantic query:
distance=0.268(cosine) is a strong semantic match for the query terms thatappear literally in the content.
content_type=document_chunkconfirms therow was written via the lex-knowledge ingest path (not a manual
/api/apollo/ingestcall, which would default toobservation).The same round-trip exercised via the skill-level wrapper (shell script
calling
legionio knowledge ingest) producedchunks_created: 1and asubsequent query distance of
0.238, confirming the fix works from theCLI/skill layer and not only from direct HTTP ingest.
Known deviations from org guidelines
Two soft-conflicts with
LegionIO/.github/CONTRIBUTING.md, flagged forreviewer transparency rather than hidden:
Commit message length. The first line of the commit
(
fix(chunker): match apollo MD5 content_hash schema; surface persistence failures,84 chars) exceeds the 72-char convention from CONTRIBUTING.md. Happy to
amend with
git commit --amendto either shorten the subject (e.g.fix(chunker): match apollo MD5 content_hash schema (32 chars), 60chars) and move the rest into the body, or split into two commits. Did
not amend pre-emptively to avoid invalidating the existing review thread
on the fork.
Two coupled fixes in one PR. Strictly per "one concern per PR" this
could be split into two: (a) chunker hash format change, (b)
upsert_chunk_with_embeddingreturn-value propagation. Defended herebecause either fix alone leaves the silent-failure intact: with (a) but
not (b), a future hash-vs-schema regression would still be silent at
the upsert layer; with (b) but not (a), the body of the persistence
failure (the truncation error) becomes visible but ingests still
produce zero rows. The two are diagnostically and operationally
inseparable. Open to splitting if the reviewer prefers.
Checklist
bundle exec rspec) — 200/200 full suite, including the9-example
upsert_chunk_with_embedding — persistence outcome propagationblock and updated chunker specs
bundle exec rubocop) — 37 files inspected, no offenses[0.6.8]→Fixed:entry shown in Versionsection, including the version-collision note with the companion manifest
walker PR)
deduplication (non-cryptographic context);
Gem.loaded_specs.key?checkreads gem metadata, no user input; the rescue widening in
apollo_compatible_content_hashcatchesStandardError(does notswallow
SystemExit,SignalException,Interrupt, orNoMemoryError).