Skip to content

fix(manifest): tolerate EPERM/EACCES during corpus walk#12

Merged
Esity merged 1 commit intoLegionIO:mainfrom
armstrongsamr:fix/manifest-scan-tolerate-eperm
Apr 27, 2026
Merged

fix(manifest): tolerate EPERM/EACCES during corpus walk#12
Esity merged 1 commit intoLegionIO:mainfrom
armstrongsamr:fix/manifest-scan-tolerate-eperm

Conversation

@armstrongsamr
Copy link
Copy Markdown

Summary

Manifest.scan previously used Ruby's Find.find with no per-entry rescue. On macOS, any path walk that descends into a TCC-protected directory (e.g. ~/Library/Accounts, ~/Library/Mail, ~/Library/Safari) raises Errno::EPERM from the enclosing dir_initialize call and aborts the entire traversal. Because Manifest.scan is invoked downstream of Runners::Ingest.scan_corpus and POST /api/knowledge/status, that single unreadable subdir surfaces to the client as a HTTP 500.

This PR replaces Find.find with a small recursive walker built on Dir.children, with a per-directory rescue for the well-known unreadable-tree errnos. Unreadable subdirs are pruned at debug level; sibling paths continue to be scanned. The public scan(path:, extensions:) signature and return shape ({ path:, size:, mtime:, sha256: }) are preserved.

Repro

On macOS with the daemon running:

cd ~
curl -s -X POST http://127.0.0.1:4567/api/knowledge/status \
  -H 'Content-Type: application/json' -d '{"path":"/Users/<you>"}'
# => HTTP 500
# => Errno::EPERM @ dir_initialize - /Users/<you>/Library/Accounts

The same call against any path that does not contain a TCC-protected subtree (e.g. /tmp) succeeds. Note that this bug composes with the Dir.pwd default fix being filed separately against LegionIO/legionio (/api/knowledge/status previously inherited the daemon's cwd as the default path). The walker fix here is independently valuable: even with the path default fixed upstream, callers can still pass $HOME or any other path containing a TCC-protected directory explicitly, and the scan should not crash on a single unreadable subdir.

The relevant stack trace from a live daemon log on 2026-04-24:

ERROR Errno::EPERM: Operation not permitted @ dir_initialize - /Users/<you>/Library/Accounts
  lex-knowledge-0.6.7/lib/legion/extensions/knowledge/helpers/manifest.rb:16:in 'Manifest.scan'
  lex-knowledge-0.6.7/lib/legion/extensions/knowledge/runners/ingest.rb:21:in 'Ingest.scan_corpus'
  legionio-1.9.0/lib/legion/api/knowledge.rb:71:in 'block in register_ingest_routes'
INFO [api] POST /api/knowledge/status 500

Root cause

Find.find does not rescue per-entry. When dir_initialize raises while opening a subdirectory, the exception propagates out of the entire Find.find block and the calling method, terminating the scan. There is no way to instruct Find to skip a specific erroring subdirectory and continue with siblings.

Fix

Before (lib/legion/extensions/knowledge/helpers/manifest.rb):

require 'find'
# ...
def scan(path:, extensions: %w[.md .txt .docx .pdf])
  results = []

  Find.find(path) do |entry|
    basename = ::File.basename(entry)
    Find.prune if basename.start_with?('.')

    next unless ::File.file?(entry)
    next unless extensions.include?(::File.extname(entry).downcase)

    results << build_entry(entry)
  end

  results
end

After:

def scan(path:, extensions: %w[.md .txt .docx .pdf])
  results = []
  walk(path, extensions, results)
  results
end

def walk(entry, extensions, results)
  basename = ::File.basename(entry)
  return if basename.start_with?('.')

  if ::File.directory?(entry)
    ::Dir.children(entry).each { |c| walk(::File.join(entry, c), extensions, results) }
  elsif ::File.file?(entry) && extensions.include?(::File.extname(entry).downcase)
    results << build_entry(entry)
  end
rescue Errno::EPERM, Errno::EACCES, Errno::ELOOP, Errno::ENOENT => e
  log.debug("[manifest] skipping unreadable #{entry}: #{e.class}: #{e.message}")
end
private_class_method :walk

def log
  Legion::Logging
end
private_class_method :log

Design choices

  1. Find.find removed entirely. require 'find' is no longer needed — manifest.rb was the only consumer in the gem.

  2. Dir.children instead of Dir.entries. Dir.children already excludes the . and .. pseudo-entries, so there's no infinite-recursion guard needed and no per-entry filter for those.

  3. Rescue catches Errno::EPERM, EACCES, ELOOP, ENOENT. EPERM/EACCES handle the macOS TCC case and standard POSIX permission denials; ELOOP handles symlink cycles; ENOENT handles the race where a file disappears between directory listing and File.size/Digest::SHA256.file (common on macOS's ephemeral caches under ~/Library/Caches).

  4. log.debug, not log.warn. TCC-protected directories are expected to be unreadable on macOS for any process without Full Disk Access — emitting a warn-level entry per skipped path would generate a lot of noise on every scan. Debug is the correct level for "this is fine; move on."

  5. Local log private_class_method returning Legion::Logging. This matches the existing pattern in sibling files in this gem — concretely, lib/legion/extensions/knowledge/runners/ingest.rb:12-15 defines the same shape:

    def log
      Legion::Logging
    end
    private_class_method :log

    Reusing the pattern keeps the helper-vs-runner module conventions consistent across the gem.

  6. Public signature preserved. scan(path:, extensions:) and the entry shape { path:, size:, mtime:, sha256: } are unchanged. No breaking changes for Runners::Ingest.scan_corpus, Runners::Corpus.corpus_stats, or any caller via Legion::Apollo.

  7. Per-entry rescue placement. The rescue lives on walk, which means an EPERM on one subdirectory only prunes that subtree. The recursion that produced sibling subtrees is unaffected.

Operational note on log level

log.debug is used intentionally (not warn) because TCC-protected directories
are expected and non-actionable on macOS — every scan that touches a home
directory will skip several. Operators running with LOG_LEVEL=DEBUG in
production should be aware that this method will emit one debug entry per
unreadable path containing the path string itself. If your environment forwards
debug-level entries to an aggregation backend and the path strings are
sensitive, either:

  • Keep LOG_LEVEL at INFO or higher in production (the default), or
  • Add a redaction filter on [manifest] skipping unreadable in your log
    pipeline.

Standard log aggregation pipelines filter debug entries by default, so this is
informational rather than a behavior change.

Tests

Added to spec/legion/extensions/knowledge/helpers/manifest_spec.rb:

  • treats extension filter as case-insensitive.MD, .TxT regression guard.
  • skips dot-directories and does not recurse into them — pruning regression guard.
  • skips unreadable directories and continues scanning siblingsErrno::EPERM from one sibling, asserts the other is still scanned.
  • skips unreadable directories raising Errno::EACCES — same shape, different errno.
  • skips multiple unreadable subdirs at different depths without failingEPERM at depth 1 and EACCES at depth 2.
  • skips files that disappear between listing and read (ENOENT) — stubs File.size to raise Errno::ENOENT; asserts the scan returns the surviving sibling.
  • does not crash when the scan root itself is unreadable — defensive guard for the case where the top-level path itself raises.

All tests use Dir.mktmpdir for real on-disk paths plus allow(...).to receive(...).and_raise(...) for the errno injection, so they don't depend on any host filesystem layout.

Result:

spec/legion/extensions/knowledge/helpers/manifest_spec.rb
  18 examples, 0 failures
bundle exec rspec
  197 examples, 0 failures
bundle exec rubocop
  37 files inspected, no offenses detected

Version

0.6.70.6.9 (skipping 0.6.8).

Version allocation: 0.6.8 is reserved for the companion fix/content-hash-md5-match-apollo-schema PR (chunker SHA-256 → MD5 hash fix). Both branches target 0.6.7 as their merge base; this PR claims 0.6.9 to avoid intra-batch collision. The order matches criticality: the content_hash fix addresses a silent corpus-ingest data loss (claim 0.6.8); this PR addresses a macOS-specific HTTP 500 (claim 0.6.9).

CHANGELOG entry added under [0.6.9]Fixed::

  • Manifest.scan no longer crashes the entire corpus walk when it encounters
    an unreadable directory. Previously, Find.find had no per-entry rescue,
    so any Errno::EPERM/EACCES/ELOOP/ENOENT raised by a subdir aborted the
    whole traversal — most visibly on macOS, where TCC-protected directories
    like ~/Library/Accounts cause EPERM to bubble up through
    Runners::Ingest.scan_corpus to POST /api/knowledge/status as a 500.
    Replaced with a recursive walker built on Dir.children with a per-method
    rescue that prunes the offending subtree at debug level and continues
    scanning siblings. Public scan(path:, extensions:) signature and entry
    shape preserved.

Live validation

The same patch has been running on the local Cellar copy (/opt/homebrew/Cellar/legionio/1.9.0-1/libexec/lib/ruby/gems/3.4.0/gems/lex-knowledge-0.6.7/lib/legion/extensions/knowledge/helpers/manifest.rb, mtime 2026-04-24 15:12:36) for roughly 75 minutes prior to this PR being filed. The daemon log at /opt/homebrew/var/log/legion/legion.log shows:

  • The most recent Errno::EPERM @ dir_initialize - /Users/<you>/Library/Accounts stack trace is at 2026-04-24 14:56:25 — i.e. before the patch was applied.
  • Zero such entries appear after 2026-04-24 15:12 despite continued daemon activity (thousands of log lines in the window between 15:00 and 16:00).

POST /api/knowledge/status now returns a normal scan result for paths containing TCC-protected subtrees, where it previously returned 500.

Related

LegionIO/legionio has a separate PR being filed against lib/legion/api/knowledge.rb to remove the || Dir.pwd default on the /api/knowledge/status route. Both fixes target the same user-visible 500 but are independent: the API-level fix addresses "the daemon should not silently inherit its cwd"; this PR addresses "even when an explicit path is passed, a single unreadable subdir should not abort the whole scan." Either fix alone narrows the failure surface; together they close it.

Checklist

  • Tests pass (bundle exec rspec) — 18/18 in spec/legion/extensions/knowledge/helpers/manifest_spec.rb; 197/197 full suite
  • RuboCop passes (bundle exec rubocop) — 37 files inspected, no offenses
  • CHANGELOG.md updated ([0.6.9]Fixed: entry shown in Version section)
  • No new security concerns introduced — debug-level log emits unreadable paths only at LOG_LEVEL=DEBUG (default INFO filters it). Operational note in Design choices above documents this for production deployments.

Find.find had no internal rescue. A single unreadable subdir (common
on macOS for TCC-protected paths like ~/Library/Accounts, Mail,
Safari) crashed the entire scan and cascaded to the knowledge status
HTTP endpoint as a 500.

Replaced with a recursive walker that rescues per-dir. Unreadable
subdirs are pruned with a debug log; scan continues with siblings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Esity Esity merged commit fe77f92 into LegionIO:main Apr 27, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants