Skip to content

feat: search poll retry + checkpoint/resume#253

Merged
tomaz-lc merged 1 commit intocli-v2from
feat-search-poll-retry-checkpoint
Mar 17, 2026
Merged

feat: search poll retry + checkpoint/resume#253
tomaz-lc merged 1 commit intocli-v2from
feat-search-poll-retry-checkpoint

Conversation

@tomaz-lc
Copy link
Copy Markdown
Contributor

@tomaz-lc tomaz-lc commented Mar 16, 2026

Details

Adds several features to search run:

  1. SDK-level poll retry with exponential backoff - Individual poll GET requests that fail with transient errors (5xx, connection reset, timeout, SSL errors) are now automatically retried with exponential backoff (1s, 2s, 4s... capped at 30s, default 3 retries). Permanent errors (401, 403, 404, 422, 429) and search-engine body errors (e.g. "context canceled") are NOT retried.

  2. CLI-level checkpoint/resume - New --checkpoint, --resume, and --force flags allow incremental persistence of search results to a JSONL file. Resume uses server-side pagination tokens to skip directly to the next un-fetched page (no re-fetching). The server re-runs the query from the cursor position embedded in the token, so resume works even after long delays between sessions - no server-side TTL limitation.

  3. search checkpoints - Lists all local checkpoints with pages, events, time range progress %, token, and timestamps. Automatically cleans up stale metadata when data files are deleted.

  4. search checkpoint-show - Displays results from a checkpoint data file through the same output pipeline as a live search (table, JSON, expand, raw, JSONL). Streams lazily for JSONL output.

  5. Security hardening - Checkpoint metadata directories (~/.limacharlie.d/search_checkpoints/) are created with 0o700 (owner-only) permissions via new secure_makedirs() helper. Data files use O_EXCL|O_NOFOLLOW for atomic creation with symlink rejection. Metadata reads use safe_open_read(). All files get 0o600.

Also fixes a pre-existing bug in client.py where KeyboardInterrupt during u.read() left the data variable unbound.

Future: config directory migration

Currently all LimaCharlie config lives in a flat file at ~/.limacharlie. This PR introduces ~/.limacharlie.d/ as a sibling directory for new persistent data (checkpoints). Eventually the plan is to migrate all LimaCharlie files to a proper config directory - either ~/.limacharlie.d/, ~/.config/limacharlie/ (XDG on Linux), or the OS-recommended location (e.g. %APPDATA%/limacharlie on Windows). That migration requires more work to handle backward compatibility correctly (detecting old flat-file config, migrating credentials, ensuring no security regression). This PR establishes the .d pattern as a first step without breaking existing config.

Example Usage

Run a search with checkpointing

limacharlie search run \
    --query "* | NEW_PROCESS | event/COMMAND_LINE contains 'curl'" \
    --start 1700000000 --end 1700086400 \
    --stream event \
    --checkpoint /tmp/my_search.jsonl
Sample output (search completes successfully)
Running search... query_id: a22bb8e7-b233-491f-8f6c-b2707bfdb20a
Waiting for results... (page 1, 0 events, 0s elapsed)
Fetching page 2... (2,171 events, 9s elapsed, token=...YWdlIjoyNX0=)
Fetching page 3... (4,099 events, 17s elapsed, token=...ZXh0UGFnZSI6M30=)
Checkpoint complete: 3 results saved to /tmp/my_search.jsonl
time                 stream  routing.event_type  routing.hostname  event.FILE_PATH    event.COMMAND_LINE
2023-11-15 00:12:34  event   NEW_PROCESS         web-01            /usr/bin/curl       curl https://example.com
2023-11-15 00:15:02  event   NEW_PROCESS         web-02            /usr/bin/curl       curl -s https://api.test.io
...
(4,099 events)
Stats: matched: 4,099, scanned: 1,203,441, bytes: 2.1 GB, time: 18.3s, cost: $0.0042

Interrupt a search (Ctrl+C)

limacharlie search run \
    --query "* | * | *" \
    --start 1771065591 --end 1773657591 \
    --checkpoint /tmp/big_search.jsonl
# ... press Ctrl+C after a while
Sample output (Ctrl+C mid-search)
Running search... query_id: 3454721a-50fb-4a25-abf6-21f3b5c3064f
Fetching page 2... (2,171 events, 9s elapsed, token=...YWdlIjoxfQ==)
Fetching page 3... (4,099 events, 17s elapsed, token=...YWdlIjoyNX0=)
Fetching page 4... (6,250 events, 25s elapsed, token=...ZXh0UGFnZSI6M30=)
^C
Search canceled. Checkpoint saved: /tmp/big_search.jsonl
  Pages fetched:  4
  Results:        4
  Total events:   6,250
  Resume token:   lc1:eyJ2IjoxLCJjb250ZXh0IjoiZXlKMmFXVjNJam9pWVhOamMwMTBaQ0lzSW5CaFoyVWlPak4w

  Resume with:    limacharlie search run --resume --checkpoint /tmp/big_search.jsonl

Resume a search

limacharlie search run --resume --checkpoint /tmp/big_search.jsonl
Sample output (resume picks up from page 4)
Resuming from page 4 (4 results already fetched)...
Running search... query_id: 55dd0e29-494e-4d6e-8851-19aadd807b7f
Resuming from page 4 (token=...ZXh0UGFnZSI6M30=)
Fetching page 5... (8,412 events, 7s elapsed, token=...UGFnZSI6NX0=)
Fetching page 6... (10,583 events, 15s elapsed, token=...UGFnZSI6Nn0=)
Resume complete: 4 new results (8 total) saved to /tmp/big_search.jsonl
time                 stream  routing.event_type  routing.hostname  ...
...
(10,583 events)

Resume starts a fresh search session but passes the stored pagination cursor to the server, which re-runs the query from that position. No pages are re-fetched. Works even days after the original search.

Display results from a checkpoint

# Table output (default)
limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl
Sample table output
Checkpoint: completed, 3 pages, 4,099 events, 3 results
Query: * | NEW_PROCESS | event/COMMAND_LINE contains 'curl'
time                 stream  routing.event_type  routing.hostname  event.FILE_PATH    event.COMMAND_LINE
2023-11-15 00:12:34  event   NEW_PROCESS         web-01            /usr/bin/curl       curl https://example.com
2023-11-15 00:15:02  event   NEW_PROCESS         web-02            /usr/bin/curl       curl -s https://api.test.io
...
(4,099 events)
Stats: matched: 4,099, scanned: 1,203,441, bytes: 2.1 GB
# Expanded JSON events
limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl --expand

# JSON for scripting
limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl --output json

# JSONL - streams lazily without loading full file into memory
limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl --output jsonl | jq '.rows[].data'

# Raw API result objects
limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl --raw

Force overwrite / context-aware errors

Completed checkpoint
Error: Checkpoint already exists and is completed (3 results, 4,099 events).
Use --force to overwrite with a new search, or delete the file and retry.

  View results:  limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl
  Overwrite:     add --force to your command
In-progress checkpoint
Error: Checkpoint already exists and is in-progress (page 4, 4 results, 6,250 events).
Use --resume to continue from where it left off, --force to overwrite, or delete the file and retry.

  Resume:    limacharlie search run --resume --checkpoint /tmp/big_search.jsonl
  Overwrite: add --force to your command

List checkpoints

limacharlie search checkpoints
Sample table output
data_file             query                 range                          pages  events  progress          status       token             created              updated
/tmp/big_search.jsonl * | * | *             2026-02-15 00:00 - 2026-03-16 4      6,250   42% (02-27 18:30) in-progress  ...YWdlIjoyNX0=   2026-03-16 10:30:00  2026-03-16 10:35:00
/tmp/my_search.jsonl  * | NEW_PROCESS | ... 2023-11-15 00:00 - 2023-11-16 3      4,099   100% (11-16 ...)  completed                      2026-03-16 09:00:00  2026-03-16 09:20:00

Stale checkpoints (deleted data files) and corrupt metadata are automatically cleaned up on listing.

Automatic retry on transient errors

Sample output when a transient 502 occurs mid-search
Fetching page 2... (2,171 events, 12s elapsed, token=...YWdlIjoyNX0=)
Retrying poll (attempt 2/4, waiting 1s)...
Fetching page 3... (4,099 events, 20s elapsed, token=...ZXh0UGFnZSI6M30=)

Retries happen transparently with exponential backoff (1s, 2s, 4s... max 30s). Only transient errors (5xx, connection reset, timeout) are retried.

Flag validation

Scenario Result
--resume without --checkpoint Error
--checkpoint existing.jsonl without --resume/--force Context-aware error (completed vs in-progress)
--resume with --query/--start/--end/--stream Error (params from checkpoint)
--resume with --limit Allowed
--resume on completed checkpoint No-op, outputs existing results
--force with --checkpoint Deletes existing, starts fresh

Architecture

Server-side resume via pagination token

The pagination token contains a cursor encoding the position in the result set. On resume, a fresh search session is initiated (new queryId) and the stored token is passed to the first poll. The server extracts the cursor from the token and re-runs the query starting from that position. This means:

  • Resume works even after long delays between sessions (no TTL limitation)
  • No pages are re-fetched - the server jumps directly to the right position
  • Falls back to re-fetch + skip only when no token is stored

Two-file checkpoint design

  • Data file (user-specified): Append-only JSONL with 0o600 permissions
  • Metadata file (~/.limacharlie.d/search_checkpoints/<sha256>.meta.json): Atomic JSON with query params, progress, token. 0o600 permissions. Paired via SHA-256 of data file path.

Security model

  • Metadata directories: 0o700 via secure_makedirs() (also tightens existing dirs)
  • Metadata files: 0o600 via atomic_write(), reads via safe_open_read() (symlink rejection)
  • Data files: 0o600, created with O_EXCL|O_NOFOLLOW (atomic TOCTOU prevention + symlink rejection)
  • list_checkpoints validates data_file is absolute path (rejects path traversal from corrupt metadata)
  • Shell-escaped suggested commands via shlex.quote()
  • Cross-platform: Windows relies on home directory ACLs

Performance model

  • Streaming JSONL: CheckpointReader.iter_results() is a generator that streams one line at a time. checkpoint-show --output jsonl uses this to output results without loading the full file into memory.
  • Table/expand/JSON output: requires the full result list in memory because table formatting needs all rows upfront for column width computation. For large checkpoints, use --output jsonl and pipe to jq or other streaming tools.
  • Resume loop: does not accumulate results in memory during the search. After completion, reads the final data file for output.
  • Resumer init: counts lines (count_results) without loading data.
  • Corrupt line detection: uses one-line lookahead in the generator, not two-pass.

Blast radius / isolation

  • SDK (search.py): Only poll loop affected. New start_token/start_page additive. Other operations unchanged.
  • CLI (commands/search.py): run() refactored. New checkpoint-show subcommand. Other search commands untouched.
  • New module (checkpoint.py): Self-contained.
  • file_utils.py: Added secure_dir_permissions()/secure_makedirs() - additive.
  • Bug fix (client.py): Single line - data=None before try block.
  • NOT affected: All other CLI commands, SDK modules, config, auth, JWT cache.

Test plan

1634 total tests pass (170+ new), 2 pre-existing skips.

SDK tests (42): retry classification (18), retry behavior (15), cancellation (9)

Checkpoint unit tests (50): writer (11), reader (8 incl. iter_results, count_results, mid-file corruption), resumer (4), path derivation (3), listing/cleanup (6), permissions (6), checkpoint ID (3), context managers (7)

File utils security tests (7): dir permissions (3), makedirs (4)

Integration tests (32): end-to-end (2), resume with token (3), correctness (4), cancellation (4), error recovery (4), context managers (7), listing (1)

CLI integration tests via CliRunner (47): checkpoint creation (6), resume (2), resume validation (5), checkpoints listing (4), non-checkpoint search (3), cancel message (1), rejected token (3), checkpoint-show (9 incl. streaming JSONL), _is_token_expired_error classifier (10), _build_fresh_query_cmd (4)

🤖 Generated with Claude Code

@tomaz-lc tomaz-lc force-pushed the feat-search-poll-retry-checkpoint branch 15 times, most recently from ebd467c to 9150475 Compare March 16, 2026 16:26
…e support

SDK layer: add _is_transient_poll_error() classification and _poll_with_retry()
method to Search. Transient errors (5xx, ConnectionError, TimeoutError, SSLError)
are retried with exponential backoff (2^attempt, capped at 30s). Permanent errors
(401, 403, 404, 422, 429) and search-engine body errors are not retried.

Add start_token/start_page parameters to execute() for server-side resume.
Resume passes the stored pagination token to the server which re-runs the
query from the cursor position embedded in the token. Works even after long
delays between sessions.

CLI layer: add --checkpoint, --resume, and --force flags to 'search run'.
Add 'search checkpoints' to list checkpoints and 'search checkpoint-show'
to display results from a checkpoint file through the same output pipeline
as live searches. Context-aware error messages when checkpoint file exists.
On Ctrl+C prints session stats and exact resume command.

New limacharlie/search_checkpoint.py module (named to clarify it is
search-specific, leaving room for other checkpoint mechanisms in the future).

Performance: CheckpointReader.iter_results() streams JSONL lazily via
generator. checkpoint-show uses streaming for JSONL output. Resume loop
does not accumulate results in memory. Table/expand/JSON formats require
the full result set in memory for column width computation; for large
checkpoints use --output jsonl which streams.

Security: checkpoint directories 0o700, files 0o600. Data files use
O_EXCL|O_NOFOLLOW. Metadata reads use safe_open_read(). Path validation
rejects non-absolute paths. Shell-escapes via shlex.quote.

Robustness: corrupt mid-file JSONL lines raise ValueError (only last line
tolerated via lookahead parsing). secure_makedirs tightens existing dirs.

Also fixes pre-existing bug in client.py where KeyboardInterrupt during
u.read() left 'data' variable unbound.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tomaz-lc tomaz-lc force-pushed the feat-search-poll-retry-checkpoint branch from 9150475 to 8e9d110 Compare March 16, 2026 16:42
@tomaz-lc tomaz-lc marked this pull request as ready for review March 16, 2026 17:14
@tomaz-lc tomaz-lc requested a review from maximelb March 16, 2026 17:14
@tomaz-lc tomaz-lc merged commit 158bd63 into cli-v2 Mar 17, 2026
1 check passed
@tomaz-lc tomaz-lc deleted the feat-search-poll-retry-checkpoint branch March 17, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants