feat: search poll retry + checkpoint/resume by tomaz-lc · Pull Request #253 · refractionPOINT/python-limacharlie

tomaz-lc · 2026-03-16T10:42:45Z

Details

Adds several features to search run:

SDK-level poll retry with exponential backoff - Individual poll GET requests that fail with transient errors (5xx, connection reset, timeout, SSL errors) are now automatically retried with exponential backoff (1s, 2s, 4s... capped at 30s, default 3 retries). Permanent errors (401, 403, 404, 422, 429) and search-engine body errors (e.g. "context canceled") are NOT retried.
CLI-level checkpoint/resume - New --checkpoint, --resume, and --force flags allow incremental persistence of search results to a JSONL file. Resume uses server-side pagination tokens to skip directly to the next un-fetched page (no re-fetching). The server re-runs the query from the cursor position embedded in the token, so resume works even after long delays between sessions - no server-side TTL limitation.
search checkpoints - Lists all local checkpoints with pages, events, time range progress %, token, and timestamps. Automatically cleans up stale metadata when data files are deleted.
search checkpoint-show - Displays results from a checkpoint data file through the same output pipeline as a live search (table, JSON, expand, raw, JSONL). Streams lazily for JSONL output.
Security hardening - Checkpoint metadata directories (~/.limacharlie.d/search_checkpoints/) are created with 0o700 (owner-only) permissions via new secure_makedirs() helper. Data files use O_EXCL|O_NOFOLLOW for atomic creation with symlink rejection. Metadata reads use safe_open_read(). All files get 0o600.

Also fixes a pre-existing bug in client.py where KeyboardInterrupt during u.read() left the data variable unbound.

Future: config directory migration

Currently all LimaCharlie config lives in a flat file at ~/.limacharlie. This PR introduces ~/.limacharlie.d/ as a sibling directory for new persistent data (checkpoints). Eventually the plan is to migrate all LimaCharlie files to a proper config directory - either ~/.limacharlie.d/, ~/.config/limacharlie/ (XDG on Linux), or the OS-recommended location (e.g. %APPDATA%/limacharlie on Windows). That migration requires more work to handle backward compatibility correctly (detecting old flat-file config, migrating credentials, ensuring no security regression). This PR establishes the .d pattern as a first step without breaking existing config.

Example Usage

Run a search with checkpointing

limacharlie search run \
    --query "* | NEW_PROCESS | event/COMMAND_LINE contains 'curl'" \
    --start 1700000000 --end 1700086400 \
    --stream event \
    --checkpoint /tmp/my_search.jsonl

Sample output (search completes successfully)

Running search... query_id: a22bb8e7-b233-491f-8f6c-b2707bfdb20a
Waiting for results... (page 1, 0 events, 0s elapsed)
Fetching page 2... (2,171 events, 9s elapsed, token=...YWdlIjoyNX0=)
Fetching page 3... (4,099 events, 17s elapsed, token=...ZXh0UGFnZSI6M30=)
Checkpoint complete: 3 results saved to /tmp/my_search.jsonl
time                 stream  routing.event_type  routing.hostname  event.FILE_PATH    event.COMMAND_LINE
2023-11-15 00:12:34  event   NEW_PROCESS         web-01            /usr/bin/curl       curl https://example.com
2023-11-15 00:15:02  event   NEW_PROCESS         web-02            /usr/bin/curl       curl -s https://api.test.io
...
(4,099 events)
Stats: matched: 4,099, scanned: 1,203,441, bytes: 2.1 GB, time: 18.3s, cost: $0.0042

Interrupt a search (Ctrl+C)

limacharlie search run \
    --query "* | * | *" \
    --start 1771065591 --end 1773657591 \
    --checkpoint /tmp/big_search.jsonl
# ... press Ctrl+C after a while

Sample output (Ctrl+C mid-search)

Running search... query_id: 3454721a-50fb-4a25-abf6-21f3b5c3064f
Fetching page 2... (2,171 events, 9s elapsed, token=...YWdlIjoxfQ==)
Fetching page 3... (4,099 events, 17s elapsed, token=...YWdlIjoyNX0=)
Fetching page 4... (6,250 events, 25s elapsed, token=...ZXh0UGFnZSI6M30=)
^C
Search canceled. Checkpoint saved: /tmp/big_search.jsonl
  Pages fetched:  4
  Results:        4
  Total events:   6,250
  Resume token:   lc1:eyJ2IjoxLCJjb250ZXh0IjoiZXlKMmFXVjNJam9pWVhOamMwMTBaQ0lzSW5CaFoyVWlPak4w

  Resume with:    limacharlie search run --resume --checkpoint /tmp/big_search.jsonl

Resume a search

limacharlie search run --resume --checkpoint /tmp/big_search.jsonl

Sample output (resume picks up from page 4)

Resuming from page 4 (4 results already fetched)...
Running search... query_id: 55dd0e29-494e-4d6e-8851-19aadd807b7f
Resuming from page 4 (token=...ZXh0UGFnZSI6M30=)
Fetching page 5... (8,412 events, 7s elapsed, token=...UGFnZSI6NX0=)
Fetching page 6... (10,583 events, 15s elapsed, token=...UGFnZSI6Nn0=)
Resume complete: 4 new results (8 total) saved to /tmp/big_search.jsonl
time                 stream  routing.event_type  routing.hostname  ...
...
(10,583 events)

Resume starts a fresh search session but passes the stored pagination cursor to the server, which re-runs the query from that position. No pages are re-fetched. Works even days after the original search.

Display results from a checkpoint

# Table output (default)
limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl

Sample table output

Checkpoint: completed, 3 pages, 4,099 events, 3 results
Query: * | NEW_PROCESS | event/COMMAND_LINE contains 'curl'
time                 stream  routing.event_type  routing.hostname  event.FILE_PATH    event.COMMAND_LINE
2023-11-15 00:12:34  event   NEW_PROCESS         web-01            /usr/bin/curl       curl https://example.com
2023-11-15 00:15:02  event   NEW_PROCESS         web-02            /usr/bin/curl       curl -s https://api.test.io
...
(4,099 events)
Stats: matched: 4,099, scanned: 1,203,441, bytes: 2.1 GB

# Expanded JSON events
limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl --expand

# JSON for scripting
limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl --output json

# JSONL - streams lazily without loading full file into memory
limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl --output jsonl | jq '.rows[].data'

# Raw API result objects
limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl --raw

Force overwrite / context-aware errors

Completed checkpoint

Error: Checkpoint already exists and is completed (3 results, 4,099 events).
Use --force to overwrite with a new search, or delete the file and retry.

  View results:  limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonl
  Overwrite:     add --force to your command

In-progress checkpoint

Error: Checkpoint already exists and is in-progress (page 4, 4 results, 6,250 events).
Use --resume to continue from where it left off, --force to overwrite, or delete the file and retry.

  Resume:    limacharlie search run --resume --checkpoint /tmp/big_search.jsonl
  Overwrite: add --force to your command

List checkpoints

limacharlie search checkpoints

Sample table output

data_file             query                 range                          pages  events  progress          status       token             created              updated
/tmp/big_search.jsonl * | * | *             2026-02-15 00:00 - 2026-03-16 4      6,250   42% (02-27 18:30) in-progress  ...YWdlIjoyNX0=   2026-03-16 10:30:00  2026-03-16 10:35:00
/tmp/my_search.jsonl  * | NEW_PROCESS | ... 2023-11-15 00:00 - 2023-11-16 3      4,099   100% (11-16 ...)  completed                      2026-03-16 09:00:00  2026-03-16 09:20:00

Stale checkpoints (deleted data files) and corrupt metadata are automatically cleaned up on listing.

Automatic retry on transient errors

Sample output when a transient 502 occurs mid-search

Fetching page 2... (2,171 events, 12s elapsed, token=...YWdlIjoyNX0=)
Retrying poll (attempt 2/4, waiting 1s)...
Fetching page 3... (4,099 events, 20s elapsed, token=...ZXh0UGFnZSI6M30=)

Retries happen transparently with exponential backoff (1s, 2s, 4s... max 30s). Only transient errors (5xx, connection reset, timeout) are retried.

Flag validation

Scenario	Result
`--resume` without `--checkpoint`	Error
`--checkpoint existing.jsonl` without `--resume`/`--force`	Context-aware error (completed vs in-progress)
`--resume` with `--query/--start/--end/--stream`	Error (params from checkpoint)
`--resume` with `--limit`	Allowed
`--resume` on completed checkpoint	No-op, outputs existing results
`--force` with `--checkpoint`	Deletes existing, starts fresh

Architecture

Server-side resume via pagination token

The pagination token contains a cursor encoding the position in the result set. On resume, a fresh search session is initiated (new queryId) and the stored token is passed to the first poll. The server extracts the cursor from the token and re-runs the query starting from that position. This means:

Resume works even after long delays between sessions (no TTL limitation)
No pages are re-fetched - the server jumps directly to the right position
Falls back to re-fetch + skip only when no token is stored

Two-file checkpoint design

Data file (user-specified): Append-only JSONL with 0o600 permissions
Metadata file (~/.limacharlie.d/search_checkpoints/<sha256>.meta.json): Atomic JSON with query params, progress, token. 0o600 permissions. Paired via SHA-256 of data file path.

Security model

Metadata directories: 0o700 via secure_makedirs() (also tightens existing dirs)
Metadata files: 0o600 via atomic_write(), reads via safe_open_read() (symlink rejection)
Data files: 0o600, created with O_EXCL|O_NOFOLLOW (atomic TOCTOU prevention + symlink rejection)
list_checkpoints validates data_file is absolute path (rejects path traversal from corrupt metadata)
Shell-escaped suggested commands via shlex.quote()
Cross-platform: Windows relies on home directory ACLs

Performance model

Streaming JSONL: CheckpointReader.iter_results() is a generator that streams one line at a time. checkpoint-show --output jsonl uses this to output results without loading the full file into memory.
Table/expand/JSON output: requires the full result list in memory because table formatting needs all rows upfront for column width computation. For large checkpoints, use --output jsonl and pipe to jq or other streaming tools.
Resume loop: does not accumulate results in memory during the search. After completion, reads the final data file for output.
Resumer init: counts lines (count_results) without loading data.
Corrupt line detection: uses one-line lookahead in the generator, not two-pass.

Blast radius / isolation

SDK (search.py): Only poll loop affected. New start_token/start_page additive. Other operations unchanged.
CLI (commands/search.py): run() refactored. New checkpoint-show subcommand. Other search commands untouched.
New module (checkpoint.py): Self-contained.
file_utils.py: Added secure_dir_permissions()/secure_makedirs() - additive.
Bug fix (client.py): Single line - data=None before try block.
NOT affected: All other CLI commands, SDK modules, config, auth, JWT cache.

Test plan

1634 total tests pass (170+ new), 2 pre-existing skips.

SDK tests (42): retry classification (18), retry behavior (15), cancellation (9)

Checkpoint unit tests (50): writer (11), reader (8 incl. iter_results, count_results, mid-file corruption), resumer (4), path derivation (3), listing/cleanup (6), permissions (6), checkpoint ID (3), context managers (7)

File utils security tests (7): dir permissions (3), makedirs (4)

Integration tests (32): end-to-end (2), resume with token (3), correctness (4), cancellation (4), error recovery (4), context managers (7), listing (1)

CLI integration tests via CliRunner (47): checkpoint creation (6), resume (2), resume validation (5), checkpoints listing (4), non-checkpoint search (3), cancel message (1), rejected token (3), checkpoint-show (9 incl. streaming JSONL), _is_token_expired_error classifier (10), _build_fresh_query_cmd (4)

🤖 Generated with Claude Code

…e support SDK layer: add _is_transient_poll_error() classification and _poll_with_retry() method to Search. Transient errors (5xx, ConnectionError, TimeoutError, SSLError) are retried with exponential backoff (2^attempt, capped at 30s). Permanent errors (401, 403, 404, 422, 429) and search-engine body errors are not retried. Add start_token/start_page parameters to execute() for server-side resume. Resume passes the stored pagination token to the server which re-runs the query from the cursor position embedded in the token. Works even after long delays between sessions. CLI layer: add --checkpoint, --resume, and --force flags to 'search run'. Add 'search checkpoints' to list checkpoints and 'search checkpoint-show' to display results from a checkpoint file through the same output pipeline as live searches. Context-aware error messages when checkpoint file exists. On Ctrl+C prints session stats and exact resume command. New limacharlie/search_checkpoint.py module (named to clarify it is search-specific, leaving room for other checkpoint mechanisms in the future). Performance: CheckpointReader.iter_results() streams JSONL lazily via generator. checkpoint-show uses streaming for JSONL output. Resume loop does not accumulate results in memory. Table/expand/JSON formats require the full result set in memory for column width computation; for large checkpoints use --output jsonl which streams. Security: checkpoint directories 0o700, files 0o600. Data files use O_EXCL|O_NOFOLLOW. Metadata reads use safe_open_read(). Path validation rejects non-absolute paths. Shell-escapes via shlex.quote. Robustness: corrupt mid-file JSONL lines raise ValueError (only last line tolerated via lookahead parsing). secure_makedirs tightens existing dirs. Also fixes pre-existing bug in client.py where KeyboardInterrupt during u.read() left 'data' variable unbound. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tomaz-lc force-pushed the feat-search-poll-retry-checkpoint branch 15 times, most recently from ebd467c to 9150475 Compare March 16, 2026 16:26

tomaz-lc force-pushed the feat-search-poll-retry-checkpoint branch from 9150475 to 8e9d110 Compare March 16, 2026 16:42

tomaz-lc marked this pull request as ready for review March 16, 2026 17:14

tomaz-lc requested a review from maximelb March 16, 2026 17:14

maximelb approved these changes Mar 16, 2026

View reviewed changes

tomaz-lc merged commit 158bd63 into cli-v2 Mar 17, 2026
1 check passed

tomaz-lc deleted the feat-search-poll-retry-checkpoint branch March 17, 2026 07:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: search poll retry + checkpoint/resume#253

feat: search poll retry + checkpoint/resume#253
tomaz-lc merged 1 commit intocli-v2from
feat-search-poll-retry-checkpoint

tomaz-lc commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomaz-lc commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Future: config directory migration

Example Usage

Run a search with checkpointing

Interrupt a search (Ctrl+C)

Resume a search

Display results from a checkpoint

Force overwrite / context-aware errors

List checkpoints

Automatic retry on transient errors

Flag validation

Architecture

Server-side resume via pagination token

Two-file checkpoint design

Security model

Performance model

Blast radius / isolation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomaz-lc commented Mar 16, 2026 •

edited

Loading