feat: search poll retry + checkpoint/resume#253
Merged
Conversation
ebd467c to
9150475
Compare
…e support SDK layer: add _is_transient_poll_error() classification and _poll_with_retry() method to Search. Transient errors (5xx, ConnectionError, TimeoutError, SSLError) are retried with exponential backoff (2^attempt, capped at 30s). Permanent errors (401, 403, 404, 422, 429) and search-engine body errors are not retried. Add start_token/start_page parameters to execute() for server-side resume. Resume passes the stored pagination token to the server which re-runs the query from the cursor position embedded in the token. Works even after long delays between sessions. CLI layer: add --checkpoint, --resume, and --force flags to 'search run'. Add 'search checkpoints' to list checkpoints and 'search checkpoint-show' to display results from a checkpoint file through the same output pipeline as live searches. Context-aware error messages when checkpoint file exists. On Ctrl+C prints session stats and exact resume command. New limacharlie/search_checkpoint.py module (named to clarify it is search-specific, leaving room for other checkpoint mechanisms in the future). Performance: CheckpointReader.iter_results() streams JSONL lazily via generator. checkpoint-show uses streaming for JSONL output. Resume loop does not accumulate results in memory. Table/expand/JSON formats require the full result set in memory for column width computation; for large checkpoints use --output jsonl which streams. Security: checkpoint directories 0o700, files 0o600. Data files use O_EXCL|O_NOFOLLOW. Metadata reads use safe_open_read(). Path validation rejects non-absolute paths. Shell-escapes via shlex.quote. Robustness: corrupt mid-file JSONL lines raise ValueError (only last line tolerated via lookahead parsing). secure_makedirs tightens existing dirs. Also fixes pre-existing bug in client.py where KeyboardInterrupt during u.read() left 'data' variable unbound. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9150475 to
8e9d110
Compare
maximelb
approved these changes
Mar 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Details
Adds several features to
search run:SDK-level poll retry with exponential backoff - Individual poll GET requests that fail with transient errors (5xx, connection reset, timeout, SSL errors) are now automatically retried with exponential backoff (1s, 2s, 4s... capped at 30s, default 3 retries). Permanent errors (401, 403, 404, 422, 429) and search-engine body errors (e.g. "context canceled") are NOT retried.
CLI-level checkpoint/resume - New
--checkpoint,--resume, and--forceflags allow incremental persistence of search results to a JSONL file. Resume uses server-side pagination tokens to skip directly to the next un-fetched page (no re-fetching). The server re-runs the query from the cursor position embedded in the token, so resume works even after long delays between sessions - no server-side TTL limitation.search checkpoints- Lists all local checkpoints with pages, events, time range progress %, token, and timestamps. Automatically cleans up stale metadata when data files are deleted.search checkpoint-show- Displays results from a checkpoint data file through the same output pipeline as a live search (table, JSON, expand, raw, JSONL). Streams lazily for JSONL output.Security hardening - Checkpoint metadata directories (
~/.limacharlie.d/search_checkpoints/) are created with 0o700 (owner-only) permissions via newsecure_makedirs()helper. Data files useO_EXCL|O_NOFOLLOWfor atomic creation with symlink rejection. Metadata reads usesafe_open_read(). All files get 0o600.Also fixes a pre-existing bug in
client.pywhereKeyboardInterruptduringu.read()left thedatavariable unbound.Future: config directory migration
Currently all LimaCharlie config lives in a flat file at
~/.limacharlie. This PR introduces~/.limacharlie.d/as a sibling directory for new persistent data (checkpoints). Eventually the plan is to migrate all LimaCharlie files to a proper config directory - either~/.limacharlie.d/,~/.config/limacharlie/(XDG on Linux), or the OS-recommended location (e.g.%APPDATA%/limacharlieon Windows). That migration requires more work to handle backward compatibility correctly (detecting old flat-file config, migrating credentials, ensuring no security regression). This PR establishes the.dpattern as a first step without breaking existing config.Example Usage
Run a search with checkpointing
limacharlie search run \ --query "* | NEW_PROCESS | event/COMMAND_LINE contains 'curl'" \ --start 1700000000 --end 1700086400 \ --stream event \ --checkpoint /tmp/my_search.jsonlSample output (search completes successfully)
Interrupt a search (Ctrl+C)
limacharlie search run \ --query "* | * | *" \ --start 1771065591 --end 1773657591 \ --checkpoint /tmp/big_search.jsonl # ... press Ctrl+C after a whileSample output (Ctrl+C mid-search)
Resume a search
Sample output (resume picks up from page 4)
Resume starts a fresh search session but passes the stored pagination cursor to the server, which re-runs the query from that position. No pages are re-fetched. Works even days after the original search.
Display results from a checkpoint
# Table output (default) limacharlie search checkpoint-show --checkpoint /tmp/my_search.jsonlSample table output
Force overwrite / context-aware errors
Completed checkpoint
In-progress checkpoint
List checkpoints
Sample table output
Stale checkpoints (deleted data files) and corrupt metadata are automatically cleaned up on listing.
Automatic retry on transient errors
Sample output when a transient 502 occurs mid-search
Retries happen transparently with exponential backoff (1s, 2s, 4s... max 30s). Only transient errors (5xx, connection reset, timeout) are retried.
Flag validation
--resumewithout--checkpoint--checkpoint existing.jsonlwithout--resume/--force--resumewith--query/--start/--end/--stream--resumewith--limit--resumeon completed checkpoint--forcewith--checkpointArchitecture
Server-side resume via pagination token
The pagination token contains a cursor encoding the position in the result set. On resume, a fresh search session is initiated (new
queryId) and the stored token is passed to the first poll. The server extracts the cursor from the token and re-runs the query starting from that position. This means:Two-file checkpoint design
~/.limacharlie.d/search_checkpoints/<sha256>.meta.json): Atomic JSON with query params, progress, token. 0o600 permissions. Paired via SHA-256 of data file path.Security model
0o700viasecure_makedirs()(also tightens existing dirs)0o600viaatomic_write(), reads viasafe_open_read()(symlink rejection)0o600, created withO_EXCL|O_NOFOLLOW(atomic TOCTOU prevention + symlink rejection)list_checkpointsvalidatesdata_fileis absolute path (rejects path traversal from corrupt metadata)shlex.quote()Performance model
CheckpointReader.iter_results()is a generator that streams one line at a time.checkpoint-show --output jsonluses this to output results without loading the full file into memory.--output jsonland pipe tojqor other streaming tools.count_results) without loading data.Blast radius / isolation
search.py): Only poll loop affected. Newstart_token/start_pageadditive. Other operations unchanged.commands/search.py):run()refactored. Newcheckpoint-showsubcommand. Other search commands untouched.checkpoint.py): Self-contained.file_utils.py: Addedsecure_dir_permissions()/secure_makedirs()- additive.client.py): Single line -data=Nonebefore try block.Test plan
1634 total tests pass (170+ new), 2 pre-existing skips.
SDK tests (42): retry classification (18), retry behavior (15), cancellation (9)
Checkpoint unit tests (50): writer (11), reader (8 incl. iter_results, count_results, mid-file corruption), resumer (4), path derivation (3), listing/cleanup (6), permissions (6), checkpoint ID (3), context managers (7)
File utils security tests (7): dir permissions (3), makedirs (4)
Integration tests (32): end-to-end (2), resume with token (3), correctness (4), cancellation (4), error recovery (4), context managers (7), listing (1)
CLI integration tests via CliRunner (47): checkpoint creation (6), resume (2), resume validation (5), checkpoints listing (4), non-checkpoint search (3), cancel message (1), rejected token (3), checkpoint-show (9 incl. streaming JSONL), _is_token_expired_error classifier (10), _build_fresh_query_cmd (4)
🤖 Generated with Claude Code