feat: streaming output, orjson, and memory-efficient search rendering by tomaz-lc · Pull Request #254 · refractionPOINT/python-limacharlie

tomaz-lc · 2026-03-17T11:10:56Z

Details

Adds streaming output support across all search command paths, replacing the previous approach that buffered all results in memory before rendering. Output formats that support it (JSONL, JSON, expand, table) now stream results with constant or near-constant memory usage, making large searches (100K+ events) feasible on memory-constrained VMs.

Also adds orjson as an optional-but-preferred JSON backend (~3-10x faster than stdlib json), with automatic fallback to stdlib if orjson is unavailable.

Output behavior by path and format

Path	JSONL	JSON	expand	table	CSV/YAML
`search run` (no checkpoint)	Stream, O(1)	Stream, O(1)	Stream, O(1)	Stream (sampled widths), O(sample)	Buffer, O(N)
`search run --checkpoint`	Stream from file, O(1)	Stream from file, O(1)	Stream from file, O(1)	2-pass stream from file, O(cols)	Buffer from file, O(N)
`search run --resume`	Stream from file, O(1)	Stream from file, O(1)	Stream from file, O(1)	2-pass stream from file, O(cols)	Buffer from file, O(N)
`search checkpoint-show`	Stream from file, O(1)	Stream from file, O(1)	Stream from file, O(1)	2-pass stream from file, O(cols)	Buffer from file, O(N)
`search saved-run`	Stream, O(1)	Stream, O(1)	Stream, O(1)	Stream (sampled widths), O(sample)	Buffer, O(N)

Peak memory by format

Format	Without checkpoint	With checkpoint
JSONL	O(1) - one result at a time	O(1) - one line at a time
JSON	O(1) - streaming `[`, items, `]`	O(1) - streaming from file
expand	O(1) - one event block at a time	O(1) - one event block at a time
table	O(sample_pages * cols) - buffers ~3 pages for column width sampling, then streams	O(cols) - two-pass file scan: pass 1 computes exact column widths, pass 2 streams rows
CSV	O(N) - inherent to format	O(N) - loaded from file
YAML	O(N) - inherent to format	O(N) - loaded from file

Memory profile - live validation

Measured during a real 30-day --resume --checkpoint search over a production org. The checkpoint file grew to 632 MB (484 result pages) while RSS stayed flat at ~138 MB, confirming O(1) memory for the checkpoint write path.

Elapsed	RSS (MB)	Checkpoint File (MB)	Result Pages
0:00	85	113	80
1:30	87	161	128
3:00	100	208	164
4:30	116	270	200
6:00	119	321	252
7:30	142	413	304
9:00	141	498	368
10:30	141	587	444
14:46	138	632	484

Key observations:

RSS rose from 85 MB to ~140 MB in the first ~7 minutes (Python allocator warm-up, HTTP connection pools, orjson parser buffers), then stabilized completely for the remaining 7+ minutes.
Checkpoint file grew linearly from 113 MB to 632 MB (519 MB of new data written) with zero corresponding RSS growth.
At 0.4% of 32 GB system memory, this can comfortably run on 512 MB VMs.

Warnings

Three independent warnings based on search time range:

Billing cost notice (>30 days): Notifies that data older than 30 days may incur additional charges. Shows the exact search estimate command to check costs.
Memory warning (>7 days, CSV/YAML only): Warns that all results are buffered in memory. Suggests streaming formats or --checkpoint.
Resumability warning (>14 days, all formats): Recommends --checkpoint so interrupted searches can be resumed.

Cost notice (>30 day search)

$ limacharlie search run --query "* | * | *" --start 1771065591 --end 1773657591 --output jsonl
Notice: this search spans 30 days. Searches over data older than 30 days may incur additional costs.
To estimate the cost before running:

  limacharlie search estimate \
    --query '* | * | *' \
    --start 1771065591 \
    --end 1773657591

Validate/estimate exit codes and output

search validate and search estimate now exit with code 1 when the server returns an error (e.g. invalid query syntax). Previously they always exited 0.

For table output, stats and estimatedPrice fields are flattened into individual columns (stats.bytesScanned, price.value, etc.) so full values are visible without truncation.

Validate: invalid query (exit code 1)

$ limacharlie search validate --query "* | * | * x" --output table
Field                     Value
------------------------  -------------------------------------------------------------------
query                     * | * | * x
startTime                 1773667570
endTime                   1773753970
error                     failed to transcode query: 1:56 (55): no match found, expected: ...
stats.bytesScanned        0
stats.eventsScanned       0
stats.eventsMatched       0
stats.eventsProcessed     0
stats.rulesEvaluated      0
stats.walltime            0
price.value               0
price.currency            USD cents

$ echo $?
1

Validate: valid query (exit code 0)

$ limacharlie search validate --query "* | * | *" --output table
Field                     Value
------------------------  -------------------------------------------------------------------
query                     * | * | *
startTime                 1773667570
endTime                   1773753970
stats.bytesScanned        1234567
stats.eventsScanned       50000
stats.eventsMatched       0
stats.eventsProcessed     0
stats.rulesEvaluated      0
stats.walltime            0
price.value               42
price.currency            USD cents

$ echo $?
0

Estimate: JSON output (preserves nested structure)

$ limacharlie search estimate --query "* | * | *" --start 1700000000 --end 1700086400 --output json
{
  "query": "* | * | *",
  "stats": {
    "bytesScanned": 1234567,
    "eventsScanned": 50000,
    "eventsMatched": 0,
    "walltime": 0
  },
  "estimatedPrice": {
    "value": 42,
    "currency": "USD cents"
  }
}

-h/--help flag fix

The -h short flag for help was not wired up at the CLI root level. This affected all commands and subcommands. Added context_settings={"help_option_names": ["-h", "--help"]} to the root group, which propagates to all subcommands.

Missing help/description strings for search subcommands

Added missing Click docstrings for: validate, estimate, saved-get, saved-create, saved-delete, saved-run.

Checkpoints list improvements

Added size column showing human-readable file size on disk
Moved created column to first position
Sorted by created timestamp descending (most recent first)

Changes

Streaming output functions: _stream_search_output() handles JSONL, JSON, expand, and table formats without buffering. Returns False for CSV/YAML/raw so the caller can fall back to list().
Two-pass table renderer: _stream_table_from_file() for checkpoint paths - pass 1 scans the JSONL file to compute exact column widths (O(cols) memory), pass 2 streams rows with the computed layout.
Sample-based table streaming: _stream_table_events() for live search paths - buffers first ~3 pages to determine column widths, then streams remaining rows.
orjson integration: limacharlie/json_compat.py provides dumps, loads, dumps_pretty using orjson when available, stdlib json fallback.
Billing cost notice: _warn_cost_if_over_30_days() warns when search spans >30 days and shows the search estimate command. Threshold matches server-side billing logic (replay: 31d, insight-go: 30d with >).
Validate/estimate fixes: Non-zero exit code on error response. Table output flattens stats/estimatedPrice into individual columns via _output_validate_or_estimate().
Two-tier warnings: Memory buffering warning (>7d, CSV/YAML only) and checkpoint resumability recommendation (>14d, all formats).
CLI help fixes: Added -h as help short flag (affects all commands/subcommands); added missing docstrings for search subcommands.
Checkpoints list: Added file size column, reordered columns (created first), sorted by created descending.
--ai-help updated: Added output format documentation to search run explain text covering streaming vs buffered behavior, --expand, --raw, and memory guidance.
Tests: 370 tests passing across related test files. New test_search_helpers.py with 144 unit tests covering all helper functions, streaming, warnings, cost notice, validate/estimate exit codes, and output formatting.

Blast radius / isolation

Affected: All search output paths (search run, search run --checkpoint, search run --resume, search checkpoint-show, search saved-run), search validate, search estimate, search help text, checkpoints list display, CLI -h flag.
NOT affected: Non-search CLI commands (except -h fix which benefits all), SDK classes, authentication, config.
Backward compatible: Output content is identical, only the internal buffering strategy changed. Users see the same results. Validate/estimate exit code change is a bug fix (was always 0, now 1 on error).

Performance characteristics

JSONL/JSON/expand output: memory drops from O(N) to O(1) for all search paths.
Table output (checkpoint): memory drops from O(N) to O(cols). Two file passes add ~50% wall time vs single pass, but avoids OOM on large result sets.
Table output (live search): memory is O(sample_pages * cols) instead of O(N). Column widths may be slightly off for late pages (sampled, not exact).
orjson gives ~3-10x faster JSON serialization/deserialization across all paths.
Live validation: 30-day checkpoint resume search - RSS stabilized at 138 MB while checkpoint file grew to 632 MB (484 pages). See memory profile table above.

Notable contracts / APIs

No wire format or API contract changes.
json_compat module is internal, not part of the public SDK API.
orjson is added as a dependency in pyproject.toml but the fallback ensures backward compatibility if it cannot be installed.
Validate/estimate exit code change: was 0, now 1 when response contains error field. This is a correctness fix.

Test plan

🤖 Generated with Claude Code

tomaz-lc · 2026-03-17T11:18:08Z

+
+orjson is a mandatory dependency (specified in pyproject.toml) but the
+fallback ensures the package still works if orjson cannot be installed
+on a particular platform (e.g. missing Rust compiler for source builds).


Keep in mind this only applies for source builds - orjson is a well maintained library and they ship pre-built wheels for all most common platforms (Linux, OS X, Windows) and architectures (x86, arm).

Add streaming output to avoid OOM on large searches. Previously all search results were buffered in a list before output, causing OOM on constrained VMs (e.g. 4GB RAM with 500K+ events). Now results stream one at a time for JSONL, JSON, expand, and table formats. Streaming behavior by format: - JSONL: one result per line, constant memory (all paths) - JSON: streaming array ([, item, item, ]), constant memory (all paths) - expand: one event block at a time, constant memory (all paths) - table (live search): sample first N pages for column widths, then stream remaining rows. O(sample + columns) memory. - table (checkpoint): two-pass over file - pass 1 computes exact column widths O(columns), pass 2 streams rows. Perfectly accurate layout. - CSV/YAML: still buffered (inherent to format, rarely used for large data) Key changes: - _stream_search_output(): core streaming function for JSONL, JSON, expand, and table from any iterable. Returns False for CSV/YAML. - _stream_table_events(): sample-based streaming table for live searches (configurable via _TABLE_SAMPLE_PAGES constant). - _stream_table_from_file(): two-pass streaming table for checkpoint files. - _run_normal and saved_run: try streaming first, fall back to list() only for CSV/YAML. - _run_with_checkpoint: search loop does not accumulate results in memory. Add orjson as dependency for ~3-10x faster JSON serialization: - New limacharlie/json_compat.py module: unified API (dumps, dumps_pretty, loads, backend_name) with graceful fallback to stdlib json. - output.py: format_json, format_jsonl, _table_value, _csv_value all use json_compat. Benefits ALL CLI commands, not just search. - Debug log (--debug) shows which JSON backend is active. CLI improvements: - Add -h as alias for --help on all commands (context_settings). - Add help strings to all search subcommands (run, validate, estimate, saved-list, saved-get, saved-create, saved-delete, saved-run). - Fix checkpoint-show --checkpoint error to show "--checkpoint" not "checkpoint_path" in missing parameter message. - Warn on large time range searches (>7 days) without --checkpoint when using buffered output formats (table/CSV/YAML). Suggests --checkpoint or --output jsonl. Threshold configurable via _LARGE_TIME_RANGE_WARN_SECONDS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add PyPI classifiers for Python 3.9-3.14, development status, topic, and audience. CI already tests on Python 3.14 via cloudbuild_pr.yaml. Add packaging tests: classifiers present, current Python version included, requires-python minimum, production/stable status. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add project URLs (Documentation, Repository, Issues, Changelog, REST API Docs) so links render on the PyPI page. Update description to better reflect the package scope. Add packaging tests for URL presence and HTTPS validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… dist checks orjson 3.11+ requires Python 3.10+, but we support 3.9. Split the dependency into two environment markers: - Python <3.10: orjson >=3.10.0,<3.11 (last series with 3.9 support) - Python >=3.10: orjson >=3.10.0 (latest) Add distribution install checks for Python 3.9-3.13 in CI (3.14 already covered by existing steps). All run in parallel. Python 3.9 step also verifies orjson 3.10.x is installed (not 3.11+) and runs the full unit test suite to catch syntax/compat issues. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…allel Consolidate the separate "Unit Tests" and "Dist Check" steps into unified per-version steps that build, install, verify orjson, and run the full unit test suite. All 6 versions run in parallel. Use E2_HIGHCPU_8 machine type (8 vCPUs) to handle ~10 concurrent steps efficiently. Previously used the default E2_MEDIUM (2 vCPUs). Integration tests and benchmarks remain on Python 3.14 only since they test API behavior, not Python version compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test_jwt_cache.py used `float | int` union syntax which requires Python 3.10+. Adding `from __future__ import annotations` makes it work on 3.9. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

platform.freedesktop_os_release was added in Python 3.10. Use create=True on mock.patch so the test works on 3.9 where the attribute doesn't exist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Split each Python version into separate "Dist" and "Unit Tests" steps for clearer CI output and easier debugging. Each step: Dist steps: build wheel in /tmp/build-<ver>, install, verify pip show, limacharlie --version, orjson backend. Clean isolation per step. Unit test steps: install from source with dev deps in /tmp/test-<ver>, run full pytest suite. Clean isolation per step. All steps use unique /tmp dirs to avoid cross-step interference. Added echo banners (======) and phase markers (--- phase ---) so CI logs are easy to scan. Also added sdist check as separate parallel step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tomaz-lc force-pushed the feat-streaming-output branch 2 times, most recently from 36c044a to 4a0602d Compare March 17, 2026 11:16

tomaz-lc commented Mar 17, 2026

View reviewed changes

tomaz-lc force-pushed the feat-streaming-output branch 6 times, most recently from 0ade604 to 7a37232 Compare March 17, 2026 13:45

tomaz-lc requested a review from maximelb March 17, 2026 15:32

tomaz-lc marked this pull request as ready for review March 17, 2026 15:34

maximelb previously approved these changes Mar 17, 2026

View reviewed changes

tomaz-lc dismissed maximelb’s stale review via 8c01ee0 March 18, 2026 07:22

tomaz-lc force-pushed the feat-streaming-output branch from 7a37232 to 8c01ee0 Compare March 18, 2026 07:22

tomaz-lc force-pushed the feat-streaming-output branch from 8c01ee0 to 6920054 Compare March 18, 2026 07:28

tomaz-lc and others added 7 commits March 18, 2026 08:33

fix: add future annotations import for Python 3.9 compat

fec0d33

test_jwt_cache.py used `float | int` union syntax which requires Python 3.10+. Adding `from __future__ import annotations` makes it work on 3.9. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: user agent test compat with Python 3.9

d0801ba

platform.freedesktop_os_release was added in Python 3.10. Use create=True on mock.patch so the test works on 3.9 where the attribute doesn't exist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tomaz-lc merged commit 6d93a72 into cli-v2 Mar 18, 2026
1 check passed

tomaz-lc deleted the feat-streaming-output branch March 18, 2026 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: streaming output, orjson, and memory-efficient search rendering#254

feat: streaming output, orjson, and memory-efficient search rendering#254
tomaz-lc merged 8 commits intocli-v2from
feat-streaming-output

tomaz-lc commented Mar 17, 2026 •

edited

Loading

Uh oh!

tomaz-lc Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomaz-lc commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Output behavior by path and format

Peak memory by format

Memory profile - live validation

Warnings

Validate/estimate exit codes and output

-h/--help flag fix

Missing help/description strings for search subcommands

Checkpoints list improvements

Changes

Blast radius / isolation

Performance characteristics

Notable contracts / APIs

Test plan

Uh oh!

tomaz-lc Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomaz-lc commented Mar 17, 2026 •

edited

Loading

tomaz-lc Mar 17, 2026 •

edited

Loading