Summary
output_hlquery currently treats batching as a document-shaping feature instead of a transport optimization: when batch_lines or batch_interval is enabled, hlog collapses multiple source events into one larger document containing event_count, events, and messages_text. This should be replaced with true bulk ingestion that keeps one source event equal to one indexed document while still sending batches efficiently.
Context
The current batching path in src/core/pipeline.cpp builds a synthetic aggregate document inside Pipeline::AsyncHlqueryOutput::Run() and then sends it through the normal single-document POST flow. That means enabling batching changes indexing semantics, not just throughput characteristics.
This is especially problematic for a high-performance search engine pipeline:
- Individual log lines stop being first-class searchable documents.
- Large aggregated documents increase indexing cost, payload size, and hot-path memory pressure.
- Downstream queries, retention, deduplication, and document-level filtering become less precise because one stored record now represents many events.
- Failure handling becomes coarse: a failed post affects a whole synthetic batch even though the failure buffer records original lines individually.
- The same pattern is mirrored in
src/modules/m_irc.cpp, which suggests the event model is drifting toward “batch blobs” across multiple ingestion paths.
- The README describes
hlog as forwarding structured events into an hlquery collection, but batching currently rewrites those structured events into wrapper documents.
Proposed Implementation
- Add a native bulk-ingest path in
HlqueryHttpOutput that accepts a vector of PipelineEvent objects and sends them in one request to an hlquery multi-document ingestion API.
- Keep batching as a transport concern only: queue and flush multiple events together, but serialize each event as its own document.
- Preserve current per-event fields exactly as produced by the pipeline; do not inject
events or messages_text wrapper fields in the normal output_hlquery path.
- Handle partial batch failures explicitly so the failure buffer can record only the documents that were rejected or not acknowledged.
- Create the destination collection once during startup or first use, then avoid collection-creation retries on the hot path for every failed 404-style response.
- Add tests covering batch size, batch interval, ordering, partial failures, and the guarantee that batching does not alter document granularity.
Impact
This restores the correct ingestion model for a search engine: every source event remains independently searchable, filterable, and retainable, while batching still improves throughput by reducing per-request overhead. It should materially improve ingestion scalability, reduce payload amplification, and prevent subtle query regressions caused by storing log batches as oversized wrapper documents.
Summary
output_hlquerycurrently treats batching as a document-shaping feature instead of a transport optimization: whenbatch_linesorbatch_intervalis enabled,hlogcollapses multiple source events into one larger document containingevent_count,events, andmessages_text. This should be replaced with true bulk ingestion that keeps one source event equal to one indexed document while still sending batches efficiently.Context
The current batching path in
src/core/pipeline.cppbuilds a synthetic aggregate document insidePipeline::AsyncHlqueryOutput::Run()and then sends it through the normal single-document POST flow. That means enabling batching changes indexing semantics, not just throughput characteristics.This is especially problematic for a high-performance search engine pipeline:
src/modules/m_irc.cpp, which suggests the event model is drifting toward “batch blobs” across multiple ingestion paths.hlogas forwarding structured events into an hlquery collection, but batching currently rewrites those structured events into wrapper documents.Proposed Implementation
HlqueryHttpOutputthat accepts a vector ofPipelineEventobjects and sends them in one request to an hlquery multi-document ingestion API.eventsormessages_textwrapper fields in the normaloutput_hlquerypath.Impact
This restores the correct ingestion model for a search engine: every source event remains independently searchable, filterable, and retainable, while batching still improves throughput by reducing per-request overhead. It should materially improve ingestion scalability, reduce payload amplification, and prevent subtle query regressions caused by storing log batches as oversized wrapper documents.