Skip to content

Preserve Per-Event Documents by Replacing Synthetic Batch Documents with Native Bulk Ingestion #1

@cferrys

Description

@cferrys

Summary

output_hlquery currently treats batching as a document-shaping feature instead of a transport optimization: when batch_lines or batch_interval is enabled, hlog collapses multiple source events into one larger document containing event_count, events, and messages_text. This should be replaced with true bulk ingestion that keeps one source event equal to one indexed document while still sending batches efficiently.

Context

The current batching path in src/core/pipeline.cpp builds a synthetic aggregate document inside Pipeline::AsyncHlqueryOutput::Run() and then sends it through the normal single-document POST flow. That means enabling batching changes indexing semantics, not just throughput characteristics.

This is especially problematic for a high-performance search engine pipeline:

  • Individual log lines stop being first-class searchable documents.
  • Large aggregated documents increase indexing cost, payload size, and hot-path memory pressure.
  • Downstream queries, retention, deduplication, and document-level filtering become less precise because one stored record now represents many events.
  • Failure handling becomes coarse: a failed post affects a whole synthetic batch even though the failure buffer records original lines individually.
  • The same pattern is mirrored in src/modules/m_irc.cpp, which suggests the event model is drifting toward “batch blobs” across multiple ingestion paths.
  • The README describes hlog as forwarding structured events into an hlquery collection, but batching currently rewrites those structured events into wrapper documents.

Proposed Implementation

  1. Add a native bulk-ingest path in HlqueryHttpOutput that accepts a vector of PipelineEvent objects and sends them in one request to an hlquery multi-document ingestion API.
  2. Keep batching as a transport concern only: queue and flush multiple events together, but serialize each event as its own document.
  3. Preserve current per-event fields exactly as produced by the pipeline; do not inject events or messages_text wrapper fields in the normal output_hlquery path.
  4. Handle partial batch failures explicitly so the failure buffer can record only the documents that were rejected or not acknowledged.
  5. Create the destination collection once during startup or first use, then avoid collection-creation retries on the hot path for every failed 404-style response.
  6. Add tests covering batch size, batch interval, ordering, partial failures, and the guarantee that batching does not alter document granularity.

Impact

This restores the correct ingestion model for a search engine: every source event remains independently searchable, filterable, and retainable, while batching still improves throughput by reducing per-request overhead. It should materially improve ingestion scalability, reduce payload amplification, and prevent subtle query regressions caused by storing log batches as oversized wrapper documents.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions