[libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18#45574
Conversation
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
Pinging @elastic/security-service-integrations (Team:Security-Service Integrations) |
|
This pull request doesn't have a |
1 similar comment
|
This pull request doesn't have a |
|
Note comment in the issue about waiting for v18.4.1. |
|
This pull request is now in conflicts. Could you fix it? 🙏 |
|
/test |
|
Hi @elastic/beats-tech-leads, require a codeowner approval for merging. |
|
@ShourieG can you please backport this to 8.19, 9.1? |
Generally it’s not recommended to do so unless it’s an absolute necessity. This library update in-turn does multiple ad hoc updates to other libraries, also some existing input dependencies for the gcs input also had to be updated along with tests. So backporting this can cause issues or break stuff. I don’t want to take that risk. |
|
@Mergifyio backport 8.19 9.1 |
✅ Backports have been createdDetails
|
…ry used in parquet reader to v18 (#47087) * [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 (#45574) Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs tests as a byproduct of errors introduced with newer storage library versions. (cherry picked from commit b7c5a85) # Conflicts: # NOTICE.txt # go.mod # go.sum * fix conflicts --------- Co-authored-by: Shourie Ganguly <shourie.ganguly@elastic.co> Co-authored-by: Khushi Jain <khushi.jain@elastic.co>
Type of change
Proposed commit message
NOTE
The parquet v18 has a dependency of a newer google storage library version.
This upgrade resulted in a response change in the gcs tests, where sdk methods used in some scenarios now return more context in case of a 404 error. The respective tests have been updated to align to this change.
Checklist
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.Disruptive User Impact
Potential larger memory footprint when using parquet decoding at a smaller scale.
Author's Checklist
How to test this PR locally
Related issues
Use cases
Much faster processing times when using parquet decoding at larger scales with the impact of smaller scale usage becoming more demanding in terms of memory.
Screenshots
Logs
Analysis:
NOTE : The following summary was generated and then edited manually after feeding the relevant benchmark data into an LLM
Summary
v18 is ~2.6× faster than v17 for large-scale data processing.
This performance gain comes at the cost of ~2.6× more memory
usage and ~1.5× more allocations.
For smaller files (e.g.,
vpc_flow.parquet), v18 is ~3× slowerand uses ~2.5× more memory.
v18 scales well with more CPU cores, showing up to ~4.5×
performance improvement from 1 → 10 cores.
v18 is best for high-throughput scenarios with ample memory.
For memory-constrained or small-file workloads, its overhead
is significant if batch_size is not constrained.
Benchmark Environment
github.com/elastic/beats/v7/x-pack/libbeat/reader/parquetgoos: darwin,goarch: arm641. Large File Processing –
taxi_2023_1.parquetSingle large Parquet file (47.7 MB), batch size = 10,000.
Analysis:
v18 is ~2.56× faster, but uses ~2.63× more memory and ~1.51× more allocations.
2. Small File Processing –
vpc_flow.parquetSmaller file (33 KB), batch size = 1,000, at 4 CPU cores.
Analysis:
v18 is ~3.15× slower, uses ~2.46× more memory, and makes ~3.43×
more allocations.
v18 Library – CPU Scaling & Parallelism
Scenario: Processing multiple files in parallel (batch size = 1,000).
Benchmark:
BenchmarkReadParquet/Process_multiple_files_parallelly_in_batches_of_1000Serial vs. Parallel Processing
Scenario: Processing a single file (batch size = 1,000).
Benchmark:
Read_a_single_row_from_a_single_file...Analysis:
Parallel implementation is ~2.28× faster, likely due to parallelizing
row-group reads.
Memory & Allocation Analysis
Memory remains stable across CPU counts but grows with batch size.
Scenario: Processing single files (Serial, 4 cores)
Analysis:
A 10× larger batch increases memory ~2.5× but barely changes allocation count,
indicating efficient buffer reuse.
Conclusion:
Performance vs. memory is a trade-off. With v18, while using smaller workloads, batch_size will play a significant role
when it comes to the memory footprint. Small workloads with large batch_size will consume significantly more memory than v17.