Skip to content

[libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18#45574

Merged
ShourieG merged 12 commits into
elastic:mainfrom
ShourieG:chore/libbeat/update_parquet_reader
Sep 12, 2025
Merged

[libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18#45574
ShourieG merged 12 commits into
elastic:mainfrom
ShourieG:chore/libbeat/update_parquet_reader

Conversation

@ShourieG
Copy link
Copy Markdown
Contributor

@ShourieG ShourieG commented Jul 28, 2025

Type of change

  • Enhancement

Proposed commit message

 Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs 
 tests as a byproduct of errors introduced with newer storage library versions.

NOTE

The parquet v18 has a dependency of a newer google storage library version.

cloud.google.com/go/storage v1.49.0 -> cloud.google.com/go/storage v1.52.0

This upgrade resulted in a response change in the gcs tests, where sdk methods used in some scenarios now return more context in case of a 404 error. The respective tests have been updated to align to this change.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Potential larger memory footprint when using parquet decoding at a smaller scale.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Much faster processing times when using parquet decoding at larger scales with the impact of smaller scale usage becoming more demanding in terms of memory.

Screenshots

Logs

Analysis:

NOTE : The following summary was generated and then edited manually after feeding the relevant benchmark data into an LLM

+----------------------------------------------------+
| Parquet-go Library: v17 vs. v18 Benchmark Analysis |
+----------------------------------------------------+

Summary

  • Massive Speed Improvement for Large Files:
    v18 is ~2.6× faster than v17 for large-scale data processing.
  • Increased Memory Consumption:
    This performance gain comes at the cost of ~2.6× more memory
    usage and ~1.5× more allocations.
  • Performance Regression on Smaller Files:
    For smaller files (e.g., vpc_flow.parquet), v18 is ~3× slower
    and uses ~2.5× more memory.
  • High CPU Scaling:
    v18 scales well with more CPU cores, showing up to ~4.5×
    performance improvement from 1 → 10 cores.
  • Conclusion:
    v18 is best for high-throughput scenarios with ample memory.
    For memory-constrained or small-file workloads, its overhead
    is significant if batch_size is not constrained.

Benchmark Environment

  • Package: github.com/elastic/beats/v7/x-pack/libbeat/reader/parquet
  • Go Version: goos: darwin, goarch: arm64
  • CPU: Apple M1 Max
  • Concurrency Levels: 1, 2, 4, 8, 10

1. Large File Processing – taxi_2023_1.parquet

Single large Parquet file (47.7 MB), batch size = 10,000.

Version Cores Time per Op (ns/op) Mem per Op (B/op) Allocs per Op
v17 10 7,113,368,875 7,162,300,232 (~7.16 GB) 40,872,797
v18 10 2,779,433,542 18,869,783,112 (~18.87 GB) 61,709,457

Analysis:
v18 is ~2.56× faster, but uses ~2.63× more memory and ~1.51× more allocations.


2. Small File Processing – vpc_flow.parquet

Smaller file (33 KB), batch size = 1,000, at 4 CPU cores.

Version Cores Time per Op (ns/op) Mem per Op (B/op) Allocs per Op
v17 4 7,139,663 15,266,732 (~15.27 MB) 55,042
v18 4 22,460,141 37,518,437 (~37.52 MB) 188,593

Analysis:
v18 is ~3.15× slower, uses ~2.46× more memory, and makes ~3.43×
more allocations.


v18 Library – CPU Scaling & Parallelism

Scenario: Processing multiple files in parallel (batch size = 1,000).
Benchmark: BenchmarkReadParquet/Process_multiple_files_parallelly_in_batches_of_1000

Cores Time per Op (ns/op) Speedup vs. 1 Core
1 41,729,156 1.00×
2 20,918,817 1.99×
4 12,080,647 3.45×
8 9,609,640 4.34×
10 9,251,827 4.51×

Serial vs. Parallel Processing

Scenario: Processing a single file (batch size = 1,000).
Benchmark: Read_a_single_row_from_a_single_file...

Mode Cores Time per Op (ns/op)
Serial 10 2,007,353
Parallel 10 880,824

Analysis:
Parallel implementation is ~2.28× faster, likely due to parallelizing
row-group reads.


Memory & Allocation Analysis

Memory remains stable across CPU counts but grows with batch size.

Scenario: Processing single files (Serial, 4 cores)

Benchmark Batch Size Mem per Op (B/op) Allocs per Op
...in_batches_of_1000-4 1,000 5,537,257 22,416
...in_batches_of_10000-4 10,000 13,670,323 22,460

Analysis:
A 10× larger batch increases memory ~2.5× but barely changes allocation count,
indicating efficient buffer reuse.


Conclusion:
Performance vs. memory is a trade-off. With v18, while using smaller workloads, batch_size will play a significant role
when it comes to the memory footprint. Small workloads with large batch_size will consume significantly more memory than v17.

@ShourieG ShourieG requested review from a team as code owners July 28, 2025 10:52
@botelastic botelastic Bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 28, 2025
@github-actions
Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jul 28, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @ShourieG? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@ShourieG ShourieG added the Team:Security-Service Integrations Security Service Integrations Team label Jul 28, 2025
@botelastic botelastic Bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 28, 2025
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@ShourieG ShourieG added libbeat needs_team Indicates that the issue/PR needs a Team:* label libbeat:reader labels Jul 28, 2025
@botelastic botelastic Bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 28, 2025
@ShourieG ShourieG added needs_team Indicates that the issue/PR needs a Team:* label and removed libbeat labels Jul 28, 2025
@botelastic botelastic Bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 28, 2025
@botelastic
Copy link
Copy Markdown

botelastic Bot commented Jul 28, 2025

This pull request doesn't have a Team:<team> label.

1 similar comment
@botelastic
Copy link
Copy Markdown

botelastic Bot commented Jul 28, 2025

This pull request doesn't have a Team:<team> label.

@ShourieG ShourieG requested review from efd6 and removed request for efd6 July 28, 2025 10:59
@ShourieG ShourieG requested a review from efd6 July 28, 2025 12:11
@efd6
Copy link
Copy Markdown
Contributor

efd6 commented Jul 28, 2025

Note comment in the issue about waiting for v18.4.1.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jul 28, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b chore/libbeat/update_parquet_reader upstream/chore/libbeat/update_parquet_reader
git merge upstream/main
git push upstream chore/libbeat/update_parquet_reader

@ShourieG
Copy link
Copy Markdown
Contributor Author

/test

Copy link
Copy Markdown
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@ShourieG
Copy link
Copy Markdown
Contributor Author

ShourieG commented Sep 10, 2025

@efd6, some otel tests are failing atm :(, won't be able to merge until that's resolved. Pending on PR: #46493. Update: PR merged so we should be good

@ShourieG
Copy link
Copy Markdown
Contributor Author

Hi @elastic/beats-tech-leads, require a codeowner approval for merging.

@ShourieG ShourieG merged commit b7c5a85 into elastic:main Sep 12, 2025
205 of 208 checks passed
@ShourieG ShourieG deleted the chore/libbeat/update_parquet_reader branch September 12, 2025 05:48
@khushijain21
Copy link
Copy Markdown
Contributor

@ShourieG can you please backport this to 8.19, 9.1?

@ShourieG
Copy link
Copy Markdown
Contributor Author

ShourieG commented Sep 26, 2025

@ShourieG can you please backport this to 8.19, 9.1?

Generally it’s not recommended to do so unless it’s an absolute necessity. This library update in-turn does multiple ad hoc updates to other libraries, also some existing input dependencies for the gcs input also had to be updated along with tests. So backporting this can cause issues or break stuff. I don’t want to take that risk.

@ShourieG
Copy link
Copy Markdown
Contributor Author

@Mergifyio backport 8.19 9.1

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Oct 14, 2025

backport 8.19 9.1

✅ Backports have been created

Details

mergify Bot pushed a commit that referenced this pull request Oct 14, 2025
…er to v18 (#45574)

Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs
 tests as a byproduct of errors introduced with newer storage library versions.

(cherry picked from commit b7c5a85)

# Conflicts:
#	NOTICE.txt
#	go.mod
#	go.sum
mergify Bot pushed a commit that referenced this pull request Oct 14, 2025
…er to v18 (#45574)

Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs
 tests as a byproduct of errors introduced with newer storage library versions.

(cherry picked from commit b7c5a85)

# Conflicts:
#	NOTICE.txt
#	go.mod
#	go.sum
khushijain21 added a commit that referenced this pull request Oct 15, 2025
…ry used in parquet reader to v18 (#47087)

* [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 (#45574)

Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs
 tests as a byproduct of errors introduced with newer storage library versions.

(cherry picked from commit b7c5a85)

# Conflicts:
#	NOTICE.txt
#	go.mod
#	go.sum

* fix conflicts

---------

Co-authored-by: Shourie Ganguly <shourie.ganguly@elastic.co>
Co-authored-by: Khushi Jain <khushi.jain@elastic.co>
khushijain21 pushed a commit that referenced this pull request Nov 6, 2025
…ary used in parquet reader to v18 (#47086)

* [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 (#45574)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[libbeat][chore] - Update "apache/arrow" parquet library to the latest v18 version in their new go specific repo

5 participants