[libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 by ShourieG · Pull Request #45574 · elastic/beats

ShourieG · 2025-07-28T10:52:02Z

Type of change

Enhancement

Proposed commit message

 Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs 
 tests as a byproduct of errors introduced with newer storage library versions.

NOTE

The parquet v18 has a dependency of a newer google storage library version.

cloud.google.com/go/storage v1.49.0 -> cloud.google.com/go/storage v1.52.0

This upgrade resulted in a response change in the gcs tests, where sdk methods used in some scenarios now return more context in case of a 404 error. The respective tests have been updated to align to this change.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Potential larger memory footprint when using parquet decoding at a smaller scale.

Author's Checklist

[ ]

How to test this PR locally

Related issues

Closes [libbeat][chore] - Update "apache/arrow" parquet library to the latest v18 version in their new go specific repo #45573

Use cases

Much faster processing times when using parquet decoding at larger scales with the impact of smaller scale usage becoming more demanding in terms of memory.

Screenshots

Logs

Analysis:

NOTE : The following summary was generated and then edited manually after feeding the relevant benchmark data into an LLM

+----------------------------------------------------+
| Parquet-go Library: v17 vs. v18 Benchmark Analysis |
+----------------------------------------------------+

Summary

Massive Speed Improvement for Large Files:
v18 is ~2.6× faster than v17 for large-scale data processing.
Increased Memory Consumption:
This performance gain comes at the cost of ~2.6× more memory
usage and ~1.5× more allocations.
Performance Regression on Smaller Files:
For smaller files (e.g., vpc_flow.parquet), v18 is ~3× slower
and uses ~2.5× more memory.
High CPU Scaling:
v18 scales well with more CPU cores, showing up to ~4.5×
performance improvement from 1 → 10 cores.
Conclusion:
v18 is best for high-throughput scenarios with ample memory.
For memory-constrained or small-file workloads, its overhead
is significant if batch_size is not constrained.

Benchmark Environment

Package: github.com/elastic/beats/v7/x-pack/libbeat/reader/parquet
Go Version: goos: darwin, goarch: arm64
CPU: Apple M1 Max
Concurrency Levels: 1, 2, 4, 8, 10

1. Large File Processing – `taxi_2023_1.parquet`

Single large Parquet file (47.7 MB), batch size = 10,000.

Version	Cores	Time per Op (ns/op)	Mem per Op (B/op)	Allocs per Op
v17	10	7,113,368,875	7,162,300,232 (~7.16 GB)	40,872,797
v18	10	2,779,433,542	18,869,783,112 (~18.87 GB)	61,709,457

Analysis:
v18 is ~2.56× faster, but uses ~2.63× more memory and ~1.51× more allocations.

2. Small File Processing – `vpc_flow.parquet`

Smaller file (33 KB), batch size = 1,000, at 4 CPU cores.

Version	Cores	Time per Op (ns/op)	Mem per Op (B/op)	Allocs per Op
v17	4	7,139,663	15,266,732 (~15.27 MB)	55,042
v18	4	22,460,141	37,518,437 (~37.52 MB)	188,593

Analysis:
v18 is ~3.15× slower, uses ~2.46× more memory, and makes ~3.43×
more allocations.

v18 Library – CPU Scaling & Parallelism

Scenario: Processing multiple files in parallel (batch size = 1,000).
Benchmark: BenchmarkReadParquet/Process_multiple_files_parallelly_in_batches_of_1000

Cores	Time per Op (ns/op)	Speedup vs. 1 Core
1	41,729,156	1.00×
2	20,918,817	1.99×
4	12,080,647	3.45×
8	9,609,640	4.34×
10	9,251,827	4.51×

Serial vs. Parallel Processing

Scenario: Processing a single file (batch size = 1,000).
Benchmark: Read_a_single_row_from_a_single_file...

Mode	Cores	Time per Op (ns/op)
Serial	10	2,007,353
Parallel	10	880,824

Analysis:
Parallel implementation is ~2.28× faster, likely due to parallelizing
row-group reads.

Memory & Allocation Analysis

Memory remains stable across CPU counts but grows with batch size.

Scenario: Processing single files (Serial, 4 cores)

Benchmark	Batch Size	Mem per Op (B/op)	Allocs per Op
...in_batches_of_1000-4	1,000	5,537,257	22,416
...in_batches_of_10000-4	10,000	13,670,323	22,460

Analysis:
A 10× larger batch increases memory ~2.5× but barely changes allocation count,
indicating efficient buffer reuse.

Conclusion:
Performance vs. memory is a trade-off. With v18, while using smaller workloads, batch_size will play a significant role
when it comes to the memory footprint. Small workloads with large batch_size will consume significantly more memory than v17.

github-actions · 2025-07-28T10:52:11Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2025-07-28T10:52:40Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @ShourieG? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

elasticmachine · 2025-07-28T10:53:25Z

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

botelastic · 2025-07-28T10:53:35Z

This pull request doesn't have a Team:<team> label.

botelastic · 2025-07-28T10:53:38Z

This pull request doesn't have a Team:<team> label.

efd6 · 2025-07-28T20:57:12Z

Note comment in the issue about waiting for v18.4.1.

mergify · 2025-07-28T22:15:10Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b chore/libbeat/update_parquet_reader upstream/chore/libbeat/update_parquet_reader
git merge upstream/main
git push upstream chore/libbeat/update_parquet_reader

ShourieG · 2025-09-10T02:50:47Z

/test

efd6

🎉

…e_parquet_reader

ShourieG · 2025-09-10T09:18:47Z

@efd6, some otel tests are failing atm :(, won't be able to merge until that's resolved. Pending on PR: #46493. Update: PR merged so we should be good

…e_parquet_reader

ShourieG · 2025-09-10T12:04:19Z

Hi @elastic/beats-tech-leads, require a codeowner approval for merging.

…e_parquet_reader

khushijain21 · 2025-09-26T12:50:15Z

@ShourieG can you please backport this to 8.19, 9.1?

ShourieG · 2025-09-26T14:12:19Z

@ShourieG can you please backport this to 8.19, 9.1?

Generally it’s not recommended to do so unless it’s an absolute necessity. This library update in-turn does multiple ad hoc updates to other libraries, also some existing input dependencies for the gcs input also had to be updated along with tests. So backporting this can cause issues or break stuff. I don’t want to take that risk.

ShourieG · 2025-10-14T13:02:21Z

@Mergifyio backport 8.19 9.1

mergify · 2025-10-14T13:02:34Z

backport 8.19 9.1

✅ Backports have been created

Details

#47086 [8.19](backport #45574) [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 has been created for branch 8.19 but encountered conflicts
#47087 [9.1](backport #45574) [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 has been created for branch 9.1 but encountered conflicts

…er to v18 (#45574) Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs tests as a byproduct of errors introduced with newer storage library versions. (cherry picked from commit b7c5a85) # Conflicts: # NOTICE.txt # go.mod # go.sum

…ry used in parquet reader to v18 (#47087) * [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 (#45574) Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs tests as a byproduct of errors introduced with newer storage library versions. (cherry picked from commit b7c5a85) # Conflicts: # NOTICE.txt # go.mod # go.sum * fix conflicts --------- Co-authored-by: Shourie Ganguly <shourie.ganguly@elastic.co> Co-authored-by: Khushi Jain <khushi.jain@elastic.co>

…ary used in parquet reader to v18 (#47086) * [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 (#45574)

updated apache/arrow library to v18

2477fce

ShourieG requested review from a team as code owners July 28, 2025 10:52

botelastic Bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 28, 2025

mergify Bot assigned ShourieG Jul 28, 2025

ShourieG added the Team:Security-Service Integrations Security Service Integrations Team label Jul 28, 2025

botelastic Bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 28, 2025

ShourieG added libbeat needs_team Indicates that the issue/PR needs a Team:* label libbeat:reader labels Jul 28, 2025

botelastic Bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 28, 2025

ShourieG added needs_team Indicates that the issue/PR needs a Team:* label and removed libbeat labels Jul 28, 2025

botelastic Bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 28, 2025

ShourieG added the enhancement label Jul 28, 2025

updated changelog

d64e2ed

ShourieG requested review from efd6 and removed request for efd6 July 28, 2025 10:59

ShourieG added 2 commits July 28, 2025 17:17

updated notice

f2a065c

updated failing gcs tests due to indirect dependency updates

e116320

ShourieG requested a review from efd6 July 28, 2025 12:11

ShourieG added 2 commits September 9, 2025 16:55

resolved merge conflicts

a977b6c

updated go-mod and NOTICE after lib update

0a61060

efd6 approved these changes Sep 10, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into chore/libbeat/updat…

115007a

…e_parquet_reader

ShourieG added 2 commits September 10, 2025 14:51

Merge remote-tracking branch 'upstream/main' into chore/libbeat/updat…

ee3691c

…e_parquet_reader

Merge remote-tracking branch 'upstream/main' into chore/libbeat/updat…

25a4d45

…e_parquet_reader

cmacknz approved these changes Sep 10, 2025

View reviewed changes

ShourieG added 3 commits September 11, 2025 11:19

resolved merge conflicts

2b57c38

updated NOTICE and go mod after merge

cefc87d

Merge remote-tracking branch 'upstream/main' into chore/libbeat/updat…

73a6cf2

…e_parquet_reader

ShourieG merged commit b7c5a85 into elastic:main Sep 12, 2025
205 of 208 checks passed

ShourieG deleted the chore/libbeat/update_parquet_reader branch September 12, 2025 05:48

khushijain21 mentioned this pull request Oct 7, 2025

[8.19](backport #46723) Add tests for beatsauth extension 1/2 #46801

Closed

6 tasks

khushijain21 mentioned this pull request Oct 14, 2025

[9.1](backport #46428) Remove settings on ES exporter config that no longer function #46736

Merged

6 tasks

mergify Bot mentioned this pull request Oct 14, 2025

[8.19](backport #45574) [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 #47086

Merged

6 tasks

mergify Bot mentioned this pull request Oct 14, 2025

[9.1](backport #45574) [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 #47087

Merged

6 tasks

khushijain21 pushed a commit that referenced this pull request Nov 6, 2025

[8.19](backport #45574) [libbeat][chore]: Updated "apache/arrow" libr…

b290590

…ary used in parquet reader to v18 (#47086) * [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 (#45574)

Conversation

ShourieG commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of change

Proposed commit message

NOTE

Checklist

Disruptive User Impact

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Analysis:

Summary

Benchmark Environment

1. Large File Processing – taxi_2023_1.parquet

2. Small File Processing – vpc_flow.parquet

v18 Library – CPU Scaling & Parallelism

Serial vs. Parallel Processing

Memory & Allocation Analysis

Uh oh!

github-actions Bot commented Jul 28, 2025

🤖 GitHub comments

Uh oh!

mergify Bot commented Jul 28, 2025

Uh oh!

elasticmachine commented Jul 28, 2025

Uh oh!

botelastic Bot commented Jul 28, 2025

Uh oh!

botelastic Bot commented Jul 28, 2025

Uh oh!

efd6 commented Jul 28, 2025

Uh oh!

mergify Bot commented Jul 28, 2025

Uh oh!

ShourieG commented Sep 10, 2025

Uh oh!

efd6 left a comment

Choose a reason for hiding this comment

Uh oh!

ShourieG commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShourieG commented Sep 10, 2025

Uh oh!

Uh oh!

khushijain21 commented Sep 26, 2025

Uh oh!

ShourieG commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShourieG commented Oct 14, 2025

Uh oh!

mergify Bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Backports have been created

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ShourieG commented Jul 28, 2025 •

edited

Loading

1. Large File Processing – `taxi_2023_1.parquet`

2. Small File Processing – `vpc_flow.parquet`

ShourieG commented Sep 10, 2025 •

edited

Loading

ShourieG commented Sep 26, 2025 •

edited

Loading

mergify Bot commented Oct 14, 2025 •

edited

Loading