Implement initial version of Rust batch support by wsxrdv · Pull Request #229 · google/sedpack

wsxrdv · 2025-08-29T14:02:52Z

This commit is setting up API between Rust and Python. This is an initial and suboptimal implementation.

This commit is setting up API between Rust and Python. The implementation is suboptimal. google#200

gemini-code-assist

Summary of Changes

Hello @wsxrdv, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an initial implementation for batching support within the Rust data processing pipeline, enabling more efficient data consumption, particularly for machine learning workloads. It establishes the API for interaction between Rust and Python for this new batching capability, although the current implementation is noted as suboptimal.

Highlights

New Rust Batching Implementation: Introduced a new BatchIterator in Rust to efficiently group data examples into batches, optimizing data consumption for machine learning workflows.
Python Integration: Exposed the Rust batching functionality to Python via pyo3 with BatchedRustIter and RustBatchedGenerator, enabling seamless interaction between the two languages.
Performance Benchmarking: Added a new benchmark (batch_iterator_benchmark) to measure the performance of the Rust batch iterator, ensuring performance tracking for the new feature.
ExampleIterator Alignment: Modified the existing ExampleIterator to align with the new batching approach, specifically in how ShardProgress is handled, ensuring compatibility and proper data flow.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

wsxrdv · 2025-08-29T14:04:34Z

src/sedpack/io/types.py

 # limitations under the License.
 """Build and load tensorFlow dataset Record wrapper"""

+# pylint: disable=invalid-name


wsxrdv · 2025-08-29T14:05:19Z

tests/io/iteration/test_rust_batched_generator.py

+    values = {
+        "fixed": np.random.random((data_points, 138)).astype(dtype),
+        "fixed_2d": np.random.random((data_points, 3, 5)).astype(dtype),
+        # TODO(reintroduce) when https://github.com/google/sedpack/pull/227 is


Waiting for #227

gemini-code-assist

Code Review

This pull request introduces an initial implementation for batched data iteration in Rust, exposing it to Python. This is a valuable addition for performance. The overall approach of handling batching in Rust and using pyo3 for the Python interface is sound. However, the implementation has some critical issues that need to be addressed. My review focuses on improving error handling, correctness, and performance. Specifically, I've pointed out unsafe panics in the Rust-Python boundary that should be converted to Python exceptions, a critical bug in the Python data decoding logic for batched attributes, and some opportunities for performance and code style improvements in the Rust implementation. Addressing these points will make the new feature more robust and efficient.

rust/src/lib.rs

src/sedpack/io/iteration/rust_batched_generator.py

gemini-code-assist · 2025-08-29T14:05:42Z

rust/src/batch_iteration.rs

+                true => BatchedAttribute::Static {
+                    data: numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(
+                        cache.iter().flat_map(|e| e[attribute_index].clone().into_iter()),
+                    ),
+                },
+                false => BatchedAttribute::Dynamic {
+                    data: cache
+                        .iter()
+                        .map(|e| {
+                            numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(
+                                e[attribute_index].clone(),
+                            )
+                        })
+                        .collect(),
+                },


Cloning the inner Vec<u8> for each attribute is inefficient and can be avoided. You can iterate over a reference to the vector (&e[attribute_index]) instead, which will improve performance by reducing allocations and copying.

Suggested change

true => BatchedAttribute::Static {

data: numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(

cache.iter().flat_map(|e| e[attribute_index].clone().into_iter()),

),

},

false => BatchedAttribute::Dynamic {

data: cache

.iter()

.map(|e| {

numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(

e[attribute_index].clone(),

)

})

.collect(),

},

true => BatchedAttribute::Static {

data: numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(

cache.iter().flat_map(|e| &e[attribute_index]),

),

},

false => BatchedAttribute::Dynamic {

data: cache

.iter()

.map(|e| {

numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(

&e[attribute_index],

)

})

.collect(),

},

rust/src/batch_iteration.rs

rust/src/lib.rs

github-actions · 2025-08-29T14:10:12Z

Bencher Report

Branch	rust_batched
Testbed	ubuntu-latest

⚠️ WARNING: No Threshold found!
Without a Threshold, no Alerts will ever be generated.
Latency (nanoseconds (ns))
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results

Benchmark	Latency	milliseconds (ms)
BatchIterator	📈 view plot ⚠️ NO THRESHOLD	894.11 ms
ExampleIterator	📈 view plot ⚠️ NO THRESHOLD	221.33 ms
parallel_map	📈 view plot ⚠️ NO THRESHOLD	107.41 ms

🐰 View full continuous benchmarking report in Bencher

rust/src/batch_iteration.rs

rust/src/lib.rs

Implement initial version of Rust batch support

ee90df6

This commit is setting up API between Rust and Python. The implementation is suboptimal. google#200

gemini-code-assist bot reviewed Aug 29, 2025

View reviewed changes

wsxrdv mentioned this pull request Aug 29, 2025

Type alias names #230

Open

wsxrdv commented Aug 29, 2025

View reviewed changes

gemini-code-assist bot reviewed Aug 29, 2025

View reviewed changes

[squash] nicer take

80fb86f

kralka requested a review from ia0 August 29, 2025 14:30

ia0 previously approved these changes Aug 29, 2025

View reviewed changes

rust/src/batch_iteration.rs Outdated Show resolved Hide resolved

rust/src/batch_iteration.rs Outdated Show resolved Hide resolved

rust/src/lib.rs Outdated Show resolved Hide resolved

wsxrdv added 2 commits August 29, 2025 15:00

[squash] no need to print oop

63bdaa9

[squash] no cloning commit

1e15ca3

wsxrdv dismissed ia0’s stale review via 1e15ca3 August 29, 2025 15:02

ia0 approved these changes Aug 29, 2025

View reviewed changes

wsxrdv added this pull request to the merge queue Aug 29, 2025

Merged via the queue into google:main with commit 531a8cb Aug 29, 2025
86 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement initial version of Rust batch support#229

Implement initial version of Rust batch support#229
wsxrdv merged 4 commits intogoogle:mainfrom
wsxrdv:rust_batched

wsxrdv commented Aug 29, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

wsxrdv Aug 29, 2025

Uh oh!

wsxrdv Aug 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Aug 29, 2025

Uh oh!

kralka Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 29, 2025

⚠️ WARNING: No Threshold found!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wsxrdv commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

wsxrdv Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

wsxrdv Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

kralka Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 29, 2025

Bencher Report

⚠️ WARNING: No Threshold found!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wsxrdv commented Aug 29, 2025 •

edited

Loading