Skip to content

Implement initial version of Rust batch support#229

Merged
wsxrdv merged 4 commits intogoogle:mainfrom
wsxrdv:rust_batched
Aug 29, 2025
Merged

Implement initial version of Rust batch support#229
wsxrdv merged 4 commits intogoogle:mainfrom
wsxrdv:rust_batched

Conversation

@wsxrdv
Copy link
Collaborator

@wsxrdv wsxrdv commented Aug 29, 2025

This commit is setting up API between Rust and Python. This is an initial and suboptimal implementation.

#200

This commit is setting up API between Rust and Python. The
implementation is suboptimal.

google#200
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @wsxrdv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an initial implementation for batching support within the Rust data processing pipeline, enabling more efficient data consumption, particularly for machine learning workloads. It establishes the API for interaction between Rust and Python for this new batching capability, although the current implementation is noted as suboptimal.

Highlights

  • New Rust Batching Implementation: Introduced a new BatchIterator in Rust to efficiently group data examples into batches, optimizing data consumption for machine learning workflows.
  • Python Integration: Exposed the Rust batching functionality to Python via pyo3 with BatchedRustIter and RustBatchedGenerator, enabling seamless interaction between the two languages.
  • Performance Benchmarking: Added a new benchmark (batch_iterator_benchmark) to measure the performance of the Rust batch iterator, ensuring performance tracking for the new feature.
  • ExampleIterator Alignment: Modified the existing ExampleIterator to align with the new batching approach, specifically in how ShardProgress is handled, ensuring compatibility and proper data flow.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@wsxrdv wsxrdv mentioned this pull request Aug 29, 2025
# limitations under the License.
"""Build and load tensorFlow dataset Record wrapper"""

# pylint: disable=invalid-name
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values = {
"fixed": np.random.random((data_points, 138)).astype(dtype),
"fixed_2d": np.random.random((data_points, 3, 5)).astype(dtype),
# TODO(reintroduce) when https://github.com/google/sedpack/pull/227 is
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for #227

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an initial implementation for batched data iteration in Rust, exposing it to Python. This is a valuable addition for performance. The overall approach of handling batching in Rust and using pyo3 for the Python interface is sound. However, the implementation has some critical issues that need to be addressed. My review focuses on improving error handling, correctness, and performance. Specifically, I've pointed out unsafe panics in the Rust-Python boundary that should be converted to Python exceptions, a critical bug in the Python data decoding logic for batched attributes, and some opportunities for performance and code style improvements in the Rust implementation. Addressing these points will make the new feature more robust and efficient.

Comment on lines +63 to +77
true => BatchedAttribute::Static {
data: numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(
cache.iter().flat_map(|e| e[attribute_index].clone().into_iter()),
),
},
false => BatchedAttribute::Dynamic {
data: cache
.iter()
.map(|e| {
numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(
e[attribute_index].clone(),
)
})
.collect(),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Cloning the inner Vec<u8> for each attribute is inefficient and can be avoided. You can iterate over a reference to the vector (&e[attribute_index]) instead, which will improve performance by reducing allocations and copying.

Suggested change
true => BatchedAttribute::Static {
data: numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(
cache.iter().flat_map(|e| e[attribute_index].clone().into_iter()),
),
},
false => BatchedAttribute::Dynamic {
data: cache
.iter()
.map(|e| {
numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(
e[attribute_index].clone(),
)
})
.collect(),
},
true => BatchedAttribute::Static {
data: numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(
cache.iter().flat_map(|e| &e[attribute_index]),
),
},
false => BatchedAttribute::Dynamic {
data: cache
.iter()
.map(|e| {
numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter(
&e[attribute_index],
)
})
.collect(),
},

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E0271

@github-actions
Copy link

🐰 Bencher Report

Branchrust_batched
Testbedubuntu-latest

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
BenchmarkLatencymilliseconds (ms)
BatchIterator📈 view plot
⚠️ NO THRESHOLD
894.11 ms
ExampleIterator📈 view plot
⚠️ NO THRESHOLD
221.33 ms
parallel_map📈 view plot
⚠️ NO THRESHOLD
107.41 ms
🐰 View full continuous benchmarking report in Bencher

@kralka kralka requested a review from ia0 August 29, 2025 14:30
ia0
ia0 previously approved these changes Aug 29, 2025
@wsxrdv wsxrdv added this pull request to the merge queue Aug 29, 2025
Merged via the queue into google:main with commit 531a8cb Aug 29, 2025
86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants