Implement initial version of Rust batch support#229
Conversation
This commit is setting up API between Rust and Python. The implementation is suboptimal. google#200
There was a problem hiding this comment.
Summary of Changes
Hello @wsxrdv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces an initial implementation for batching support within the Rust data processing pipeline, enabling more efficient data consumption, particularly for machine learning workloads. It establishes the API for interaction between Rust and Python for this new batching capability, although the current implementation is noted as suboptimal.
Highlights
- New Rust Batching Implementation: Introduced a new
BatchIteratorin Rust to efficiently group data examples into batches, optimizing data consumption for machine learning workflows. - Python Integration: Exposed the Rust batching functionality to Python via
pyo3withBatchedRustIterandRustBatchedGenerator, enabling seamless interaction between the two languages. - Performance Benchmarking: Added a new benchmark (
batch_iterator_benchmark) to measure the performance of the Rust batch iterator, ensuring performance tracking for the new feature. - ExampleIterator Alignment: Modified the existing
ExampleIteratorto align with the new batching approach, specifically in howShardProgressis handled, ensuring compatibility and proper data flow.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
| # limitations under the License. | ||
| """Build and load tensorFlow dataset Record wrapper""" | ||
|
|
||
| # pylint: disable=invalid-name |
| values = { | ||
| "fixed": np.random.random((data_points, 138)).astype(dtype), | ||
| "fixed_2d": np.random.random((data_points, 3, 5)).astype(dtype), | ||
| # TODO(reintroduce) when https://github.com/google/sedpack/pull/227 is |
There was a problem hiding this comment.
Code Review
This pull request introduces an initial implementation for batched data iteration in Rust, exposing it to Python. This is a valuable addition for performance. The overall approach of handling batching in Rust and using pyo3 for the Python interface is sound. However, the implementation has some critical issues that need to be addressed. My review focuses on improving error handling, correctness, and performance. Specifically, I've pointed out unsafe panics in the Rust-Python boundary that should be converted to Python exceptions, a critical bug in the Python data decoding logic for batched attributes, and some opportunities for performance and code style improvements in the Rust implementation. Addressing these points will make the new feature more robust and efficient.
| true => BatchedAttribute::Static { | ||
| data: numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter( | ||
| cache.iter().flat_map(|e| e[attribute_index].clone().into_iter()), | ||
| ), | ||
| }, | ||
| false => BatchedAttribute::Dynamic { | ||
| data: cache | ||
| .iter() | ||
| .map(|e| { | ||
| numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter( | ||
| e[attribute_index].clone(), | ||
| ) | ||
| }) | ||
| .collect(), | ||
| }, |
There was a problem hiding this comment.
Cloning the inner Vec<u8> for each attribute is inefficient and can be avoided. You can iterate over a reference to the vector (&e[attribute_index]) instead, which will improve performance by reducing allocations and copying.
| true => BatchedAttribute::Static { | |
| data: numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter( | |
| cache.iter().flat_map(|e| e[attribute_index].clone().into_iter()), | |
| ), | |
| }, | |
| false => BatchedAttribute::Dynamic { | |
| data: cache | |
| .iter() | |
| .map(|e| { | |
| numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter( | |
| e[attribute_index].clone(), | |
| ) | |
| }) | |
| .collect(), | |
| }, | |
| true => BatchedAttribute::Static { | |
| data: numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter( | |
| cache.iter().flat_map(|e| &e[attribute_index]), | |
| ), | |
| }, | |
| false => BatchedAttribute::Dynamic { | |
| data: cache | |
| .iter() | |
| .map(|e| { | |
| numpy::ndarray::Array::<u8, numpy::Ix1>::from_iter( | |
| &e[attribute_index], | |
| ) | |
| }) | |
| .collect(), | |
| }, |
|
| Branch | rust_batched |
| Testbed | ubuntu-latest |
⚠️ WARNING: No Threshold found!Without a Threshold, no Alerts will ever be generated.
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the--ci-only-thresholdsflag.
Click to view all benchmark results
| Benchmark | Latency | milliseconds (ms) |
|---|---|---|
| BatchIterator | 📈 view plot | 894.11 ms |
| ExampleIterator | 📈 view plot | 221.33 ms |
| parallel_map | 📈 view plot | 107.41 ms |
This commit is setting up API between Rust and Python. This is an initial and suboptimal implementation.
#200