Conversation
This has the potential to speed up dataset creation when both getting values for writing an example and IO are slow.
Summary of ChangesHello @wsxrdv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the dataset creation process by introducing multi-threading capabilities for writing shard files. The primary goal is to improve performance in scenarios where generating data examples and performing disk I/O are bottlenecks. By allowing shard files to be written in parallel, the overall time required to build large datasets can be reduced, making the system more efficient for data-intensive applications. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an option for concurrent shard writing to speed up dataset creation, which is a valuable performance enhancement. The implementation for the flatbuffer writer is efficient, computing hashes from memory before the asynchronous write. The test suite has also been updated to cover the new concurrency paths.
However, I've identified a critical issue with error handling in the asynchronous operations that could lead to silent failures and data corruption. Additionally, the concurrency feature is not implemented for numpy and tfrecord shard writers, which is an inconsistency. My review provides detailed feedback and suggestions to address these points to improve the robustness and completeness of the feature.
Pull Request Test Coverage Report for Build 17921698713Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Speed up dataset creation when both getting values for writing an example and IO are slow.
write_example.