Skip to content

Concurrent shard writing#237

Merged
wsxrdv merged 9 commits intogoogle:mainfrom
wsxrdv:multithreading_write
Sep 23, 2025
Merged

Concurrent shard writing#237
wsxrdv merged 9 commits intogoogle:mainfrom
wsxrdv:multithreading_write

Conversation

@wsxrdv
Copy link
Collaborator

@wsxrdv wsxrdv commented Sep 19, 2025

Speed up dataset creation when both getting values for writing an example and IO are slow.

  • Compress, save file, and compute hash check-sums in a different process (or thread) than the one calling write_example.
  • For fb compute hash check-sums in memory instead of reading the file again.

This has the potential to speed up dataset creation when both getting
values for writing an example and IO are slow.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @wsxrdv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the dataset creation process by introducing multi-threading capabilities for writing shard files. The primary goal is to improve performance in scenarios where generating data examples and performing disk I/O are bottlenecks. By allowing shard files to be written in parallel, the overall time required to build large datasets can be reduced, making the system more efficient for data-intensive applications.

Highlights

  • Multi-threaded Shard File Writing: Introduced an option to enable multi-threaded writing of shard files, which can significantly speed up dataset creation, especially when both example value generation and I/O operations are slow. This is controlled by a new concurrency parameter in the DatasetWriting.filler() method.
  • Concurrency Integration: The DatasetFiller and DatasetFillerContext classes now manage a concurrent.futures.ThreadPoolExecutor based on the provided concurrency level. Shard closing operations in Shard and its ShardWriter implementations (FlatBuffer, NumPy, TFRecord) have been updated to optionally utilize this pool for asynchronous file writing.
  • Optimized Hash Checksum Calculation: A new utility function, hash_checksums_from_bytes, was added to compute hash checksums directly from in-memory bytes. This avoids re-reading files from disk after they are written, which is crucial for efficiency when writing asynchronously.
  • Copyright Year Updates: Updated copyright years in several files from 2024 to 2025.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an option for concurrent shard writing to speed up dataset creation, which is a valuable performance enhancement. The implementation for the flatbuffer writer is efficient, computing hashes from memory before the asynchronous write. The test suite has also been updated to cover the new concurrency paths.

However, I've identified a critical issue with error handling in the asynchronous operations that could lead to silent failures and data corruption. Additionally, the concurrency feature is not implemented for numpy and tfrecord shard writers, which is an inconsistency. My review provides detailed feedback and suggestions to address these points to improve the robustness and completeness of the feature.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 17921698713

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 77 of 79 (97.47%) changed or added relevant lines in 17 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 88.753%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/sedpack/io/shard/shard_writer_flatbuffer.py 10 11 90.91%
src/sedpack/io/shard/shard_writer_np.py 4 5 80.0%
Files with Coverage Reduction New Missed Lines %
src/sedpack/io/shard/shard_writer_np.py 1 89.83%
Totals Coverage Status
Change from base Build 17905149035: 0.1%
Covered Lines: 2912
Relevant Lines: 3281

💛 - Coveralls

@wsxrdv wsxrdv changed the title Add an option for multi-threading shard file write Concurrent shard writing Sep 23, 2025
@wsxrdv wsxrdv requested a review from jmichelp September 23, 2025 10:31
@wsxrdv wsxrdv enabled auto-merge September 23, 2025 12:00
@wsxrdv wsxrdv added this pull request to the merge queue Sep 23, 2025
Merged via the queue into google:main with commit 22fa29d Sep 23, 2025
60 of 61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants