Concurrent shard writing by wsxrdv · Pull Request #237 · google/sedpack

wsxrdv · 2025-09-19T14:56:23Z

Speed up dataset creation when both getting values for writing an example and IO are slow.

Compress, save file, and compute hash check-sums in a different process (or thread) than the one calling write_example.
For fb compute hash check-sums in memory instead of reading the file again.

This has the potential to speed up dataset creation when both getting values for writing an example and IO are slow.

gemini-code-assist · 2025-09-19T14:56:48Z

Summary of Changes

Hello @wsxrdv, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the dataset creation process by introducing multi-threading capabilities for writing shard files. The primary goal is to improve performance in scenarios where generating data examples and performing disk I/O are bottlenecks. By allowing shard files to be written in parallel, the overall time required to build large datasets can be reduced, making the system more efficient for data-intensive applications.

Highlights

Multi-threaded Shard File Writing: Introduced an option to enable multi-threaded writing of shard files, which can significantly speed up dataset creation, especially when both example value generation and I/O operations are slow. This is controlled by a new concurrency parameter in the DatasetWriting.filler() method.
Concurrency Integration: The DatasetFiller and DatasetFillerContext classes now manage a concurrent.futures.ThreadPoolExecutor based on the provided concurrency level. Shard closing operations in Shard and its ShardWriter implementations (FlatBuffer, NumPy, TFRecord) have been updated to optionally utilize this pool for asynchronous file writing.
Optimized Hash Checksum Calculation: A new utility function, hash_checksums_from_bytes, was added to compute hash checksums directly from in-memory bytes. This avoids re-reading files from disk after they are written, which is crucial for efficiency when writing asynchronously.
Copyright Year Updates: Updated copyright years in several files from 2024 to 2025.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an option for concurrent shard writing to speed up dataset creation, which is a valuable performance enhancement. The implementation for the flatbuffer writer is efficient, computing hashes from memory before the asynchronous write. The test suite has also been updated to cover the new concurrency paths.

However, I've identified a critical issue with error handling in the asynchronous operations that could lead to silent failures and data corruption. Additionally, the concurrency feature is not implemented for numpy and tfrecord shard writers, which is an inconsistency. My review provides detailed feedback and suggestions to address these points to improve the robustness and completeness of the feature.

src/sedpack/io/dataset_filler.py

src/sedpack/io/shard/shard_writer_np.py

src/sedpack/io/shard/shard_writer_tfrec.py

coveralls · 2025-09-22T16:29:57Z

Pull Request Test Coverage Report for Build 17921698713

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

77 of 79 (97.47%) changed or added relevant lines in 17 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.1%) to 88.753%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/sedpack/io/shard/shard_writer_flatbuffer.py	10	11	90.91%
src/sedpack/io/shard/shard_writer_np.py	4	5	80.0%

Files with Coverage Reduction	New Missed Lines	%
src/sedpack/io/shard/shard_writer_np.py	1	89.83%

Totals
Change from base Build 17905149035:	0.1%
Covered Lines:	2912
Relevant Lines:	3281

💛 - Coveralls

Add an option for multi-threading shard file write

c052917

This has the potential to speed up dataset creation when both getting values for writing an example and IO are slow.

gemini-code-assist bot reviewed Sep 19, 2025

View reviewed changes

src/sedpack/io/dataset_filler.py Outdated Show resolved Hide resolved

src/sedpack/io/shard/shard_writer_np.py Outdated Show resolved Hide resolved

src/sedpack/io/shard/shard_writer_tfrec.py Outdated Show resolved Hide resolved

wsxrdv mentioned this pull request Sep 19, 2025

Check available disk space before writing #240

Open

[squash] simplify to always use an executor

77b1f12

wsxrdv and others added 6 commits September 22, 2025 16:34

[squash] add retry

1471525

Merge branch 'main' into multithreading_write

a79d8b6

[squash] require less?

005feff

[squash]

1a818eb

[squash] a little code cleanup

5f0c059

[squash]

b600403

wsxrdv changed the title ~~Add an option for multi-threading shard file write~~ Concurrent shard writing Sep 23, 2025

wsxrdv requested a review from jmichelp September 23, 2025 10:31

[squash] add tenacity in a separate PR

f9dd43d

jmichelp approved these changes Sep 23, 2025

View reviewed changes

wsxrdv enabled auto-merge September 23, 2025 12:00

wsxrdv added this pull request to the merge queue Sep 23, 2025

Merged via the queue into google:main with commit 22fa29d Sep 23, 2025
60 of 61 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent shard writing#237

Concurrent shard writing#237
wsxrdv merged 9 commits intogoogle:mainfrom
wsxrdv:multithreading_write

wsxrdv commented Sep 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Sep 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coveralls commented Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wsxrdv commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Sep 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coveralls commented Sep 22, 2025

Pull Request Test Coverage Report for Build 17921698713

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wsxrdv commented Sep 19, 2025 •

edited

Loading