Improve saving and loading of non-NumPy attributes by kralka · Pull Request #227 · google/sedpack

kralka · 2025-08-27T17:27:20Z

Improve saving of int, str, and bytes attributes. This PR mostly fixes existing bugs. It might cause a regression and incompatibility with older versions (version bumped).

tfrec str is represented as bytes (which is consistent with TensorFlow)
int is represented as int64, which might be revisited later to enable arbitrarily large integers
bytes in npz are stored as contiguous array with indexes. This is not the most robust implementation. Otherwise we would not allow dynamic size (with examples_per_shard > 1). The other option of saving them as padded size arrays was causing length problems when decoding.

as_tfdataset should also have strings? or is bytearray representation of string ok here? Copy issues should be fine since Rust is allocating the memory and passing ownership to Python. Only difference maybe str(value) vs value.decode("utf-8")? bytearray needs the same need a dataset fixture for efficiency

kralka · 2025-08-28T09:30:33Z

src/sedpack/io/flatbuffer/iterate.py


-                # Copy otherwise the arrays are immutable and keep the whole
-                # file content from being garbage collected.
-                np_array = np.copy(np_array)


Copy issues should be fine since Rust is allocating the memory and
passing ownership to Python.

- not all of these pass - large int is still turned into a NumPy value

kralka · 2025-08-29T10:31:08Z

/Gemini review

gemini-code-assist

Code Review

This pull request significantly improves the handling of str, int, and bytes attributes across various storage formats, which is a valuable enhancement. The changes introduce specialized serialization and deserialization logic to improve efficiency and security, notably by working to avoid Python's pickle module. The addition of comprehensive end-to-end tests is also a great step towards ensuring correctness. My review has identified a few areas for improvement, including a critical security concern, a potential memory leak, and some opportunities to enhance code correctness, efficiency, and API consistency. Addressing these points will help solidify the robustness of these new features.

src/sedpack/io/shard/shard_writer_np.py

src/sedpack/io/flatbuffer/iterate.py

src/sedpack/io/shard/shard_writer_np.py

tests/io/test_end2end_dtypes.py

src/sedpack/io/npz/iterate_npz.py

src/sedpack/io/shard/shard_writer_flatbuffer.py

reasons mentioned)

github-actions · 2025-08-29T12:37:59Z

Bencher Report

Branch	dtype_tests
Testbed	ubuntu-latest

⚠️ WARNING: No Threshold found!
Without a Threshold, no Alerts will ever be generated.
Latency (nanoseconds (ns))
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results

Benchmark	Latency	milliseconds (ms)
ExampleIterator	📈 view plot ⚠️ NO THRESHOLD	226.48 ms
parallel_map	📈 view plot ⚠️ NO THRESHOLD	228.99 ms

🐰 View full continuous benchmarking report in Bencher

jmichelp · 2025-09-08T14:48:46Z

src/sedpack/io/npz/iterate_npz.py

-        shard_content: dict[str, list[AttributeValueT]] = np.load(file_path)
+        # A prefix such that prepended it creates a new name without collision
+        # with any attribute name.
+        self._counting_prefix: str = "len" + "_" * max(


I can't really say I'm a huge fan of this mangling, especially because string concatenation creates copies.

I am also not happy with this workaround. But at least now no extra strings are being created.

In the 1.0 version I would suggest either getting rid of npz shard format or not allowing bytes for it.

jmichelp · 2025-09-08T14:51:23Z

src/sedpack/io/npz/iterate_npz.py

+                    IterateShardNP.decode_attribute(
+                        np_value=shard_content[attribute.name][i],
+                        attribute=attribute,
+                    ) if attribute.dtype != "bytes" else


That comprehension is a bit hard to read.
consider writing it for / if yield / else yield

Refactored out

bump sedpack version (this PR fixes major problem)

kralka added 4 commits August 27, 2025 19:26

Improve saving and loading of str attributes

efed8fd

[squash] fix nits

8478126

[squash] fix workflows

0396394

kralka commented Aug 28, 2025

View reviewed changes

kralka added 7 commits August 28, 2025 14:47

[squash] parametrized tests

3b86d4d

- not all of these pass - large int is still turned into a NumPy value

[squash]

be17d60

Merge branch 'main' into dtype_tests

9e32875

[squash] typo

ddec48e

[squash] fix deprecated warning

19b0d6f

[drop] Try allowing pickle when saving but not when loading

552e01f

[squash] add pickle explanation

6ee51c1

kralka marked this pull request as ready for review August 29, 2025 10:29

gemini-code-assist bot reviewed Aug 29, 2025

View reviewed changes

kralka added 4 commits August 29, 2025 13:21

[squash] nit allow pickle

f8d2ff8

[squash] NIT about zip (it is better this way, yes but not for the

e834ea6

reasons mentioned)

[squash] fix nits

5956646

[squash] remove commented out code

8e9cee8

kralka changed the title ~~Improve saving and loading of str attributes~~ Improve saving and loading of non-NumPy attributes Aug 29, 2025

[squash] also bump package version to fix possible incompatibilities

e8ba34c

kralka requested a review from jmichelp August 29, 2025 12:50

wsxrdv mentioned this pull request Aug 29, 2025

Implement initial version of Rust batch support #229

Merged

Merge branch 'main' into dtype_tests

29aa9f6

jmichelp reviewed Sep 8, 2025

View reviewed changes

kralka added 2 commits September 8, 2025 17:56

[squash] fix incomprehensible comprehension

7e03cdc

bump sedpack version (this PR fixes major problem)

[squash] avoid creating new strings

52b8f78

kralka requested a review from jmichelp September 8, 2025 16:18

jmichelp approved these changes Sep 8, 2025

View reviewed changes

kralka added this pull request to the merge queue Sep 8, 2025

Merged via the queue into google:main with commit 198cd98 Sep 8, 2025
73 of 87 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve saving and loading of non-NumPy attributes#227

Improve saving and loading of non-NumPy attributes#227
kralka merged 19 commits intogoogle:mainfrom
kralka:dtype_tests

kralka commented Aug 27, 2025 •

edited

Loading

Uh oh!

kralka Aug 28, 2025

Uh oh!

kralka commented Aug 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 29, 2025

⚠️ WARNING: No Threshold found!

Uh oh!

jmichelp Sep 8, 2025

Uh oh!

kralka Sep 8, 2025

Uh oh!

jmichelp Sep 8, 2025

Uh oh!

kralka Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kralka commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kralka Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

kralka commented Aug 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 29, 2025

Bencher Report

⚠️ WARNING: No Threshold found!

Uh oh!

jmichelp Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

kralka Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

jmichelp Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

kralka Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kralka commented Aug 27, 2025 •

edited

Loading