Skip to content

Improve saving and loading of non-NumPy attributes#227

Merged
kralka merged 19 commits intogoogle:mainfrom
kralka:dtype_tests
Sep 8, 2025
Merged

Improve saving and loading of non-NumPy attributes#227
kralka merged 19 commits intogoogle:mainfrom
kralka:dtype_tests

Conversation

@kralka
Copy link
Collaborator

@kralka kralka commented Aug 27, 2025

Improve saving of int, str, and bytes attributes. This PR mostly fixes existing bugs. It might cause a regression and incompatibility with older versions (version bumped).

  • tfrec str is represented as bytes (which is consistent with TensorFlow)
  • int is represented as int64, which might be revisited later to enable arbitrarily large integers
  • bytes in npz are stored as contiguous array with indexes. This is not the most robust implementation. Otherwise we would not allow dynamic size (with examples_per_shard > 1). The other option of saving them as padded size arrays was causing length problems when decoding.

kralka added 4 commits August 27, 2025 19:26
as_tfdataset should also have strings? or is bytearray representation of
string ok here?

Copy issues should be fine since Rust is allocating the memory and
passing ownership to Python. Only difference maybe str(value) vs
value.decode("utf-8")?

bytearray needs the same

need a dataset fixture for efficiency

# Copy otherwise the arrays are immutable and keep the whole
# file content from being garbage collected.
np_array = np.copy(np_array)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy issues should be fine since Rust is allocating the memory and
passing ownership to Python.

@kralka kralka marked this pull request as ready for review August 29, 2025 10:29
@kralka
Copy link
Collaborator Author

kralka commented Aug 29, 2025

/Gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the handling of str, int, and bytes attributes across various storage formats, which is a valuable enhancement. The changes introduce specialized serialization and deserialization logic to improve efficiency and security, notably by working to avoid Python's pickle module. The addition of comprehensive end-to-end tests is also a great step towards ensuring correctness. My review has identified a few areas for improvement, including a critical security concern, a potential memory leak, and some opportunities to enhance code correctness, efficiency, and API consistency. Addressing these points will help solidify the robustness of these new features.

@kralka kralka changed the title Improve saving and loading of str attributes Improve saving and loading of non-NumPy attributes Aug 29, 2025
@github-actions
Copy link

🐰 Bencher Report

Branchdtype_tests
Testbedubuntu-latest

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
BenchmarkLatencymilliseconds (ms)
ExampleIterator📈 view plot
⚠️ NO THRESHOLD
226.48 ms
parallel_map📈 view plot
⚠️ NO THRESHOLD
228.99 ms
🐰 View full continuous benchmarking report in Bencher

shard_content: dict[str, list[AttributeValueT]] = np.load(file_path)
# A prefix such that prepended it creates a new name without collision
# with any attribute name.
self._counting_prefix: str = "len" + "_" * max(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really say I'm a huge fan of this mangling, especially because string concatenation creates copies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not happy with this workaround. But at least now no extra strings are being created.

In the 1.0 version I would suggest either getting rid of npz shard format or not allowing bytes for it.

IterateShardNP.decode_attribute(
np_value=shard_content[attribute.name][i],
attribute=attribute,
) if attribute.dtype != "bytes" else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comprehension is a bit hard to read.
consider writing it for / if yield / else yield

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored out

@kralka kralka requested a review from jmichelp September 8, 2025 16:18
@kralka kralka added this pull request to the merge queue Sep 8, 2025
Merged via the queue into google:main with commit 198cd98 Sep 8, 2025
73 of 87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants