Improve saving and loading of non-NumPy attributes#227
Conversation
as_tfdataset should also have strings? or is bytearray representation of
string ok here?
Copy issues should be fine since Rust is allocating the memory and
passing ownership to Python. Only difference maybe str(value) vs
value.decode("utf-8")?
bytearray needs the same
need a dataset fixture for efficiency
|
|
||
| # Copy otherwise the arrays are immutable and keep the whole | ||
| # file content from being garbage collected. | ||
| np_array = np.copy(np_array) |
There was a problem hiding this comment.
Copy issues should be fine since Rust is allocating the memory and
passing ownership to Python.
- not all of these pass - large int is still turned into a NumPy value
|
/Gemini review |
There was a problem hiding this comment.
Code Review
This pull request significantly improves the handling of str, int, and bytes attributes across various storage formats, which is a valuable enhancement. The changes introduce specialized serialization and deserialization logic to improve efficiency and security, notably by working to avoid Python's pickle module. The addition of comprehensive end-to-end tests is also a great step towards ensuring correctness. My review has identified a few areas for improvement, including a critical security concern, a potential memory leak, and some opportunities to enhance code correctness, efficiency, and API consistency. Addressing these points will help solidify the robustness of these new features.
|
| Branch | dtype_tests |
| Testbed | ubuntu-latest |
⚠️ WARNING: No Threshold found!Without a Threshold, no Alerts will ever be generated.
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the--ci-only-thresholdsflag.
Click to view all benchmark results
| Benchmark | Latency | milliseconds (ms) |
|---|---|---|
| ExampleIterator | 📈 view plot | 226.48 ms |
| parallel_map | 📈 view plot | 228.99 ms |
src/sedpack/io/npz/iterate_npz.py
Outdated
| shard_content: dict[str, list[AttributeValueT]] = np.load(file_path) | ||
| # A prefix such that prepended it creates a new name without collision | ||
| # with any attribute name. | ||
| self._counting_prefix: str = "len" + "_" * max( |
There was a problem hiding this comment.
I can't really say I'm a huge fan of this mangling, especially because string concatenation creates copies.
There was a problem hiding this comment.
I am also not happy with this workaround. But at least now no extra strings are being created.
In the 1.0 version I would suggest either getting rid of npz shard format or not allowing bytes for it.
src/sedpack/io/npz/iterate_npz.py
Outdated
| IterateShardNP.decode_attribute( | ||
| np_value=shard_content[attribute.name][i], | ||
| attribute=attribute, | ||
| ) if attribute.dtype != "bytes" else |
There was a problem hiding this comment.
That comprehension is a bit hard to read.
consider writing it for / if yield / else yield
bump sedpack version (this PR fixes major problem)
Improve saving of
int,str, andbytesattributes. This PR mostly fixes existing bugs. It might cause a regression and incompatibility with older versions (version bumped).examples_per_shard > 1). The other option of saving them as padded size arrays was causing length problems when decoding.