perf(geotiff): drop bytes(bytearray) copy in TIFF layout assemble (#1756)#1762
Open
brendancol wants to merge 2 commits into
Open
perf(geotiff): drop bytes(bytearray) copy in TIFF layout assemble (#1756)#1762brendancol wants to merge 2 commits into
brendancol wants to merge 2 commits into
Conversation
) The eager (non-streaming) writer builds the output TIFF in a bytearray inside _assemble_standard_layout and _assemble_cog_layout, then ends with ``return bytes(output)``. The bytes() call copies the entire buffer, transiently doubling peak Python-allocated memory for the duration of the conversion. Measured on a 95 MB uint8 raster: Before: peak 202 MB (95 MB bytearray + 95 MB bytes copy) After: peak 107 MB (just the bytearray) Returning the bytearray directly preserves correctness: ``_write_bytes`` already calls ``f.write(file_bytes)`` which accepts any buffer-protocol object, and the post-write ``parse_header(file_bytes[:16])`` validation slice works the same on bytearray and bytes. The streaming writer is unaffected -- it writes straight to a file handle and never built a single contiguous output buffer. Type annotations on _assemble_tiff, _assemble_standard_layout, _assemble_cog_layout, and _write_bytes are updated to reflect the buffer-protocol contract. Tests in test_assemble_layout_no_bytes_copy_1756.py: * Both layout helpers return bytearray (not bytes) * _assemble_tiff propagates the bytearray return through CPU and GPU writer paths * Round-trip via BytesIO and a tmp_path .tif still produces correct pixel data after the type change * The assembler returns a writeable bytearray whose first 16 bytes parse as a valid TIFF header
Audited the geotiff subpackage on top of Pass 7 (2026-05-12). Found and fixed one new MEDIUM: the eager TIFF layout assembler ended with a ``bytes(bytearray)`` copy that doubled peak Python-allocated memory for the duration of the conversion. Filed #1756, fix landed in the same branch (PR pending). SAFE / IO-bound verdict holds. Peak memory now scales 1x with the output buffer size instead of 2x.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_assemble_standard_layoutand_assemble_cog_layoutbuild the output TIFF in abytearray, then ended withreturn bytes(output)which copies the entire bufferbytearraydirectly drops transient peak memory in half on eager writes; downstream consumers (_write_bytes, theparse_headervalidation slice,BytesIO.write, file handlewrite) all accept the buffer protocol so no caller changes are needed_assemble_tiff, both layout helpers, and_write_bytesare updated to reflect that the return type is nowbytearrayFixes #1756.
Measurements
On a 10000x10000 uint8 raster (95 MB output):
The savings scale linearly with output size: a 1 GB write drops 1 GB of peak memory, a 10 GB write drops 10 GB.
Test plan
test_assemble_layout_no_bytes_copy_1756.pycover thebytearrayreturn type at the layout, assembler, and end-to-end levels (BytesIOandtmp_pathround-trips)test_no_georef_windowed_coords_1710,test_predictor2_big_endian_gpu_1517) reference the now-privateread_to_arrayattribute (commit 8adb749 / issue geotiff: read_to_array leaks into public namespace but is not in __all__ or docs #1708) and predate this change_assemble_tiff; the GPU path is covered bytest_gpu_writer_attrs_1563.pyandtest_kwarg_behaviour_2026_05_12.py(all 36 pass)