geotiff: account for IFD overhead when picking BigTIFF in streaming writer (#1785)#1787
Merged
Conversation
…riter (#1785) The streaming writer's auto-BigTIFF decision compared only the uncompressed pixel-data size against UINT32_MAX. For rasters just under 4 GiB the IFD plus the strip/tile-offset table pushed the actual file past the classic-TIFF uint32 offset ceiling, and the write failed late with struct.error during offset patching. Factor the decision into _should_use_bigtiff_streaming() and include header (8 bytes), an IFD estimate (~200 bytes), and the strip/tile table (n_entries * 8). Compare with >= because UINT32_MAX itself is not a referenceable offset in classic TIFF. The explicit bigtiff=True/False user override is unchanged. Closes #1785
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a failure mode in the GeoTIFF streaming writer where it could incorrectly choose classic TIFF for rasters near the 4 GiB boundary, then fail late while patching strip/tile offsets. The change brings streaming BigTIFF auto-detection closer to the eager writer behavior and adds regression coverage for the decision boundary.
Changes:
- Add
_should_use_bigtiff_streaming()to include reserved IFD + strip/tile table overhead in the streaming BigTIFF decision. - Update
write_streaming()to use the new helper instead of comparing only raw uncompressed pixel bytes. - Add unit + integration regression tests for the helper and for
bigtiff=True/False/Noneoverride behavior on a small dask-backed raster.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_writer.py |
Adds a helper for streaming BigTIFF selection and switches write_streaming() to use it. |
xrspatial/tests/test_geotiff_streaming_bigtiff_threshold_1785.py |
Adds regression tests covering the helper threshold logic and user override behavior. |
Comments suppressed due to low confidence (1)
xrspatial/geotiff/_writer.py:1589
- In classic TIFF, the highest valid byte offset is (file_size - 1). This helper compares the file size estimate (
uncompressed_bytes + reserved_overhead) againstUINT32_MAXusing>=, which effectively reduces the classic-TIFF ceiling by 1 byte. This also diverges from the eager path’s BigTIFF detection in_assemble_tiff(which usesestimated_file_size > UINT32_MAX). Consider aligning the streaming helper with the eager check (and/or comparing the last referenced offset) and update the docstring/tests accordingly to avoid an off-by-one promotion to BigTIFF.
Classic TIFF stores offsets as uint32, so any offset referenced from
the IFD must satisfy ``offset < UINT32_MAX``. The streaming writer
appends pixel data after the header and IFD, so the last byte of
pixel data sits at approximately
``header + ifd + strip_table + uncompressed_bytes``. This helper
promotes to BigTIFF if that total reaches ``UINT32_MAX``.
See issue #1785: the previous check only compared ``uncompressed_bytes``
against ``UINT32_MAX`` and missed the IFD/strip-table overhead, which
pushed sub-4 GiB rasters over the classic-TIFF limit late in the write.
"""
# 8 bytes per strip/tile entry: 4 for offset (LONG) + 4 for byte count.
strip_table_overhead = n_entries * 8
reserved_overhead = (
header_size_classic + ifd_overhead_estimate + strip_table_overhead
)
UINT32_MAX = 0xFFFFFFFF
# Use >= because UINT32_MAX itself is not a referenceable offset.
return uncompressed_bytes + reserved_overhead >= UINT32_MAX
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1565
to
+1586
| def _should_use_bigtiff_streaming(uncompressed_bytes: int, | ||
| n_entries: int, | ||
| header_size_classic: int = 8, | ||
| ifd_overhead_estimate: int = 200) -> bool: | ||
| """Decide whether the streaming writer must emit BigTIFF. | ||
|
|
||
| Classic TIFF stores offsets as uint32, so any offset referenced from | ||
| the IFD must satisfy ``offset < UINT32_MAX``. The streaming writer | ||
| appends pixel data after the header and IFD, so the last byte of | ||
| pixel data sits at approximately | ||
| ``header + ifd + strip_table + uncompressed_bytes``. This helper | ||
| promotes to BigTIFF if that total reaches ``UINT32_MAX``. | ||
|
|
||
| See issue #1785: the previous check only compared ``uncompressed_bytes`` | ||
| against ``UINT32_MAX`` and missed the IFD/strip-table overhead, which | ||
| pushed sub-4 GiB rasters over the classic-TIFF limit late in the write. | ||
| """ | ||
| # 8 bytes per strip/tile entry: 4 for offset (LONG) + 4 for byte count. | ||
| strip_table_overhead = n_entries * 8 | ||
| reserved_overhead = ( | ||
| header_size_classic + ifd_overhead_estimate + strip_table_overhead | ||
| ) |
Comment on lines
+88
to
+101
| # Pick overhead components large enough that the sum crosses | ||
| # UINT32_MAX even though uncompressed_bytes < UINT32_MAX. | ||
| n_entries = 100_000 # 800 KB strip table | ||
| uncompressed = UINT32_MAX - (8 + 200 + n_entries * 8) | ||
| # Sum equals exactly UINT32_MAX -> promote (>=). | ||
| assert _should_use_bigtiff_streaming( | ||
| uncompressed_bytes=uncompressed, | ||
| n_entries=n_entries, | ||
| ) is True | ||
| # One byte less and we stay classic. | ||
| assert _should_use_bigtiff_streaming( | ||
| uncompressed_bytes=uncompressed - 1, | ||
| n_entries=n_entries, | ||
| ) is False |
…gTIFF boundary Address Copilot review feedback on PR #1787: * The 200-byte ifd_overhead_estimate constant in _should_use_bigtiff_streaming was not guaranteed conservative. Variable-length payloads (gdal_metadata_xml ASCII overflow, user extra_tags, geo ascii params) can push real IFD overhead well past 200 bytes and let sub-4 GiB rasters overflow classic-TIFF uint32 offsets late in the write. Replace the fudge with _compute_classic_ifd_overhead, which reuses _build_ifd to measure the actual entry block plus overflow heap from the tag list. write_streaming now builds tags first, measures, decides BigTIFF, then promotes offset arrays via _promote_offsets_to_long8. * The streaming helper used >= UINT32_MAX while the eager _assemble_tiff path uses > UINT32_MAX. Align both: a file of exactly UINT32_MAX bytes still fits classic (last byte sits at offset UINT32_MAX - 1, a valid uint32). Update the corresponding boundary tests so equality stays classic and the +1 case promotes. Tests cover the boundary flip and a 5000-byte gdal_metadata blob that the old 200-byte fudge would have ignored.
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The streaming writer's auto-BigTIFF check compared only the uncompressed pixel-data size against
UINT32_MAX. For rasters just under 4 GiB the IFD plus the strip/tile-offset table pushed the actual file past the classic-TIFF uint32 offset ceiling, and the write failed late withstruct.errorwhile patching offsets. The eager path in_assemble_tiffalready accounts for this overhead; the streaming path now matches._should_use_bigtiff_streaming(uncompressed_bytes, n_entries, header_size_classic)so it can be unit-tested without multi-GB allocations.n_entries * 8, covering StripOffsets + StripByteCounts as LONG).>=because classic TIFF cannot reference an offset equal toUINT32_MAX(last valid offset isUINT32_MAX - 1).bigtiff=True/Falseuser override is unchanged.Closes #1785.
Test plan
pytest -xvs xrspatial/tests/test_geotiff_streaming_bigtiff_threshold_1785.py(9 passed)pytest -x xrspatial/ -k writer(227 passed)pytest -x xrspatial/geotiff/tests/test_streaming_write.py(27 passed)