Skip to content

geotiff: account for IFD overhead when picking BigTIFF in streaming writer (#1785)#1787

Merged
brendancol merged 2 commits into
mainfrom
issue-1785
May 13, 2026
Merged

geotiff: account for IFD overhead when picking BigTIFF in streaming writer (#1785)#1787
brendancol merged 2 commits into
mainfrom
issue-1785

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

The streaming writer's auto-BigTIFF check compared only the uncompressed pixel-data size against UINT32_MAX. For rasters just under 4 GiB the IFD plus the strip/tile-offset table pushed the actual file past the classic-TIFF uint32 offset ceiling, and the write failed late with struct.error while patching offsets. The eager path in _assemble_tiff already accounts for this overhead; the streaming path now matches.

  • Factor the decision into _should_use_bigtiff_streaming(uncompressed_bytes, n_entries, header_size_classic) so it can be unit-tested without multi-GB allocations.
  • Reserved overhead = header (8) + IFD estimate (~200 bytes) + strip/tile table (n_entries * 8, covering StripOffsets + StripByteCounts as LONG).
  • Compare with >= because classic TIFF cannot reference an offset equal to UINT32_MAX (last valid offset is UINT32_MAX - 1).
  • Explicit bigtiff=True/False user override is unchanged.

Closes #1785.

Test plan

  • pytest -xvs xrspatial/tests/test_geotiff_streaming_bigtiff_threshold_1785.py (9 passed)
  • pytest -x xrspatial/ -k writer (227 passed)
  • pytest -x xrspatial/geotiff/tests/test_streaming_write.py (27 passed)

…riter (#1785)

The streaming writer's auto-BigTIFF decision compared only the
uncompressed pixel-data size against UINT32_MAX. For rasters just
under 4 GiB the IFD plus the strip/tile-offset table pushed the
actual file past the classic-TIFF uint32 offset ceiling, and the
write failed late with struct.error during offset patching.

Factor the decision into _should_use_bigtiff_streaming() and include
header (8 bytes), an IFD estimate (~200 bytes), and the strip/tile
table (n_entries * 8). Compare with >= because UINT32_MAX itself is
not a referenceable offset in classic TIFF. The explicit
bigtiff=True/False user override is unchanged.

Closes #1785
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 13, 2026
@brendancol brendancol requested a review from Copilot May 13, 2026 13:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a failure mode in the GeoTIFF streaming writer where it could incorrectly choose classic TIFF for rasters near the 4 GiB boundary, then fail late while patching strip/tile offsets. The change brings streaming BigTIFF auto-detection closer to the eager writer behavior and adds regression coverage for the decision boundary.

Changes:

  • Add _should_use_bigtiff_streaming() to include reserved IFD + strip/tile table overhead in the streaming BigTIFF decision.
  • Update write_streaming() to use the new helper instead of comparing only raw uncompressed pixel bytes.
  • Add unit + integration regression tests for the helper and for bigtiff=True/False/None override behavior on a small dask-backed raster.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
xrspatial/geotiff/_writer.py Adds a helper for streaming BigTIFF selection and switches write_streaming() to use it.
xrspatial/tests/test_geotiff_streaming_bigtiff_threshold_1785.py Adds regression tests covering the helper threshold logic and user override behavior.
Comments suppressed due to low confidence (1)

xrspatial/geotiff/_writer.py:1589

  • In classic TIFF, the highest valid byte offset is (file_size - 1). This helper compares the file size estimate (uncompressed_bytes + reserved_overhead) against UINT32_MAX using >=, which effectively reduces the classic-TIFF ceiling by 1 byte. This also diverges from the eager path’s BigTIFF detection in _assemble_tiff (which uses estimated_file_size > UINT32_MAX). Consider aligning the streaming helper with the eager check (and/or comparing the last referenced offset) and update the docstring/tests accordingly to avoid an off-by-one promotion to BigTIFF.
    Classic TIFF stores offsets as uint32, so any offset referenced from
    the IFD must satisfy ``offset < UINT32_MAX``. The streaming writer
    appends pixel data after the header and IFD, so the last byte of
    pixel data sits at approximately
    ``header + ifd + strip_table + uncompressed_bytes``. This helper
    promotes to BigTIFF if that total reaches ``UINT32_MAX``.

    See issue #1785: the previous check only compared ``uncompressed_bytes``
    against ``UINT32_MAX`` and missed the IFD/strip-table overhead, which
    pushed sub-4 GiB rasters over the classic-TIFF limit late in the write.
    """
    # 8 bytes per strip/tile entry: 4 for offset (LONG) + 4 for byte count.
    strip_table_overhead = n_entries * 8
    reserved_overhead = (
        header_size_classic + ifd_overhead_estimate + strip_table_overhead
    )
    UINT32_MAX = 0xFFFFFFFF
    # Use >= because UINT32_MAX itself is not a referenceable offset.
    return uncompressed_bytes + reserved_overhead >= UINT32_MAX

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1565 to +1586
def _should_use_bigtiff_streaming(uncompressed_bytes: int,
n_entries: int,
header_size_classic: int = 8,
ifd_overhead_estimate: int = 200) -> bool:
"""Decide whether the streaming writer must emit BigTIFF.

Classic TIFF stores offsets as uint32, so any offset referenced from
the IFD must satisfy ``offset < UINT32_MAX``. The streaming writer
appends pixel data after the header and IFD, so the last byte of
pixel data sits at approximately
``header + ifd + strip_table + uncompressed_bytes``. This helper
promotes to BigTIFF if that total reaches ``UINT32_MAX``.

See issue #1785: the previous check only compared ``uncompressed_bytes``
against ``UINT32_MAX`` and missed the IFD/strip-table overhead, which
pushed sub-4 GiB rasters over the classic-TIFF limit late in the write.
"""
# 8 bytes per strip/tile entry: 4 for offset (LONG) + 4 for byte count.
strip_table_overhead = n_entries * 8
reserved_overhead = (
header_size_classic + ifd_overhead_estimate + strip_table_overhead
)
Comment on lines +88 to +101
# Pick overhead components large enough that the sum crosses
# UINT32_MAX even though uncompressed_bytes < UINT32_MAX.
n_entries = 100_000 # 800 KB strip table
uncompressed = UINT32_MAX - (8 + 200 + n_entries * 8)
# Sum equals exactly UINT32_MAX -> promote (>=).
assert _should_use_bigtiff_streaming(
uncompressed_bytes=uncompressed,
n_entries=n_entries,
) is True
# One byte less and we stay classic.
assert _should_use_bigtiff_streaming(
uncompressed_bytes=uncompressed - 1,
n_entries=n_entries,
) is False
…gTIFF boundary

Address Copilot review feedback on PR #1787:

* The 200-byte ifd_overhead_estimate constant in _should_use_bigtiff_streaming
  was not guaranteed conservative. Variable-length payloads (gdal_metadata_xml
  ASCII overflow, user extra_tags, geo ascii params) can push real IFD overhead
  well past 200 bytes and let sub-4 GiB rasters overflow classic-TIFF uint32
  offsets late in the write. Replace the fudge with _compute_classic_ifd_overhead,
  which reuses _build_ifd to measure the actual entry block plus overflow heap
  from the tag list. write_streaming now builds tags first, measures, decides
  BigTIFF, then promotes offset arrays via _promote_offsets_to_long8.

* The streaming helper used >= UINT32_MAX while the eager _assemble_tiff path
  uses > UINT32_MAX. Align both: a file of exactly UINT32_MAX bytes still fits
  classic (last byte sits at offset UINT32_MAX - 1, a valid uint32). Update
  the corresponding boundary tests so equality stays classic and the +1 case
  promotes.

Tests cover the boundary flip and a 5000-byte gdal_metadata blob that the old
200-byte fudge would have ignored.
@brendancol brendancol merged commit 4c601ae into main May 13, 2026
2 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: streaming dask writer picks classic TIFF too late near the 4 GB boundary

2 participants