geotiff: account for IFD overhead when picking BigTIFF in streaming writer (#1785) by brendancol · Pull Request #1787 · xarray-contrib/xarray-spatial

brendancol · 2026-05-13T13:34:35Z

Summary

The streaming writer's auto-BigTIFF check compared only the uncompressed pixel-data size against UINT32_MAX. For rasters just under 4 GiB the IFD plus the strip/tile-offset table pushed the actual file past the classic-TIFF uint32 offset ceiling, and the write failed late with struct.error while patching offsets. The eager path in _assemble_tiff already accounts for this overhead; the streaming path now matches.

Factor the decision into _should_use_bigtiff_streaming(uncompressed_bytes, n_entries, header_size_classic) so it can be unit-tested without multi-GB allocations.
Reserved overhead = header (8) + IFD estimate (~200 bytes) + strip/tile table (n_entries * 8, covering StripOffsets + StripByteCounts as LONG).
Compare with >= because classic TIFF cannot reference an offset equal to UINT32_MAX (last valid offset is UINT32_MAX - 1).
Explicit bigtiff=True/False user override is unchanged.

Closes #1785.

Test plan

pytest -xvs xrspatial/tests/test_geotiff_streaming_bigtiff_threshold_1785.py (9 passed)
pytest -x xrspatial/ -k writer (227 passed)
pytest -x xrspatial/geotiff/tests/test_streaming_write.py (27 passed)

…riter (#1785) The streaming writer's auto-BigTIFF decision compared only the uncompressed pixel-data size against UINT32_MAX. For rasters just under 4 GiB the IFD plus the strip/tile-offset table pushed the actual file past the classic-TIFF uint32 offset ceiling, and the write failed late with struct.error during offset patching. Factor the decision into _should_use_bigtiff_streaming() and include header (8 bytes), an IFD estimate (~200 bytes), and the strip/tile table (n_entries * 8). Compare with >= because UINT32_MAX itself is not a referenceable offset in classic TIFF. The explicit bigtiff=True/False user override is unchanged. Closes #1785

Copilot

Pull request overview

This PR fixes a failure mode in the GeoTIFF streaming writer where it could incorrectly choose classic TIFF for rasters near the 4 GiB boundary, then fail late while patching strip/tile offsets. The change brings streaming BigTIFF auto-detection closer to the eager writer behavior and adds regression coverage for the decision boundary.

Changes:

Add _should_use_bigtiff_streaming() to include reserved IFD + strip/tile table overhead in the streaming BigTIFF decision.
Update write_streaming() to use the new helper instead of comparing only raw uncompressed pixel bytes.
Add unit + integration regression tests for the helper and for bigtiff=True/False/None override behavior on a small dask-backed raster.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`xrspatial/geotiff/_writer.py`	Adds a helper for streaming BigTIFF selection and switches `write_streaming()` to use it.
`xrspatial/tests/test_geotiff_streaming_bigtiff_threshold_1785.py`	Adds regression tests covering the helper threshold logic and user override behavior.

Comments suppressed due to low confidence (1)

xrspatial/geotiff/_writer.py:1589

In classic TIFF, the highest valid byte offset is (file_size - 1). This helper compares the file size estimate (uncompressed_bytes + reserved_overhead) against UINT32_MAX using >=, which effectively reduces the classic-TIFF ceiling by 1 byte. This also diverges from the eager path’s BigTIFF detection in _assemble_tiff (which uses estimated_file_size > UINT32_MAX). Consider aligning the streaming helper with the eager check (and/or comparing the last referenced offset) and update the docstring/tests accordingly to avoid an off-by-one promotion to BigTIFF.

    Classic TIFF stores offsets as uint32, so any offset referenced from
    the IFD must satisfy ``offset < UINT32_MAX``. The streaming writer
    appends pixel data after the header and IFD, so the last byte of
    pixel data sits at approximately
    ``header + ifd + strip_table + uncompressed_bytes``. This helper
    promotes to BigTIFF if that total reaches ``UINT32_MAX``.

    See issue #1785: the previous check only compared ``uncompressed_bytes``
    against ``UINT32_MAX`` and missed the IFD/strip-table overhead, which
    pushed sub-4 GiB rasters over the classic-TIFF limit late in the write.
    """
    # 8 bytes per strip/tile entry: 4 for offset (LONG) + 4 for byte count.
    strip_table_overhead = n_entries * 8
    reserved_overhead = (
        header_size_classic + ifd_overhead_estimate + strip_table_overhead
    )
    UINT32_MAX = 0xFFFFFFFF
    # Use >= because UINT32_MAX itself is not a referenceable offset.
    return uncompressed_bytes + reserved_overhead >= UINT32_MAX

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+def _should_use_bigtiff_streaming(uncompressed_bytes: int,
+                                  n_entries: int,
+                                  header_size_classic: int = 8,
+                                  ifd_overhead_estimate: int = 200) -> bool:
+    """Decide whether the streaming writer must emit BigTIFF.
+
+    Classic TIFF stores offsets as uint32, so any offset referenced from
+    the IFD must satisfy ``offset < UINT32_MAX``. The streaming writer
+    appends pixel data after the header and IFD, so the last byte of
+    pixel data sits at approximately
+    ``header + ifd + strip_table + uncompressed_bytes``. This helper
+    promotes to BigTIFF if that total reaches ``UINT32_MAX``.
+
+    See issue #1785: the previous check only compared ``uncompressed_bytes``
+    against ``UINT32_MAX`` and missed the IFD/strip-table overhead, which
+    pushed sub-4 GiB rasters over the classic-TIFF limit late in the write.
+    """
+    # 8 bytes per strip/tile entry: 4 for offset (LONG) + 4 for byte count.
+    strip_table_overhead = n_entries * 8
+    reserved_overhead = (
+        header_size_classic + ifd_overhead_estimate + strip_table_overhead
+    )


+        # Pick overhead components large enough that the sum crosses
+        # UINT32_MAX even though uncompressed_bytes < UINT32_MAX.
+        n_entries = 100_000  # 800 KB strip table
+        uncompressed = UINT32_MAX - (8 + 200 + n_entries * 8)
+        # Sum equals exactly UINT32_MAX -> promote (>=).
+        assert _should_use_bigtiff_streaming(
+            uncompressed_bytes=uncompressed,
+            n_entries=n_entries,
+        ) is True
+        # One byte less and we stay classic.
+        assert _should_use_bigtiff_streaming(
+            uncompressed_bytes=uncompressed - 1,
+            n_entries=n_entries,
+        ) is False


…gTIFF boundary Address Copilot review feedback on PR #1787: * The 200-byte ifd_overhead_estimate constant in _should_use_bigtiff_streaming was not guaranteed conservative. Variable-length payloads (gdal_metadata_xml ASCII overflow, user extra_tags, geo ascii params) can push real IFD overhead well past 200 bytes and let sub-4 GiB rasters overflow classic-TIFF uint32 offsets late in the write. Replace the fudge with _compute_classic_ifd_overhead, which reuses _build_ifd to measure the actual entry block plus overflow heap from the tag list. write_streaming now builds tags first, measures, decides BigTIFF, then promotes offset arrays via _promote_offsets_to_long8. * The streaming helper used >= UINT32_MAX while the eager _assemble_tiff path uses > UINT32_MAX. Align both: a file of exactly UINT32_MAX bytes still fits classic (last byte sits at offset UINT32_MAX - 1, a valid uint32). Update the corresponding boundary tests so equality stays classic and the +1 case promotes. Tests cover the boundary flip and a 5000-byte gdal_metadata blob that the old 200-byte fudge would have ignored.

github-actions Bot added the performance PR touches performance-sensitive code label May 13, 2026

brendancol requested a review from Copilot May 13, 2026 13:35

Copilot started reviewing on behalf of brendancol May 13, 2026 13:36 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

brendancol merged commit 4c601ae into main May 13, 2026
2 of 20 checks passed

brendancol mentioned this pull request May 13, 2026

fix(test): avoid 4 GB list allocation in BigTIFF threshold test #1808

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

geotiff: account for IFD overhead when picking BigTIFF in streaming writer (#1785)#1787

geotiff: account for IFD overhead when picking BigTIFF in streaming writer (#1785)#1787
brendancol merged 2 commits into
mainfrom
issue-1785

brendancol commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brendancol commented May 13, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants