Skip to content

geotiff: split overloaded masked_nodata into lifecycle signals#2146

Merged
brendancol merged 4 commits into
xarray-contrib:mainfrom
brendancol:issue-2135
May 19, 2026
Merged

geotiff: split overloaded masked_nodata into lifecycle signals#2146
brendancol merged 4 commits into
xarray-contrib:mainfrom
brendancol:issue-2135

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2135.

Summary

Splits the overloaded attrs['masked_nodata'] boolean into separate
lifecycle signals so consumers can answer four distinct questions
without one bit that lies on the cast-after-mask path.

The existing nodata / masked_nodata keys keep their post-#2092
semantics. Two additive attrs land alongside them:

  • nodata_pixels_present (bool): True iff the read window contained
    any pixel matching the declared sentinel before masking. Lets QA and
    writer code answer "any nodata in this tile" without rescanning. The
    eager numpy, GPU, and VRT eager paths compute it; the dask backends
    leave it unset because a strict per-chunk reduction would force eager
    .compute().
  • nodata_dtype_cast (str): target dtype name (e.g. "float64")
    when the caller passed an explicit dtype= kwarg. Distinguishes
    float-because-masked from float-because-promoted, which the existing
    masked_nodata lookup alone cannot do.

Backend coverage

  • numpy eager: scan integrated into the existing mask block; also
    scans in the mask_nodata=False branch so the attr stays useful.
  • dask numpy: pixels_present deliberately unset (lazy contract);
    dtype_cast recorded.
  • GPU eager (stripped and tiled): new
    _apply_nodata_mask_gpu_with_presence helper returns the bool
    alongside the masked buffer.
  • dask GPU: pixels_present unset; dtype_cast recorded.
  • VRT eager: _apply_integer_sentinel_mask_with_presence plus a
    float-path NaN-presence proxy plus a no-mask-scan helper for the
    mask_nodata=False branch.
  • VRT chunked: pixels_present unset; dtype_cast recorded.

Overlap with sibling PRs

This change brushes _attrs.py and the four read backends. It does
not pre-empt the scope of #2133, #2136, or #2139; each of those owns a
different attrs lifecycle problem on the same module. The only edit to
_set_nodata_attrs here adds two kwargs with None defaults so
the existing call shape stays valid.

Test plan

  • xrspatial/geotiff/tests/test_nodata_lifecycle_attrs_2135.py:
    18 tests across eager / dask / VRT (+ skip-marked GPU).
  • Existing test_masked_nodata_attr_2092.py (12) and
    test_nodata_semantics_split_1988.py (39) still green.
  • Full xrspatial/geotiff/tests/ suite (excluding GPU-only
    paths that need cupy + CUDA): 4200 passed.

…y-contrib#2135)

Adds two additive attrs alongside the existing nodata / masked_nodata
pair so consumers can answer separate questions without a single bit
that lies in the cast-after-mask case:

- nodata_pixels_present: bool, True iff the read window contained any
  pixel matching the declared sentinel before masking. Emitted by the
  eager numpy, GPU, and VRT paths; left unset on the dask backends so
  the lazy contract is preserved (a strict per-chunk reduction would
  force eager compute).
- nodata_dtype_cast: target dtype name (e.g. "float64") when the caller
  passed an explicit dtype= kwarg. Lets downstream tell
  float-because-masked from float-because-promoted.

The existing nodata and masked_nodata keys keep their post-xarray-contrib#2092
semantics; nothing renames or deprecates. attrs_contract.rst documents
both new keys.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 19, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: geotiff: split overloaded masked_nodata into lifecycle signals

Blockers

None.

Suggestions

  • xrspatial/geotiff/_backends/vrt.py:364: when the integer-sentinel helper already returned pixels_present=True, the follow-up np.isnan(arr).any() scan still runs (since the not nodata_pixels_present guard only short-circuits the True case before re-checking). Today the helper sets True on int-with-sentinel hits, so the float-NaN proxy is gated correctly. If a future change starts using this code path with a float-source-already-NaN-masked-by-internal-reader buffer that also went through the helper, we will scan twice. Worth a comment pinning the assumption that exactly one branch sets the bool.
  • xrspatial/geotiff/_vrt.py:866 (_apply_integer_sentinel_mask_with_presence): duplicates the body of _apply_integer_sentinel_mask. Tolerable for clarity right now; if a third variant ever lands, factor the shared loop into a single helper that takes a "report presence" flag.

Nits

  • xrspatial/geotiff/_backends/_gpu_helpers.py:45: _apply_nodata_mask_gpu_with_presence repeats most of _apply_nodata_mask_gpu. The presence-reporting variant adds one mask.any().item() plus a tuple return; the rest is identical. A single internal helper plus thin wrappers would cut the duplication.
  • xrspatial/geotiff/__init__.py:711: nodata_pixels_present initialised to None, then potentially set to False in the no-mask-scan branch. The mask_nodata=True branch never explicitly sets False when the dtype is neither float nor int (e.g. complex source). Leaving it None for that case is fine, but worth a one-line comment so a future maintainer does not "fix" the asymmetry.

What looks good

  • Additive contract: every existing caller of _set_nodata_attrs was updated by-name in this PR; the two new kwargs default to None so out-of-tree consumers stay valid.
  • Dask backends honour the issue's recommendation to leave nodata_pixels_present unset rather than forcing eager compute.
  • VRT path threads the presence bit through both the integer-sentinel helper (explicit return) and the inline float-NaN-masking case (post-mask proxy scan).
  • nodata_dtype_cast only emitted when the caller actually passed a dtype= kwarg, so the absence-means-no-cast contract holds.
  • attrs_contract.rst and the _attrs.py module docstring both updated; the test matrix from the issue is covered row-by-row.

Checklist

  • No off-by-one in window scans (uses existing mask kernels).
  • All implemented backends produce consistent attrs.
  • NaN handling routes through the existing helpers; no == np.nan slips.
  • Edge cases covered: out-of-range sentinel, no-sentinel file, mask off + cast on, sentinel absent in window.
  • Dask chunk boundaries unaffected; per-chunk path unchanged.
  • No premature materialisation; dask path explicitly skips the reduction.
  • [n/a] No benchmark needed for an attrs-only contract change.
  • [n/a] README feature matrix unchanged (no new public function).
  • Docstrings and the contract doc updated.

…b#2135)

- vrt.py: document the single-branch invariant the float-NaN proxy
  scan relies on so a future maintainer does not accidentally
  double-scan a buffer that both the integer helper and the inline
  reader masked.
- __init__.py: explain why nodata_pixels_present stays None on exotic
  dtypes rather than being defaulted to False.
- _gpu_helpers.py / _vrt.py: note why each presence-reporting helper
  is a sibling of its base rather than a folded variant; defer the
  refactor until a third variant exists.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review pass

Walked back through the four findings from the first review. Pushed 17c23b9.

Dispositions

  • vrt.py:364 (Suggestion): fixed. Added an explicit invariant comment so the single-branch contract that lets the float-NaN proxy use a not present guard is no longer implicit.
  • init.py:711 (Nit): fixed. Documented the asymmetry between the float / int gates and the exotic-dtype fall-through so a future maintainer does not "fix" it by forcing a False default.
  • _gpu_helpers.py:45 (Nit): deferred. The duplication is real, but _apply_nodata_mask_gpu is on the GPU writer's hot path; introducing an opt-in flag would force every call site to pay either the Python-side branch cost or the extra bool computation. Added a docstring note pointing at the consolidation opportunity once a third variant lands.
  • _vrt.py:866 (Suggestion): deferred for the same reason. The base helper is shared with the per-task chunked-VRT path; folding the variants now adds branch cost to every chunk. Documented the same conditional consolidation plan.

Re-ran the 70 tests in test_nodata_lifecycle_attrs_2135.py, test_masked_nodata_attr_2092.py, and test_nodata_semantics_split_1988.py: all green.

@brendancol brendancol merged commit 7b5aaf1 into xarray-contrib:main May 19, 2026
4 of 5 checks passed
brendancol added a commit that referenced this pull request May 20, 2026
#2146 (nodata lifecycle) and #2150 (tier surface) had landed cleanly on
main without touching _populate_attrs_from_geo_info, but #2125's
"drop attrs['crs'] on rotated reads" introduced a _vrt_is_rotated gate
on the VRT branches that our pre-merge dataclass-based VRT code did not
honour. Resolved by:

* _attrs.py: kept the GeoTIFFMetadata shim. The dataclass already
  produces the same attrs surface as the legacy field-by-field path.
* _backends/vrt.py: kept the dataclass build on both VRT branches and
  threaded _vrt_is_rotated through to drop crs/crs_wkt and set
  has_georef=False on rotated reads (so metadata_to_attrs stamps the
  no-georef marker).

Verified test_allow_rotated_no_crs_2122, test_geotiff_metadata_2139,
test_nodata_lifecycle_attrs_2135, test_supported_features_tiers_2137,
test_backend_parity_matrix, and the broader VRT/attrs/nodata suite
(1309 tests) all pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: split overloaded masked_nodata into separate nodata lifecycle signals

1 participant