geotiff: split overloaded masked_nodata into lifecycle signals#2146
Merged
Conversation
…y-contrib#2135) Adds two additive attrs alongside the existing nodata / masked_nodata pair so consumers can answer separate questions without a single bit that lies in the cast-after-mask case: - nodata_pixels_present: bool, True iff the read window contained any pixel matching the declared sentinel before masking. Emitted by the eager numpy, GPU, and VRT paths; left unset on the dask backends so the lazy contract is preserved (a strict per-chunk reduction would force eager compute). - nodata_dtype_cast: target dtype name (e.g. "float64") when the caller passed an explicit dtype= kwarg. Lets downstream tell float-because-masked from float-because-promoted. The existing nodata and masked_nodata keys keep their post-xarray-contrib#2092 semantics; nothing renames or deprecates. attrs_contract.rst documents both new keys.
brendancol
commented
May 19, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
PR Review: geotiff: split overloaded masked_nodata into lifecycle signals
Blockers
None.
Suggestions
xrspatial/geotiff/_backends/vrt.py:364: when the integer-sentinel helper already returnedpixels_present=True, the follow-upnp.isnan(arr).any()scan still runs (since thenot nodata_pixels_presentguard only short-circuits the True case before re-checking). Today the helper sets True on int-with-sentinel hits, so the float-NaN proxy is gated correctly. If a future change starts using this code path with a float-source-already-NaN-masked-by-internal-reader buffer that also went through the helper, we will scan twice. Worth a comment pinning the assumption that exactly one branch sets the bool.xrspatial/geotiff/_vrt.py:866(_apply_integer_sentinel_mask_with_presence): duplicates the body of_apply_integer_sentinel_mask. Tolerable for clarity right now; if a third variant ever lands, factor the shared loop into a single helper that takes a "report presence" flag.
Nits
xrspatial/geotiff/_backends/_gpu_helpers.py:45:_apply_nodata_mask_gpu_with_presencerepeats most of_apply_nodata_mask_gpu. The presence-reporting variant adds onemask.any().item()plus a tuple return; the rest is identical. A single internal helper plus thin wrappers would cut the duplication.xrspatial/geotiff/__init__.py:711:nodata_pixels_presentinitialised toNone, then potentially set toFalsein the no-mask-scan branch. Themask_nodata=Truebranch never explicitly setsFalsewhen the dtype is neither float nor int (e.g. complex source). Leaving itNonefor that case is fine, but worth a one-line comment so a future maintainer does not "fix" the asymmetry.
What looks good
- Additive contract: every existing caller of
_set_nodata_attrswas updated by-name in this PR; the two new kwargs default toNoneso out-of-tree consumers stay valid. - Dask backends honour the issue's recommendation to leave
nodata_pixels_presentunset rather than forcing eager compute. - VRT path threads the presence bit through both the integer-sentinel helper (explicit return) and the inline float-NaN-masking case (post-mask proxy scan).
nodata_dtype_castonly emitted when the caller actually passed adtype=kwarg, so the absence-means-no-cast contract holds.attrs_contract.rstand the_attrs.pymodule docstring both updated; the test matrix from the issue is covered row-by-row.
Checklist
- No off-by-one in window scans (uses existing mask kernels).
- All implemented backends produce consistent attrs.
- NaN handling routes through the existing helpers; no
== np.nanslips. - Edge cases covered: out-of-range sentinel, no-sentinel file, mask off + cast on, sentinel absent in window.
- Dask chunk boundaries unaffected; per-chunk path unchanged.
- No premature materialisation; dask path explicitly skips the reduction.
- [n/a] No benchmark needed for an attrs-only contract change.
- [n/a] README feature matrix unchanged (no new public function).
- Docstrings and the contract doc updated.
…b#2135) - vrt.py: document the single-branch invariant the float-NaN proxy scan relies on so a future maintainer does not accidentally double-scan a buffer that both the integer helper and the inline reader masked. - __init__.py: explain why nodata_pixels_present stays None on exotic dtypes rather than being defaulted to False. - _gpu_helpers.py / _vrt.py: note why each presence-reporting helper is a sibling of its base rather than a folded variant; defer the refactor until a third variant exists.
brendancol
commented
May 19, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
Follow-up review pass
Walked back through the four findings from the first review. Pushed 17c23b9.
Dispositions
- vrt.py:364 (Suggestion): fixed. Added an explicit invariant comment so the single-branch contract that lets the float-NaN proxy use a
not presentguard is no longer implicit. - init.py:711 (Nit): fixed. Documented the asymmetry between the float / int gates and the exotic-dtype fall-through so a future maintainer does not "fix" it by forcing a False default.
- _gpu_helpers.py:45 (Nit): deferred. The duplication is real, but
_apply_nodata_mask_gpuis on the GPU writer's hot path; introducing an opt-in flag would force every call site to pay either the Python-side branch cost or the extra bool computation. Added a docstring note pointing at the consolidation opportunity once a third variant lands. - _vrt.py:866 (Suggestion): deferred for the same reason. The base helper is shared with the per-task chunked-VRT path; folding the variants now adds branch cost to every chunk. Documented the same conditional consolidation plan.
Re-ran the 70 tests in test_nodata_lifecycle_attrs_2135.py, test_masked_nodata_attr_2092.py, and test_nodata_semantics_split_1988.py: all green.
brendancol
added a commit
that referenced
this pull request
May 20, 2026
#2146 (nodata lifecycle) and #2150 (tier surface) had landed cleanly on main without touching _populate_attrs_from_geo_info, but #2125's "drop attrs['crs'] on rotated reads" introduced a _vrt_is_rotated gate on the VRT branches that our pre-merge dataclass-based VRT code did not honour. Resolved by: * _attrs.py: kept the GeoTIFFMetadata shim. The dataclass already produces the same attrs surface as the legacy field-by-field path. * _backends/vrt.py: kept the dataclass build on both VRT branches and threaded _vrt_is_rotated through to drop crs/crs_wkt and set has_georef=False on rotated reads (so metadata_to_attrs stamps the no-georef marker). Verified test_allow_rotated_no_crs_2122, test_geotiff_metadata_2139, test_nodata_lifecycle_attrs_2135, test_supported_features_tiers_2137, test_backend_parity_matrix, and the broader VRT/attrs/nodata suite (1309 tests) all pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2135.
Summary
Splits the overloaded
attrs['masked_nodata']boolean into separatelifecycle signals so consumers can answer four distinct questions
without one bit that lies on the cast-after-mask path.
The existing
nodata/masked_nodatakeys keep their post-#2092semantics. Two additive attrs land alongside them:
nodata_pixels_present(bool): True iff the read window containedany pixel matching the declared sentinel before masking. Lets QA and
writer code answer "any nodata in this tile" without rescanning. The
eager numpy, GPU, and VRT eager paths compute it; the dask backends
leave it unset because a strict per-chunk reduction would force eager
.compute().nodata_dtype_cast(str): target dtype name (e.g."float64")when the caller passed an explicit
dtype=kwarg. Distinguishesfloat-because-masked from float-because-promoted, which the existing
masked_nodatalookup alone cannot do.Backend coverage
scans in the
mask_nodata=Falsebranch so the attr stays useful.pixels_presentdeliberately unset (lazy contract);dtype_castrecorded._apply_nodata_mask_gpu_with_presencehelper returns the boolalongside the masked buffer.
pixels_presentunset;dtype_castrecorded._apply_integer_sentinel_mask_with_presenceplus afloat-path NaN-presence proxy plus a no-mask-scan helper for the
mask_nodata=Falsebranch.pixels_presentunset;dtype_castrecorded.Overlap with sibling PRs
This change brushes
_attrs.pyand the four read backends. It doesnot pre-empt the scope of #2133, #2136, or #2139; each of those owns a
different attrs lifecycle problem on the same module. The only edit to
_set_nodata_attrshere adds two kwargs withNonedefaults sothe existing call shape stays valid.
Test plan
xrspatial/geotiff/tests/test_nodata_lifecycle_attrs_2135.py:18 tests across eager / dask / VRT (+ skip-marked GPU).
test_masked_nodata_attr_2092.py(12) andtest_nodata_semantics_split_1988.py(39) still green.xrspatial/geotiff/tests/suite (excluding GPU-onlypaths that need cupy + CUDA): 4200 passed.