geotiff: thread masked decision through _set_nodata_attrs (#2092)#2127
Merged
Conversation
`attrs['masked_nodata']` was inferred purely from the final array dtype. With `mask_nodata=False` on a float file with a non-NaN sentinel, the masking step was skipped, the buffer kept the literal sentinel pixels, but the attr still claimed True; downstream code that trusted the attr treated those pixels as already-NaN. Change `_set_nodata_attrs` to take an explicit `masked: bool` argument and have every read path pass the actual decision. The eager / dask / GPU paths compute it as `mask_nodata and final_dtype.kind == 'f'`. VRT keeps the dtype-driven rule because its internal reader inlines float NaN-masking unconditionally. Closes #2092
# Conflicts: # xrspatial/geotiff/_attrs.py # xrspatial/geotiff/_backends/gpu.py
brendancol
commented
May 19, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
PR Review: geotiff: thread masked decision through _set_nodata_attrs (#2092)
Blockers (must fix before merge)
- None.
Suggestions (should fix, not blocking)
docs/source/user_guide/attrs_contract.rst:69-76still describes the pre-fix rule: "Truewhen the in-memory array is float dtype and the reader's sentinel-to-NaN step ran." This PR decouples those two conditions (mask_nodata=Falseon a float file is now float dtype + no step ran + False). The doc should match the new contract so a reader doesn't trust the old coupled rule.- VRT path,
xrspatial/geotiff/_backends/vrt.py:350(eager) and:725(chunked) keep a dtype-only rule. The accompanying comment argues this is fine because the VRT internal reader inlines NaN-masking on float sources unconditionally, which is true for native-float VRT sources. But there's a remaining hole:open_geotiff(vrt, mask_nodata=False, dtype=np.float64)on an integer VRT source skips the integer mask helper atvrt.py:320, then casts to float at:317. The buffer holds literal int sentinels cast to float, but the rule reportsmasked_nodata=True. The non-VRT eager path avoids this via themask_nodata and ...conjunction. Either tighten the VRT rule to the same conjunction, or add a test pinning the discrepancy as documented behavior.
Nits (optional improvements)
test_masked_nodata_attr_2092.pycovers the dask "int source + dtype=float64 + mask_off" edge intest_dask_explicit_float_dtype_mask_off_reports_false, but no analogous test exists for the eager numpy path (open_geotiffwith the same kwargs) or the VRT eager path. Adding both would pin the new contract symmetrically across backends._set_nodata_attrsdocstring at_attrs.pycould lead with the one-line contract (maskedis the actual mask-decision the read path made) before the historical context paragraph. The explanation is useful but pushes the actual rule below the fold.
What looks good
- The signature change from
array_dtype=tomasked: boolis clean and matches the issue's spec. - All 7 call sites identified in the issue are updated, and each call site has a comment explaining the per-backend rule plus the VRT vs eager/dask/GPU asymmetry.
- The test file covers all four backend paths in both directions (
mask_nodata=Falseand=True), with realistic float-with-sentinel and int-out-of-range repros. test_masked_coerced_to_boolpins the bool-coercion contract for downstream serializers that can't take numpy scalars._should_restore_nan_sentinel(from a recent main commit) survives the merge intact.
Checklist
- Algorithm matches issue specification
- All four read backends agree for the common case
- NaN handling unchanged (the fix only touches the attr, not the buffer)
- Edge cases covered (mask_off, int-out-of-range, explicit dtype cast on dask path)
- Dask chunk boundaries unaffected (no neighborhood changes)
- No premature materialization or extra copies
- Benchmark — not applicable (attr-only fix)
- README feature matrix — not applicable
- Docstrings updated for the new signature
- User guide doc
attrs_contract.rststill describes the old rule (see Suggestions)
Follow-up to review feedback: * VRT eager (`vrt.py:329-353`) and chunked (`vrt.py:719-738`) paths now read the pre-cast / declared dtype, not the post-cast one. Fixes the corner case where `mask_nodata=False` + `dtype=float64` on an int VRT source claimed `masked_nodata=True` even though the buffer still held literal sentinels. * `attrs_contract.rst` updated to describe the actual rule (replaced pixels with NaN) instead of the old coupled "float dtype + step ran" wording. * `_set_nodata_attrs` docstring now leads with the one-line contract; history moves below. * Two new tests pin the int-source + mask_off + float-cast case for the eager (`__init__.py`) and VRT paths, mirroring the existing dask coverage.
brendancol
commented
May 19, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
Follow-up Review
Disposition of round-1 findings:
Blockers
- None.
Suggestions
attrs_contract.rst:69-76doc drift — Fixed in1953135. The doc now describes the post-#2092 contract (reader actually replaced sentinel pixels), and the float-cast-on-int corner case is called out explicitly.- VRT dtype-only rule had a hole — Fixed in
1953135. The eager VRT path capturespre_cast_dtypebefore the userdtype=cast and the chunked VRT path usesdeclared_dtype(also pre-cast) instead offinal_dtype. The float-with-NaN buffer from_vrt._read_data(native float source or int-into-float VRT) still reports True; an int source withmask_nodata=False+ float cast now correctly reports False.
Nits
- Missing eager / VRT analogues for the dask "int + float cast + mask off" test — Fixed in
1953135. Addedtest_eager_explicit_float_dtype_mask_off_reports_falseandtest_vrt_int_source_mask_off_with_float_cast_reports_false, mirroring the existing dask edge. _set_nodata_attrsdocstring buried the rule — Fixed in1953135. Docstring now opens with themaskedcontract and the history follows.
Verification
- 4229 passed, 25 skipped (full geotiff suite)
- 51 passed (new + helper tests)
No remaining findings.
This was referenced May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_set_nodata_attrsnow takes an explicitmasked: boolargument instead of inferring from dtype. The eager, dask, and GPU read paths passmask_nodata and final_dtype.kind == 'f'so the attr matches the actual masking decision; the VRT path keeps the dtype-driven rule because its internal reader NaN-masks float sources unconditionally.mask_nodata=Falseleft literal sentinel pixels in the buffer but setattrs['masked_nodata']=True. Downstream code that trusted the attr ("NaN means missing, sentinels have been replaced") then treated -9999 pixels as already-masked valid data.Closes #2092
Test plan
masked_nodata=Falsewithmask_nodata=Falsetest_masked_nodata_attr_2092.pycover eager, dask, VRT, and GPU paths in both directionstest_nodata_semantics_split_1988.pyupdated for the newmasked=signature