Skip to content

geotiff: thread masked decision through _set_nodata_attrs (#2092)#2127

Merged
brendancol merged 3 commits into
mainfrom
issue-2092
May 19, 2026
Merged

geotiff: thread masked decision through _set_nodata_attrs (#2092)#2127
brendancol merged 3 commits into
mainfrom
issue-2092

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • _set_nodata_attrs now takes an explicit masked: bool argument instead of inferring from dtype. The eager, dask, and GPU read paths pass mask_nodata and final_dtype.kind == 'f' so the attr matches the actual masking decision; the VRT path keeps the dtype-driven rule because its internal reader NaN-masks float sources unconditionally.
  • Before this fix, opening a float file with a non-NaN sentinel and mask_nodata=False left literal sentinel pixels in the buffer but set attrs['masked_nodata']=True. Downstream code that trusted the attr ("NaN means missing, sentinels have been replaced") then treated -9999 pixels as already-masked valid data.
  • Backend coverage: numpy, cupy, dask+numpy, dask+cupy, plus the VRT eager and chunked paths. 7 call sites updated.

Closes #2092

Test plan

  • Issue 2092 repro now reports masked_nodata=False with mask_nodata=False
  • New tests in test_masked_nodata_attr_2092.py cover eager, dask, VRT, and GPU paths in both directions
  • Helper unit tests in test_nodata_semantics_split_1988.py updated for the new masked= signature
  • Full geotiff test suite passes (4227 passed, 25 skipped)

`attrs['masked_nodata']` was inferred purely from the final array
dtype. With `mask_nodata=False` on a float file with a non-NaN
sentinel, the masking step was skipped, the buffer kept the literal
sentinel pixels, but the attr still claimed True; downstream code
that trusted the attr treated those pixels as already-NaN.

Change `_set_nodata_attrs` to take an explicit `masked: bool`
argument and have every read path pass the actual decision. The
eager / dask / GPU paths compute it as
`mask_nodata and final_dtype.kind == 'f'`. VRT keeps the
dtype-driven rule because its internal reader inlines float
NaN-masking unconditionally.

Closes #2092
# Conflicts:
#	xrspatial/geotiff/_attrs.py
#	xrspatial/geotiff/_backends/gpu.py
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 19, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: geotiff: thread masked decision through _set_nodata_attrs (#2092)

Blockers (must fix before merge)

  • None.

Suggestions (should fix, not blocking)

  • docs/source/user_guide/attrs_contract.rst:69-76 still describes the pre-fix rule: "True when the in-memory array is float dtype and the reader's sentinel-to-NaN step ran." This PR decouples those two conditions (mask_nodata=False on a float file is now float dtype + no step ran + False). The doc should match the new contract so a reader doesn't trust the old coupled rule.
  • VRT path, xrspatial/geotiff/_backends/vrt.py:350 (eager) and :725 (chunked) keep a dtype-only rule. The accompanying comment argues this is fine because the VRT internal reader inlines NaN-masking on float sources unconditionally, which is true for native-float VRT sources. But there's a remaining hole: open_geotiff(vrt, mask_nodata=False, dtype=np.float64) on an integer VRT source skips the integer mask helper at vrt.py:320, then casts to float at :317. The buffer holds literal int sentinels cast to float, but the rule reports masked_nodata=True. The non-VRT eager path avoids this via the mask_nodata and ... conjunction. Either tighten the VRT rule to the same conjunction, or add a test pinning the discrepancy as documented behavior.

Nits (optional improvements)

  • test_masked_nodata_attr_2092.py covers the dask "int source + dtype=float64 + mask_off" edge in test_dask_explicit_float_dtype_mask_off_reports_false, but no analogous test exists for the eager numpy path (open_geotiff with the same kwargs) or the VRT eager path. Adding both would pin the new contract symmetrically across backends.
  • _set_nodata_attrs docstring at _attrs.py could lead with the one-line contract (masked is the actual mask-decision the read path made) before the historical context paragraph. The explanation is useful but pushes the actual rule below the fold.

What looks good

  • The signature change from array_dtype= to masked: bool is clean and matches the issue's spec.
  • All 7 call sites identified in the issue are updated, and each call site has a comment explaining the per-backend rule plus the VRT vs eager/dask/GPU asymmetry.
  • The test file covers all four backend paths in both directions (mask_nodata=False and =True), with realistic float-with-sentinel and int-out-of-range repros.
  • test_masked_coerced_to_bool pins the bool-coercion contract for downstream serializers that can't take numpy scalars.
  • _should_restore_nan_sentinel (from a recent main commit) survives the merge intact.

Checklist

  • Algorithm matches issue specification
  • All four read backends agree for the common case
  • NaN handling unchanged (the fix only touches the attr, not the buffer)
  • Edge cases covered (mask_off, int-out-of-range, explicit dtype cast on dask path)
  • Dask chunk boundaries unaffected (no neighborhood changes)
  • No premature materialization or extra copies
  • Benchmark — not applicable (attr-only fix)
  • README feature matrix — not applicable
  • Docstrings updated for the new signature
  • User guide doc attrs_contract.rst still describes the old rule (see Suggestions)

Follow-up to review feedback:

* VRT eager (`vrt.py:329-353`) and chunked (`vrt.py:719-738`) paths
  now read the pre-cast / declared dtype, not the post-cast one.
  Fixes the corner case where `mask_nodata=False` + `dtype=float64`
  on an int VRT source claimed `masked_nodata=True` even though the
  buffer still held literal sentinels.
* `attrs_contract.rst` updated to describe the actual rule (replaced
  pixels with NaN) instead of the old coupled "float dtype + step
  ran" wording.
* `_set_nodata_attrs` docstring now leads with the one-line contract;
  history moves below.
* Two new tests pin the int-source + mask_off + float-cast case for
  the eager (`__init__.py`) and VRT paths, mirroring the existing
  dask coverage.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up Review

Disposition of round-1 findings:

Blockers

  • None.

Suggestions

  • attrs_contract.rst:69-76 doc drift — Fixed in 1953135. The doc now describes the post-#2092 contract (reader actually replaced sentinel pixels), and the float-cast-on-int corner case is called out explicitly.
  • VRT dtype-only rule had a hole — Fixed in 1953135. The eager VRT path captures pre_cast_dtype before the user dtype= cast and the chunked VRT path uses declared_dtype (also pre-cast) instead of final_dtype. The float-with-NaN buffer from _vrt._read_data (native float source or int-into-float VRT) still reports True; an int source with mask_nodata=False + float cast now correctly reports False.

Nits

  • Missing eager / VRT analogues for the dask "int + float cast + mask off" test — Fixed in 1953135. Added test_eager_explicit_float_dtype_mask_off_reports_false and test_vrt_int_source_mask_off_with_float_cast_reports_false, mirroring the existing dask edge.
  • _set_nodata_attrs docstring buried the rule — Fixed in 1953135. Docstring now opens with the masked contract and the history follows.

Verification

  • 4229 passed, 25 skipped (full geotiff suite)
  • 51 passed (new + helper tests)

No remaining findings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: attrs['masked_nodata'] reports True when masking was disabled

1 participant