You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
attrs['masked_nodata'] is a single boolean trying to describe a multi-stage process. The fix in 478dbd5 (PR #2127, closing #2092) tightened it so the attr now means "the reader replaced sentinel pixels with NaN," which is better than the pre-fix "final dtype is float." Callers still cannot answer four distinct questions from one bit:
Did the source file declare a sentinel?
Did the read path run the sentinel-to-NaN masking step?
Were sentinel pixels actually found in the read window?
Did a user-requested dtype cast happen after the masking decision (which can promote int to float without masking)?
Today (1) is encoded by the presence of attrs['nodata'], (2) is attrs['masked_nodata'], and (3) and (4) are not surfaced. Where each one matters in practice:
(1) declared sentinel exists. A round-trip writer needs the file-declared sentinel so it can preserve it on write even if the in-memory array has no missing pixels. attrs['nodata'] already covers this.
(2) masking step ran. A consumer deciding between np.isnan(arr) and arr == attrs['nodata'] needs this. That is what masked_nodata means after #2092.
(3) sentinel pixels present. A QA function asking "does this tile contain any nodata" currently has to scan the buffer. With a bool set at read time, the answer is free. Same problem on the writer: if no sentinel pixels were touched, the unmask-before-write scan can be skipped. Today the writer pays for it every time.
(4) dtype cast after masking. Caller passes dtype=float64 on an integer source with mask_nodata=False. Post-mask the buffer is integer with literal sentinels; post-cast it is float64 with literal sentinels (no NaNs). The #2092 follow-up reads pre-cast dtype so masked_nodata reports False here, but the cast itself is invisible downstream. A consumer wanting to know "float because promoted, or float because masked" cannot tell.
Proposed attribute scheme
Split the lifecycle into separate signals. Keep nodata and masked_nodata for backward compatibility, add two new attrs:
nodata (unchanged): declared file sentinel as a scalar of the source dtype. None/absent means no declared sentinel.
nodata_pixels_present (new): bool or int count. True/non-zero iff at least one pixel in the read window matched the sentinel before masking. Lets the writer skip the unmask scan, and lets QA code answer "any nodata?" without touching the buffer.
nodata_dtype_cast (new): None if the caller did not pass dtype=, otherwise the target dtype as a string. Records that a post-mask cast happened. Combined with masked_nodata this distinguishes float-because-masked from float-because-promoted.
A consumer reading these together can answer:
"Is there a declared sentinel?" -> 'nodata' in attrs
"Is the buffer NaN-aware?" -> attrs.get('masked_nodata') is True
"Were any sentinel pixels in this window?" -> attrs.get('nodata_pixels_present')
"Was this buffer cast after the read?" -> attrs.get('nodata_dtype_cast')
Backward compatibility
masked_nodata is already documented in attrs_contract.rst and exercised by the writer's _should_restore_nan_sentinel. The post-#2092 semantics are the ones to keep ("masking step ran"), so no rename or deprecation for that key.
The two new keys are additive. The writer's restore-on-write logic stays gated on masked_nodata, but it can also check nodata_pixels_present is False to skip the scan.
Recommend an additive change: ship nodata_pixels_present and nodata_dtype_cast alongside the existing pair, document them in attrs_contract.rst, and let downstream adopt at its own pace.
mask_nodata and arr.dtype.kind == 'f' (after #2092)
mask_nodata and target_dtype.kind == 'f'
mask_nodata and arr_gpu.dtype.kind == 'f'
pre_cast_dtype.kind == 'f'
same as eager (overview reads use the eager path)
nodata_pixels_present (new)
bool from the mask kernel's any() result; free side-effect of _apply_nodata_mask
per-chunk reduction, aggregated bool on the task graph
bool from the GPU mask kernel
bool from the VRT integer-sentinel mask helper plus the inline float-NaN-mask in _vrt._read_data
same
nodata_dtype_cast (new)
the dtype= kwarg passed to arr.astype(target)
the dtype= kwarg on the graph cast
same
same
same
For dask, nodata_pixels_present is the awkward one: a strict per-chunk reduction means the value is only known after .compute(). Two options: (a) compute eagerly as a delayed scalar attached to attrs, or (b) leave it unset on dask outputs and document that. Recommend (b) for the first cut; the writer can fall back to "scan if missing" on dask paths.
Tie-in with the GeoTIFFMetadata dataclass proposal
The companion proposal to back the attrs dict with a GeoTIFFMetadata dataclass gives a clean home for these fields. Once attrs is a view over a dataclass, the four signals become typed fields rather than loose dict keys: metadata.nodata, metadata.masked_nodata, metadata.nodata_pixels_present, metadata.nodata_dtype_cast. The dict view stays for backward compatibility and serialization.
The dataclass also resolves the "False vs missing" ambiguity that _should_restore_nan_sentinel currently handles with a deliberate identity check on value is not False at _attrs.py:244. With typed fields, None vs False vs True is unambiguous.
Tests
Matrix to cover, per backend (eager, dask, GPU, VRT):
Source
mask_nodata
dtype=
sentinel in window
Expected nodata
Expected masked_nodata
Expected nodata_pixels_present
Expected nodata_dtype_cast
float, sentinel declared
True
None
yes
sentinel
True
True
None
float, sentinel declared
False
None
yes
sentinel
False
True
None
float, sentinel declared
True
None
no
sentinel
True
False
None
int, sentinel declared
True
None
yes
sentinel
True (post-promotion)
True
None
int, sentinel declared
False
None
yes
sentinel
False
True
None
int, sentinel declared
False
float64
yes
sentinel
False
True
"float64"
int, sentinel out of range
True
None
n/a
sentinel
False (no cast happened)
False
None
no sentinel declared
True
None
n/a
absent
absent
absent
None
Extend test_nodata_semantics_split_1988.py and test_masked_nodata_attr_2092.py to assert on the new keys.
VRT also needs the multi-band case from #1611 (per-band sentinels) so nodata_pixels_present is computed per band when band=None.
Problem
attrs['masked_nodata']is a single boolean trying to describe a multi-stage process. The fix in 478dbd5 (PR #2127, closing #2092) tightened it so the attr now means "the reader replaced sentinel pixels with NaN," which is better than the pre-fix "final dtype is float." Callers still cannot answer four distinct questions from one bit:Today (1) is encoded by the presence of
attrs['nodata'], (2) isattrs['masked_nodata'], and (3) and (4) are not surfaced. Where each one matters in practice:(1) declared sentinel exists. A round-trip writer needs the file-declared sentinel so it can preserve it on write even if the in-memory array has no missing pixels.
attrs['nodata']already covers this.(2) masking step ran. A consumer deciding between
np.isnan(arr)andarr == attrs['nodata']needs this. That is whatmasked_nodatameans after #2092.(3) sentinel pixels present. A QA function asking "does this tile contain any nodata" currently has to scan the buffer. With a bool set at read time, the answer is free. Same problem on the writer: if no sentinel pixels were touched, the unmask-before-write scan can be skipped. Today the writer pays for it every time.
(4) dtype cast after masking. Caller passes
dtype=float64on an integer source withmask_nodata=False. Post-mask the buffer is integer with literal sentinels; post-cast it is float64 with literal sentinels (no NaNs). The #2092 follow-up reads pre-cast dtype somasked_nodatareports False here, but the cast itself is invisible downstream. A consumer wanting to know "float because promoted, or float because masked" cannot tell.Proposed attribute scheme
Split the lifecycle into separate signals. Keep
nodataandmasked_nodatafor backward compatibility, add two new attrs:nodata(unchanged): declared file sentinel as a scalar of the source dtype.None/absent means no declared sentinel.masked_nodata(unchanged after Bug: attrs['masked_nodata'] reports True when masking was disabled #2092): bool. True iff the reader replaced sentinel pixels with NaN. Only emitted whennodata is not None.nodata_pixels_present(new): bool or int count. True/non-zero iff at least one pixel in the read window matched the sentinel before masking. Lets the writer skip the unmask scan, and lets QA code answer "any nodata?" without touching the buffer.nodata_dtype_cast(new):Noneif the caller did not passdtype=, otherwise the target dtype as a string. Records that a post-mask cast happened. Combined withmasked_nodatathis distinguishes float-because-masked from float-because-promoted.A consumer reading these together can answer:
'nodata' in attrsattrs.get('masked_nodata') is Trueattrs.get('nodata_pixels_present')attrs.get('nodata_dtype_cast')Backward compatibility
masked_nodatais already documented inattrs_contract.rstand exercised by the writer's_should_restore_nan_sentinel. The post-#2092 semantics are the ones to keep ("masking step ran"), so no rename or deprecation for that key.The two new keys are additive. The writer's restore-on-write logic stays gated on
masked_nodata, but it can also checknodata_pixels_present is Falseto skip the scan.Recommend an additive change: ship
nodata_pixels_presentandnodata_dtype_castalongside the existing pair, document them inattrs_contract.rst, and let downstream adopt at its own pace.Read paths
Where each signal is computed in each backend:
__init__.py)_backends/dask.py)_backends/gpu.py)_backends/vrt.py)nodatageo_info.nodatanodata_attrgeo_info.nodatavrt.bands[].nodatamasked_nodatamask_nodata and arr.dtype.kind == 'f'(after #2092)mask_nodata and target_dtype.kind == 'f'mask_nodata and arr_gpu.dtype.kind == 'f'pre_cast_dtype.kind == 'f'nodata_pixels_present(new)any()result; free side-effect of_apply_nodata_mask_vrt._read_datanodata_dtype_cast(new)dtype=kwarg passed toarr.astype(target)dtype=kwarg on the graph castFor dask,
nodata_pixels_presentis the awkward one: a strict per-chunk reduction means the value is only known after.compute(). Two options: (a) compute eagerly as a delayed scalar attached to attrs, or (b) leave it unset on dask outputs and document that. Recommend (b) for the first cut; the writer can fall back to "scan if missing" on dask paths.Tie-in with the GeoTIFFMetadata dataclass proposal
The companion proposal to back the attrs dict with a
GeoTIFFMetadatadataclass gives a clean home for these fields. Onceattrsis a view over a dataclass, the four signals become typed fields rather than loose dict keys:metadata.nodata,metadata.masked_nodata,metadata.nodata_pixels_present,metadata.nodata_dtype_cast. The dict view stays for backward compatibility and serialization.The dataclass also resolves the "False vs missing" ambiguity that
_should_restore_nan_sentinelcurrently handles with a deliberate identity check onvalue is not Falseat_attrs.py:244. With typed fields,NonevsFalsevsTrueis unambiguous.Tests
Matrix to cover, per backend (eager, dask, GPU, VRT):
mask_nodatadtype=nodatamasked_nodatanodata_pixels_presentnodata_dtype_castExtend
test_nodata_semantics_split_1988.pyandtest_masked_nodata_attr_2092.pyto assert on the new keys.VRT also needs the multi-band case from #1611 (per-band sentinels) so
nodata_pixels_presentis computed per band whenband=None.References
masked_nodatafrom dtype inference to the actual mask decision.mask_nodata=Falseon a float file with a non-NaN sentinel left the attr lying.nodatafrommasked_nodata.nodatafrom the base IFD; the new attrs need to inherit too.band_nodata=kwarg;nodata_pixels_presentshould be per-band consistent.dtype=on int sentinel sources, the same cast path that motivatesnodata_dtype_cast.