Skip to content

geotiff: split overloaded masked_nodata into separate nodata lifecycle signals #2135

@brendancol

Description

@brendancol

Problem

attrs['masked_nodata'] is a single boolean trying to describe a multi-stage process. The fix in 478dbd5 (PR #2127, closing #2092) tightened it so the attr now means "the reader replaced sentinel pixels with NaN," which is better than the pre-fix "final dtype is float." Callers still cannot answer four distinct questions from one bit:

  1. Did the source file declare a sentinel?
  2. Did the read path run the sentinel-to-NaN masking step?
  3. Were sentinel pixels actually found in the read window?
  4. Did a user-requested dtype cast happen after the masking decision (which can promote int to float without masking)?

Today (1) is encoded by the presence of attrs['nodata'], (2) is attrs['masked_nodata'], and (3) and (4) are not surfaced. Where each one matters in practice:

(1) declared sentinel exists. A round-trip writer needs the file-declared sentinel so it can preserve it on write even if the in-memory array has no missing pixels. attrs['nodata'] already covers this.

(2) masking step ran. A consumer deciding between np.isnan(arr) and arr == attrs['nodata'] needs this. That is what masked_nodata means after #2092.

(3) sentinel pixels present. A QA function asking "does this tile contain any nodata" currently has to scan the buffer. With a bool set at read time, the answer is free. Same problem on the writer: if no sentinel pixels were touched, the unmask-before-write scan can be skipped. Today the writer pays for it every time.

(4) dtype cast after masking. Caller passes dtype=float64 on an integer source with mask_nodata=False. Post-mask the buffer is integer with literal sentinels; post-cast it is float64 with literal sentinels (no NaNs). The #2092 follow-up reads pre-cast dtype so masked_nodata reports False here, but the cast itself is invisible downstream. A consumer wanting to know "float because promoted, or float because masked" cannot tell.

Proposed attribute scheme

Split the lifecycle into separate signals. Keep nodata and masked_nodata for backward compatibility, add two new attrs:

  • nodata (unchanged): declared file sentinel as a scalar of the source dtype. None/absent means no declared sentinel.
  • masked_nodata (unchanged after Bug: attrs['masked_nodata'] reports True when masking was disabled #2092): bool. True iff the reader replaced sentinel pixels with NaN. Only emitted when nodata is not None.
  • nodata_pixels_present (new): bool or int count. True/non-zero iff at least one pixel in the read window matched the sentinel before masking. Lets the writer skip the unmask scan, and lets QA code answer "any nodata?" without touching the buffer.
  • nodata_dtype_cast (new): None if the caller did not pass dtype=, otherwise the target dtype as a string. Records that a post-mask cast happened. Combined with masked_nodata this distinguishes float-because-masked from float-because-promoted.

A consumer reading these together can answer:

  • "Is there a declared sentinel?" -> 'nodata' in attrs
  • "Is the buffer NaN-aware?" -> attrs.get('masked_nodata') is True
  • "Were any sentinel pixels in this window?" -> attrs.get('nodata_pixels_present')
  • "Was this buffer cast after the read?" -> attrs.get('nodata_dtype_cast')

Backward compatibility

masked_nodata is already documented in attrs_contract.rst and exercised by the writer's _should_restore_nan_sentinel. The post-#2092 semantics are the ones to keep ("masking step ran"), so no rename or deprecation for that key.

The two new keys are additive. The writer's restore-on-write logic stays gated on masked_nodata, but it can also check nodata_pixels_present is False to skip the scan.

Recommend an additive change: ship nodata_pixels_present and nodata_dtype_cast alongside the existing pair, document them in attrs_contract.rst, and let downstream adopt at its own pace.

Read paths

Where each signal is computed in each backend:

Signal Eager (__init__.py) Dask (_backends/dask.py) GPU (_backends/gpu.py) VRT (_backends/vrt.py) HTTP / COG overview
nodata from geo_info.nodata from nodata_attr from geo_info.nodata from vrt.bands[].nodata inherited from base IFD (#1739)
masked_nodata mask_nodata and arr.dtype.kind == 'f' (after #2092) mask_nodata and target_dtype.kind == 'f' mask_nodata and arr_gpu.dtype.kind == 'f' pre_cast_dtype.kind == 'f' same as eager (overview reads use the eager path)
nodata_pixels_present (new) bool from the mask kernel's any() result; free side-effect of _apply_nodata_mask per-chunk reduction, aggregated bool on the task graph bool from the GPU mask kernel bool from the VRT integer-sentinel mask helper plus the inline float-NaN-mask in _vrt._read_data same
nodata_dtype_cast (new) the dtype= kwarg passed to arr.astype(target) the dtype= kwarg on the graph cast same same same

For dask, nodata_pixels_present is the awkward one: a strict per-chunk reduction means the value is only known after .compute(). Two options: (a) compute eagerly as a delayed scalar attached to attrs, or (b) leave it unset on dask outputs and document that. Recommend (b) for the first cut; the writer can fall back to "scan if missing" on dask paths.

Tie-in with the GeoTIFFMetadata dataclass proposal

The companion proposal to back the attrs dict with a GeoTIFFMetadata dataclass gives a clean home for these fields. Once attrs is a view over a dataclass, the four signals become typed fields rather than loose dict keys: metadata.nodata, metadata.masked_nodata, metadata.nodata_pixels_present, metadata.nodata_dtype_cast. The dict view stays for backward compatibility and serialization.

The dataclass also resolves the "False vs missing" ambiguity that _should_restore_nan_sentinel currently handles with a deliberate identity check on value is not False at _attrs.py:244. With typed fields, None vs False vs True is unambiguous.

Tests

Matrix to cover, per backend (eager, dask, GPU, VRT):

Source mask_nodata dtype= sentinel in window Expected nodata Expected masked_nodata Expected nodata_pixels_present Expected nodata_dtype_cast
float, sentinel declared True None yes sentinel True True None
float, sentinel declared False None yes sentinel False True None
float, sentinel declared True None no sentinel True False None
int, sentinel declared True None yes sentinel True (post-promotion) True None
int, sentinel declared False None yes sentinel False True None
int, sentinel declared False float64 yes sentinel False True "float64"
int, sentinel out of range True None n/a sentinel False (no cast happened) False None
no sentinel declared True None n/a absent absent absent None

Extend test_nodata_semantics_split_1988.py and test_masked_nodata_attr_2092.py to assert on the new keys.

VRT also needs the multi-band case from #1611 (per-band sentinels) so nodata_pixels_present is computed per band when band=None.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiAPI design and consistencyenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions