Skip to content

geotiff: introduce internal GeoTIFFMetadata dataclass; convert to/from attrs at API boundary #2139

@brendancol

Description

@brendancol

Problem

Every read path in the geotiff module funnels through _populate_attrs_from_geo_info, and every writer through _resolve_nodata_attr / _extract_rich_tags. The shared surface is still a plain dict[str, Any]. Each backend builds and reads that dict by hand, and we have a tail of drift bugs plus a tail of helpers that exist only to paper over the dict shape. A small internal dataclass would let readers, writers, validators, and the VRT path pass a typed record around, and only marshal to/from attrs at the public API boundary.

Concrete drift cases from the current tree:

  1. Two read paths bypass _populate_attrs_from_geo_info and build the attrs dict inline. xrspatial/geotiff/_backends/vrt.py:275 and xrspatial/geotiff/_backends/vrt.py:697 open attrs = {'_xrspatial_geotiff_contract': _ATTRS_CONTRACT_VERSION} and then set attrs['crs'], attrs['crs_wkt'], attrs['transform'], attrs['raster_type'], attrs[_NO_GEOREF_KEY], and attrs['vrt_holes'] directly. The eager, dask, and GPU paths funnel through the helper instead (__init__.py:633, _backends/dask.py:342, _backends/gpu.py:437, :816, :1369). Two of the four read paths (and any future fifth) must remember to stamp the contract version, set crs only when an EPSG resolves, set crs_wkt whenever any CRS is known, set raster_type='point' only on point rasters, and apply the no-georef marker on missing transform tags. Nothing in the type system enforces that list.

  2. The reader's _populate_attrs_from_geo_info emits attrs['crs'] only when geo_info.crs_epsg is not None and attrs['crs_wkt'] only when geo_info.crs_wkt is not None (_attrs.py:364-367). The two VRT read paths run epsg = _wkt_to_epsg(vrt.crs_wkt) inline and only emit attrs['crs'] when that returns non-None (_backends/vrt.py:285-288, :715-718). Three writers re-implement the resolve-from-attrs step from scratch (_writers/eager.py:574-592, _writers/eager.py:913-931, _writers/gpu.py:438-456), each duplicating the "attrs['crs'] may be int EPSG or WKT string, attrs['crs_wkt'] only carries WKT" branch tree. The duplicated logic has already drifted: _writers/eager.py:585 validates crs_attr via _validate_crs_arg before coercing to int, while _backends/vrt.py:287 does not validate at all (the VRT path constructs the int from _wkt_to_epsg and never sees a user-supplied attrs['crs']). Any future tightening of that validator has to land in three places.

  3. The nodata / masked_nodata pair is split across _set_nodata_attrs and a per-backend masked= computation. The eager path sets masked=(mask_nodata and arr.dtype.kind == 'f') against the final array (__init__.py:697-700). The dask path sets masked=(mask_nodata and target_dtype.kind == 'f') against the declared graph dtype (_backends/dask.py:350-353). The VRT eager path captures a pre_cast_dtype before the optional user cast and uses masked=(pre_cast_dtype.kind == 'f') (_backends/vrt.py:341, :366-369). The VRT chunked path uses masked=(declared_dtype.kind == 'f') for the same reason (_backends/vrt.py:743-746). All four were introduced independently under issue Bug: attrs['masked_nodata'] reports True when masking was disabled #2092 because the pre-fix code inferred masked from the final dtype, which lied when mask_nodata=False left literal sentinel values in a float buffer. A typed record would put the "masked is the actual mask-decision the read path made" invariant in one place rather than four.

  4. Pass-through attrs ride on the same dict. attrs['vrt_holes'] exists only on the VRT path (_backends/vrt.py:300, :774). attrs['extra_tags'] and attrs['colormap'] exist on every path except VRT (_attrs.py:412-413, :438). attrs['gdal_metadata_xml'] and attrs['gdal_metadata'] exist on every path except VRT (_attrs.py:407-410). A typed record makes the per-backend coverage explicit and turns silent omissions into type-checker errors.

  5. Writers resolve nodata from attrs via _resolve_nodata_attr (_attrs.py:442-513), which accepts three aliases (nodata, nodatavals, _FillValue). The companion validator _validate_attrs_nodata_consistency (_validation.py:636) refuses writes where attrs['nodata'] and attrs['nodatavals'] disagree. The same logic is duplicated by hand in three writers via if nodata is None: nodata = _resolve_nodata_attr(data.attrs) (_writers/eager.py:593-594, _writers/eager.py:932-941, _writers/gpu.py:457-458). A typed record built once at the API boundary lets those three writers take a GeoTIFFMetadata parameter and drop the resolve step.

The dataclass shape

from dataclasses import dataclass, field
from typing import Any

@dataclass(frozen=True)
class GeoTIFFMetadata:
    # Spatial reference
    transform: tuple[float, ...] | None = None   # rasterio 6-tuple
    crs_epsg: int | None = None                  # surfaces as attrs['crs']
    crs_wkt: str | None = None                   # surfaces as attrs['crs_wkt']
    raster_type: str = 'area'                    # 'area' | 'point'
    has_georef: bool = True                      # False -> stamps _NO_GEOREF_KEY

    # NoData semantics (issue #1988 / #2092)
    nodata: int | float | None = None            # declared file sentinel
    masked_nodata: bool | None = None            # True iff sentinel-to-NaN ran

    # Pass-through TIFF tags
    extra_tags: list[tuple] | None = None
    image_description: str | None = None
    extra_samples: tuple[int, ...] | None = None
    colormap: tuple[int, ...] | None = None

    # GDAL_METADATA
    gdal_metadata: dict | None = None
    gdal_metadata_xml: str | None = None

    # Resolution tags
    x_resolution: float | None = None
    y_resolution: float | None = None
    resolution_unit: str | None = None           # 'none' | 'inch' | 'centimeter'

    # VRT-only
    vrt_holes: list[dict] | None = None

    # Contract version
    contract_version: int = 2

The keys map 1:1 onto the existing attrs contract in _attrs.py:18-122, so external behaviour is unchanged. Field notes:

  • transform, crs_epsg, crs_wkt, raster_type, and has_georef are the spatial reference fields. has_georef=False is the in-memory form of attrs[_NO_GEOREF_KEY] = True (_attrs.py:394); making it a bool means the writer's "did this come from a real georef" check is md.has_georef instead of _is_no_georef_array(da).
  • nodata and masked_nodata follow the issue geotiff: split declared nodata from masked-data state in DataArray attrs #1988 split. nodata is the declared file sentinel. masked_nodata is True iff the in-memory array went through the sentinel-to-NaN step. masked_nodata is None (not False) when no sentinel was declared, mirroring the current "absence of attrs['masked_nodata'] means no declared sentinel" rule (_attrs.py:248-278).
  • extra_tags, image_description, extra_samples, and colormap are the pass-through tags that _extract_rich_tags folds back into the writer kwargs (_attrs.py:573-621). Keeping them on the dataclass means _merge_friendly_extra_tags can take a GeoTIFFMetadata and return one.
  • gdal_metadata and gdal_metadata_xml stay as separate fields because _extract_rich_tags already prefers the XML form over the dict form (_attrs.py:599-604); merging them would lose that precedence.
  • vrt_holes is VRT-only today and stays VRT-only on the dataclass.
  • contract_version lives on the record so a future v3 reader can produce a v3 record and the conversion layer can decide what to stamp.

Conversion functions

Two functions at the API boundary, both in _attrs.py:

def metadata_to_attrs(md: GeoTIFFMetadata) -> dict[str, Any]:
    """Build the public attrs dict from a metadata record.
    Mirrors the current _populate_attrs_from_geo_info field-by-field."""

def attrs_to_metadata(attrs: dict[str, Any]) -> GeoTIFFMetadata:
    """Parse a (possibly user-supplied) attrs dict into a metadata record.
    Honours the alias resolution _resolve_nodata_attr already implements."""

metadata_to_attrs replaces the seven manual attrs[...] = ... sites in _populate_attrs_from_geo_info (_attrs.py:362-439) and the two inline blocks at _backends/vrt.py:275-300 and :697-720.

attrs_to_metadata replaces the three duplicated "read crs / crs_wkt / nodata off attrs" blocks in the writers (_writers/eager.py:574-611, :913-941, _writers/gpu.py:438-460).

Internal call sites

Functions that stop touching attrs directly and take/return GeoTIFFMetadata:

Read side:

  • _populate_attrs_from_geo_info(attrs, geo_info, *, window=None) becomes geo_info_to_metadata(geo_info, *, window=None) -> GeoTIFFMetadata. Callers in __init__.py:633, _backends/dask.py:342, _backends/gpu.py:437, :816, and :1369 switch to the new function and call metadata_to_attrs(md) immediately before constructing the DataArray.
  • _set_nodata_attrs(attrs, nodata, *, masked) becomes a builder method on GeoTIFFMetadata, e.g. md = md.with_nodata(nodata, masked=...). Callers in __init__.py:697, _backends/dask.py:350, _backends/gpu.py:453, :817, _backends/vrt.py:366, and :743 switch to the builder.
  • The two VRT inline blocks (_backends/vrt.py:275-300, :697-720) build a GeoTIFFMetadata directly from vrt.crs_wkt / vrt.geo_transform / vrt.bands / vrt.holes and call metadata_to_attrs at the DataArray-construction step.

Write side:

  • _resolve_nodata_attr(attrs) becomes a no-op when the writer takes a GeoTIFFMetadata: the field is already resolved.
  • _extract_rich_tags(attrs) becomes _extract_rich_tags(md). The three writers (_writers/eager.py:605, :932-941 via _resolve_nodata_attr, _writers/gpu.py:458) call md = attrs_to_metadata(data.attrs) once at entry and read fields off md instead of re-resolving from data.attrs.
  • _should_restore_nan_sentinel(attrs) becomes a one-liner on the record (md.masked_nodata is not False).

Validators:

  • validate_write_metadata (_validation.py) already takes a dict literal of crs_kwarg, attrs_crs, attrs_crs_wkt, nodata_kwarg, attrs_nodata, attrs_nodatavals, coord_y, coord_x. That dict is built in three call sites (_writers/eager.py:298-307, :856-865, _writers/gpu.py:318-327); they all collapse to validate_write_metadata(md, crs_kwarg=crs, nodata_kwarg=nodata, coord_y=..., coord_x=...).

Migration plan

Internal refactor; the public attrs surface does not change. Each step is independently testable:

  1. Land GeoTIFFMetadata and the two conversion functions in _attrs.py. Add round-trip tests asserting metadata_to_attrs(attrs_to_metadata(x)) == x for representative attrs dicts (eager numpy, dask, GPU, VRT, no-georef, point raster, user-defined CRS WKT).

  2. Rewrite _populate_attrs_from_geo_info as a thin wrapper: attrs.update(metadata_to_attrs(geo_info_to_metadata(geo_info, window=window))). The existing helper signature is preserved so every reader call site keeps working. Tests in test_round_trip_invariants.py and test_nodata_semantics_split_1988.py cover the read paths; they should pass unchanged.

  3. Replace the inline attrs blocks in _backends/vrt.py:275-300 and :697-720 with the new builder. That is the highest-drift surface today, so landing it second clears the bulk of the consistency gap.

  4. Switch the four read-side _set_nodata_attrs callers to the builder method. The four masked= computations stay where they are; the dataclass only changes the assignment shape.

  5. Land attrs_to_metadata on the write side. The three writers gain a md = attrs_to_metadata(data.attrs) call at entry, and the duplicated CRS / nodata / rich-tag resolve blocks collapse to field lookups on md. validate_write_metadata switches to taking md directly.

  6. Once all read and write paths use GeoTIFFMetadata internally, deprecate the attrs parameter on _resolve_nodata_attr, _extract_rich_tags, _should_restore_nan_sentinel, and _set_nodata_attrs. The compatibility shims stay until one minor release after the dataclass lands.

Each step is a separate PR so the diff stays reviewable.

Relation to sibling issues

The dataclass is where three companion proposals on the tracker live:

  • georef_status (the proposal to replace attrs[_NO_GEOREF_KEY] = True with an enum-valued attr) becomes md.has_georef, or md.georef_status: Literal['present', 'absent', 'placeholder'] if the enum form wins. Either way the field is named once, set once by the readers, and consumed once by the writers and _is_no_georef_array.
  • _xrspatial_no_georef: the underscore-prefix attr stays on the public surface for back-compat but is no longer the in-memory signal. Read paths construct md.has_georef = False; the conversion function emits the legacy key for round-trip compatibility.
  • Tightened nodata semantics (item 5 of the parent attrs-contract review): md.nodata and md.masked_nodata are the canonical pair. The alias attrs (nodatavals, _FillValue) are resolved inside attrs_to_metadata and never re-surface on a write.

Landing the dataclass first turns the sibling issues into short follow-ups (one field rename or one builder method each).

Risk

Internal refactor:

  • Public attrs surface does not change. The attrs contract documented in _attrs.py:18-122 continues to hold key-for-key.
  • The round-trip tests in xrspatial/geotiff/tests/test_round_trip_invariants.py and the parity tests in test_round_trip_parity_rasterio_zarr_1961.py enforce the contract on every read/write path; they should pass unmodified at each step of the migration.
  • The four masked_nodata computations stay at their current call sites; the dataclass only changes the assignment target. Issue Bug: attrs['masked_nodata'] reports True when masking was disabled #2092's tests in test_masked_nodata_attr_2092.py and test_nodata_semantics_split_1988.py continue to gate the masking semantics.
  • The two VRT inline attrs blocks (_backends/vrt.py:275-300, :697-720) are the only place where behaviour could observably change, because they currently emit a slightly different field set than _populate_attrs_from_geo_info (no extra_tags, no gdal_metadata, no resolution tags). Routing them through metadata_to_attrs brings them into line. The alignment is what we want, but it is a behaviour change in the sense that a VRT read will start carrying tags it did not before. Gated behind a follow-up so this PR can land as a pure refactor first.

Out of scope

  • Public renames of attrs['crs'], attrs['crs_wkt'], attrs['nodata'], or attrs['masked_nodata'].
  • Bumping _ATTRS_CONTRACT_VERSION to 3.
  • Adding new fields to the attrs contract.

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiAPI design and consistencyenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions