You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Every read path in the geotiff module funnels through _populate_attrs_from_geo_info, and every writer through _resolve_nodata_attr / _extract_rich_tags. The shared surface is still a plain dict[str, Any]. Each backend builds and reads that dict by hand, and we have a tail of drift bugs plus a tail of helpers that exist only to paper over the dict shape. A small internal dataclass would let readers, writers, validators, and the VRT path pass a typed record around, and only marshal to/from attrs at the public API boundary.
Concrete drift cases from the current tree:
Two read paths bypass _populate_attrs_from_geo_info and build the attrs dict inline. xrspatial/geotiff/_backends/vrt.py:275 and xrspatial/geotiff/_backends/vrt.py:697 open attrs = {'_xrspatial_geotiff_contract': _ATTRS_CONTRACT_VERSION} and then set attrs['crs'], attrs['crs_wkt'], attrs['transform'], attrs['raster_type'], attrs[_NO_GEOREF_KEY], and attrs['vrt_holes'] directly. The eager, dask, and GPU paths funnel through the helper instead (__init__.py:633, _backends/dask.py:342, _backends/gpu.py:437, :816, :1369). Two of the four read paths (and any future fifth) must remember to stamp the contract version, set crs only when an EPSG resolves, set crs_wkt whenever any CRS is known, set raster_type='point' only on point rasters, and apply the no-georef marker on missing transform tags. Nothing in the type system enforces that list.
The reader's _populate_attrs_from_geo_info emits attrs['crs'] only when geo_info.crs_epsg is not None and attrs['crs_wkt'] only when geo_info.crs_wkt is not None (_attrs.py:364-367). The two VRT read paths run epsg = _wkt_to_epsg(vrt.crs_wkt) inline and only emit attrs['crs'] when that returns non-None (_backends/vrt.py:285-288, :715-718). Three writers re-implement the resolve-from-attrs step from scratch (_writers/eager.py:574-592, _writers/eager.py:913-931, _writers/gpu.py:438-456), each duplicating the "attrs['crs'] may be int EPSG or WKT string, attrs['crs_wkt'] only carries WKT" branch tree. The duplicated logic has already drifted: _writers/eager.py:585 validates crs_attr via _validate_crs_arg before coercing to int, while _backends/vrt.py:287 does not validate at all (the VRT path constructs the int from _wkt_to_epsg and never sees a user-supplied attrs['crs']). Any future tightening of that validator has to land in three places.
The nodata / masked_nodata pair is split across _set_nodata_attrs and a per-backend masked= computation. The eager path sets masked=(mask_nodata and arr.dtype.kind == 'f') against the final array (__init__.py:697-700). The dask path sets masked=(mask_nodata and target_dtype.kind == 'f') against the declared graph dtype (_backends/dask.py:350-353). The VRT eager path captures a pre_cast_dtype before the optional user cast and uses masked=(pre_cast_dtype.kind == 'f') (_backends/vrt.py:341, :366-369). The VRT chunked path uses masked=(declared_dtype.kind == 'f') for the same reason (_backends/vrt.py:743-746). All four were introduced independently under issue Bug: attrs['masked_nodata'] reports True when masking was disabled #2092 because the pre-fix code inferred masked from the final dtype, which lied when mask_nodata=False left literal sentinel values in a float buffer. A typed record would put the "masked is the actual mask-decision the read path made" invariant in one place rather than four.
Pass-through attrs ride on the same dict. attrs['vrt_holes'] exists only on the VRT path (_backends/vrt.py:300, :774). attrs['extra_tags'] and attrs['colormap'] exist on every path except VRT (_attrs.py:412-413, :438). attrs['gdal_metadata_xml'] and attrs['gdal_metadata'] exist on every path except VRT (_attrs.py:407-410). A typed record makes the per-backend coverage explicit and turns silent omissions into type-checker errors.
Writers resolve nodata from attrs via _resolve_nodata_attr (_attrs.py:442-513), which accepts three aliases (nodata, nodatavals, _FillValue). The companion validator _validate_attrs_nodata_consistency (_validation.py:636) refuses writes where attrs['nodata'] and attrs['nodatavals'] disagree. The same logic is duplicated by hand in three writers via if nodata is None: nodata = _resolve_nodata_attr(data.attrs) (_writers/eager.py:593-594, _writers/eager.py:932-941, _writers/gpu.py:457-458). A typed record built once at the API boundary lets those three writers take a GeoTIFFMetadata parameter and drop the resolve step.
The keys map 1:1 onto the existing attrs contract in _attrs.py:18-122, so external behaviour is unchanged. Field notes:
transform, crs_epsg, crs_wkt, raster_type, and has_georef are the spatial reference fields. has_georef=False is the in-memory form of attrs[_NO_GEOREF_KEY] = True (_attrs.py:394); making it a bool means the writer's "did this come from a real georef" check is md.has_georef instead of _is_no_georef_array(da).
nodata and masked_nodata follow the issue geotiff: split declared nodata from masked-data state in DataArray attrs #1988 split. nodata is the declared file sentinel. masked_nodata is True iff the in-memory array went through the sentinel-to-NaN step. masked_nodata is None (not False) when no sentinel was declared, mirroring the current "absence of attrs['masked_nodata'] means no declared sentinel" rule (_attrs.py:248-278).
extra_tags, image_description, extra_samples, and colormap are the pass-through tags that _extract_rich_tags folds back into the writer kwargs (_attrs.py:573-621). Keeping them on the dataclass means _merge_friendly_extra_tags can take a GeoTIFFMetadata and return one.
gdal_metadata and gdal_metadata_xml stay as separate fields because _extract_rich_tags already prefers the XML form over the dict form (_attrs.py:599-604); merging them would lose that precedence.
vrt_holes is VRT-only today and stays VRT-only on the dataclass.
contract_version lives on the record so a future v3 reader can produce a v3 record and the conversion layer can decide what to stamp.
Conversion functions
Two functions at the API boundary, both in _attrs.py:
defmetadata_to_attrs(md: GeoTIFFMetadata) ->dict[str, Any]:
"""Build the public attrs dict from a metadata record. Mirrors the current _populate_attrs_from_geo_info field-by-field."""defattrs_to_metadata(attrs: dict[str, Any]) ->GeoTIFFMetadata:
"""Parse a (possibly user-supplied) attrs dict into a metadata record. Honours the alias resolution _resolve_nodata_attr already implements."""
metadata_to_attrs replaces the seven manual attrs[...] = ... sites in _populate_attrs_from_geo_info (_attrs.py:362-439) and the two inline blocks at _backends/vrt.py:275-300 and :697-720.
attrs_to_metadata replaces the three duplicated "read crs / crs_wkt / nodata off attrs" blocks in the writers (_writers/eager.py:574-611, :913-941, _writers/gpu.py:438-460).
Internal call sites
Functions that stop touching attrs directly and take/return GeoTIFFMetadata:
Read side:
_populate_attrs_from_geo_info(attrs, geo_info, *, window=None) becomes geo_info_to_metadata(geo_info, *, window=None) -> GeoTIFFMetadata. Callers in __init__.py:633, _backends/dask.py:342, _backends/gpu.py:437, :816, and :1369 switch to the new function and call metadata_to_attrs(md) immediately before constructing the DataArray.
_set_nodata_attrs(attrs, nodata, *, masked) becomes a builder method on GeoTIFFMetadata, e.g. md = md.with_nodata(nodata, masked=...). Callers in __init__.py:697, _backends/dask.py:350, _backends/gpu.py:453, :817, _backends/vrt.py:366, and :743 switch to the builder.
The two VRT inline blocks (_backends/vrt.py:275-300, :697-720) build a GeoTIFFMetadata directly from vrt.crs_wkt / vrt.geo_transform / vrt.bands / vrt.holes and call metadata_to_attrs at the DataArray-construction step.
Write side:
_resolve_nodata_attr(attrs) becomes a no-op when the writer takes a GeoTIFFMetadata: the field is already resolved.
_extract_rich_tags(attrs) becomes _extract_rich_tags(md). The three writers (_writers/eager.py:605, :932-941 via _resolve_nodata_attr, _writers/gpu.py:458) call md = attrs_to_metadata(data.attrs) once at entry and read fields off md instead of re-resolving from data.attrs.
_should_restore_nan_sentinel(attrs) becomes a one-liner on the record (md.masked_nodata is not False).
Validators:
validate_write_metadata (_validation.py) already takes a dict literal of crs_kwarg, attrs_crs, attrs_crs_wkt, nodata_kwarg, attrs_nodata, attrs_nodatavals, coord_y, coord_x. That dict is built in three call sites (_writers/eager.py:298-307, :856-865, _writers/gpu.py:318-327); they all collapse to validate_write_metadata(md, crs_kwarg=crs, nodata_kwarg=nodata, coord_y=..., coord_x=...).
Migration plan
Internal refactor; the public attrs surface does not change. Each step is independently testable:
Land GeoTIFFMetadata and the two conversion functions in _attrs.py. Add round-trip tests asserting metadata_to_attrs(attrs_to_metadata(x)) == x for representative attrs dicts (eager numpy, dask, GPU, VRT, no-georef, point raster, user-defined CRS WKT).
Rewrite _populate_attrs_from_geo_info as a thin wrapper: attrs.update(metadata_to_attrs(geo_info_to_metadata(geo_info, window=window))). The existing helper signature is preserved so every reader call site keeps working. Tests in test_round_trip_invariants.py and test_nodata_semantics_split_1988.py cover the read paths; they should pass unchanged.
Replace the inline attrs blocks in _backends/vrt.py:275-300 and :697-720 with the new builder. That is the highest-drift surface today, so landing it second clears the bulk of the consistency gap.
Switch the four read-side _set_nodata_attrs callers to the builder method. The four masked= computations stay where they are; the dataclass only changes the assignment shape.
Land attrs_to_metadata on the write side. The three writers gain a md = attrs_to_metadata(data.attrs) call at entry, and the duplicated CRS / nodata / rich-tag resolve blocks collapse to field lookups on md. validate_write_metadata switches to taking md directly.
Once all read and write paths use GeoTIFFMetadata internally, deprecate the attrs parameter on _resolve_nodata_attr, _extract_rich_tags, _should_restore_nan_sentinel, and _set_nodata_attrs. The compatibility shims stay until one minor release after the dataclass lands.
Each step is a separate PR so the diff stays reviewable.
Relation to sibling issues
The dataclass is where three companion proposals on the tracker live:
georef_status (the proposal to replace attrs[_NO_GEOREF_KEY] = True with an enum-valued attr) becomes md.has_georef, or md.georef_status: Literal['present', 'absent', 'placeholder'] if the enum form wins. Either way the field is named once, set once by the readers, and consumed once by the writers and _is_no_georef_array.
_xrspatial_no_georef: the underscore-prefix attr stays on the public surface for back-compat but is no longer the in-memory signal. Read paths construct md.has_georef = False; the conversion function emits the legacy key for round-trip compatibility.
Tightened nodata semantics (item 5 of the parent attrs-contract review): md.nodata and md.masked_nodata are the canonical pair. The alias attrs (nodatavals, _FillValue) are resolved inside attrs_to_metadata and never re-surface on a write.
Landing the dataclass first turns the sibling issues into short follow-ups (one field rename or one builder method each).
Risk
Internal refactor:
Public attrs surface does not change. The attrs contract documented in _attrs.py:18-122 continues to hold key-for-key.
The round-trip tests in xrspatial/geotiff/tests/test_round_trip_invariants.py and the parity tests in test_round_trip_parity_rasterio_zarr_1961.py enforce the contract on every read/write path; they should pass unmodified at each step of the migration.
The four masked_nodata computations stay at their current call sites; the dataclass only changes the assignment target. Issue Bug: attrs['masked_nodata'] reports True when masking was disabled #2092's tests in test_masked_nodata_attr_2092.py and test_nodata_semantics_split_1988.py continue to gate the masking semantics.
The two VRT inline attrs blocks (_backends/vrt.py:275-300, :697-720) are the only place where behaviour could observably change, because they currently emit a slightly different field set than _populate_attrs_from_geo_info (no extra_tags, no gdal_metadata, no resolution tags). Routing them through metadata_to_attrs brings them into line. The alignment is what we want, but it is a behaviour change in the sense that a VRT read will start carrying tags it did not before. Gated behind a follow-up so this PR can land as a pure refactor first.
Out of scope
Public renames of attrs['crs'], attrs['crs_wkt'], attrs['nodata'], or attrs['masked_nodata'].
Problem
Every read path in the geotiff module funnels through
_populate_attrs_from_geo_info, and every writer through_resolve_nodata_attr/_extract_rich_tags. The shared surface is still a plaindict[str, Any]. Each backend builds and reads that dict by hand, and we have a tail of drift bugs plus a tail of helpers that exist only to paper over the dict shape. A small internal dataclass would let readers, writers, validators, and the VRT path pass a typed record around, and only marshal to/fromattrsat the public API boundary.Concrete drift cases from the current tree:
Two read paths bypass
_populate_attrs_from_geo_infoand build the attrs dict inline.xrspatial/geotiff/_backends/vrt.py:275andxrspatial/geotiff/_backends/vrt.py:697openattrs = {'_xrspatial_geotiff_contract': _ATTRS_CONTRACT_VERSION}and then setattrs['crs'],attrs['crs_wkt'],attrs['transform'],attrs['raster_type'],attrs[_NO_GEOREF_KEY], andattrs['vrt_holes']directly. The eager, dask, and GPU paths funnel through the helper instead (__init__.py:633,_backends/dask.py:342,_backends/gpu.py:437,:816,:1369). Two of the four read paths (and any future fifth) must remember to stamp the contract version, setcrsonly when an EPSG resolves, setcrs_wktwhenever any CRS is known, setraster_type='point'only on point rasters, and apply the no-georef marker on missing transform tags. Nothing in the type system enforces that list.The reader's
_populate_attrs_from_geo_infoemitsattrs['crs']only whengeo_info.crs_epsg is not Noneandattrs['crs_wkt']only whengeo_info.crs_wkt is not None(_attrs.py:364-367). The two VRT read paths runepsg = _wkt_to_epsg(vrt.crs_wkt)inline and only emitattrs['crs']when that returns non-None (_backends/vrt.py:285-288,:715-718). Three writers re-implement the resolve-from-attrs step from scratch (_writers/eager.py:574-592,_writers/eager.py:913-931,_writers/gpu.py:438-456), each duplicating the "attrs['crs'] may be int EPSG or WKT string, attrs['crs_wkt'] only carries WKT" branch tree. The duplicated logic has already drifted:_writers/eager.py:585validatescrs_attrvia_validate_crs_argbefore coercing to int, while_backends/vrt.py:287does not validate at all (the VRT path constructs the int from_wkt_to_epsgand never sees a user-suppliedattrs['crs']). Any future tightening of that validator has to land in three places.The
nodata/masked_nodatapair is split across_set_nodata_attrsand a per-backendmasked=computation. The eager path setsmasked=(mask_nodata and arr.dtype.kind == 'f')against the final array (__init__.py:697-700). The dask path setsmasked=(mask_nodata and target_dtype.kind == 'f')against the declared graph dtype (_backends/dask.py:350-353). The VRT eager path captures apre_cast_dtypebefore the optional user cast and usesmasked=(pre_cast_dtype.kind == 'f')(_backends/vrt.py:341,:366-369). The VRT chunked path usesmasked=(declared_dtype.kind == 'f')for the same reason (_backends/vrt.py:743-746). All four were introduced independently under issue Bug: attrs['masked_nodata'] reports True when masking was disabled #2092 because the pre-fix code inferredmaskedfrom the final dtype, which lied whenmask_nodata=Falseleft literal sentinel values in a float buffer. A typed record would put the "masked is the actual mask-decision the read path made" invariant in one place rather than four.Pass-through attrs ride on the same dict.
attrs['vrt_holes']exists only on the VRT path (_backends/vrt.py:300,:774).attrs['extra_tags']andattrs['colormap']exist on every path except VRT (_attrs.py:412-413,:438).attrs['gdal_metadata_xml']andattrs['gdal_metadata']exist on every path except VRT (_attrs.py:407-410). A typed record makes the per-backend coverage explicit and turns silent omissions into type-checker errors.Writers resolve
nodatafrom attrs via_resolve_nodata_attr(_attrs.py:442-513), which accepts three aliases (nodata,nodatavals,_FillValue). The companion validator_validate_attrs_nodata_consistency(_validation.py:636) refuses writes whereattrs['nodata']andattrs['nodatavals']disagree. The same logic is duplicated by hand in three writers viaif nodata is None: nodata = _resolve_nodata_attr(data.attrs)(_writers/eager.py:593-594,_writers/eager.py:932-941,_writers/gpu.py:457-458). A typed record built once at the API boundary lets those three writers take aGeoTIFFMetadataparameter and drop the resolve step.The dataclass shape
The keys map 1:1 onto the existing attrs contract in
_attrs.py:18-122, so external behaviour is unchanged. Field notes:transform,crs_epsg,crs_wkt,raster_type, andhas_georefare the spatial reference fields.has_georef=Falseis the in-memory form ofattrs[_NO_GEOREF_KEY] = True(_attrs.py:394); making it a bool means the writer's "did this come from a real georef" check ismd.has_georefinstead of_is_no_georef_array(da).nodataandmasked_nodatafollow the issue geotiff: split declared nodata from masked-data state in DataArray attrs #1988 split.nodatais the declared file sentinel.masked_nodataisTrueiff the in-memory array went through the sentinel-to-NaN step.masked_nodataisNone(notFalse) when no sentinel was declared, mirroring the current "absence ofattrs['masked_nodata']means no declared sentinel" rule (_attrs.py:248-278).extra_tags,image_description,extra_samples, andcolormapare the pass-through tags that_extract_rich_tagsfolds back into the writer kwargs (_attrs.py:573-621). Keeping them on the dataclass means_merge_friendly_extra_tagscan take aGeoTIFFMetadataand return one.gdal_metadataandgdal_metadata_xmlstay as separate fields because_extract_rich_tagsalready prefers the XML form over the dict form (_attrs.py:599-604); merging them would lose that precedence.vrt_holesis VRT-only today and stays VRT-only on the dataclass.contract_versionlives on the record so a future v3 reader can produce a v3 record and the conversion layer can decide what to stamp.Conversion functions
Two functions at the API boundary, both in
_attrs.py:metadata_to_attrsreplaces the seven manualattrs[...] = ...sites in_populate_attrs_from_geo_info(_attrs.py:362-439) and the two inline blocks at_backends/vrt.py:275-300and:697-720.attrs_to_metadatareplaces the three duplicated "read crs / crs_wkt / nodata off attrs" blocks in the writers (_writers/eager.py:574-611,:913-941,_writers/gpu.py:438-460).Internal call sites
Functions that stop touching
attrsdirectly and take/returnGeoTIFFMetadata:Read side:
_populate_attrs_from_geo_info(attrs, geo_info, *, window=None)becomesgeo_info_to_metadata(geo_info, *, window=None) -> GeoTIFFMetadata. Callers in__init__.py:633,_backends/dask.py:342,_backends/gpu.py:437,:816, and:1369switch to the new function and callmetadata_to_attrs(md)immediately before constructing the DataArray._set_nodata_attrs(attrs, nodata, *, masked)becomes a builder method onGeoTIFFMetadata, e.g.md = md.with_nodata(nodata, masked=...). Callers in__init__.py:697,_backends/dask.py:350,_backends/gpu.py:453,:817,_backends/vrt.py:366, and:743switch to the builder._backends/vrt.py:275-300,:697-720) build aGeoTIFFMetadatadirectly fromvrt.crs_wkt/vrt.geo_transform/vrt.bands/vrt.holesand callmetadata_to_attrsat the DataArray-construction step.Write side:
_resolve_nodata_attr(attrs)becomes a no-op when the writer takes aGeoTIFFMetadata: the field is already resolved._extract_rich_tags(attrs)becomes_extract_rich_tags(md). The three writers (_writers/eager.py:605,:932-941via_resolve_nodata_attr,_writers/gpu.py:458) callmd = attrs_to_metadata(data.attrs)once at entry and read fields offmdinstead of re-resolving fromdata.attrs._should_restore_nan_sentinel(attrs)becomes a one-liner on the record (md.masked_nodata is not False).Validators:
validate_write_metadata(_validation.py) already takes a dict literal ofcrs_kwarg,attrs_crs,attrs_crs_wkt,nodata_kwarg,attrs_nodata,attrs_nodatavals,coord_y,coord_x. That dict is built in three call sites (_writers/eager.py:298-307,:856-865,_writers/gpu.py:318-327); they all collapse tovalidate_write_metadata(md, crs_kwarg=crs, nodata_kwarg=nodata, coord_y=..., coord_x=...).Migration plan
Internal refactor; the public attrs surface does not change. Each step is independently testable:
Land
GeoTIFFMetadataand the two conversion functions in_attrs.py. Add round-trip tests assertingmetadata_to_attrs(attrs_to_metadata(x)) == xfor representative attrs dicts (eager numpy, dask, GPU, VRT, no-georef, point raster, user-defined CRS WKT).Rewrite
_populate_attrs_from_geo_infoas a thin wrapper:attrs.update(metadata_to_attrs(geo_info_to_metadata(geo_info, window=window))). The existing helper signature is preserved so every reader call site keeps working. Tests intest_round_trip_invariants.pyandtest_nodata_semantics_split_1988.pycover the read paths; they should pass unchanged.Replace the inline attrs blocks in
_backends/vrt.py:275-300and:697-720with the new builder. That is the highest-drift surface today, so landing it second clears the bulk of the consistency gap.Switch the four read-side
_set_nodata_attrscallers to the builder method. The fourmasked=computations stay where they are; the dataclass only changes the assignment shape.Land
attrs_to_metadataon the write side. The three writers gain amd = attrs_to_metadata(data.attrs)call at entry, and the duplicated CRS / nodata / rich-tag resolve blocks collapse to field lookups onmd.validate_write_metadataswitches to takingmddirectly.Once all read and write paths use
GeoTIFFMetadatainternally, deprecate theattrsparameter on_resolve_nodata_attr,_extract_rich_tags,_should_restore_nan_sentinel, and_set_nodata_attrs. The compatibility shims stay until one minor release after the dataclass lands.Each step is a separate PR so the diff stays reviewable.
Relation to sibling issues
The dataclass is where three companion proposals on the tracker live:
georef_status(the proposal to replaceattrs[_NO_GEOREF_KEY] = Truewith an enum-valued attr) becomesmd.has_georef, ormd.georef_status: Literal['present', 'absent', 'placeholder']if the enum form wins. Either way the field is named once, set once by the readers, and consumed once by the writers and_is_no_georef_array._xrspatial_no_georef: the underscore-prefix attr stays on the public surface for back-compat but is no longer the in-memory signal. Read paths constructmd.has_georef = False; the conversion function emits the legacy key for round-trip compatibility.md.nodataandmd.masked_nodataare the canonical pair. The alias attrs (nodatavals,_FillValue) are resolved insideattrs_to_metadataand never re-surface on a write.Landing the dataclass first turns the sibling issues into short follow-ups (one field rename or one builder method each).
Risk
Internal refactor:
_attrs.py:18-122continues to hold key-for-key.xrspatial/geotiff/tests/test_round_trip_invariants.pyand the parity tests intest_round_trip_parity_rasterio_zarr_1961.pyenforce the contract on every read/write path; they should pass unmodified at each step of the migration.masked_nodatacomputations stay at their current call sites; the dataclass only changes the assignment target. Issue Bug: attrs['masked_nodata'] reports True when masking was disabled #2092's tests intest_masked_nodata_attr_2092.pyandtest_nodata_semantics_split_1988.pycontinue to gate the masking semantics._backends/vrt.py:275-300,:697-720) are the only place where behaviour could observably change, because they currently emit a slightly different field set than_populate_attrs_from_geo_info(noextra_tags, nogdal_metadata, no resolution tags). Routing them throughmetadata_to_attrsbrings them into line. The alignment is what we want, but it is a behaviour change in the sense that a VRT read will start carrying tags it did not before. Gated behind a follow-up so this PR can land as a pure refactor first.Out of scope
attrs['crs'],attrs['crs_wkt'],attrs['nodata'], orattrs['masked_nodata']._ATTRS_CONTRACT_VERSIONto 3.