geotiff: add internal GeoTIFFMetadata dataclass and migrate readers (#2139)#2144
Conversation
…2139) Introduce a frozen GeoTIFFMetadata dataclass with two boundary functions (metadata_to_attrs, attrs_to_metadata) plus a geo_info_to_metadata builder so the four read paths and three writers stop building / parsing the same attrs dict by hand. The public attrs surface is unchanged. Migrated in this PR (reader-first slice from the issue's plan): - _populate_attrs_from_geo_info becomes a thin wrapper that builds a GeoTIFFMetadata and folds metadata_to_attrs back into the caller's dict. Field-for-field equivalent to the prior implementation, so the three non-VRT read paths see no change. - The two VRT inline attrs blocks (eager + chunked) now build a GeoTIFFMetadata and run it through metadata_to_attrs, removing the duplicated 'stamp contract version / set crs / set crs_wkt / set raster_type / mark no-georef / surface vrt_holes' sequences. The VRT path keeps emitting the same key subset it always did (no extra_tags / gdal_metadata / resolution tags); aligning that surface with the non-VRT readers is a separate follow-up. Writers stay on the dict for now to keep the diff small and avoid overlap with the in-flight attrs-contract work on the same files (#2133 / #2135 / #2136). The dataclass already exposes the fields the writers need (with_nodata builder, attrs_to_metadata) so the writer-side migration is the natural next PR. Adds 25 tests in test_geotiff_metadata_2139.py covering: - geo_info_to_metadata field-by-field copy for transform / point / no-georef / resolution-unit / colormap cases. - attrs_to_metadata for nodata aliases, CRS-as-WKT-in-crs-field, and the bool-EPSG guard. - Round-trip stability for representative attrs dicts (eager / point / no-georef / user-WKT / VRT with holes). - with_nodata builder semantics. All 4066 geotiff tests pass.
brendancol
left a comment
There was a problem hiding this comment.
PR Review: geotiff: add internal GeoTIFFMetadata dataclass and migrate readers (#2139)
Blockers
None.
Suggestions
-
attrs_to_metadatasilently dropsattrs['crs']=Truetocrs_epsg=None(_attrs.py:391-398). The writers today raise via_validate_crs_argon a bool. Nothing in this PR callsattrs_to_metadatafrom a writer, so the difference is invisible right now, buttest_crs_arg_validation_1971.pyexpects the writer to reject. When the writer-side migration lands the validator needs to either move ontoattrs_to_metadataor stay in the writer with the parsed record still flagged enough to be rejected. A one-line comment recording the boundary contract ("parse leniently here; validate at the writer") would save the next person five minutes. -
attrs_to_metadatacoercestransformviatuple(transform)(_attrs.py:404) without checking length or numeric type._transform_from_attrdoes that check today. Same shape of problem as the CRS gate: harmless until a writer callsattrs_to_metadata. A note in the docstring about the deferred check is enough.
Nits
-
metadata_to_attrs(_attrs.py:319-322) has a three-state branch on(transform, has_georef):transform is not None and has_georef-> emittransformnot has_georef-> emit_NO_GEOREF_KEYtransform is None and has_georef-> emit nothing
The third state is what the VRT eager path relies on (record has no transform yet; the existing block stamps it later). It's correct, but at first read the third case looks like a missed branch rather than the intended path. A one-line comment ("transform-less but georef'd record: caller stamps transform separately, eager VRT path") would make the intent obvious.
-
The VRT eager builder uses
vrt.crs_wkt or None(_backends/vrt.py:286) to convert an empty string to None. The originalif vrt.crs_wkt:gate did the same thing. Reading the dataclass site without the original nearby, the empty-string normalisation looks accidental. Inline comment or a tiny_nullify_emptyhelper would help. -
test_geotiff_metadata_2139.py:288-294asserts every input key is present in the round-tripped output but doesn't assert the symmetric direction. Full equality would fail for the no-georef case becausemetadata_to_attrsalways stamps_xrspatial_geotiff_contract, but a sibling test asserting "no unexpected keys other than the contract version" would catch future field additions slipping in unannounced. -
_RESOLUTION_UNIT_NAMES(_attrs.py:185) duplicates the inline{1: 'none', 2: 'inch', 3: 'centimeter'}literal that used to live in_populate_attrs_from_geo_info. Worth deriving it from_RESOLUTION_UNIT_IDS({v: k for k, v in _RESOLUTION_UNIT_IDS.items()}) so the two maps can't drift.
What looks good
- Reader-side migration matches the issue's PR-1 plan: dataclass + two boundary functions + thin-wrapper
_populate_attrs_from_geo_info+ VRT inline-block replacement. Writers stay on the dict, which both follows the issue's per-step PR guidance and avoids the open #2133 / #2135 / #2136 work on the same files. - The two VRT inline blocks (
_backends/vrt.py:275,:697) now build aGeoTIFFMetadataand run throughmetadata_to_attrs, removing the duplicated contract-version stamp and the duplicated CRS / raster_type / no-georef setters. with_nodatareturns a new frozen instance and short-circuits onnodata is None, mirroring_set_nodata_attrs's no-op rule so absence still signals "no declared sentinel".- 25 new tests cover the builder, the parser, alias resolution, the no-georef path, the bool-EPSG guard, and round trips across five representative attrs dicts.
- Full geotiff test sweep: 4268 passed, 25 skipped. No backend-parity regression.
Checklist
- Algorithm matches reference/paper -- N/A, pure refactor.
- All implemented backends produce consistent results -- 4268 geotiff tests pass.
- NaN handling is correct -- unchanged; the
_set_nodata_attrscall sites carry the samemasked=computation as before. - Edge cases are covered by tests -- no-georef, point raster, user-defined WKT, VRT holes, bool-EPSG.
- Dask chunk boundaries handled correctly -- N/A, attrs construction is per-DataArray, not per-chunk.
- No premature materialization or unnecessary copies -- the dataclass is frozen and built once per read.
- Benchmark exists or is not needed -- not needed; refactor is hot-path neutral.
- README feature matrix updated -- N/A, no new public function.
- Docstrings present and accurate -- all new functions carry docstrings tying back to issue #2139.
* Document the lenient-parse / writer-validates contract on attrs_to_metadata so the dropped-bool-EPSG and unchecked-tuple transform are obvious as intentional boundary behaviour. * Annotate the three-state branch in metadata_to_attrs so the transform-less-but-georef'd case (eager VRT path) reads as intentional rather than a missed branch. * Comment the empty-string crs_wkt normalisation in the VRT builder. * Derive _RESOLUTION_UNIT_NAMES from _RESOLUTION_UNIT_IDS so the forward and reverse maps cannot drift. * Add test_round_trip_emits_no_unexpected_keys to lock the marshalling step against silent field additions slipping in unannounced. All 4071 geotiff tests pass.
brendancol
left a comment
There was a problem hiding this comment.
Review follow-up
All review items addressed in 5161385:
Suggestions (both fixed):
attrs_to_metadatanow documents the lenient-parse / writer-validates boundary contract in its docstring, calling out the dropped-bool-EPSG and the unchecked-tuple-transform cases explicitly.
Nits (all fixed):
- Three-state branch in
metadata_to_attrshas a block comment explaining thetransform=None, has_georef=Truecase is the eager VRT path's contract, not a missed branch. - VRT eager builder comments why
vrt.crs_wkt or Noneis the right normalisation (empty-string SRS). _RESOLUTION_UNIT_NAMESderives from_RESOLUTION_UNIT_IDSvia dict comprehension so the two maps can't drift.- Added
test_round_trip_emits_no_unexpected_keysto lock the marshalling step. A future field added tometadata_to_attrswithout an attrs-contract update will now fail this test rather than appearing silently on every read.
Final test count: 4071 geotiff tests pass (5 new; was 4066).
# Conflicts: # xrspatial/geotiff/_attrs.py
#2146 (nodata lifecycle) and #2150 (tier surface) had landed cleanly on main without touching _populate_attrs_from_geo_info, but #2125's "drop attrs['crs'] on rotated reads" introduced a _vrt_is_rotated gate on the VRT branches that our pre-merge dataclass-based VRT code did not honour. Resolved by: * _attrs.py: kept the GeoTIFFMetadata shim. The dataclass already produces the same attrs surface as the legacy field-by-field path. * _backends/vrt.py: kept the dataclass build on both VRT branches and threaded _vrt_is_rotated through to drop crs/crs_wkt and set has_georef=False on rotated reads (so metadata_to_attrs stamps the no-georef marker). Verified test_allow_rotated_no_crs_2122, test_geotiff_metadata_2139, test_nodata_lifecycle_attrs_2135, test_supported_features_tiers_2137, test_backend_parity_matrix, and the broader VRT/attrs/nodata suite (1309 tests) all pass.
Summary
Closes #2139.
Adds an internal
GeoTIFFMetadatafrozen dataclass plus two boundary functions (metadata_to_attrs,attrs_to_metadata) and ageo_info_to_metadatabuilder. The four read paths and three writers all built/parsed the sameattrsdict by hand; this PR consolidates the read side onto the typed record. Writers stay on the dict for now to keep the diff small and avoid overlap with the open attrs-contract work on the same files (#2133 / #2135 / #2136).Reader-side migration:
_populate_attrs_from_geo_infobecomes a thin wrapper that builds aGeoTIFFMetadataand foldsmetadata_to_attrsback into the caller's dict. Field-for-field equivalent to the prior implementation, so the three non-VRT read paths see no change.GeoTIFFMetadataand run it throughmetadata_to_attrs, removing the duplicated stamp/set/mark sequences at_backends/vrt.py:275and:697. The VRT path emits the same key set it did before; aligning the VRT field set with the non-VRT readers is the issue's separate follow-up.Public attrs surface is unchanged.
Test plan
test_geotiff_metadata_2139.pycoveringgeo_info_to_metadata,attrs_to_metadata, round trips, and thewith_nodatabuilder.test_attrs_contract_*_1984.py,test_round_trip_invariants.py,test_nodata_semantics_split_1988.py,test_masked_nodata_attr_2092.py,test_vrt_holes_attr_1734.py,test_nodata_attr_aliases_1582.py.