Skip to content

geotiff: add internal GeoTIFFMetadata dataclass and migrate readers (#2139)#2144

Merged
brendancol merged 4 commits into
mainfrom
issue-2139
May 20, 2026
Merged

geotiff: add internal GeoTIFFMetadata dataclass and migrate readers (#2139)#2144
brendancol merged 4 commits into
mainfrom
issue-2139

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Closes #2139.

Adds an internal GeoTIFFMetadata frozen dataclass plus two boundary functions (metadata_to_attrs, attrs_to_metadata) and a geo_info_to_metadata builder. The four read paths and three writers all built/parsed the same attrs dict by hand; this PR consolidates the read side onto the typed record. Writers stay on the dict for now to keep the diff small and avoid overlap with the open attrs-contract work on the same files (#2133 / #2135 / #2136).

Reader-side migration:

  • _populate_attrs_from_geo_info becomes a thin wrapper that builds a GeoTIFFMetadata and folds metadata_to_attrs back into the caller's dict. Field-for-field equivalent to the prior implementation, so the three non-VRT read paths see no change.
  • The two VRT inline attrs blocks (eager + chunked) now build a GeoTIFFMetadata and run it through metadata_to_attrs, removing the duplicated stamp/set/mark sequences at _backends/vrt.py:275 and :697. The VRT path emits the same key set it did before; aligning the VRT field set with the non-VRT readers is the issue's separate follow-up.

Public attrs surface is unchanged.

Test plan

  • 25 new tests in test_geotiff_metadata_2139.py covering geo_info_to_metadata, attrs_to_metadata, round trips, and the with_nodata builder.
  • Existing tests pass: test_attrs_contract_*_1984.py, test_round_trip_invariants.py, test_nodata_semantics_split_1988.py, test_masked_nodata_attr_2092.py, test_vrt_holes_attr_1734.py, test_nodata_attr_aliases_1582.py.
  • Full geotiff sweep: 4066 passed, 25 skipped.

…2139)

Introduce a frozen GeoTIFFMetadata dataclass with two boundary
functions (metadata_to_attrs, attrs_to_metadata) plus a
geo_info_to_metadata builder so the four read paths and three writers
stop building / parsing the same attrs dict by hand. The public attrs
surface is unchanged.

Migrated in this PR (reader-first slice from the issue's plan):

- _populate_attrs_from_geo_info becomes a thin wrapper that builds a
  GeoTIFFMetadata and folds metadata_to_attrs back into the caller's
  dict. Field-for-field equivalent to the prior implementation, so the
  three non-VRT read paths see no change.
- The two VRT inline attrs blocks (eager + chunked) now build a
  GeoTIFFMetadata and run it through metadata_to_attrs, removing the
  duplicated 'stamp contract version / set crs / set crs_wkt / set
  raster_type / mark no-georef / surface vrt_holes' sequences. The
  VRT path keeps emitting the same key subset it always did
  (no extra_tags / gdal_metadata / resolution tags); aligning that
  surface with the non-VRT readers is a separate follow-up.

Writers stay on the dict for now to keep the diff small and avoid
overlap with the in-flight attrs-contract work on the same files
(#2133 / #2135 / #2136). The dataclass already exposes the fields the
writers need (with_nodata builder, attrs_to_metadata) so the
writer-side migration is the natural next PR.

Adds 25 tests in test_geotiff_metadata_2139.py covering:
- geo_info_to_metadata field-by-field copy for transform / point /
  no-georef / resolution-unit / colormap cases.
- attrs_to_metadata for nodata aliases, CRS-as-WKT-in-crs-field, and
  the bool-EPSG guard.
- Round-trip stability for representative attrs dicts (eager / point /
  no-georef / user-WKT / VRT with holes).
- with_nodata builder semantics.

All 4066 geotiff tests pass.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 19, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: geotiff: add internal GeoTIFFMetadata dataclass and migrate readers (#2139)

Blockers

None.

Suggestions

  • attrs_to_metadata silently drops attrs['crs']=True to crs_epsg=None (_attrs.py:391-398). The writers today raise via _validate_crs_arg on a bool. Nothing in this PR calls attrs_to_metadata from a writer, so the difference is invisible right now, but test_crs_arg_validation_1971.py expects the writer to reject. When the writer-side migration lands the validator needs to either move onto attrs_to_metadata or stay in the writer with the parsed record still flagged enough to be rejected. A one-line comment recording the boundary contract ("parse leniently here; validate at the writer") would save the next person five minutes.

  • attrs_to_metadata coerces transform via tuple(transform) (_attrs.py:404) without checking length or numeric type. _transform_from_attr does that check today. Same shape of problem as the CRS gate: harmless until a writer calls attrs_to_metadata. A note in the docstring about the deferred check is enough.

Nits

  • metadata_to_attrs (_attrs.py:319-322) has a three-state branch on (transform, has_georef):

    • transform is not None and has_georef -> emit transform
    • not has_georef -> emit _NO_GEOREF_KEY
    • transform is None and has_georef -> emit nothing
      The third state is what the VRT eager path relies on (record has no transform yet; the existing block stamps it later). It's correct, but at first read the third case looks like a missed branch rather than the intended path. A one-line comment ("transform-less but georef'd record: caller stamps transform separately, eager VRT path") would make the intent obvious.
  • The VRT eager builder uses vrt.crs_wkt or None (_backends/vrt.py:286) to convert an empty string to None. The original if vrt.crs_wkt: gate did the same thing. Reading the dataclass site without the original nearby, the empty-string normalisation looks accidental. Inline comment or a tiny _nullify_empty helper would help.

  • test_geotiff_metadata_2139.py:288-294 asserts every input key is present in the round-tripped output but doesn't assert the symmetric direction. Full equality would fail for the no-georef case because metadata_to_attrs always stamps _xrspatial_geotiff_contract, but a sibling test asserting "no unexpected keys other than the contract version" would catch future field additions slipping in unannounced.

  • _RESOLUTION_UNIT_NAMES (_attrs.py:185) duplicates the inline {1: 'none', 2: 'inch', 3: 'centimeter'} literal that used to live in _populate_attrs_from_geo_info. Worth deriving it from _RESOLUTION_UNIT_IDS ({v: k for k, v in _RESOLUTION_UNIT_IDS.items()}) so the two maps can't drift.

What looks good

  • Reader-side migration matches the issue's PR-1 plan: dataclass + two boundary functions + thin-wrapper _populate_attrs_from_geo_info + VRT inline-block replacement. Writers stay on the dict, which both follows the issue's per-step PR guidance and avoids the open #2133 / #2135 / #2136 work on the same files.
  • The two VRT inline blocks (_backends/vrt.py:275, :697) now build a GeoTIFFMetadata and run through metadata_to_attrs, removing the duplicated contract-version stamp and the duplicated CRS / raster_type / no-georef setters.
  • with_nodata returns a new frozen instance and short-circuits on nodata is None, mirroring _set_nodata_attrs's no-op rule so absence still signals "no declared sentinel".
  • 25 new tests cover the builder, the parser, alias resolution, the no-georef path, the bool-EPSG guard, and round trips across five representative attrs dicts.
  • Full geotiff test sweep: 4268 passed, 25 skipped. No backend-parity regression.

Checklist

  • Algorithm matches reference/paper -- N/A, pure refactor.
  • All implemented backends produce consistent results -- 4268 geotiff tests pass.
  • NaN handling is correct -- unchanged; the _set_nodata_attrs call sites carry the same masked= computation as before.
  • Edge cases are covered by tests -- no-georef, point raster, user-defined WKT, VRT holes, bool-EPSG.
  • Dask chunk boundaries handled correctly -- N/A, attrs construction is per-DataArray, not per-chunk.
  • No premature materialization or unnecessary copies -- the dataclass is frozen and built once per read.
  • Benchmark exists or is not needed -- not needed; refactor is hot-path neutral.
  • README feature matrix updated -- N/A, no new public function.
  • Docstrings present and accurate -- all new functions carry docstrings tying back to issue #2139.

* Document the lenient-parse / writer-validates contract on
  attrs_to_metadata so the dropped-bool-EPSG and unchecked-tuple
  transform are obvious as intentional boundary behaviour.
* Annotate the three-state branch in metadata_to_attrs so the
  transform-less-but-georef'd case (eager VRT path) reads as
  intentional rather than a missed branch.
* Comment the empty-string crs_wkt normalisation in the VRT builder.
* Derive _RESOLUTION_UNIT_NAMES from _RESOLUTION_UNIT_IDS so the
  forward and reverse maps cannot drift.
* Add test_round_trip_emits_no_unexpected_keys to lock the marshalling
  step against silent field additions slipping in unannounced.

All 4071 geotiff tests pass.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review follow-up

All review items addressed in 5161385:

Suggestions (both fixed):

  • attrs_to_metadata now documents the lenient-parse / writer-validates boundary contract in its docstring, calling out the dropped-bool-EPSG and the unchecked-tuple-transform cases explicitly.

Nits (all fixed):

  • Three-state branch in metadata_to_attrs has a block comment explaining the transform=None, has_georef=True case is the eager VRT path's contract, not a missed branch.
  • VRT eager builder comments why vrt.crs_wkt or None is the right normalisation (empty-string SRS).
  • _RESOLUTION_UNIT_NAMES derives from _RESOLUTION_UNIT_IDS via dict comprehension so the two maps can't drift.
  • Added test_round_trip_emits_no_unexpected_keys to lock the marshalling step. A future field added to metadata_to_attrs without an attrs-contract update will now fail this test rather than appearing silently on every read.

Final test count: 4071 geotiff tests pass (5 new; was 4066).

# Conflicts:
#	xrspatial/geotiff/_attrs.py
#2146 (nodata lifecycle) and #2150 (tier surface) had landed cleanly on
main without touching _populate_attrs_from_geo_info, but #2125's
"drop attrs['crs'] on rotated reads" introduced a _vrt_is_rotated gate
on the VRT branches that our pre-merge dataclass-based VRT code did not
honour. Resolved by:

* _attrs.py: kept the GeoTIFFMetadata shim. The dataclass already
  produces the same attrs surface as the legacy field-by-field path.
* _backends/vrt.py: kept the dataclass build on both VRT branches and
  threaded _vrt_is_rotated through to drop crs/crs_wkt and set
  has_georef=False on rotated reads (so metadata_to_attrs stamps the
  no-georef marker).

Verified test_allow_rotated_no_crs_2122, test_geotiff_metadata_2139,
test_nodata_lifecycle_attrs_2135, test_supported_features_tiers_2137,
test_backend_parity_matrix, and the broader VRT/attrs/nodata suite
(1309 tests) all pass.
@brendancol brendancol merged commit 42d66c0 into main May 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: introduce internal GeoTIFFMetadata dataclass; convert to/from attrs at API boundary

1 participant