feat: add Forest Data Partnership (FDP) accessor#7
Open
jakenotjay wants to merge 15 commits into
Open
Conversation
Adds an accessor for the Forest Data Partnership commodity-probability COGs hosted at gs://earth-engine-public-requester-pays/forestdatapartnership/. Unlike AEF, FDP does not publish a parquet index — FDPIndex.build() lists the release/commodity/year hierarchy and writes a local GeoParquet, after which queries are served from the cache (same download → load → query shape as AEFIndex). - Tile bbox is derived from the filename (1°×1° EPSG:4326). Empirically the filename encodes the SW corner, not the NW corner as the public docs claim — verified against ModelTiepoint and rasterio bounds. - palm_2023 is filtered out (non-standard flat layout). - Releases, commodities, and years are discovered dynamically so future backfills (2017–2025 has been flagged as upcoming) need no code changes. Refactors AEF index/reader to share cloud helpers via a new aef_loader._cloud module (path parsing, requester-pays GCS store factory, GeoBox extraction), and moves the default cache to ~/.cache/aef-loader/ for both datasets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pixel-scale fallback in get_geobox_from_dataset previously used a literal positive y-step from ModelPixelScaleY, hedged by a TODO and a brief flip_y flag. The GeoTIFF spec (OGC 19-008r4) is unambiguous: ModelPixelScaleY is stored as a positive magnitude with the affine y-step implicitly negative (north-up). Datasets that are genuinely bottom-up — like AEF — carry ModelTransformationTag instead, which encodes the sign explicitly in the matrix. With both paths spec-compliant, AEF (ModelTransformationTag, e=+10.0) and FDP (ModelPixelScaleTag, north-up by spec) are auto-detected from the file tags with no per-dataset configuration. Drop the flip_y flag entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Make _year_range accept a bare string ("2024", "2024-06-01"), avoiding the
confusing 4-element unpack error a user would otherwise hit.
- query(limit=0) now returns an empty list instead of treating zero as
"no limit" — switch from truthy to is-not-None check, matching the other
filters in the same function.
- Replace the tautological test_default_cache_dir (the assertion's RHS
always passed) with two explicit cases covering XDG_CACHE_HOME set and
unset.
- Log skipped non-tile filenames at debug level so unexpected files in a
leaf directory are diagnosable when verbose logging is enabled.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors VirtualTiffReader for the AEF case but trimmed for FDP's much
simpler shape: single CRS (EPSG:4326), single 'probability' band, native
float32, no UTM zone grouping. Always returns a DataTree with one child
per commodity — even for single-commodity input — so callers writing
generic pipelines see a predictable type.
Tile mosaicking uses the same outer-join + per-year temporal concat as the
AEF reader, so missing 1° tiles inside a queried bbox surface as NaN
holes. NaN is the natural fill for FDP's float32 dtype and is stamped on
the output via set_aef_nodata. DataTree children are kept independent —
no cross-commodity coordinate alignment, since coverage differs by
commodity-year.
The obstore/registry bookkeeping is duplicated from VirtualTiffReader
rather than lifted to a shared base; if a third dataset arrives that
shares this surface, that's the moment to factor.
One CRS-specific subtlety: odc-geo's xr_coords returns longitude/latitude
keys for geographic CRSs and only x/y for projected ones. AEF (UTM) hits
the projected path so the AEF reader works as-is, but FDP needs an
explicit dims=("y", "x") to keep the dim names consistent with what
virtual-tiff produces.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xr.combine_by_coords is typed as `Dataset | DataArray`. We only feed it Datasets so the result is always a Dataset, but ty can't see that — assert it locally so time_slices stays narrowly typed as list[Dataset]. Drop a stale `# type: ignore[arg-type]` from the protocol-rejection test that ty now flags as unused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds tests for the items triaged from the post-PR review as worth landing — net +12 unit tests (70 → 82), all production code unchanged: - Pin the GeoTIFF y-orientation fix at both the affine helper level and through get_geobox_from_dataset (new tests/test_cloud.py). - Synthetic on-disk COG written via rasterio, opened through FDPReader._open_tile against an obstore LocalStore — exercises the full virtual-tiff → xarray → geobox → set_aef_nodata pipeline that unit tests previously bypassed via monkeypatch. Pins north-up orientation end-to-end (ys[0] > ys[-1]) and round-trips tile.as_datetime through _combine_opened_datasets. - Multi-var error path in _open_tile (mocked open_zarr). - Off-pattern files in obs.list output are silently excluded. - obs.list batches are concatenated correctly across multiple yields. - Parametrized _year_range covering ISO strings, mixed tuples, and the inverted-range pass-through. - Repoint AEF test_get_store_caches patch from obstore.store.GCSStore to aef_loader.reader.make_gcs_store so the assert_not_called() check is no longer vacuous after the _cloud refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
virtualizarr.registry now re-exports ObjectStoreRegistry from obspec_utils with a DeprecationWarning. Switch the AEF and FDP readers to the canonical path. The remaining DeprecationWarnings in pytest output come from inside virtualizarr.parsers.dmrpp itself and need an upstream fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pre-existing inconsistencies a code review surfaced when FDP landed
— bring AEF in line:
- query(limit=0) now returns an empty list instead of treating zero as
"no limit". Switch from truthy to is-not-None check, matching the
same fix already applied to FDPIndex.query().
- _get_start_and_end_year accepts a bare string ("2024",
"2024-06-01") and parses the first 4 chars, matching the matching
helper FDPIndex._year_range. Previously a bare string fell through
to start, end = years and raised an unpack error. Type signature
widens accordingly to int | str | DateRange.
Adds three unit tests mirroring the FDP coverage: limit=0,
string-year, and a parametrized normalisation matrix covering ISO
strings, mixed tuples, and the inverted-range pass-through behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
virtual-tiff only attaches model_pixel_scale / model_tiepoint to IFD 0; opening any overview level (IFD > 0) yielded a dataset with no geo-attrs and get_geobox_from_dataset would raise. The slow integration test only covered IFD 0 so the regression slipped past CI. Each FDP tile covers exactly its filename's 1°×1° EPSG:4326 box (verified during initial bucket exploration) and the IFD pixel dimensions are always in the dataset, so the geobox is fully derivable from those two facts at any IFD without consulting the tags. Switch to that derivation; drop the now-unused get_geobox_from_dataset import from this module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rviews The previous fix went too far — it bypassed model_pixel_scale tag reading entirely, even at IFD 0 where the writer's sub-pixel-precision tiepoint is the most authoritative source for georeferencing. Restore the tag path at IFD 0 (matches gdalinfo / rasterio bounds) and only fall back to the tile-bbox derivation when the dataset has no geo-attrs (i.e. when virtual-tiff has dropped them at overview IFDs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the overview-fallback branch in _open_tile that derives the geobox from tile.bbox + ds.sizes when the dataset has no model_pixel_scale attrs. The original slow test only covered IFD 0 — that's how the overview-IFD bug slipped past CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups from review #671: - Extract has_geo_attrs into _cloud so the FDP reader's "do we have GeoTIFF tags?" check uses the same any-data-var iteration that get_geobox_from_dataset uses internally. Previously the reader looked only at ds["probability"].attrs — benign today because we assert single-var, but the predicates were not aligned and could drift. - Add a direct unit test for _geobox_from_tile so a regression in the overview-IFD fallback is detectable without needing GCS. The slow integration test exercises the same path end-to-end at IFD 2 but is too coarse for catching small affine bugs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The same int|str|DateRange → (int, int) helper had landed on both AEFIndex._get_start_and_end_year and FDPIndex._year_range with identical behaviour and identical parametrized tests. Lift it to aef_loader._cloud.normalize_year_range so the two callers stay in sync, and consolidate the parametrization into tests/test_cloud.py. Also fixes a stale "default: /tmp" reference in AEFIndex.__init__ that survived the cache-dir default change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rker skip Two small follow-ups from the post-merge review: - assign_coords(xr_coords(geobox)) instead of unpacking to bare .values arrays. xr_coords already returns a dict[str, DataArray]; passing it through preserves the coord-DataArray attrs (axis, standard_name, the spatial_ref CRS coord) that odc-geo attaches. Same change in both readers. - Add a comment to the size==0 guard in FDPIndex._list_leaf_tiles explaining it's the GCS pseudo-directory placeholder, not a size-based filter on real tiles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace inline split/conditional in AEFIndex.query() with pyproj.CRS.from_user_input(...).to_epsg(); add pyproj as a direct dep (was already transitive via geopandas/rasterio/odc-geo). - Drop banner separators from tests/test_fdp_reader.py and restating-the-code comments across reader.py / index.py / fdp/index.py. Kept comments that explain why. - Replace org-specific GCP_PROJECT example with a generic placeholder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a sibling accessor for the Forest Data Partnership commodity-probability COGs hosted at
gs://earth-engine-public-requester-pays/forestdatapartnership/. New public API:FDPIndex— builds a local GeoParquet catalogue by listing the bucket (build()→load()→query()), since FDP doesn't publish a parquet index. Discovers releases, commodities, and years dynamically; tile bboxes derived from the filename (1°×1° EPSG:4326).FDPReader.open(tiles)— always returns aDataTreekeyed by commodity. Mosaicks tiles within a year via outer-join (NaN-filled holes), concatenates across years ontime. Single CRS so no zone-grouping or reprojection needed.FDPTileInfo,Commodityliteral,COMMODITIEStuple — exported alongside the AEF types.Refactors the AEF index/reader to share cloud helpers via a new
aef_loader/_cloud.py(path parsing, requester-pays GCS store factory, GeoBox extraction). Default cache moves from/tmpto~/.cache/aef-loader/for both AEF and FDP — minor behaviour change worth flagging.Two correctness fixes along the way:
TODO: validateand used positiveModelPixelScaleYliterally. Per the GeoTIFF spec (OGC 19-008r4), positiveModelPixelScaleYimplies a negative affine y-step (north-up). AEF carriesModelTransformationTagwith the sign explicit; FDP usesModelPixelScaleTagand is north-up under the spec. Both now auto-detect with no per-dataset config.xr_coordsdim naming.odc-geo'sxr_coordsreturnslongitude/latitudefor geographic CRSs andx/yfor projected. AEF (UTM) hits thex/ypath; FDP (EPSG:4326) needs explicitdims=("y", "x")to match the dim names that virtual-tiff produces.Test plan
uv run pytest -m unit)GCP_PROJECT=… uv run pytest -m slow)FDPIndex.build()against the real bucket: 4.6s for 43,868 tiles → 853 KB parquetFDPIndex.query()against the cached parquet: sub-millisecondFDPReader.open()on two adjacent Cameroon coffee/2024 tilesNotable findings, recorded in commit messages
ModelTiepointTagand rasterio bounds.2025brelease with years2020and2024(despite docs claiming "2017 to latest"); discovery is dynamic so backfills will be picked up automatically.Behaviour changes
AEFIndexdefaultcache_dirchanges from/tmpto~/.cache/aef-loader/. Existing CI runners or ephemeral containers that relied on/tmpfor fresh cache between runs will now persist the parquet across runs. No code change required for callers; just disk-usage worth noting.Follow-ups (not in this PR)
README.mdandmkdocs.yml/docs/don't mention FDP at all yet. The cache-dir default change is also undocumented.AEFIndex.query()still usesif limit:(solimit=0returns everything) and doesn't accept a bare-string year. Pre-existing pattern; small drive-by for symmetry.virtualizarr.registrydeprecation: both readers importObjectStoreRegistryfrom a path that emitsDeprecationWarning: Importing ObjectStoreRegistry from VirtualiZarr is deprecated. Please use 'from obspec_utils.registry import ObjectStoreRegistry instead.. Pre-existing in AEF, propagated into FDP. ~2-line fix._open_tile: current FDP unit testsmonkeypatch_open_tile, which means a real bug in the IO path (KeyError: 'x', fixed in this PR) was caught only by the slow integration test. A small synthetic-COG fixture written via rasterio would close this gap with no GCS dependency.🤖 Generated with Claude Code