Skip to content

feat: add Forest Data Partnership (FDP) accessor#7

Open
jakenotjay wants to merge 15 commits into
mainfrom
jake/fdp-datasets
Open

feat: add Forest Data Partnership (FDP) accessor#7
jakenotjay wants to merge 15 commits into
mainfrom
jake/fdp-datasets

Conversation

@jakenotjay
Copy link
Copy Markdown
Owner

@jakenotjay jakenotjay commented May 6, 2026

Summary

Adds a sibling accessor for the Forest Data Partnership commodity-probability COGs hosted at gs://earth-engine-public-requester-pays/forestdatapartnership/. New public API:

  • FDPIndex — builds a local GeoParquet catalogue by listing the bucket (build()load()query()), since FDP doesn't publish a parquet index. Discovers releases, commodities, and years dynamically; tile bboxes derived from the filename (1°×1° EPSG:4326).
  • FDPReader.open(tiles) — always returns a DataTree keyed by commodity. Mosaicks tiles within a year via outer-join (NaN-filled holes), concatenates across years on time. Single CRS so no zone-grouping or reprojection needed.
  • FDPTileInfo, Commodity literal, COMMODITIES tuple — exported alongside the AEF types.

Refactors the AEF index/reader to share cloud helpers via a new aef_loader/_cloud.py (path parsing, requester-pays GCS store factory, GeoBox extraction). Default cache moves from /tmp to ~/.cache/aef-loader/ for both AEF and FDP — minor behaviour change worth flagging.

Two correctness fixes along the way:

  • GeoTIFF y-orientation auto-detection. The previous AEF helper hedged with a TODO: validate and used positive ModelPixelScaleY literally. Per the GeoTIFF spec (OGC 19-008r4), positive ModelPixelScaleY implies a negative affine y-step (north-up). AEF carries ModelTransformationTag with the sign explicit; FDP uses ModelPixelScaleTag and is north-up under the spec. Both now auto-detect with no per-dataset config.
  • xr_coords dim naming. odc-geo's xr_coords returns longitude/latitude for geographic CRSs and x/y for projected. AEF (UTM) hits the x/y path; FDP (EPSG:4326) needs explicit dims=("y", "x") to match the dim names that virtual-tiff produces.

Test plan

  • 70 unit tests pass (uv run pytest -m unit)
  • Slow integration test passes against real GCS (GCP_PROJECT=… uv run pytest -m slow)
  • FDPIndex.build() against the real bucket: 4.6s for 43,868 tiles → 853 KB parquet
  • FDPIndex.query() against the cached parquet: sub-millisecond
  • End-to-end FDPReader.open() on two adjacent Cameroon coffee/2024 tiles
  • Lint + format clean (ruff)

Notable findings, recorded in commit messages

  • FDP COG filenames encode the SW corner, not the NW corner as the public Google docs claim. Verified across multiple tiles via ModelTiepointTag and rasterio bounds.
  • Bucket has only 2025b release with years 2020 and 2024 (despite docs claiming "2017 to latest"); discovery is dynamic so backfills will be picked up automatically.
  • Coverage is not a clean Cartesian product — coffee defines the full 5,516-tile envelope; rubber/2024 has 155 holes vs that envelope. Reader handles this via outer-join NaN fill.

Behaviour changes

  • AEFIndex default cache_dir changes from /tmp to ~/.cache/aef-loader/. Existing CI runners or ephemeral containers that relied on /tmp for fresh cache between runs will now persist the parquet across runs. No code change required for callers; just disk-usage worth noting.

Follow-ups (not in this PR)

  • Docs: README.md and mkdocs.yml/docs/ don't mention FDP at all yet. The cache-dir default change is also undocumented.
  • Align AEF with FDP review fixes: AEFIndex.query() still uses if limit: (so limit=0 returns everything) and doesn't accept a bare-string year. Pre-existing pattern; small drive-by for symmetry.
  • virtualizarr.registry deprecation: both readers import ObjectStoreRegistry from a path that emits DeprecationWarning: Importing ObjectStoreRegistry from VirtualiZarr is deprecated. Please use 'from obspec_utils.registry import ObjectStoreRegistry instead.. Pre-existing in AEF, propagated into FDP. ~2-line fix.
  • Boundary-bug-resistant unit tests for _open_tile: current FDP unit tests monkeypatch _open_tile, which means a real bug in the IO path (KeyError: 'x', fixed in this PR) was caught only by the slow integration test. A small synthetic-COG fixture written via rasterio would close this gap with no GCS dependency.

🤖 Generated with Claude Code

jakenotjay and others added 15 commits May 5, 2026 15:17
Adds an accessor for the Forest Data Partnership commodity-probability COGs
hosted at gs://earth-engine-public-requester-pays/forestdatapartnership/.
Unlike AEF, FDP does not publish a parquet index — FDPIndex.build() lists the
release/commodity/year hierarchy and writes a local GeoParquet, after which
queries are served from the cache (same download → load → query shape as
AEFIndex).

- Tile bbox is derived from the filename (1°×1° EPSG:4326). Empirically the
  filename encodes the SW corner, not the NW corner as the public docs claim
  — verified against ModelTiepoint and rasterio bounds.
- palm_2023 is filtered out (non-standard flat layout).
- Releases, commodities, and years are discovered dynamically so future
  backfills (2017–2025 has been flagged as upcoming) need no code changes.

Refactors AEF index/reader to share cloud helpers via a new aef_loader._cloud
module (path parsing, requester-pays GCS store factory, GeoBox extraction),
and moves the default cache to ~/.cache/aef-loader/ for both datasets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pixel-scale fallback in get_geobox_from_dataset previously used a literal
positive y-step from ModelPixelScaleY, hedged by a TODO and a brief flip_y
flag. The GeoTIFF spec (OGC 19-008r4) is unambiguous: ModelPixelScaleY is
stored as a positive magnitude with the affine y-step implicitly negative
(north-up). Datasets that are genuinely bottom-up — like AEF — carry
ModelTransformationTag instead, which encodes the sign explicitly in the
matrix.

With both paths spec-compliant, AEF (ModelTransformationTag, e=+10.0) and
FDP (ModelPixelScaleTag, north-up by spec) are auto-detected from the file
tags with no per-dataset configuration. Drop the flip_y flag entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Make _year_range accept a bare string ("2024", "2024-06-01"), avoiding the
  confusing 4-element unpack error a user would otherwise hit.
- query(limit=0) now returns an empty list instead of treating zero as
  "no limit" — switch from truthy to is-not-None check, matching the other
  filters in the same function.
- Replace the tautological test_default_cache_dir (the assertion's RHS
  always passed) with two explicit cases covering XDG_CACHE_HOME set and
  unset.
- Log skipped non-tile filenames at debug level so unexpected files in a
  leaf directory are diagnosable when verbose logging is enabled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors VirtualTiffReader for the AEF case but trimmed for FDP's much
simpler shape: single CRS (EPSG:4326), single 'probability' band, native
float32, no UTM zone grouping. Always returns a DataTree with one child
per commodity — even for single-commodity input — so callers writing
generic pipelines see a predictable type.

Tile mosaicking uses the same outer-join + per-year temporal concat as the
AEF reader, so missing 1° tiles inside a queried bbox surface as NaN
holes. NaN is the natural fill for FDP's float32 dtype and is stamped on
the output via set_aef_nodata. DataTree children are kept independent —
no cross-commodity coordinate alignment, since coverage differs by
commodity-year.

The obstore/registry bookkeeping is duplicated from VirtualTiffReader
rather than lifted to a shared base; if a third dataset arrives that
shares this surface, that's the moment to factor.

One CRS-specific subtlety: odc-geo's xr_coords returns longitude/latitude
keys for geographic CRSs and only x/y for projected ones. AEF (UTM) hits
the projected path so the AEF reader works as-is, but FDP needs an
explicit dims=("y", "x") to keep the dim names consistent with what
virtual-tiff produces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xr.combine_by_coords is typed as `Dataset | DataArray`. We only feed it
Datasets so the result is always a Dataset, but ty can't see that —
assert it locally so time_slices stays narrowly typed as list[Dataset].

Drop a stale `# type: ignore[arg-type]` from the protocol-rejection test
that ty now flags as unused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds tests for the items triaged from the post-PR review as worth
landing — net +12 unit tests (70 → 82), all production code unchanged:

- Pin the GeoTIFF y-orientation fix at both the affine helper level and
  through get_geobox_from_dataset (new tests/test_cloud.py).
- Synthetic on-disk COG written via rasterio, opened through
  FDPReader._open_tile against an obstore LocalStore — exercises the
  full virtual-tiff → xarray → geobox → set_aef_nodata pipeline that
  unit tests previously bypassed via monkeypatch. Pins north-up
  orientation end-to-end (ys[0] > ys[-1]) and round-trips
  tile.as_datetime through _combine_opened_datasets.
- Multi-var error path in _open_tile (mocked open_zarr).
- Off-pattern files in obs.list output are silently excluded.
- obs.list batches are concatenated correctly across multiple yields.
- Parametrized _year_range covering ISO strings, mixed tuples, and the
  inverted-range pass-through.
- Repoint AEF test_get_store_caches patch from obstore.store.GCSStore
  to aef_loader.reader.make_gcs_store so the assert_not_called() check
  is no longer vacuous after the _cloud refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
virtualizarr.registry now re-exports ObjectStoreRegistry from
obspec_utils with a DeprecationWarning. Switch the AEF and FDP readers
to the canonical path. The remaining DeprecationWarnings in pytest
output come from inside virtualizarr.parsers.dmrpp itself and need an
upstream fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pre-existing inconsistencies a code review surfaced when FDP landed
— bring AEF in line:

- query(limit=0) now returns an empty list instead of treating zero as
  "no limit". Switch from truthy to is-not-None check, matching the
  same fix already applied to FDPIndex.query().
- _get_start_and_end_year accepts a bare string ("2024",
  "2024-06-01") and parses the first 4 chars, matching the matching
  helper FDPIndex._year_range. Previously a bare string fell through
  to start, end = years and raised an unpack error. Type signature
  widens accordingly to int | str | DateRange.

Adds three unit tests mirroring the FDP coverage: limit=0,
string-year, and a parametrized normalisation matrix covering ISO
strings, mixed tuples, and the inverted-range pass-through behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
virtual-tiff only attaches model_pixel_scale / model_tiepoint to IFD 0;
opening any overview level (IFD > 0) yielded a dataset with no geo-attrs
and get_geobox_from_dataset would raise. The slow integration test only
covered IFD 0 so the regression slipped past CI.

Each FDP tile covers exactly its filename's 1°×1° EPSG:4326 box
(verified during initial bucket exploration) and the IFD pixel
dimensions are always in the dataset, so the geobox is fully derivable
from those two facts at any IFD without consulting the tags. Switch to
that derivation; drop the now-unused get_geobox_from_dataset import
from this module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rviews

The previous fix went too far — it bypassed model_pixel_scale tag reading
entirely, even at IFD 0 where the writer's sub-pixel-precision tiepoint
is the most authoritative source for georeferencing. Restore the tag
path at IFD 0 (matches gdalinfo / rasterio bounds) and only fall back to
the tile-bbox derivation when the dataset has no geo-attrs (i.e. when
virtual-tiff has dropped them at overview IFDs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the overview-fallback branch in _open_tile that derives the geobox
from tile.bbox + ds.sizes when the dataset has no model_pixel_scale
attrs. The original slow test only covered IFD 0 — that's how the
overview-IFD bug slipped past CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups from review #671:

- Extract has_geo_attrs into _cloud so the FDP reader's "do we have
  GeoTIFF tags?" check uses the same any-data-var iteration that
  get_geobox_from_dataset uses internally. Previously the reader looked
  only at ds["probability"].attrs — benign today because we assert
  single-var, but the predicates were not aligned and could drift.
- Add a direct unit test for _geobox_from_tile so a regression in the
  overview-IFD fallback is detectable without needing GCS. The slow
  integration test exercises the same path end-to-end at IFD 2 but is
  too coarse for catching small affine bugs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The same int|str|DateRange → (int, int) helper had landed on both
AEFIndex._get_start_and_end_year and FDPIndex._year_range with
identical behaviour and identical parametrized tests. Lift it to
aef_loader._cloud.normalize_year_range so the two callers stay in
sync, and consolidate the parametrization into tests/test_cloud.py.

Also fixes a stale "default: /tmp" reference in AEFIndex.__init__
that survived the cache-dir default change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rker skip

Two small follow-ups from the post-merge review:

- assign_coords(xr_coords(geobox)) instead of unpacking to bare
  .values arrays. xr_coords already returns a dict[str, DataArray];
  passing it through preserves the coord-DataArray attrs (axis,
  standard_name, the spatial_ref CRS coord) that odc-geo attaches.
  Same change in both readers.
- Add a comment to the size==0 guard in FDPIndex._list_leaf_tiles
  explaining it's the GCS pseudo-directory placeholder, not a
  size-based filter on real tiles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace inline split/conditional in AEFIndex.query() with
  pyproj.CRS.from_user_input(...).to_epsg(); add pyproj as a direct
  dep (was already transitive via geopandas/rasterio/odc-geo).
- Drop banner separators from tests/test_fdp_reader.py and
  restating-the-code comments across reader.py / index.py / fdp/index.py.
  Kept comments that explain why.
- Replace org-specific GCP_PROJECT example with a generic placeholder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant