Skip to content

geotiff: support fsspec URIs in read_geotiff_dask (#1749)#1755

Merged
brendancol merged 2 commits into
mainfrom
issue-1749
May 13, 2026
Merged

geotiff: support fsspec URIs in read_geotiff_dask (#1749)#1755
brendancol merged 2 commits into
mainfrom
issue-1749

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • read_geotiff_dask failed on s3://, gs://, az://, memory:// and other fsspec URIs with FileNotFoundError. The eager path already handled cloud URIs via _read_to_array + _CloudSource, but the dask path's metadata-only step (_read_geo_info in xrspatial/geotiff/__init__.py) used a plain open(source, 'rb') call.
  • Per-chunk pixel reads in the dask graph were already cloud-aware (they go through _read_to_array which dispatches to _CloudSource), so only the upfront metadata read needed fixing.
  • Route fsspec URIs in _read_geo_info through _CloudSource.read_all(). The local-path mmap fast path is unchanged. The HTTP path (which uses _parse_cog_http_meta) is unchanged.

Closes #1749.

Test plan

  • New test_dask_path_fsspec_uri_1749 in TestCloudStorage writes a small TIFF, copies it into the fsspec memory:// filesystem, and verifies open_geotiff('memory:///...', chunks=4) returns a dask-backed DataArray whose values match both the eager read and the original array.
  • Full TestCloudStorage class still passes (6/6).
  • Wider xrspatial/geotiff/tests/ suite shows the same pre-existing failures as main (GPU/matplotlib unrelated to this change).

…1749)

read_geotiff_dask failed on s3://, gs://, az://, memory:// and other
fsspec URIs because the metadata-only step (_read_geo_info) used a
plain open(source, 'rb') call. The eager path already handled cloud
URIs via _read_to_array + _CloudSource, so the dask graph's per-chunk
pixel reads were already cloud-aware; only the upfront metadata read
broke.

Detect fsspec URIs in _read_geo_info and pull the file bytes via
_CloudSource.read_all(). Local-path mmap fast path is unchanged.
HTTP path is unchanged and continues to use _parse_cog_http_meta.

Closes #1749.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 12, 2026
@brendancol brendancol requested a review from Copilot May 12, 2026 23:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes open_geotiff(..., chunks=...) / read_geotiff_dask failures for fsspec-backed URIs (e.g., s3://, gs://, az://, memory://) by making the upfront metadata read (_read_geo_info) cloud-aware, aligning dask reads with the existing eager read path.

Changes:

  • Route fsspec/cloud URIs in _read_geo_info through _CloudSource instead of open(..., 'rb').
  • Add a regression test that reads a GeoTIFF from memory:// using dask chunks and validates results against eager reads and the original array.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
xrspatial/geotiff/__init__.py Adds an fsspec URI branch in _read_geo_info to avoid local open() for cloud/memory schemes.
xrspatial/geotiff/tests/test_features.py Adds a regression test covering dask reads from an fsspec memory:// URI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xrspatial/geotiff/__init__.py Outdated
Comment on lines +480 to +484
elif isinstance(source, str) and _is_fsspec_uri(source):
# fsspec URI (s3://, gs://, az://, memory://, ...): pull the
# whole file via _CloudSource for metadata parsing. Per-chunk
# pixel reads in the dask graph go through _read_to_array
# which opens its own _CloudSource, so this fetch is metadata-only.
The prior commit added an fsspec branch in _read_geo_info that called
_CloudSource.read_all() to parse metadata. For a large COG on S3 that
pulls the whole object into memory just to learn its shape/transform,
which defeats the O(1) memory intent of _read_geo_info.

Route fsspec URIs through _parse_cog_http_meta (same path as HTTP COGs).
It only requires a read_range-having source, which _CloudSource
satisfies, and grows a bounded buffer (capped at MAX_HTTP_HEADER_BYTES)
until the IFD chain resolves. Applies to both _read_geo_info itself
(used by the .raster accessor) and the read_geotiff_dask metadata
prefetch. Per-chunk tile reads in the dask graph now also dispatch
through _fetch_decode_cog_http_tiles with a _CloudSource instead of
falling back to read_to_array's read_all path.

Adds read_ranges and read_ranges_coalesced methods to _CloudSource so
the tiled COG decode path can drive a cloud source the same way it
drives an HTTP source. Extends the regression test to assert
_CloudSource.read_all is not called during dask graph construction or
chunk materialisation.

Addresses PR #1755 review comment.
@brendancol
Copy link
Copy Markdown
Contributor Author

Fixed in 16b76ea. fsspec URIs now go through _parse_cog_http_meta (the bounded range-prefetch parser used for HTTP COGs) instead of _CloudSource.read_all. Three changes:

  • _read_geo_info: fsspec branch switched from read_all() to _parse_cog_http_meta(_CloudSource(source)). Metadata reads stay capped at MAX_HTTP_HEADER_BYTES.
  • read_geotiff_dask: the per-graph metadata prefetch now treats fsspec URIs the same as HTTP URIs (one parse, wrapped in a dask.delayed so all chunk tasks share it).
  • _delayed_read_window._read: per-chunk reads for fsspec sources dispatch through _fetch_decode_cog_http_tiles with a _CloudSource, so each chunk issues range GETs (and coalesces neighbours) instead of pulling the whole file. Added read_ranges and read_ranges_coalesced to _CloudSource to match the _HTTPSource interface that the tile decoder expects.

Test test_dask_path_fsspec_uri_1749 still passes, and a new test_dask_path_fsspec_uri_no_full_download_1749 monkeypatches _CloudSource.read_all to raise and confirms open_geotiff(..., chunks=4) plus .compute() succeeds, proving the bounded-prefetch path is taken end-to-end.

@brendancol brendancol merged commit ef5fc31 into main May 13, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: read_geotiff_dask does not support fsspec / cloud URIs

2 participants