geotiff: sidecar download honours max_cloud_bytes#2123
Conversation
`load_sidecar` downloaded the sibling `.tif.ovr` over HTTP and fsspec with no byte cap. The base-file `max_cloud_bytes` budget that `read_to_array` and `_CloudSource` enforce was bypassed entirely, so a hostile server could serve a tiny base TIFF that passes the cloud budget plus a multi-GB sidecar; opening with `overview_level >= 1` pulled the full sidecar body into memory and OOMed the process. Thread `max_cloud_bytes` through `load_sidecar` and apply it on both transports: - HTTP path forwards the cap to `_HTTPSource.read_all(max_bytes=...)`, reusing the existing streaming overshoot detector. - fsspec path stats the sidecar via `fs.size()` and raises `CloudSizeLimitError` when the declared size exceeds the budget, mirroring the `_CloudSource` guard at `_reader.py:3239-3260`. `read_to_array` now resolves the cloud budget once at the top of the function so both the base-file size guard and the sidecar fetch see the same effective cap. `max_cloud_bytes=None` preserves the legacy unbounded behaviour. The local-file mmap path is unchanged. The GPU and dask-metadata sidecar call sites only see local file paths (via `_FileSource` and the `_read_geo_info` local branch respectively), so they fall through to the mmap branch and inherit the default `max_cloud_bytes=None` for free. Tests cover the fsspec size guard, the HTTP streaming cap, the `max_cloud_bytes=None` unbounded path, the local mmap passthrough, and end-to-end propagation from `read_to_array` into `load_sidecar` with a sidecar inflated past the base file's size. Closes #2121.
brendancol
left a comment
There was a problem hiding this comment.
Domain-aware review (post-merge audit)
Automated self-review of the sidecar max_cloud_bytes fix.
Blockers
None.
Suggestions
-
Asymmetric exception type across transports. The fsspec path raises
CloudSizeLimitErrorfrom a pre-flightfs.size()check; the HTTP path
raisesOSErrorfrom the streaming overshoot detector inside
_HTTPSource.read_all. Both are valid for the threat model (fsspec
knows the size up front, HTTP may not), and the docstring now spells
the asymmetry out. Callers who catchCloudSizeLimitErrorwill miss
the HTTP path. Consider whether a thin wrapper around the HTTP read
should translateOSErrortoCloudSizeLimitErrorwhen the cause is
the streaming cap. Filing as a follow-up would be fine if the
asymmetry is intentional for now. -
fsspec.core.url_to_fsvsfsspec.open. The fsspec branch now
opens the filesystem twice (once for size, once for read). The change
fromfsspec.open(path)tourl_to_fs+fs.open(fs_path)is needed
to share the filesystem handle, but it slightly changes the open path.
The test suite passes forfile://URIs; the production fsspec
schemes (s3://,gs://,az://, ...) are exercised elsewhere but
not in this PR's tests. Worth a smoke test against at least one
non-file://backend (e.g.memory://) if available.
Nits
-
The end-to-end test inflates the sidecar by appending zero bytes to
forcesidecar_size > base_size. The TIFF parser tolerates trailing
garbage past the IFD chain, so this works, but a monkey-patched
fs.size()would be a more direct way to express the contract. The
current approach has the virtue of exercising the real disk path. -
The docstring's "Issue #2121" anchor appears in three places
(Parameters, HTTP comment, fsspec comment). Slight redundancy, but
consistent with the rest of the module's commenting style.
Coverage
- 8 new tests in
test_sidecar_max_cloud_bytes_2121.pycovering the
fsspec size guard, HTTP streaming cap,max_cloud_bytes=None
unbounded path, local-mmap passthrough, and end-to-end propagation
fromread_to_arrayintoload_sidecar. - 28 existing tests in
test_sidecar_ovr_2112.pypass unchanged. - 83 tests pass under
pytest -k "cloud or sidecar"with no
regressions.
Dispositions
The asymmetric-exception suggestion is being deferred to a follow-up
because aligning the two transports cleanly requires a small refactor
of _HTTPSource.read_all (currently raises raw OSError from inside
the urllib3 stream loop) that is outside this PR's scope. The
memory:// fsspec coverage nit is also being deferred -- the existing
test_sidecar_ovr_2112.py::test_find_sidecar_fsspec_probe_returns_uri_when_present
already exercises file:// end-to-end; broader fsspec backend coverage
belongs with the golden-corpus matrix (#1930).
Address review feedback on #2121: 1. ``load_sidecar`` now translates the ``OSError`` raised by ``_HTTPSource.read_all`` budget guards into ``CloudSizeLimitError`` so the HTTP and fsspec branches surface the same exception type for "sidecar exceeds the cap". The original ``OSError`` is preserved as ``__cause__``. Non-budget HTTP failures (connection reset, DNS, etc.) still propagate as ``OSError``. 2. Add ``test_env_var_propagates_to_sidecar`` to pin that ``XRSPATIAL_GEOTIFF_MAX_CLOUD_BYTES`` (no kwarg) caps the sidecar the same way an explicit ``max_cloud_bytes`` kwarg does. The existing HTTP overshoot test now asserts ``CloudSizeLimitError`` and that ``__cause__`` retains the underlying ``OSError`` detail.
PR Review: geotiff: sidecar download honours max_cloud_bytesBlockersNone. Suggestions (addressed in follow-up commit 77e5bb6)
Nits (left for a future cleanup, not blocking)
What looks good
Checklist
|
Closes #2121.
Summary
load_sidecardownloaded the sibling.tif.ovrover HTTP and fsspec withno byte cap. The base-file
max_cloud_bytesbudget thatread_to_arrayand
_CloudSourceenforce was bypassed entirely, so a hostile servercould serve a tiny base TIFF (which passes the cloud-budget check) plus a
multi-GB
<base>.tif.ovrsidecar; opening the file withoverview_level >= 1pulled the full sidecar body into memory and OOMed the process.
Changes
xrspatial/geotiff/_sidecar.py:load_sidecaracceptsmax_cloud_bytesand applies it on both remote transports.
_HTTPSource.read_all(max_bytes=...),reusing the existing streaming overshoot detector (which raises
OSErrorwhen the body exceeds the cap).fs.size()and raisesCloudSizeLimitErrorwhen the declared size exceeds the budget,mirroring the
_CloudSourceguard at_reader.py:3239-3260.xrspatial/geotiff/_reader.py:read_to_arrayresolves the cloudbudget once at the top so both the base-file size guard and the
sidecar fetch see the same cap. The sidecar load now passes the
resolved budget through.
max_cloud_bytes=Nonepreserves the legacy unbounded behaviour,matching the base-file semantics.
(via
_FileSourceand the_read_geo_infolocal branch), so they fallthrough to the mmap path and inherit the default
max_cloud_bytes=None.Backend coverage
read_to_array).cap).
sidecars, per the comment in
_backends/gpu.py).Test plan
pytest xrspatial/geotiff/tests/test_sidecar_max_cloud_bytes_2121.py(8 tests covering fsspec size guard, HTTP streaming cap,
max_cloud_bytes=Noneunbounded path, local mmap passthrough, andend-to-end propagation from
read_to_arraytoload_sidecar).pytest xrspatial/geotiff/tests/test_sidecar_ovr_2112.py(28existing sidecar tests pass unchanged).
pytest -k "cloud or sidecar" xrspatial/geotiff/tests/(83 testspass, no regressions).