geotiff: read external .tif.ovr sidecar overviews#2114
Conversation
GDAL/rasterio writes overview pyramids to a sibling .tif.ovr file when the source is not a COG. The reader walked only the in-file IFD chain, so any overview_level >= 1 on such a file raised "out of range" even though the data lived right next to the source. Add a _sidecar module that discovers the sibling file (local paths only -- cloud / HTTP / file-like sources fall back to base-only) and parses its IFDs as a continuation of the base file's pyramid. Each sidecar IFD is tagged with its (data, header) so the strip / tile reader resolves byte offsets against the right buffer. Both the eager read_to_array path and the metadata helper used by the dask graph builder pick up the merged pyramid, so dask + numpy reads of sidecar levels work for free. GPU and dask+GPU paths still walk the base IFD chain only -- threading sidecar bytes through nvCOMP is follow-up work. The corpus-level skip for the eager parity check is now empty; the dask+GPU skip stays.
brendancol
left a comment
There was a problem hiding this comment.
Self-review of PR #2114.
Coverage
- Local-file sidecar discovery only; cloud / HTTP / file-like sources skip the lookup (early return in
find_sidecar). - Each sidecar IFD is tagged with its origin
(data, header)viaobject.__setattr__so the strip / tile reader slices the right buffer. - Both
read_to_arrayand_read_geo_infosee the merged pyramid, so the dask graph builder gets correct metadata for sidecar levels too. Dask + numpy chunks read throughread_to_array, so dask + sidecar works without further changes. - Byte parity check against rasterio for both sidecar levels of the bundled fixture passes.
Risk areas worth a closer look
- GPU and dask + GPU paths still walk the base IFD chain. The corpus skip stays for those two backends. The GPU reader's separate
parse_all_ifdscallers in_backends/gpu.pywould need similar treatment to extend coverage. - HTTP / cloud sidecar discovery is non-trivial (you would need to know the sidecar URL exists). Not in scope.
- The sidecar mmap is closed in a
try / finallynext to the base file close; if the base parse fails before sidecar load, the sidecar isNoneand skipped.
Cleanup
- Removed the
_OVERVIEW_READER_GAPS["overview_external_ovr_uint16"]entry from the eager numpy parity test so the corpus exercises the sidecar path end-to-end.
brendancol
left a comment
There was a problem hiding this comment.
Follow-up to my self-review.
GPU and dask+GPU sidecar coverage -> deferred.
The GPU reader (_backends/gpu.py) and the dask+GPU pipeline both call parse_all_ifds directly. Threading the sidecar (data, header) swap through the nvCOMP tile decode would mean introducing a per-IFD source switch in the GPU kernel dispatcher, which is wider than this PR's scope. The corpus skip stays in place for those two backends and the gap is now narrowly described.
HTTP / cloud sidecar -> deferred.
The sidecar URL would need to be discoverable in the source's namespace (e.g. s3://bucket/path.tif.ovr exists alongside s3://bucket/path.tif). The _CloudSource layer already supports range reads but does not expose a "does object X exist" predicate without a HEAD round-trip. Leaving the discovery off entirely on remote URIs means the existing base-only behaviour is preserved and no spurious HEAD requests fire on every open.
Mmap close ordering -> verified.
Walked the failure modes:
- If
parse_header(data)raises,sidecaris stillNone(the load is after parse), so thefinallyblock only closes the base mmap. - If
find_sidecarreturns a path butload_sidecarraises (e.g. malformed.ovr), thesidecarlocal staysNoneand the base mmap closes normally; the partial sidecar file handle was already closed insideload_sidecar's owntry / finally. - If
attach_sidecar_originraises (shouldn't, but worth checking), thesidecarlocal is set, so the outer finally closes the sidecar mmap too.
No change needed.
Self-review on #2114 deferred three items. Address them. - HTTP sidecar discovery: ``find_sidecar`` now issues a HEAD on ``<url>.ovr`` for http:// / https:// sources and ``load_sidecar`` downloads the bytes via urllib. The HEAD probe is bounded by a 10s timeout and silently returns None on any error so a missing or unreachable sidecar falls back to base-only behaviour. - fsspec sidecar discovery: ``find_sidecar`` falls through to ``fsspec.AbstractFileSystem.exists`` for s3:// / gs:// / az:// / memory:// / file:// URIs and ``load_sidecar`` reads via fsspec. - GPU eager reader: ``read_geotiff_gpu`` parses the sidecar IFD chain alongside the base file and swaps the in-scope ``data`` / ``header`` to the sidecar's buffers when the selected overview level lives there. The GDS fast path (kvikio direct read from the base file path) is skipped for sidecar IFDs since their byte offsets index the ``.ovr``, not the base; the CPU-mmap GPU decode path slices the already-loaded sidecar buffer directly. Verified byte parity with the CPU eager path on both sidecar levels of the bundled fixture. Added ``close_sidecar`` helper to share the data-close convention across the three call sites. The GPU path relies on Python GC to reap the sidecar mmap since the GPU decode happens after the metadata try/finally exits. Chunked GPU + sidecar (``read_geotiff_gpu`` with chunks=...) still walks the base IFD only because the chunked GDS path needs separate plumbing for a different file path; dask+GPU corpus skip stays.
brendancol
left a comment
There was a problem hiding this comment.
Re-review after the GPU + HTTP + fsspec extension.
Suggestions (should fix, not blocking)
-
HEAD probe and HTTP sidecar load bypass the existing urllib3 pool.
_sidecar._probe_httpusesurllib.request.Request(method="HEAD")andload_sidecarHTTP branch usesurllib.request.urlopen. The eager reader at_reader.pyalready maintains a_get_http_poolurllib3 PoolManager with connection reuse, retries, and a shared user agent. Sidecar discovery and load should share that pool so cloud reads of many sidecars do not re-open TCP per probe. -
_probe_httpdoes not honour the private-host guard._reader.py:556rejects loopback / link-local / private hosts by default and only allows them withXRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1. The new sidecar probe fires a HEAD against any URL the caller passes, so a user opening a public URL whose.ovrhappens to live behind a private hostname (or whose 302 redirects to one) bypasses the guard. Route the probe through the same_validate_url_safecheck the reader uses. -
GPU sidecar mmap is reaped by Python GC rather than explicitly closed.
_backends/gpu.pydocuments this but Python is free to defer GC arbitrarily and the underlying file descriptor stays open until then. For long-running processes opening many GeoTIFFs in sequence this can accumulate file descriptors. A function-leveltry / finallythat wraps both the metadata parse and the GPU decode would letclose_sidecarrun deterministically. -
SidecarOverviewsis created and immediately mutated.attach_sidecar_originwrites back to the IFD instances viaobject.__setattr__. If a future caller reuses the same IFD list across requests, that mutation persists. Not a problem today, but a defensive option is to clone the IFD list (or capture the origin in a side dict) instead of mutating the parsed IFDs.
Nits (optional)
-
SidecarOverviews.data: objectcould narrow toUnion[mmap.mmap, bytes]so readers see the actual type. -
_is_fsspec_uriis duplicated in_sidecar.pyto dodge the circular import. A shared helper module (or moving both definitions into_paths.py/ similar) keeps the gates in sync. -
HEAD probe follows HTTP redirects by default.
urllib.request.urlopenfollows up to 10 redirects; a 302 from a public host to a private one would still satisfy the probe even after the safe-host fix above. SettingHTTPRedirectHandlerto refuse or restrict redirects keeps the probe predictable. -
Comment in
_backends/gpu.py:521-526explains why the close is deferred, which is helpful, but the doc reads as if the lifetime equivalence withdata(also touched after src.close()) is exact; it is approximate becausedatais a cached-mmap slice that survives release whereas the sidecar mmap is a fresh kernel mapping with its own lifecycle.
What looks good
- Discovery + load are layered: a cheap existence probe followed by an explicit
load_sidecar, with consistent silent-fail-to-base-only semantics on every transport. - GPU sidecar parity test (commit 55fb85c) confirms byte equality between CPU and GPU on both sidecar levels of the fixture.
- HTTP + fsspec tests use a loopback server /
file://URI so the coverage exercises real transport without external network dependency. close_sidecarhelper centralises the close convention and keeps the call sites symmetric.
Apply every item from the re-review pass.
- Route the HTTP sidecar probe and download through ``_HTTPSource`` so
both inherit the eager reader's defences: SSRF allow-list, IP
pinning across redirect re-validation, shared urllib3 PoolManager
with retries, and ``redirect=False`` with manual ``Location`` re-
check. The probe uses a 1-byte Range GET rather than HEAD because
HEAD support is inconsistent across CDNs / object stores. Removes
the duplicated 10s urllib timeout path.
- Close the sidecar mmap deterministically in ``read_geotiff_gpu`` by
moving the GPU decode into the same try/finally that owns the
metadata parse. The previous "rely on GC" path leaked file
descriptors on long-running processes opening many GeoTIFFs in
sequence; the close now runs after the decode has finished reading
``data``.
- Replace the IFD-mutating ``attach_sidecar_origin`` (``object.__set
attr__`` on every parsed IFD) with a return value -- a request-
scoped ``dict[id(ifd), (data, header)]`` -- so a future caller that
caches the parsed IFD list does not inherit stale ``_source_data`` /
``_source_header`` attributes from a previous request.
- Narrow ``SidecarOverviews.data`` from ``object`` to a new
``SidecarBuffer = Union[mmap.mmap, bytes]`` alias so the reader sees
the actual buffer variants. Carry the alias on
``attach_sidecar_origin``'s signature too.
- Drop the duplicate ``_is_fsspec_uri`` from ``_sidecar`` and import
the helper from ``_reader``. ``_reader`` imports ``_sidecar`` lazily
inside its function bodies, so the top-level reverse import does
not form a cycle at module load time.
- Tests:
- Add ``test_find_sidecar_http_probe_rejects_loopback_without_env_override``
to pin the SSRF guard's silent-fail-to-base contract.
- Existing loopback-server probes opt into ``XRSPATIAL_GEOTIFF_ALLOW_
PRIVATE_HOSTS=1`` (same pattern other HTTP tests use) so they
keep passing through the new SSRF-gated probe.
Verified: full ``xrspatial/geotiff/tests/`` suite 4046 passed, 25
skipped.
brendancol
left a comment
There was a problem hiding this comment.
Addressed the re-review (commit 722528e).
| Item | Disposition |
|---|---|
| HEAD probe / HTTP load bypass urllib3 pool | Fixed -- both _probe_http and load_sidecar HTTP branch now route through _HTTPSource. Sidecar fetches reuse the eager reader's _get_http_pool urllib3 PoolManager, share retries, and inherit the manual redirect re-validation. |
| HEAD probe missing private-host guard | Fixed -- _HTTPSource.__init__ runs _validate_http_url (SSRF allow-list + IP pinning) before the connection opens. A loopback probe with no XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1 now returns None silently, matching the rest of the discovery contract. New regression test test_find_sidecar_http_probe_rejects_loopback_without_env_override pins this. |
| HEAD probe follows redirects implicitly | Closed by the _HTTPSource routing -- the source uses redirect=False and re-runs _validate_http_url on every Location header, so a 302 from a public host to a private one is rejected the same way the eager reader rejects it. Also swapped HEAD for a 1-byte Range GET so the probe works on CDNs / object stores with inconsistent HEAD support. |
| GPU sidecar mmap relied on GC | Fixed -- restructured read_geotiff_gpu so the GPU decode runs inside the same try/finally that owns the metadata parse. close_sidecar now runs deterministically after the decode finishes touching the buffer. File-descriptor count is bounded for long-running processes. The "this is approximate equivalence with data's lifetime" comment is gone with the GC path. |
attach_sidecar_origin mutated IFDs via object.__setattr__ |
Fixed -- the helper now returns a dict[id(ifd), (data, header)] side map per call. Reader and GPU paths look up sidecar_origin.get(id(ifd), (data, header)) to resolve the buffer. No IFD instance is mutated; a future caller that caches parsed IFDs across requests inherits a clean object. |
SidecarOverviews.data: object |
Narrowed to SidecarBuffer = Union[mmap.mmap, bytes]. The new alias also annotates attach_sidecar_origin. |
_is_fsspec_uri duplicated |
Removed the local copy; _sidecar.py now imports the helper from _reader. _reader imports _sidecar lazily inside function bodies so the top-level reverse import does not form a cycle at module load. |
Outdated GC-lifetime comment in _backends/gpu.py |
Gone with the deterministic close. |
Full xrspatial/geotiff/tests/ suite: 4046 passed, 25 skipped.
Resolves the conflict in xrspatial/geotiff/_backends/gpu.py where #2113 (per-tile GPU TileByteCounts cap) and this branch both modify the ``read_geotiff_gpu`` decode setup region. Resolution: - Keep the per-tile cap loop from #2113. It enforces the same ``XRSPATIAL_COG_MAX_TILE_BYTES`` budget the CPU paths (``_read_tiles``, ``_fetch_decode_cog_http_tiles``) already apply per tile, so the GPU pipeline no longer accepts crafted ``TileByteCounts`` entries the CPU side would refuse. The loop runs inside the extended try/finally introduced for sidecar mmap cleanup, so a failure here still releases ``src`` and any ``sidecar`` mmap. - Keep the sidecar refactor's single function-end ``finally`` that closes ``src`` and (if present) the sidecar mmap together. Do not re-introduce the earlier mid-function ``finally: src.close()`` that #2113 carried, because the GPU decode below still slices ``data`` -- which is the sidecar mmap when the selected IFD lives there -- and closing early would invalidate the buffer. - Ordering: byte cap check first (safety guard), then ``has_sparse_tile`` / ``masked_fill`` setup. Both iterate ``byte_counts`` independently; keeping the safety check first means a malformed file fails before any decode-prep allocation. Verified: full ``xrspatial/geotiff/tests/`` suite passes (4208 passed, 25 skipped) including the new ``test_gpu_tile_byte_cap_2026_05_18.py`` from main and the existing sidecar regression tests.
Closes #2112.
Summary
_sidecar.pyfor local-file sidecar discovery (<source>.ovr) and IFD parsing.read_to_arrayand_read_geo_infosooverview_level >= 1resolves the GDAL/rasterio convention where the base file holds level 0 and the sidecar holds the rest.(data, header)so strip / tile reads slice the right buffer.overview_external_ovr_uint16on the eager numpy parity check.Backend coverage
read_to_array).Test plan
test_sidecar_ovr_2112.pycovers discovery (local / remote URIs / file-like / missing sidecar), parsing, end-to-end reads at each level, byte parity with rasterio, and the dask+numpy path.test_golden_corpus_eager_numpy_1930.py) now exercises the sidecar overview levels.xrspatial/geotiff/tests/suite: 4038 passed, 25 skipped.