Skip to content

geotiff: read external .tif.ovr sidecar overviews#2114

Merged
brendancol merged 4 commits into
mainfrom
issue-2112
May 19, 2026
Merged

geotiff: read external .tif.ovr sidecar overviews#2114
brendancol merged 4 commits into
mainfrom
issue-2112

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2112.

Summary

  • Adds _sidecar.py for local-file sidecar discovery (<source>.ovr) and IFD parsing.
  • Wires sidecar IFDs into read_to_array and _read_geo_info so overview_level >= 1 resolves the GDAL/rasterio convention where the base file holds level 0 and the sidecar holds the rest.
  • Each sidecar IFD carries its origin (data, header) so strip / tile reads slice the right buffer.
  • Removes the corpus-level skip for overview_external_ovr_uint16 on the eager numpy parity check.
  • Cloud / HTTP / file-like sources skip the lookup and keep the base-only behaviour; sidecar support on those paths is follow-up.

Backend coverage

  • numpy: covered (direct).
  • cupy: not in scope; GPU reader walks the base IFD chain only. Corpus skip stays.
  • dask+numpy: covered (per-chunk read goes through read_to_array).
  • dask+cupy: not in scope; same reason as cupy.

Test plan

  • test_sidecar_ovr_2112.py covers discovery (local / remote URIs / file-like / missing sidecar), parsing, end-to-end reads at each level, byte parity with rasterio, and the dask+numpy path.
  • Golden corpus eager parity (test_golden_corpus_eager_numpy_1930.py) now exercises the sidecar overview levels.
  • Full xrspatial/geotiff/tests/ suite: 4038 passed, 25 skipped.

GDAL/rasterio writes overview pyramids to a sibling .tif.ovr file when
the source is not a COG. The reader walked only the in-file IFD chain,
so any overview_level >= 1 on such a file raised "out of range" even
though the data lived right next to the source.

Add a _sidecar module that discovers the sibling file (local paths
only -- cloud / HTTP / file-like sources fall back to base-only) and
parses its IFDs as a continuation of the base file's pyramid. Each
sidecar IFD is tagged with its (data, header) so the strip / tile
reader resolves byte offsets against the right buffer. Both the eager
read_to_array path and the metadata helper used by the dask graph
builder pick up the merged pyramid, so dask + numpy reads of sidecar
levels work for free.

GPU and dask+GPU paths still walk the base IFD chain only -- threading
sidecar bytes through nvCOMP is follow-up work. The corpus-level skip
for the eager parity check is now empty; the dask+GPU skip stays.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 19, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review of PR #2114.

Coverage

  • Local-file sidecar discovery only; cloud / HTTP / file-like sources skip the lookup (early return in find_sidecar).
  • Each sidecar IFD is tagged with its origin (data, header) via object.__setattr__ so the strip / tile reader slices the right buffer.
  • Both read_to_array and _read_geo_info see the merged pyramid, so the dask graph builder gets correct metadata for sidecar levels too. Dask + numpy chunks read through read_to_array, so dask + sidecar works without further changes.
  • Byte parity check against rasterio for both sidecar levels of the bundled fixture passes.

Risk areas worth a closer look

  • GPU and dask + GPU paths still walk the base IFD chain. The corpus skip stays for those two backends. The GPU reader's separate parse_all_ifds callers in _backends/gpu.py would need similar treatment to extend coverage.
  • HTTP / cloud sidecar discovery is non-trivial (you would need to know the sidecar URL exists). Not in scope.
  • The sidecar mmap is closed in a try / finally next to the base file close; if the base parse fails before sidecar load, the sidecar is None and skipped.

Cleanup

  • Removed the _OVERVIEW_READER_GAPS["overview_external_ovr_uint16"] entry from the eager numpy parity test so the corpus exercises the sidecar path end-to-end.

Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up to my self-review.

GPU and dask+GPU sidecar coverage -> deferred.
The GPU reader (_backends/gpu.py) and the dask+GPU pipeline both call parse_all_ifds directly. Threading the sidecar (data, header) swap through the nvCOMP tile decode would mean introducing a per-IFD source switch in the GPU kernel dispatcher, which is wider than this PR's scope. The corpus skip stays in place for those two backends and the gap is now narrowly described.

HTTP / cloud sidecar -> deferred.
The sidecar URL would need to be discoverable in the source's namespace (e.g. s3://bucket/path.tif.ovr exists alongside s3://bucket/path.tif). The _CloudSource layer already supports range reads but does not expose a "does object X exist" predicate without a HEAD round-trip. Leaving the discovery off entirely on remote URIs means the existing base-only behaviour is preserved and no spurious HEAD requests fire on every open.

Mmap close ordering -> verified.
Walked the failure modes:

  • If parse_header(data) raises, sidecar is still None (the load is after parse), so the finally block only closes the base mmap.
  • If find_sidecar returns a path but load_sidecar raises (e.g. malformed .ovr), the sidecar local stays None and the base mmap closes normally; the partial sidecar file handle was already closed inside load_sidecar's own try / finally.
  • If attach_sidecar_origin raises (shouldn't, but worth checking), the sidecar local is set, so the outer finally closes the sidecar mmap too.

No change needed.

Self-review on #2114 deferred three items. Address them.

- HTTP sidecar discovery: ``find_sidecar`` now issues a HEAD on
  ``<url>.ovr`` for http:// / https:// sources and ``load_sidecar``
  downloads the bytes via urllib. The HEAD probe is bounded by a 10s
  timeout and silently returns None on any error so a missing or
  unreachable sidecar falls back to base-only behaviour.

- fsspec sidecar discovery: ``find_sidecar`` falls through to
  ``fsspec.AbstractFileSystem.exists`` for s3:// / gs:// / az:// /
  memory:// / file:// URIs and ``load_sidecar`` reads via fsspec.

- GPU eager reader: ``read_geotiff_gpu`` parses the sidecar IFD chain
  alongside the base file and swaps the in-scope ``data`` / ``header``
  to the sidecar's buffers when the selected overview level lives
  there. The GDS fast path (kvikio direct read from the base file
  path) is skipped for sidecar IFDs since their byte offsets index the
  ``.ovr``, not the base; the CPU-mmap GPU decode path slices the
  already-loaded sidecar buffer directly. Verified byte parity with
  the CPU eager path on both sidecar levels of the bundled fixture.

Added ``close_sidecar`` helper to share the data-close convention
across the three call sites. The GPU path relies on Python GC to
reap the sidecar mmap since the GPU decode happens after the
metadata try/finally exits.

Chunked GPU + sidecar (``read_geotiff_gpu`` with chunks=...) still
walks the base IFD only because the chunked GDS path needs separate
plumbing for a different file path; dask+GPU corpus skip stays.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review after the GPU + HTTP + fsspec extension.

Suggestions (should fix, not blocking)

  • HEAD probe and HTTP sidecar load bypass the existing urllib3 pool. _sidecar._probe_http uses urllib.request.Request(method="HEAD") and load_sidecar HTTP branch uses urllib.request.urlopen. The eager reader at _reader.py already maintains a _get_http_pool urllib3 PoolManager with connection reuse, retries, and a shared user agent. Sidecar discovery and load should share that pool so cloud reads of many sidecars do not re-open TCP per probe.

  • _probe_http does not honour the private-host guard. _reader.py:556 rejects loopback / link-local / private hosts by default and only allows them with XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1. The new sidecar probe fires a HEAD against any URL the caller passes, so a user opening a public URL whose .ovr happens to live behind a private hostname (or whose 302 redirects to one) bypasses the guard. Route the probe through the same _validate_url_safe check the reader uses.

  • GPU sidecar mmap is reaped by Python GC rather than explicitly closed. _backends/gpu.py documents this but Python is free to defer GC arbitrarily and the underlying file descriptor stays open until then. For long-running processes opening many GeoTIFFs in sequence this can accumulate file descriptors. A function-level try / finally that wraps both the metadata parse and the GPU decode would let close_sidecar run deterministically.

  • SidecarOverviews is created and immediately mutated. attach_sidecar_origin writes back to the IFD instances via object.__setattr__. If a future caller reuses the same IFD list across requests, that mutation persists. Not a problem today, but a defensive option is to clone the IFD list (or capture the origin in a side dict) instead of mutating the parsed IFDs.

Nits (optional)

  • SidecarOverviews.data: object could narrow to Union[mmap.mmap, bytes] so readers see the actual type.

  • _is_fsspec_uri is duplicated in _sidecar.py to dodge the circular import. A shared helper module (or moving both definitions into _paths.py / similar) keeps the gates in sync.

  • HEAD probe follows HTTP redirects by default. urllib.request.urlopen follows up to 10 redirects; a 302 from a public host to a private one would still satisfy the probe even after the safe-host fix above. Setting HTTPRedirectHandler to refuse or restrict redirects keeps the probe predictable.

  • Comment in _backends/gpu.py:521-526 explains why the close is deferred, which is helpful, but the doc reads as if the lifetime equivalence with data (also touched after src.close()) is exact; it is approximate because data is a cached-mmap slice that survives release whereas the sidecar mmap is a fresh kernel mapping with its own lifecycle.

What looks good

  • Discovery + load are layered: a cheap existence probe followed by an explicit load_sidecar, with consistent silent-fail-to-base-only semantics on every transport.
  • GPU sidecar parity test (commit 55fb85c) confirms byte equality between CPU and GPU on both sidecar levels of the fixture.
  • HTTP + fsspec tests use a loopback server / file:// URI so the coverage exercises real transport without external network dependency.
  • close_sidecar helper centralises the close convention and keeps the call sites symmetric.

Apply every item from the re-review pass.

- Route the HTTP sidecar probe and download through ``_HTTPSource`` so
  both inherit the eager reader's defences: SSRF allow-list, IP
  pinning across redirect re-validation, shared urllib3 PoolManager
  with retries, and ``redirect=False`` with manual ``Location`` re-
  check. The probe uses a 1-byte Range GET rather than HEAD because
  HEAD support is inconsistent across CDNs / object stores. Removes
  the duplicated 10s urllib timeout path.

- Close the sidecar mmap deterministically in ``read_geotiff_gpu`` by
  moving the GPU decode into the same try/finally that owns the
  metadata parse. The previous "rely on GC" path leaked file
  descriptors on long-running processes opening many GeoTIFFs in
  sequence; the close now runs after the decode has finished reading
  ``data``.

- Replace the IFD-mutating ``attach_sidecar_origin`` (``object.__set
  attr__`` on every parsed IFD) with a return value -- a request-
  scoped ``dict[id(ifd), (data, header)]`` -- so a future caller that
  caches the parsed IFD list does not inherit stale ``_source_data`` /
  ``_source_header`` attributes from a previous request.

- Narrow ``SidecarOverviews.data`` from ``object`` to a new
  ``SidecarBuffer = Union[mmap.mmap, bytes]`` alias so the reader sees
  the actual buffer variants. Carry the alias on
  ``attach_sidecar_origin``'s signature too.

- Drop the duplicate ``_is_fsspec_uri`` from ``_sidecar`` and import
  the helper from ``_reader``. ``_reader`` imports ``_sidecar`` lazily
  inside its function bodies, so the top-level reverse import does
  not form a cycle at module load time.

- Tests:
  - Add ``test_find_sidecar_http_probe_rejects_loopback_without_env_override``
    to pin the SSRF guard's silent-fail-to-base contract.
  - Existing loopback-server probes opt into ``XRSPATIAL_GEOTIFF_ALLOW_
    PRIVATE_HOSTS=1`` (same pattern other HTTP tests use) so they
    keep passing through the new SSRF-gated probe.

Verified: full ``xrspatial/geotiff/tests/`` suite 4046 passed, 25
skipped.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the re-review (commit 722528e).

Item Disposition
HEAD probe / HTTP load bypass urllib3 pool Fixed -- both _probe_http and load_sidecar HTTP branch now route through _HTTPSource. Sidecar fetches reuse the eager reader's _get_http_pool urllib3 PoolManager, share retries, and inherit the manual redirect re-validation.
HEAD probe missing private-host guard Fixed -- _HTTPSource.__init__ runs _validate_http_url (SSRF allow-list + IP pinning) before the connection opens. A loopback probe with no XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1 now returns None silently, matching the rest of the discovery contract. New regression test test_find_sidecar_http_probe_rejects_loopback_without_env_override pins this.
HEAD probe follows redirects implicitly Closed by the _HTTPSource routing -- the source uses redirect=False and re-runs _validate_http_url on every Location header, so a 302 from a public host to a private one is rejected the same way the eager reader rejects it. Also swapped HEAD for a 1-byte Range GET so the probe works on CDNs / object stores with inconsistent HEAD support.
GPU sidecar mmap relied on GC Fixed -- restructured read_geotiff_gpu so the GPU decode runs inside the same try/finally that owns the metadata parse. close_sidecar now runs deterministically after the decode finishes touching the buffer. File-descriptor count is bounded for long-running processes. The "this is approximate equivalence with data's lifetime" comment is gone with the GC path.
attach_sidecar_origin mutated IFDs via object.__setattr__ Fixed -- the helper now returns a dict[id(ifd), (data, header)] side map per call. Reader and GPU paths look up sidecar_origin.get(id(ifd), (data, header)) to resolve the buffer. No IFD instance is mutated; a future caller that caches parsed IFDs across requests inherits a clean object.
SidecarOverviews.data: object Narrowed to SidecarBuffer = Union[mmap.mmap, bytes]. The new alias also annotates attach_sidecar_origin.
_is_fsspec_uri duplicated Removed the local copy; _sidecar.py now imports the helper from _reader. _reader imports _sidecar lazily inside function bodies so the top-level reverse import does not form a cycle at module load.
Outdated GC-lifetime comment in _backends/gpu.py Gone with the deterministic close.

Full xrspatial/geotiff/tests/ suite: 4046 passed, 25 skipped.

Resolves the conflict in xrspatial/geotiff/_backends/gpu.py where #2113
(per-tile GPU TileByteCounts cap) and this branch both modify the
``read_geotiff_gpu`` decode setup region.

Resolution:

- Keep the per-tile cap loop from #2113. It enforces the same
  ``XRSPATIAL_COG_MAX_TILE_BYTES`` budget the CPU paths
  (``_read_tiles``, ``_fetch_decode_cog_http_tiles``) already apply
  per tile, so the GPU pipeline no longer accepts crafted
  ``TileByteCounts`` entries the CPU side would refuse. The loop
  runs inside the extended try/finally introduced for sidecar mmap
  cleanup, so a failure here still releases ``src`` and any
  ``sidecar`` mmap.

- Keep the sidecar refactor's single function-end ``finally`` that
  closes ``src`` and (if present) the sidecar mmap together. Do not
  re-introduce the earlier mid-function ``finally: src.close()`` that
  #2113 carried, because the GPU decode below still slices ``data``
  -- which is the sidecar mmap when the selected IFD lives there --
  and closing early would invalidate the buffer.

- Ordering: byte cap check first (safety guard), then
  ``has_sparse_tile`` / ``masked_fill`` setup. Both iterate
  ``byte_counts`` independently; keeping the safety check first
  means a malformed file fails before any decode-prep allocation.

Verified: full ``xrspatial/geotiff/tests/`` suite passes
(4208 passed, 25 skipped) including the new
``test_gpu_tile_byte_cap_2026_05_18.py`` from main and the existing
sidecar regression tests.
@brendancol brendancol merged commit 68574fe into main May 19, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read external .tif.ovr sidecar overviews

1 participant