Skip to content

geotiff: read_vrt(chunks=...) materialises the full mosaic before chunking #1814

@brendancol

Description

@brendancol

Summary

read_vrt(chunks=...) reads the entire VRT mosaic into a numpy array first, then wraps the result in a dask array via .chunk(). The chunks kwarg provides no memory protection. Setting gpu=True, chunks=... is worse: the full mosaic is assembled on host RAM before cupy.asarray() copies it across to the device.

Where

xrspatial/geotiff/__init__.py:3879-3882 calls _read_vrt_internal(...) eagerly:

arr, vrt = _read_vrt_internal(
    source, window=window, band=band, max_pixels=max_pixels,
    missing_sources=missing_sources,
)

Then xrspatial/geotiff/__init__.py:4074-4079 chunks the already-materialised array:

if chunks is not None:
    ...
    result = result.chunk(chunk_dict)

read_geotiff_dask (regular TIFFs) takes a different path that constructs delayed per-tile reads, so the bug is specific to the VRT branch.

Repro

A VRT covering a few terabytes of source data with chunks=(1024, 1024) will OOM before any compute is requested, even though the intent of chunks= is exactly to avoid that.

Proposed fix

Build a real dask graph in the chunked path: for each destination chunk, schedule a delayed call into _read_vrt_internal with that chunk's destination window. Each task reads only the sources that intersect its window.

Implementation notes:

  • A task-count cap is needed. A small chunks value over a multi-million-pixel VRT would otherwise generate millions of tasks and stall the scheduler. Reuse the convention from read_geotiff_dask.
  • The eager path (chunks=None) stays as is.
  • gpu=True, chunks=... should produce a dask-cupy array where each block is materialised directly on the device, avoiding the host-side mosaic step.
  • The vrt.holes book-keeping needs to merge per-task hole lists into the final DataArray attrs.

Tests

  • Regression: build a VRT covering a region wider than the chunk size, call read_vrt(..., chunks=(64, 64)), assert result.data.dask has multiple tasks before any compute and that no full materialisation occurs (e.g. by checking that the underlying VRT reader is called once per chunk rather than once for the whole mosaic).
  • Cross-backend parity with the eager path: compute the chunked result and compare to the eager result for the same VRT.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingoomOut-of-memory risk with large datasetsperformancePR touches performance-sensitive code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions