Reason or Problem
xrspatial/geotiff/__init__.py is over 4000 lines and mixes:
- public API entry points (
open_geotiff, read_vrt, read_geotiff_dask, read_geotiff_gpu, to_geotiff, write_vrt, write_geotiff_gpu)
- dask backend orchestration
- GPU backend orchestration and host/device transitions
- writer dispatch and CRS resolution
- fallback policy (
on_gpu_failure, missing-source handling)
- coordinate construction from
GeoTransform
- dtype / window / band validation
Several recent parity bugs trace back to drift between branches inside this file:
The same kwarg or behaviour needs to be wired into 3+ places that have grown apart. Each parity fix is small individually, but the rate at which they arrive shows the file is past the size where the structure helps.
Proposal
Split __init__.py into smaller modules organised by responsibility:
_coords.py -- GeoTransform to (x, y) coord arrays, half-pixel shift for AREA vs POINT, window-aware origin offsets, transform attr construction. Currently inlined separately in every backend.
_validation.py -- _validate_chunks_arg, _validate_dtype_cast, band/window/sample-count checks. Already partially extracted into _reader.py; finish the move and let every backend call the same checker.
_backends/eager.py -- the local-file synchronous path (the body of today's open_geotiff non-VRT branch).
_backends/dask.py -- read_geotiff_dask.
_backends/gpu.py -- read_geotiff_gpu.
_backends/vrt.py -- read_vrt, with the eager and dask paths inside.
_backends/http.py -- the COG-over-HTTP path currently in _reader.py.
_writers/ -- to_geotiff, write_geotiff_gpu, write_vrt, plus the streaming strip writer.
__init__.py -- thin dispatch only: open_geotiff routing on suffix, public re-exports, version constants.
Add a parity-test matrix: one fixture file, all read paths, assert byte-identical pixel output and identical attrs/coords. Wire it into CI so a future kwarg drift is caught immediately rather than as a follow-up bug report.
Value:
- Reduces the failure mode that produced the recent parity-bug cluster.
- Makes each backend reviewable in isolation.
- Lets
chunks=, missing_sources=, on_gpu_failure=, dtype= etc. be wired into one place rather than every backend's signature.
Stakeholders and Impacts
- All downstream callers of the public API. Public signatures stay identical; the split is internal.
- CI test runtime grows by the parity-matrix tests. Probably small (one fixture, ~5 backends).
Drawbacks
- A multi-PR refactor. Each module split lands separately, with the parity tests serving as the contract for "no regression".
- Git history for the moved code becomes a stack of moves, harder to
git blame across.
Alternatives
- Keep the current layout and rely on review discipline. The pattern of recent parity bugs is the argument against this.
- Auto-generate per-backend wrappers from a single source-of-truth signature. Heavier machinery than warranted.
Unresolved Questions
- Whether
_backends should be a private subpackage or a flat module per backend at xrspatial.geotiff root.
- Whether to migrate one backend at a time (preserving the old
__init__.py until the last backend is moved) or all at once.
Additional Notes or Context
This is a tracking issue, not a PR. The split is multi-PR scope; individual PRs should reference this issue.
Reason or Problem
xrspatial/geotiff/__init__.pyis over 4000 lines and mixes:open_geotiff,read_vrt,read_geotiff_dask,read_geotiff_gpu,to_geotiff,write_vrt,write_geotiff_gpu)on_gpu_failure, missing-source handling)GeoTransformSeveral recent parity bugs trace back to drift between branches inside this file:
crs_wktdeprecation needing to be reapplied per writer (geotiff: write_vrt's crs_wkt kwarg drifts from to_geotiff/write_geotiff_gpu's crs #1715)chunksvalidation missing on the VRT branch until geotiff: extend #1752 size-arg validation to write_geotiff_gpu, read_geotiff_gpu, read_vrt #1776missing_sourcesreachingread_vrtbut notopen_geotiff(geotiff: open_geotiff silently drops missing_sources when routing to read_vrt #1810)The same kwarg or behaviour needs to be wired into 3+ places that have grown apart. Each parity fix is small individually, but the rate at which they arrive shows the file is past the size where the structure helps.
Proposal
Split
__init__.pyinto smaller modules organised by responsibility:_coords.py--GeoTransformto (x, y) coord arrays, half-pixel shift for AREA vs POINT, window-aware origin offsets,transformattr construction. Currently inlined separately in every backend._validation.py--_validate_chunks_arg,_validate_dtype_cast, band/window/sample-count checks. Already partially extracted into_reader.py; finish the move and let every backend call the same checker._backends/eager.py-- the local-file synchronous path (the body of today'sopen_geotiffnon-VRT branch)._backends/dask.py--read_geotiff_dask._backends/gpu.py--read_geotiff_gpu._backends/vrt.py--read_vrt, with the eager and dask paths inside._backends/http.py-- the COG-over-HTTP path currently in_reader.py._writers/--to_geotiff,write_geotiff_gpu,write_vrt, plus the streaming strip writer.__init__.py-- thin dispatch only:open_geotiffrouting on suffix, public re-exports, version constants.Add a parity-test matrix: one fixture file, all read paths, assert byte-identical pixel output and identical attrs/coords. Wire it into CI so a future kwarg drift is caught immediately rather than as a follow-up bug report.
Value:
chunks=,missing_sources=,on_gpu_failure=,dtype=etc. be wired into one place rather than every backend's signature.Stakeholders and Impacts
Drawbacks
git blameacross.Alternatives
Unresolved Questions
_backendsshould be a private subpackage or a flat module per backend atxrspatial.geotiffroot.__init__.pyuntil the last backend is moved) or all at once.Additional Notes or Context
This is a tracking issue, not a PR. The split is multi-PR scope; individual PRs should reference this issue.