Skip to content

read_vrt drops integer-source nodata when VRT declares a float dataType #1616

@brendancol

Description

@brendancol

Describe the bug

When a VRT XML declares dataType="Float32" or "Float64" but the underlying source GeoTIFFs are integer with an in-range GDAL_NODATA sentinel, read_vrt leaves the sentinel pixels as literal float values instead of NaN-masking them. The result is a float array whose attrs['nodata'] advertises 65535.0, but pixels at the sentinel positions still contain 65535.0 rather than np.nan. NaN-aware downstream code (np.isnan, xr.where, slope/aspect, statistics) sees those pixels as valid data and gets wrong answers.

Repro

import numpy as np, tempfile, os
from xrspatial.geotiff import read_vrt
from xrspatial.geotiff._writer import write

band0 = np.array([[1, 2], [3, 65535]], dtype=np.uint16)
with tempfile.TemporaryDirectory() as td:
    p0 = os.path.join(td, 'b0.tif')
    write(band0, p0, nodata=65535, compression='none', tiled=False)

    vrt = os.path.join(td, 'mismatch.vrt')
    with open(vrt, 'w') as f:
        f.write(f'''<VRTDataset rasterXSize="2" rasterYSize="2">
  <GeoTransform>0.0, 1.0, 0.0, 0.0, 0.0, -1.0</GeoTransform>
  <VRTRasterBand dataType="Float32" band="1">
    <NoDataValue>65535</NoDataValue>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">{p0}</SourceFilename>
      <SourceBand>1</SourceBand>
      <SrcRect xOff="0" yOff="0" xSize="2" ySize="2"/>
      <DstRect xOff="0" yOff="0" xSize="2" ySize="2"/>
    </SimpleSource>
  </VRTRasterBand>
</VRTDataset>''')

    r = read_vrt(vrt)
    print(r.dtype)                    # float32
    print(r.attrs['nodata'])          # 65535.0
    print(r.values[1, 1])             # 65535.0 -- should be nan
    print(np.isnan(r.values[1, 1]))   # False

Expected behavior

r.values[1, 1] should be np.nan after the read, matching the eager and dask paths for single-file int rasters with an in-range sentinel. attrs['nodata'] should still surface the original sentinel for downstream code that needs to round-trip on write.

Root cause

_vrt.read_vrt in xrspatial/geotiff/_vrt.py NaN-masks src_arr only when src_arr.dtype.kind == 'f' (lines 371-374). When the source is integer with a sentinel, the int sentinel pixels are assigned into the float result buffer at lines 393-398, where the cast turns them into literal float sentinel values. The outer read_vrt in xrspatial/geotiff/__init__.py then runs its int-to-float promotion only when arr.dtype.kind in ('u', 'i') (line 3059), so a float-typed VRT result never gets its sentinel pixels masked.

Int source + int (or wider int) dataType works fine because the outer promotion fires. Float source files get masked inline. Only the int source + float dataType combination leaks the sentinel.

Scope

  • numpy eager (read_vrt(...)): broken
  • dask (read_vrt(..., chunks=...)): broken; chunking happens after the eager decode so the bug propagates
  • gpu (read_vrt(..., gpu=True)): broken; cupy upload happens after the eager decode
  • write side: unaffected

Suggested fix

In _vrt.read_vrt, NaN-mask integer source arrays before placing them into the float result. Use the same in-range gating the outer read_vrt applies so out-of-range sentinels stay no-op rather than tripping OverflowError.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions