Skip to content

geotiff: bytes(bytearray) at end of _assemble_*_layout doubles peak memory #1756

@brendancol

Description

@brendancol

What

_assemble_standard_layout and _assemble_cog_layout in xrspatial/geotiff/_writer.py build up the output TIFF file in a bytearray, then return bytes(output) at the end. The bytes(...) call copies the entire bytearray contents, transiently doubling peak memory.

Lines:

  • xrspatial/geotiff/_writer.py:1035 (_assemble_standard_layout)
  • xrspatial/geotiff/_writer.py:1118 (_assemble_cog_layout)

Why it matters

The eager (non-streaming) writer materializes the full output buffer in memory. The bytearray-to-bytes conversion at the end of the assembly path doubles peak memory transiently. Measured:

output (bytearray) size: 95.37 MB
bytes(output) peak: 95.37 MB extra

So a 1 GB TIFF write adds 1 GB of peak memory beyond the bytearray itself. For users writing large COGs from memory, this can push a borderline-OK write into OOM territory.

Fix

Return the bytearray directly. _write_bytes calls f.write(file_bytes) which accepts any buffer-protocol object, and the post-write parse_header(file_bytes[:16]) validation also accepts bytearray slicing. The bytes(...) call is a pure copy with no contract benefit.

The streaming writer (write_streaming) already writes pixel data directly to the file handle and does not have this issue.

Reproduction

import io, numpy as np, tracemalloc
from xrspatial.geotiff import to_geotiff
import xarray as xr

arr = xr.DataArray(np.random.randint(0, 255, (10000, 10000), dtype=np.uint8), dims=['y', 'x'])
out = io.BytesIO()
tracemalloc.start()
to_geotiff(arr, out, compression='none', tiled=False)
_, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"output size {len(out.getvalue())/(1024*1024):.1f} MB, peak {peak/(1024*1024):.1f} MB")

Observed: peak ~3x output size. After fix: peak ~2x output size.

Severity

MEDIUM. Real measurable memory overhead, easy fix, narrow blast radius.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingoomOut-of-memory risk with large datasetsperformancePR touches performance-sensitive code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions