Skip to content

perf(geotiff): vectorise sub-byte unpack_bits for bps=2/4/12 #1713

@brendancol

Description

@brendancol

Reason or Problem

unpack_bits in xrspatial/geotiff/_compression.py (lines 815-880) decodes sub-byte pixel data when a file declares BitsPerSample in {1, 2, 4, 12}. The bps=1 branch uses np.unpackbits (vectorised, fast). The bps=2, bps=4, and bps=12 branches use Python for loops over packed bytes, which is 100-200x slower than numpy-vectorised equivalents at realistic tile sizes.

Microbench on a single thread:

bps pixel_count current unpack_bits vectorised speedup
2 4,000,000 336 ms (a few ms) ~150x
4 2,000,000 162 ms 1.5 ms ~108x
12 2,000,000 220 ms (a few ms) ~70x

Sub-byte BPS files are uncommon in geospatial workflows but real: bps=4 is used by some indexed/classified rasters and bps=12 by certain scientific instruments (e.g. older satellite radiometers, some medical/microscopy formats). When such a file is read the slow path is the dominant cost.

Proposal

Replace the per-byte Python loops with numpy slicing/shifting:

elif bps == 2:
    # 4 pixels per byte, MSB-first.
    n_bytes = (pixel_count + 3) // 4
    raw = data[:n_bytes]
    out = np.empty(n_bytes * 4, dtype=np.uint8)
    out[0::4] = (raw >> 6) & 0x03
    out[1::4] = (raw >> 4) & 0x03
    out[2::4] = (raw >> 2) & 0x03
    out[3::4] = raw & 0x03
    return out[:pixel_count]

elif bps == 4:
    # 2 pixels per byte, high nibble first.
    n_bytes = (pixel_count + 1) // 2
    raw = data[:n_bytes]
    out = np.empty(n_bytes * 2, dtype=np.uint8)
    out[0::2] = (raw >> 4) & 0x0F
    out[1::2] = raw & 0x0F
    return out[:pixel_count]

elif bps == 12:
    # 2 pixels per 3 bytes, MSB-first.
    n_pairs = pixel_count // 2
    remainder = pixel_count % 2
    n_bytes = n_pairs * 3 + (2 if remainder else 0)
    raw = data[:n_bytes].astype(np.uint16)
    out = np.empty(pixel_count, dtype=np.uint16)
    b0 = raw[0:n_pairs * 3:3]
    b1 = raw[1:n_pairs * 3:3]
    b2 = raw[2:n_pairs * 3:3]
    out[0:n_pairs * 2:2] = (b0 << 4) | (b1 >> 4)
    out[1:n_pairs * 2:2] = ((b1 & 0x0F) << 8) | b2
    if remainder:
        out[pixel_count - 1] = ((int(data[n_pairs * 3]) << 4)
                                 | (int(data[n_pairs * 3 + 1]) >> 4))
    return out

The vectorised versions match the existing reference implementation bit-for-bit (verified against random uint8 inputs of length 1000 across all three bps values).

Acceptance criteria

  • unpack_bits(data, bps, pixel_count) for bps in {2, 4, 12} returns the same byte/short sequence as the current implementation across the full pixel_count range.
  • Microbench: bps=4 on 2M pixels reduces from ~160 ms to <5 ms.
  • Existing sub-byte tests in xrspatial/geotiff/tests/test_features.py (the _make_sub_byte_tiff helpers around lines 1840-1880) continue to pass.

Context

Found via deep-sweep performance audit on 2026-05-12. Cat 5 (Numba anti-pattern): nested Python loops over packed bytes that should be vectorised.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePR touches performance-sensitive code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions