perf(geotiff): vectorise sub-byte unpack_bits for bps=2/4/12

## Reason or Problem

`unpack_bits` in `xrspatial/geotiff/_compression.py` (lines 815-880) decodes sub-byte pixel data when a file declares `BitsPerSample` in `{1, 2, 4, 12}`. The bps=1 branch uses `np.unpackbits` (vectorised, fast). The bps=2, bps=4, and bps=12 branches use Python `for` loops over packed bytes, which is 100-200x slower than numpy-vectorised equivalents at realistic tile sizes.

Microbench on a single thread:

| bps | pixel_count | current `unpack_bits` | vectorised | speedup |
|----:|------------:|----------------------:|-----------:|--------:|
| 2   | 4,000,000   | 336 ms                | (a few ms) | ~150x   |
| 4   | 2,000,000   | 162 ms                | 1.5 ms     | ~108x   |
| 12  | 2,000,000   | 220 ms                | (a few ms) | ~70x    |

Sub-byte BPS files are uncommon in geospatial workflows but real: `bps=4` is used by some indexed/classified rasters and `bps=12` by certain scientific instruments (e.g. older satellite radiometers, some medical/microscopy formats). When such a file is read the slow path is the dominant cost.

## Proposal

Replace the per-byte Python loops with numpy slicing/shifting:

```python
elif bps == 2:
    # 4 pixels per byte, MSB-first.
    n_bytes = (pixel_count + 3) // 4
    raw = data[:n_bytes]
    out = np.empty(n_bytes * 4, dtype=np.uint8)
    out[0::4] = (raw >> 6) & 0x03
    out[1::4] = (raw >> 4) & 0x03
    out[2::4] = (raw >> 2) & 0x03
    out[3::4] = raw & 0x03
    return out[:pixel_count]

elif bps == 4:
    # 2 pixels per byte, high nibble first.
    n_bytes = (pixel_count + 1) // 2
    raw = data[:n_bytes]
    out = np.empty(n_bytes * 2, dtype=np.uint8)
    out[0::2] = (raw >> 4) & 0x0F
    out[1::2] = raw & 0x0F
    return out[:pixel_count]

elif bps == 12:
    # 2 pixels per 3 bytes, MSB-first.
    n_pairs = pixel_count // 2
    remainder = pixel_count % 2
    n_bytes = n_pairs * 3 + (2 if remainder else 0)
    raw = data[:n_bytes].astype(np.uint16)
    out = np.empty(pixel_count, dtype=np.uint16)
    b0 = raw[0:n_pairs * 3:3]
    b1 = raw[1:n_pairs * 3:3]
    b2 = raw[2:n_pairs * 3:3]
    out[0:n_pairs * 2:2] = (b0 << 4) | (b1 >> 4)
    out[1:n_pairs * 2:2] = ((b1 & 0x0F) << 8) | b2
    if remainder:
        out[pixel_count - 1] = ((int(data[n_pairs * 3]) << 4)
                                 | (int(data[n_pairs * 3 + 1]) >> 4))
    return out
```

The vectorised versions match the existing reference implementation bit-for-bit (verified against random uint8 inputs of length 1000 across all three bps values).

## Acceptance criteria

- `unpack_bits(data, bps, pixel_count)` for `bps` in `{2, 4, 12}` returns the same byte/short sequence as the current implementation across the full `pixel_count` range.
- Microbench: bps=4 on 2M pixels reduces from ~160 ms to <5 ms.
- Existing sub-byte tests in `xrspatial/geotiff/tests/test_features.py` (the `_make_sub_byte_tiff` helpers around lines 1840-1880) continue to pass.

## Context

Found via deep-sweep performance audit on 2026-05-12. Cat 5 (Numba anti-pattern): nested Python loops over packed bytes that should be vectorised.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(geotiff): vectorise sub-byte unpack_bits for bps=2/4/12 #1713

Reason or Problem

Proposal

Acceptance criteria

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bps	pixel_count	current `unpack_bits`	vectorised	speedup
2	4,000,000	336 ms	(a few ms)	~150x
4	2,000,000	162 ms	1.5 ms	~108x
12	2,000,000	220 ms	(a few ms)	~70x

perf(geotiff): vectorise sub-byte unpack_bits for bps=2/4/12 #1713

Description

Reason or Problem

Proposal

Acceptance criteria

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions