diff --git a/.claude/sweep-performance-state.csv b/.claude/sweep-performance-state.csv
index ea927b19a..324935dd7 100644
--- a/.claude/sweep-performance-state.csv
+++ b/.claude/sweep-performance-state.csv
@@ -18,7 +18,7 @@ fire,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
 flood,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
 focal,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
 geodesic,2026-03-31T18:00:00Z,N/A,compute-bound,0,,
-geotiff,2026-05-12,SAFE,IO-bound,1,1688,"Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds. | Pass 6 (2026-05-12): re-audit on top of #1659. New HIGH in _try_kvikio_read_tiles at _gpu_decode.py:941: per-tile cupy.empty() + blocking IOFuture.get() inside loop serialised GDS reads to ~1 outstanding pread, missed parallelism the kvikio worker pool was designed for, paid per-tile cupy.empty setup (matches #1659 anti-pattern in nvCOMP path), and lacked _check_gpu_memory guard. Filed #1688 and fixed to single contiguous buffer + batched submit + guard. Microbench with 8-worker pool simulation: 256 tiles@1ms latency drops 256ms->38.7ms (~6.6x); single-thread simulation 256ms->28.5ms (9x). Tests: 9 new tests in test_kvikio_batched_pread_1688.py (kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard, partial-read fallback, round-trip data, zero-size/all-sparse tiles). All 1577 geotiff tests pass except pre-existing matplotlib/py3.14 failures."
+geotiff,2026-05-12,SAFE,IO-bound,0,1713,"Pass 7 (2026-05-12): re-audit identified 4 MEDIUM findings, all real, all backed by microbenches. (1) unpack_bits sub-byte loops for bps=2/4/12 in _compression.py:836-878 were 100-200x slower than vectorised numpy (filed #1713, fixed in this branch: bps=4 2M pixels drops from 165ms to 3ms = 55x; bps=2/12 similar). (2) _write_vrt_tiled at __init__.py:1708 uses scheduler='synchronous' on independent tile writes; measured 33% slowdown on 256-tile zstd write vs threads scheduler (filed #1714, no fix yet). (3) _nvcomp_batch_compress at _gpu_decode.py:2522-2526 still does per-tile cupy.get().tobytes() despite #1552 / #1659 fixing the same pattern elsewhere; measured 45% reduction with concat+single get on n=1024 (filed #1712, no fix yet). (4) _nvcomp_batch_compress at _gpu_decode.py:2457 uses per-tile cupy.empty allocations; 1024 tiles 16KB drops from 4.7ms to 1.0ms with single contiguous + views (bundled into #1712). Cat 6 OOM verdict: SAFE/IO-bound holds -- read_geotiff_dask caps task count at _MAX_DASK_CHUNKS=50_000 and per-chunk memory is bounded by chunk size. _inflate_tiles_kernel resource usage on Ampere: 67 regs/thread, 2896B local/thread, 8192B shared/block (LZW kernel: 29 regs, 24576B shared) -- register pressure under control; high local memory in inflate is unavoidable (LZ77 state) but only thread 0 in each block uses it. | Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds. | Pass 6 (2026-05-12): re-audit on top of #1659. New HIGH in _try_kvikio_read_tiles at _gpu_decode.py:941: per-tile cupy.empty() + blocking IOFuture.get() inside loop serialised GDS reads to ~1 outstanding pread, missed parallelism the kvikio worker pool was designed for, paid per-tile cupy.empty setup (matches #1659 anti-pattern in nvCOMP path), and lacked _check_gpu_memory guard. Filed #1688 and fixed to single contiguous buffer + batched submit + guard. Microbench with 8-worker pool simulation: 256 tiles@1ms latency drops 256ms->38.7ms (~6.6x); single-thread simulation 256ms->28.5ms (9x). Tests: 9 new tests in test_kvikio_batched_pread_1688.py (kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard, partial-read fallback, round-trip data, zero-size/all-sparse tiles). All 1577 geotiff tests pass except pre-existing matplotlib/py3.14 failures."
 glcm,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,"Downgraded to MEDIUM. da.stack without rechunk is scheduling overhead, not OOM risk."
 hillshade,2026-04-16T12:00:00Z,SAFE,compute-bound,0,,"Re-audit after Horn's method rewrite (PR 1175): clean stencil, map_overlap depth=(1,1), no materialization. Zero findings."
 hydro,2026-05-01,RISKY,memory-bound,0,1416,"Fixed-in-tree 2026-05-01: hand_mfd._hand_mfd_dask now assembles via da.map_blocks instead of eager da.block of pre-computed tiles (matches hand_dinf pattern). Remaining MEDIUM: sink_d8 CCL fully materializes labels (inherently global), flow_accumulation_mfd frac_bdry held in driver dict instead of memmap-backed BoundaryStore. D8 iterative paths (flow_accum/fill/watershed/basin/stream_*) use serial-tile sweep with memmap-backed boundary store -- per-tile RAM bounded but driver iterates O(diameter) times. flow_direction_*, flow_path/snap_pour_point/twi/hand_d8/hand_dinf are SAFE."
diff --git a/xrspatial/geotiff/_compression.py b/xrspatial/geotiff/_compression.py
index 95a123a11..3be9400c4 100644
--- a/xrspatial/geotiff/_compression.py
+++ b/xrspatial/geotiff/_compression.py
@@ -834,44 +834,89 @@ def unpack_bits(data: np.ndarray, bps: int, pixel_count: int) -> np.ndarray:
         out = np.unpackbits(data)[:pixel_count]
         return out.astype(np.uint8)
     elif bps == 2:
-        # 4 pixels per byte, MSB-first
+        # 4 pixels per byte, MSB-first. Vectorised across the packed
+        # byte array so a large tile decodes in one pass instead of N
+        # Python iterations (issue #1713). The strided assignments
+        # below mirror the original per-byte loop's write pattern,
+        # including the contract that positions beyond
+        # ``len(data) * 4`` are left as the ``np.empty`` initial
+        # garbage value -- matching the prior behaviour for short
+        # input buffers.
         out = np.empty(pixel_count, dtype=np.uint8)
-        for i in range(min(len(data), (pixel_count + 3) // 4)):
-            b = data[i]
-            base = i * 4
-            if base < pixel_count:
-                out[base] = (b >> 6) & 0x03
-            if base + 1 < pixel_count:
-                out[base + 1] = (b >> 4) & 0x03
-            if base + 2 < pixel_count:
-                out[base + 2] = (b >> 2) & 0x03
-            if base + 3 < pixel_count:
-                out[base + 3] = b & 0x03
+        n_full_bytes = min(len(data), (pixel_count + 3) // 4)
+        if n_full_bytes > 0:
+            raw = data[:n_full_bytes]
+            written = n_full_bytes * 4
+            limit = min(written, pixel_count)
+            # Compute each of the four bit-slots for every covered byte
+            # then write the prefix that fits into ``out``.
+            slot0 = (raw >> 6) & np.uint8(0x03)
+            slot1 = (raw >> 4) & np.uint8(0x03)
+            slot2 = (raw >> 2) & np.uint8(0x03)
+            slot3 = raw & np.uint8(0x03)
+            full_idx = limit - (limit % 4)
+            if full_idx > 0:
+                quart = full_idx // 4
+                out[0:full_idx:4] = slot0[:quart]
+                out[1:full_idx:4] = slot1[:quart]
+                out[2:full_idx:4] = slot2[:quart]
+                out[3:full_idx:4] = slot3[:quart]
+            # Handle the tail (1 to 3 elements past the last full quad)
+            tail = limit - full_idx
+            tail_byte = full_idx // 4
+            if tail >= 1:
+                out[full_idx] = slot0[tail_byte]
+            if tail >= 2:
+                out[full_idx + 1] = slot1[tail_byte]
+            if tail >= 3:
+                out[full_idx + 2] = slot2[tail_byte]
         return out
     elif bps == 4:
-        # 2 pixels per byte, high nibble first
+        # 2 pixels per byte, high nibble first. Vectorised across the
+        # packed byte array (issue #1713). Positions beyond
+        # ``len(data) * 2`` keep the ``np.empty`` initial value,
+        # matching the prior per-byte loop's contract for short
+        # input buffers.
         out = np.empty(pixel_count, dtype=np.uint8)
-        for i in range(min(len(data), (pixel_count + 1) // 2)):
-            b = data[i]
-            base = i * 2
-            if base < pixel_count:
-                out[base] = (b >> 4) & 0x0F
-            if base + 1 < pixel_count:
-                out[base + 1] = b & 0x0F
+        n_full_bytes = min(len(data), (pixel_count + 1) // 2)
+        if n_full_bytes > 0:
+            raw = data[:n_full_bytes]
+            written = n_full_bytes * 2
+            limit = min(written, pixel_count)
+            hi = (raw >> 4) & np.uint8(0x0F)
+            lo = raw & np.uint8(0x0F)
+            full_idx = limit - (limit % 2)
+            if full_idx > 0:
+                pairs = full_idx // 2
+                out[0:full_idx:2] = hi[:pairs]
+                out[1:full_idx:2] = lo[:pairs]
+            if limit - full_idx >= 1:
+                out[full_idx] = hi[full_idx // 2]
         return out
     elif bps == 12:
-        # 2 pixels per 3 bytes, MSB-first
+        # 2 pixels per 3 bytes, MSB-first. Vectorised across the
+        # packed byte array (issue #1713). Pairs whose third byte is
+        # past ``len(data)`` are skipped, preserving the prior
+        # ``np.empty`` semantics for short input buffers.
         out = np.empty(pixel_count, dtype=np.uint16)
         n_pairs = pixel_count // 2
         remainder = pixel_count % 2
-        for i in range(n_pairs):
-            off = i * 3
-            if off + 2 < len(data):
-                b0 = int(data[off])
-                b1 = int(data[off + 1])
-                b2 = int(data[off + 2])
-                out[i * 2] = (b0 << 4) | (b1 >> 4)
-                out[i * 2 + 1] = ((b1 & 0x0F) << 8) | b2
+        if n_pairs > 0:
+            # Only those triples that fit entirely within ``data`` get
+            # written. The prior code used ``off + 2 < len(data)``
+            # (strict less than), which is satisfied when ``3i + 2 <
+            # len(data)``. Solving for the largest valid ``i`` gives
+            # ``i_max = (len(data) - 3) // 3``, so the pair count cap
+            # is ``i_max + 1 = len(data) // 3``.
+            max_pairs = len(data) // 3
+            usable_pairs = min(n_pairs, max_pairs)
+            if usable_pairs > 0:
+                raw = data[:usable_pairs * 3].astype(np.uint16)
+                b0 = raw[0::3]
+                b1 = raw[1::3]
+                b2 = raw[2::3]
+                out[0:usable_pairs * 2:2] = (b0 << 4) | (b1 >> 4)
+                out[1:usable_pairs * 2:2] = ((b1 & 0x0F) << 8) | b2
         if remainder and n_pairs * 3 + 1 < len(data):
             off = n_pairs * 3
             out[pixel_count - 1] = (int(data[off]) << 4) | (int(data[off + 1]) >> 4)
diff --git a/xrspatial/geotiff/tests/test_unpack_bits_vectorised_1713.py b/xrspatial/geotiff/tests/test_unpack_bits_vectorised_1713.py
new file mode 100644
index 000000000..746fe9346
--- /dev/null
+++ b/xrspatial/geotiff/tests/test_unpack_bits_vectorised_1713.py
@@ -0,0 +1,197 @@
+"""Regression coverage for the vectorised ``unpack_bits`` (issue #1713).
+
+The pre-vectorisation implementation walked the packed input byte by byte
+in Python, which was ~100x slower than the numpy-strided equivalent on
+realistic tile sizes. These tests pin the bit-for-bit equivalence
+between the new code and an in-line reference implementation across the
+sub-byte BitsPerSample values the reader recognises (``{1, 2, 4, 12}``),
+including the short-input-buffer corner cases the original loops
+handled.
+"""
+import numpy as np
+import pytest
+
+from xrspatial.geotiff._compression import unpack_bits
+
+
+def _reference_unpack_bits(data: np.ndarray, bps: int,
+                           pixel_count: int) -> np.ndarray:
+    """Bit-for-bit copy of the original loop-based implementation.
+
+    Kept here (rather than imported) so the test survives any future
+    deletion of the loop-based code path.
+    """
+    if bps == 1:
+        out = np.unpackbits(data)[:pixel_count]
+        return out.astype(np.uint8)
+    if bps == 2:
+        out = np.empty(pixel_count, dtype=np.uint8)
+        for i in range(min(len(data), (pixel_count + 3) // 4)):
+            b = data[i]
+            base = i * 4
+            if base < pixel_count:
+                out[base] = (b >> 6) & 0x03
+            if base + 1 < pixel_count:
+                out[base + 1] = (b >> 4) & 0x03
+            if base + 2 < pixel_count:
+                out[base + 2] = (b >> 2) & 0x03
+            if base + 3 < pixel_count:
+                out[base + 3] = b & 0x03
+        return out
+    if bps == 4:
+        out = np.empty(pixel_count, dtype=np.uint8)
+        for i in range(min(len(data), (pixel_count + 1) // 2)):
+            b = data[i]
+            base = i * 2
+            if base < pixel_count:
+                out[base] = (b >> 4) & 0x0F
+            if base + 1 < pixel_count:
+                out[base + 1] = b & 0x0F
+        return out
+    if bps == 12:
+        out = np.empty(pixel_count, dtype=np.uint16)
+        n_pairs = pixel_count // 2
+        remainder = pixel_count % 2
+        for i in range(n_pairs):
+            off = i * 3
+            if off + 2 < len(data):
+                b0 = int(data[off])
+                b1 = int(data[off + 1])
+                b2 = int(data[off + 2])
+                out[i * 2] = (b0 << 4) | (b1 >> 4)
+                out[i * 2 + 1] = ((b1 & 0x0F) << 8) | b2
+        if remainder and n_pairs * 3 + 1 < len(data):
+            off = n_pairs * 3
+            out[pixel_count - 1] = (
+                (int(data[off]) << 4) | (int(data[off + 1]) >> 4)
+            )
+        return out
+    raise ValueError(f"Unsupported bps: {bps}")
+
+
+def _written_positions(bps: int, pixel_count: int, data_len: int) -> set:
+    """Return the indices the original loop *wrote to*.
+
+    The pre-vectorisation code used ``np.empty`` and then conditionally
+    skipped writes when the input buffer was too short. The new code
+    must agree on positions that the old code wrote; positions that
+    were never written are pure ``np.empty`` garbage and must not be
+    compared. This helper enumerates the written set so the test can
+    stick to it.
+    """
+    positions: set[int] = set()
+    if bps == 2:
+        n_bytes = min(data_len, (pixel_count + 3) // 4)
+        for i in range(n_bytes):
+            base = i * 4
+            for n in range(4):
+                if base + n < pixel_count:
+                    positions.add(base + n)
+    elif bps == 4:
+        n_bytes = min(data_len, (pixel_count + 1) // 2)
+        for i in range(n_bytes):
+            base = i * 2
+            for n in range(2):
+                if base + n < pixel_count:
+                    positions.add(base + n)
+    elif bps == 12:
+        n_pairs = pixel_count // 2
+        for i in range(n_pairs):
+            off = i * 3
+            if off + 2 < data_len:
+                positions.add(i * 2)
+                positions.add(i * 2 + 1)
+        rem = pixel_count % 2
+        if rem and n_pairs * 3 + 1 < data_len:
+            positions.add(pixel_count - 1)
+    elif bps == 1:
+        # bps=1 covers every position via np.unpackbits.
+        positions.update(range(pixel_count))
+    return positions
+
+
+# ----------------------------------------------------------------------
+# Equivalence to the reference implementation across all sub-byte bps
+# ----------------------------------------------------------------------
+
+@pytest.mark.parametrize("bps", [2, 4, 12])
+@pytest.mark.parametrize("pixel_count", [0, 1, 2, 3, 4, 7, 8, 100, 10_000])
+@pytest.mark.parametrize("data_factor", [0.0, 0.5, 1.0, 1.5, 2.0])
+def test_vectorised_matches_reference(bps, pixel_count, data_factor):
+    """Vectorised output equals the original for every covered position."""
+    if bps == 2:
+        bytes_per_pixel = 0.25
+    elif bps == 4:
+        bytes_per_pixel = 0.5
+    else:  # bps == 12
+        bytes_per_pixel = 1.5
+
+    required = int(np.ceil(pixel_count * bytes_per_pixel))
+    n_bytes = max(0, int(required * data_factor))
+    rng = np.random.default_rng(seed=bps * 10_000 + pixel_count)
+    data = rng.integers(0, 256, size=n_bytes, dtype=np.uint8)
+
+    ref = _reference_unpack_bits(data, bps, pixel_count)
+    new = unpack_bits(data, bps, pixel_count)
+
+    assert ref.shape == new.shape
+    assert ref.dtype == new.dtype
+
+    for p in _written_positions(bps, pixel_count, len(data)):
+        assert ref[p] == new[p], (
+            f"bps={bps} pc={pixel_count} data_factor={data_factor}: "
+            f"position {p} differs ref={ref[p]} new={new[p]}"
+        )
+
+
+# ----------------------------------------------------------------------
+# Spot-check that bps=1 still returns the unpacked bit stream
+# ----------------------------------------------------------------------
+
+def test_bps1_unchanged():
+    """bps=1 still routes through ``np.unpackbits`` and returns uint8."""
+    data = np.array([0b10101100, 0b00001111], dtype=np.uint8)
+    out = unpack_bits(data, 1, 16)
+    assert out.dtype == np.uint8
+    np.testing.assert_array_equal(
+        out,
+        np.array([1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
+                 dtype=np.uint8),
+    )
+
+
+# ----------------------------------------------------------------------
+# Boundary case the original loop's strict-less-than guard cared about
+# ----------------------------------------------------------------------
+
+def test_bps12_three_byte_buffer_decodes_one_pair():
+    """bps=12 with exactly 3 input bytes writes the single pair.
+
+    The original loop's guard was ``off + 2 < len(data)``, which is
+    satisfied for ``off=0, len(data)=3``. The vectorised implementation
+    must keep the same decision and not skip the pair (this was the
+    boundary case that surfaced during the rewrite).
+    """
+    # Two 12-bit values: 0x123 and 0x456 packed MSB-first into 3 bytes.
+    data = np.array([0x12, 0x34, 0x56], dtype=np.uint8)
+    out = unpack_bits(data, 12, 2)
+    assert tuple(int(v) for v in out) == (0x123, 0x456)
+
+
+def test_bps12_two_byte_buffer_no_pair_decoded():
+    """A 2-byte buffer cannot satisfy ``off+2 < 2`` so no pair is written.
+
+    Mirrors the original loop semantics. ``np.empty`` initial garbage
+    at those positions is fine -- the test only asserts that the
+    function does not crash and returns an array of the right shape.
+    """
+    data = np.array([0x12, 0x34], dtype=np.uint8)
+    out = unpack_bits(data, 12, 2)
+    assert out.shape == (2,)
+    assert out.dtype == np.uint16
+
+
+def test_unsupported_bps_raises():
+    """Unknown sub-byte bps still raises a clear ValueError."""
+    with pytest.raises(ValueError, match="Unsupported"):
+        unpack_bits(np.zeros(10, dtype=np.uint8), 3, 10)