Skip to content

bucket[N] transform computes incorrect hash for decimal values #1981

@yshcz

Description

@yshcz

Apache Iceberg Rust version

None

Describe the bug

A value like -1 has multiple valid two's complement representations:

  • 1 byte: 0xFF
  • 2 bytes: 0xFF 0xFF
  • 3 bytes: 0xFF 0xFF
  • and so on

and the spec requires using the minimum representation for hashing:

Decimal values are hashed using the minimum number of bytes required to hold the unscaled value as a two's complement big-endian

The current implementation converts i128 to bytes using https://doc.rust-lang.org/std/primitive.i128.html#method.to_be_bytes, which always returns a fixed [u8; 16] array. It then tries to find the minimum representation by skipping leading 0x00 bytes:

  let bytes = v.to_be_bytes();  // always 16 bytes
  if let Some(start) = bytes.iter().position(|&x| x != 0) {
      Self::hash_bytes(&bytes[start..])
  } ...

For -1, to_be_bytes() returns [0xFF; 16]. Since no byte is 0x00, all 16 bytes are hashed instead of just [0xFF].

This causes hash mismatches with Java and Python implementations. For example, rows written by Spark with bucket[N] partitioning on a decimal column will be assigned to different buckets by iceberg-rust, causing incorrect partition pruning or writes to wrong files.

A possible fix is to use num_bigint:

use num_bigint::BigInt;

fn hash_decimal(v: i128) -> i32 {
    let bytes = BigInt::from(v).to_signed_bytes_be();
    Self::hash_bytes(&bytes)
}

But this allocates a Vec<u8> on every call. Would be better to compute the minimum representation directly from the [u8; 16]

To Reproduce

PyIceberg:

from decimal import Decimal
from pyiceberg.utils.decimal import decimal_to_bytes
import mmh3

print(mmh3.hash(decimal_to_bytes(Decimal(1))))   # -463810133
print(mmh3.hash(decimal_to_bytes(Decimal(-1))))  # -43192051

iceberg-rust:

#[test]
fn test_hash_decimal_with_negative_value() {
    assert_eq!(Bucket::hash_decimal(1), -463810133);  // passes
    assert_eq!(Bucket::hash_decimal(-1), -43192051);  // fails
}

Expected behavior

No response

Willingness to contribute

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions