Skip to content

feat: Implement Iceberg values#20

Merged
Fokko merged 42 commits intoapache:mainfrom
JanKaul:values
Aug 9, 2023
Merged

feat: Implement Iceberg values#20
Fokko merged 42 commits intoapache:mainfrom
JanKaul:values

Conversation

@JanKaul
Copy link
Copy Markdown
Collaborator

@JanKaul JanKaul commented Aug 2, 2023

This pull request defines the representation of iceberg values. Additionally the serialization/deserialization is implemented.

Copy link
Copy Markdown
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort. I think deriving serialization/deserialization for primitive types is ok. But I'm not sure we should implement ser/de for composite types in this way. When ser/de composite types, we should associate more information(such as schema, type) with it to avoid keeping them in memory.

Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
@liurenjie1024 liurenjie1024 requested a review from nastra August 3, 2023 03:14
Comment thread crates/iceberg/src/error.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @JanKaul, I left some comments

Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
Value::String(_) => Type::Primitive(PrimitiveType::String),
Value::UUID(_) => Type::Primitive(PrimitiveType::Uuid),
Value::Decimal(dec) => Type::Primitive(PrimitiveType::Decimal {
precision: 38,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to make the precision configurable as well.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we then might need to store the precision in another field. I don't know how to calculate the precision from the scale and I don't see a way to store it in the rust decimal.

Comment thread crates/iceberg/src/spec/values.rs
Comment thread crates/iceberg/src/spec/values.rs Outdated
@liurenjie1024
Copy link
Copy Markdown
Contributor

I would suggest two changes to this pr:

  1. Create a PrimitiveValue type so that we can be type safe when ser/de from bytes.
  2. Remove the ser/de implementation for composite types, I think we would need to have data types when implementing them, and we can do it later.

Copy link
Copy Markdown
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@liurenjie1024
Copy link
Copy Markdown
Contributor

Thanks @JanKaul I think we are almost there, just need to update date/time with long/int.

Comment thread crates/iceberg/src/spec/values.rs Outdated
@liurenjie1024 liurenjie1024 mentioned this pull request Aug 9, 2023
Copy link
Copy Markdown
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one question about the naming. A lot has happened since I last checked, and I'll leave it up to you to leave it as Literal or revert it to Value, it is up to you

PrimitiveLiteral::String(_) => Type::Primitive(PrimitiveType::String),
PrimitiveLiteral::UUID(_) => Type::Primitive(PrimitiveType::Uuid),
PrimitiveLiteral::Decimal(dec) => Type::Primitive(PrimitiveType::Decimal {
precision: 38,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't always 38. It actually depends on the number of bytes that are read. But we can break that out in a separate chore

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to store the precision in an extra field because I don't see a way to get the precision from the rust Decimal. I'm not sure if this can be somehow calculated depending on the mantissa and the scale.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we do it in Python:

MAX_PRECISION = tuple(math.floor(math.log10(math.fabs(math.pow(2, 8 * pos - 1) - 1))) for pos in range(24))
REQUIRED_LENGTH = tuple(next(pos for pos in range(24) if p <= MAX_PRECISION[pos]) for p in range(40))


def decimal_required_bytes(precision: int) -> int:
    """Compute the number of bytes required to store a precision.

    Args:
        precision: The number of digits to store.

    Returns:
        The number of bytes required to store a decimal with a certain precision.
    """
    if precision <= 0 or precision >= 40:
        raise ValueError(f"Unsupported precision, outside of (0, 40]: {precision}")

    return REQUIRED_LENGTH[precision]

And how we do it in Java: https://github.com/apache/iceberg/blob/f5f543a54ff7460648bb864f4f06a29eb28938b9/api/src/main/java/org/apache/iceberg/types/TypeUtil.java#L679-L715

// under the License.

/*!
* Value in iceberg
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of happened since I reviewed this PR the last time. I'm also okay with calling this a value instead of a literal.

fn avro_bytes_long() {
let bytes = vec![32u8, 0u8, 0u8, 0u8, 0u8, 0u8, 0u8, 0u8];

check_avro_bytes_serde(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we testing here? Keep in mind that Avro uses zig-zag encoding for integers.

Copy link
Copy Markdown
Collaborator Author

@JanKaul JanKaul Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically the upper and lower bounds are stored as avro "bytes" inside the manifest_entry. These bytes conform to the iceberg binary single value serialization. This is a test to check if bytes serialized into avro "bytes" can be deserialized into a literal value.

Comment thread crates/iceberg/src/error.rs Outdated
Comment thread crates/iceberg/src/spec/values.rs Outdated
}
}

fn to_optional_literal(value: Result<Literal, Error>) -> Result<Option<Literal>, Error> {

This comment was marked as duplicate.

Copy link
Copy Markdown
Contributor

@ZENOTME ZENOTME left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM!

Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JanKaul for working on this, and @liurenjie1024 @Xuanwo @ZENOTME for the reviews 🚀

@Fokko Fokko merged commit 4ee05b4 into apache:main Aug 9, 2023
@JanKaul JanKaul deleted the values branch August 9, 2023 14:00
@ZENOTME ZENOTME mentioned this pull request Oct 12, 2023
let reader = apache_avro::Reader::new(&*encoded).unwrap();

for record in reader {
let result = apache_avro::from_value::<ByteBuf>(&record.unwrap()).unwrap();
Copy link
Copy Markdown
Contributor

@ZENOTME ZENOTME Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't figure it out what is the difference between result and bytes? 🤔 cc @JanKaul

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The literal is supposed to be written to the Avro Writer instead of the bytes. Like so:

writer.append_ser(Into::<ByteBuf>::into(literal)).unwrap();

I will create a PR to fix this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a PR to fix it.

@Xuanwo Xuanwo mentioned this pull request Nov 11, 2023
xxchan pushed a commit to xxchan/iceberg-rust that referenced this pull request Mar 12, 2025
xxchan pushed a commit to xxchan/iceberg-rust that referenced this pull request Mar 25, 2025
xxchan pushed a commit to xxchan/iceberg-rust that referenced this pull request Mar 25, 2025
xxchan pushed a commit to xxchan/iceberg-rust that referenced this pull request Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants