Conversation
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks for the effort. I think deriving serialization/deserialization for primitive types is ok. But I'm not sure we should implement ser/de for composite types in this way. When ser/de composite types, we should associate more information(such as schema, type) with it to avoid keeping them in memory.
| Value::String(_) => Type::Primitive(PrimitiveType::String), | ||
| Value::UUID(_) => Type::Primitive(PrimitiveType::Uuid), | ||
| Value::Decimal(dec) => Type::Primitive(PrimitiveType::Decimal { | ||
| precision: 38, |
There was a problem hiding this comment.
I think we want to make the precision configurable as well.
There was a problem hiding this comment.
I think we then might need to store the precision in another field. I don't know how to calculate the precision from the scale and I don't see a way to store it in the rust decimal.
|
I would suggest two changes to this pr:
|
Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>
|
Thanks @JanKaul I think we are almost there, just need to update date/time with long/int. |
Fokko
left a comment
There was a problem hiding this comment.
LGTM, one question about the naming. A lot has happened since I last checked, and I'll leave it up to you to leave it as Literal or revert it to Value, it is up to you
| PrimitiveLiteral::String(_) => Type::Primitive(PrimitiveType::String), | ||
| PrimitiveLiteral::UUID(_) => Type::Primitive(PrimitiveType::Uuid), | ||
| PrimitiveLiteral::Decimal(dec) => Type::Primitive(PrimitiveType::Decimal { | ||
| precision: 38, |
There was a problem hiding this comment.
This isn't always 38. It actually depends on the number of bytes that are read. But we can break that out in a separate chore
There was a problem hiding this comment.
We might need to store the precision in an extra field because I don't see a way to get the precision from the rust Decimal. I'm not sure if this can be somehow calculated depending on the mantissa and the scale.
There was a problem hiding this comment.
This is how we do it in Python:
MAX_PRECISION = tuple(math.floor(math.log10(math.fabs(math.pow(2, 8 * pos - 1) - 1))) for pos in range(24))
REQUIRED_LENGTH = tuple(next(pos for pos in range(24) if p <= MAX_PRECISION[pos]) for p in range(40))
def decimal_required_bytes(precision: int) -> int:
"""Compute the number of bytes required to store a precision.
Args:
precision: The number of digits to store.
Returns:
The number of bytes required to store a decimal with a certain precision.
"""
if precision <= 0 or precision >= 40:
raise ValueError(f"Unsupported precision, outside of (0, 40]: {precision}")
return REQUIRED_LENGTH[precision]And how we do it in Java: https://github.com/apache/iceberg/blob/f5f543a54ff7460648bb864f4f06a29eb28938b9/api/src/main/java/org/apache/iceberg/types/TypeUtil.java#L679-L715
| // under the License. | ||
|
|
||
| /*! | ||
| * Value in iceberg |
There was a problem hiding this comment.
A lot of happened since I reviewed this PR the last time. I'm also okay with calling this a value instead of a literal.
| fn avro_bytes_long() { | ||
| let bytes = vec![32u8, 0u8, 0u8, 0u8, 0u8, 0u8, 0u8, 0u8]; | ||
|
|
||
| check_avro_bytes_serde( |
There was a problem hiding this comment.
What are we testing here? Keep in mind that Avro uses zig-zag encoding for integers.
There was a problem hiding this comment.
Typically the upper and lower bounds are stored as avro "bytes" inside the manifest_entry. These bytes conform to the iceberg binary single value serialization. This is a test to check if bytes serialized into avro "bytes" can be deserialized into a literal value.
| } | ||
| } | ||
|
|
||
| fn to_optional_literal(value: Result<Literal, Error>) -> Result<Option<Literal>, Error> { |
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
Fokko
left a comment
There was a problem hiding this comment.
Thanks @JanKaul for working on this, and @liurenjie1024 @Xuanwo @ZENOTME for the reviews 🚀
| let reader = apache_avro::Reader::new(&*encoded).unwrap(); | ||
|
|
||
| for record in reader { | ||
| let result = apache_avro::from_value::<ByteBuf>(&record.unwrap()).unwrap(); |
There was a problem hiding this comment.
Sorry, I can't figure it out what is the difference between result and bytes? 🤔 cc @JanKaul
There was a problem hiding this comment.
Good catch! The literal is supposed to be written to the Avro Writer instead of the bytes. Like so:
writer.append_ser(Into::<ByteBuf>::into(literal)).unwrap();I will create a PR to fix this.
This pull request defines the representation of iceberg values. Additionally the serialization/deserialization is implemented.