-
Notifications
You must be signed in to change notification settings - Fork 3k
API: Remove source type from Transform #5601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| public static <T> UnboundTerm<T> truncate(String name, int width) { | ||
| return new UnboundTransform<>(ref(name), Transforms.truncate(Types.LongType.get(), width)); | ||
| return new UnboundTransform<>(ref(name), Transforms.truncate(width)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the places where the old API was awkward. Expressions with transforms needed to guess the source type before the expression was bound.
| * @return an identity transform | ||
| * @deprecated use {@link #identity()} instead; will be removed in 2.0.0 | ||
| */ | ||
| @Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure that everything works, I removed the deprecated methods in this class and made sure that all tests in core were passing. Then I added these back marked deprecated (so that Spark 3.2 and other versions would still work).
| import org.apache.iceberg.relocated.com.google.common.hash.HashFunction; | ||
| import org.apache.iceberg.relocated.com.google.common.hash.Hashing; | ||
| import org.apache.iceberg.types.Types; | ||
| import org.apache.iceberg.util.BucketUtil; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests instantiated bucket transforms to call the hash method. That method was moved to BucketUtil, so this now tests those functions directly.
| default String toHumanString(T value) { | ||
| return String.valueOf(value); | ||
| if (value instanceof ByteBuffer) { | ||
| return TransformUtil.base64encode(((ByteBuffer) value).duplicate()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to duplicate the ByteBuffer for base64encode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is so this method doesn't modify the original buffer. We could either add that here or in the base64encode method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Looking at the Base64#encode method, it change the position of the source buffer. should we just save the position and restore it after the encoding is done?
Encodes all remaining bytes from the specified byte buffer into a newly-allocated
ByteBuffer using the Base64 encoding scheme. Upon return, the source
buffer's position will be updated to its limit; its limit will not have been
changed. The returned output buffer's position will be zero and its limit
will be the number of resulting encoded bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nm. this is not on the critical code path. duplicate is simpler.
|
I'll take a look tomorrow morning. |
|
|
||
| @Override | ||
| public boolean canTransform(Type type) { | ||
| return type.isPrimitiveType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this correct? If we can canTransform(StringType) on BucketLong, this will pass. Or caller always use the same type as the bucket type and this is not a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a valid point cause we still allow to construct Bucket for a particular type. Should we deprecate or prohibit that? Or still override in children?
static <T> Bucket<T> get(Type type, int numBuckets) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct because it is whether the transform can be applied to a type, not whether the current instance of that transform can be applied to a type. This is used to validate partition specs, sort orders, and bound functions that contain transforms. When we know the concrete type, we check whether the transform can handle it.
Before, we would get the final transform at the point just before that check, when the type was known. That was basically using Transform.fromString(actualType, transform.toString()) to rebuild the transform and then it would call canTransform. Now the same generic transform can be used and canTransform doesn't need a specific transform instance.
| MONTH(ChronoUnit.MONTHS, "month"), | ||
| DAY(ChronoUnit.DAYS, "day"); | ||
|
|
||
| static class Apply implements Function<Integer, Integer>, Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: similarly, should this be called DatesFunction?
| private final int size; | ||
| private final Object[] partitionTuple; | ||
| private final Transform[] transforms; | ||
| private final Function[] transforms; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Function<?, ?> to avoid warnings about raw usage of parameterized types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it is probably not possible because of array covariance in Java. Forget about it.
api/src/main/java/org/apache/iceberg/expressions/BoundTransform.java
Outdated
Show resolved
Hide resolved
| public Integer width() { | ||
| return width; | ||
| public Function<Long, Long> bind(Type type) { | ||
| return this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it different compared to how we handle integers above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference is that this function operations on longs rather than ints.
908cb2f to
9786d2c
Compare
|
Sorry for the delay. Let me take a look. |
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I left some optional nits, feel free to skip them.
| nextFieldId(), | ||
| targetName, | ||
| Transforms.day(sourceColumn.type())); | ||
| new PartitionField(sourceColumn.fieldId(), nextFieldId(), targetName, Transforms.day()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new line length is quite unfortunate.
|
|
||
| @SuppressWarnings("unchecked") | ||
| static <T> Bucket<T> get(Type type, int numBuckets) { | ||
| static <T, B extends Bucket<T> & SerializableFunction<T, Integer>> B get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we deprecate this like we did in Truncate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is internal so we don't need to. Deprecating it now would just add warnings that we don't need.
| } | ||
|
|
||
| @Override | ||
| public boolean canTransform(Type type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure removing canTransform from each specific BucketXXX class was absolutely necessary. We have an unbound generic Bucket, which can transform all types, but BucketInteger can't transform String, for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conforms to the contract more closely. The functions are abstract and not tied to a type, so this should always be generic.
api/src/main/java/org/apache/iceberg/transforms/Timestamps.java
Outdated
Show resolved
Hide resolved
|
|
||
| if (timestampMicros >= 0) { | ||
| OffsetDateTime timestamp = | ||
| Instant.ofEpochSecond( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: I know it was copied from another place but it is a bit hard to read this block because of formatting, I'd consider adding temp vars.
if (timestampMicros >= 0) {
long epochSecond = Math.floorDiv(timestampMicros, 1_000_000);
int nanoAdjustment = Math.floorMod(timestampMicros, 1_000_000) * 1000;
Instant instant = Instant.ofEpochSecond(epochSecond, nanoAdjustment);
return (int) granularity.between(EPOCH, instant.atOffset(ZoneOffset.UTC));
} else {
// ...
long epochSecond = Math.floorDiv(timestampMicros, 1_000_000);
int nanoAdjustment = Math.floorMod(timestampMicros + 1, 1_000_000) * 1000;
Instant instant = Instant.ofEpochSecond(epochSecond, nanoAdjustment);
return (int) granularity.between(EPOCH, instant.atOffset(ZoneOffset.UTC);) - 1;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided not to modify this. I think it's a good suggestion, but messing this up with a cut & paste error just would not be worth it 😅.
97149fb to
7e176fe
Compare
|
Thanks for the reviews, @aokolnychyi and @stevenzwu! |
(cherry picked from commit 223177f)
This refactors the
TransformAPI so that transforms are generic and do not require aTypewhen they are created or loaded.Initially,
Transformexposedapplyto run the transform on a value. This requires knowing the type of the value ahead of time. However, mostTransformmethods are generic and accept a type argument so the API was mixed. In addition, working withTransformwas more difficult becauseTransformmust always have a type when it is created, even though types may change or may not be known. For example, a sort order can reference transformed columns, but does not include the types of those source columns.This PR is inspired by the Python implementation, which left
Transformgeneric and uses a method to run a function that applies the transform based on a type that is known later, when the schema is known. This PR introducesbind(Type)that returnsFunction<S, T>to transform values.This is a large PR that changes the uses in API, core, and Spark 3.3.