API: Remove source type from Transform #5601

rdblue · 2022-08-21T18:11:12Z

This refactors the Transform API so that transforms are generic and do not require a Type when they are created or loaded.

Initially, Transform exposed apply to run the transform on a value. This requires knowing the type of the value ahead of time. However, most Transform methods are generic and accept a type argument so the API was mixed. In addition, working with Transform was more difficult because Transform must always have a type when it is created, even though types may change or may not be known. For example, a sort order can reference transformed columns, but does not include the types of those source columns.

This PR is inspired by the Python implementation, which left Transform generic and uses a method to run a function that applies the transform based on a type that is known later, when the schema is known. This PR introduces bind(Type) that returns Function<S, T> to transform values.

This is a large PR that changes the uses in API, core, and Spark 3.3.

rdblue · 2022-08-21T18:14:25Z

api/src/main/java/org/apache/iceberg/expressions/Expressions.java


  public static <T> UnboundTerm<T> truncate(String name, int width) {
-    return new UnboundTransform<>(ref(name), Transforms.truncate(Types.LongType.get(), width));
+    return new UnboundTransform<>(ref(name), Transforms.truncate(width));


This is one of the places where the old API was awkward. Expressions with transforms needed to guess the source type before the expression was bound.

rdblue · 2022-08-21T18:21:35Z

api/src/main/java/org/apache/iceberg/transforms/Transforms.java

   * @return an identity transform
+   * @deprecated use {@link #identity()} instead; will be removed in 2.0.0
   */
+  @Deprecated


To ensure that everything works, I removed the deprecated methods in this class and made sure that all tests in core were passing. Then I added these back marked deprecated (so that Spark 3.2 and other versions would still work).

rdblue · 2022-08-21T18:26:52Z

api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java

 import org.apache.iceberg.relocated.com.google.common.hash.HashFunction;
 import org.apache.iceberg.relocated.com.google.common.hash.Hashing;
 import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.BucketUtil;


These tests instantiated bucket transforms to call the hash method. That method was moved to BucketUtil, so this now tests those functions directly.

.palantir/revapi.yml

stevenzwu · 2022-08-23T02:33:27Z

api/src/main/java/org/apache/iceberg/transforms/Transform.java

  default String toHumanString(T value) {
-    return String.valueOf(value);
+    if (value instanceof ByteBuffer) {
+      return TransformUtil.base64encode(((ByteBuffer) value).duplicate());


why do we need to duplicate the ByteBuffer for base64encode?

That is so this method doesn't modify the original buffer. We could either add that here or in the base64encode method.

Got it. Looking at the Base64#encode method, it change the position of the source buffer. should we just save the position and restore it after the encoding is done?

Encodes all remaining bytes from the specified byte buffer into a newly-allocated ByteBuffer using the Base64 encoding scheme. Upon return, the source buffer's position will be updated to its limit; its limit will not have been changed. The returned output buffer's position will be zero and its limit will be the number of resulting encoded bytes.

nm. this is not on the critical code path. duplicate is simpler.

aokolnychyi · 2022-08-23T05:35:50Z

I'll take a look tomorrow morning.

api/src/main/java/org/apache/iceberg/transforms/Bucket.java

stevenzwu · 2022-08-23T18:31:07Z

api/src/main/java/org/apache/iceberg/transforms/Bucket.java


+  @Override
+  public boolean canTransform(Type type) {
+    return type.isPrimitiveType();


is this correct? If we can canTransform(StringType) on BucketLong, this will pass. Or caller always use the same type as the bucket type and this is not a concern?

It seems like a valid point cause we still allow to construct Bucket for a particular type. Should we deprecate or prohibit that? Or still override in children?

static <T> Bucket<T> get(Type type, int numBuckets) {

I think this is correct because it is whether the transform can be applied to a type, not whether the current instance of that transform can be applied to a type. This is used to validate partition specs, sort orders, and bound functions that contain transforms. When we know the concrete type, we check whether the transform can handle it.

Before, we would get the final transform at the point just before that check, when the type was known. That was basically using Transform.fromString(actualType, transform.toString()) to rebuild the transform and then it would call canTransform. Now the same generic transform can be used and canTransform doesn't need a specific transform instance.

stevenzwu · 2022-08-23T18:38:21Z

api/src/main/java/org/apache/iceberg/transforms/Dates.java

  MONTH(ChronoUnit.MONTHS, "month"),
  DAY(ChronoUnit.DAYS, "day");

+  static class Apply implements Function<Integer, Integer>, Serializable {


nit: similarly, should this be called DatesFunction?

aokolnychyi · 2022-08-23T18:57:45Z

api/src/main/java/org/apache/iceberg/PartitionKey.java

  private final int size;
  private final Object[] partitionTuple;
-  private final Transform[] transforms;
+  private final Function[] transforms;


Should this be Function<?, ?> to avoid warnings about raw usage of parameterized types?

Well, it is probably not possible because of array covariance in Java. Forget about it.

api/src/main/java/org/apache/iceberg/PartitionSpec.java

api/src/main/java/org/apache/iceberg/expressions/BoundTransform.java

api/src/main/java/org/apache/iceberg/transforms/Bucket.java

api/src/main/java/org/apache/iceberg/transforms/Truncate.java

aokolnychyi · 2022-08-23T20:25:33Z

api/src/main/java/org/apache/iceberg/transforms/Truncate.java

-    public Integer width() {
-      return width;
+    public Function<Long, Long> bind(Type type) {
+      return this;


Is it different compared to how we handle integers above?

The only difference is that this function operations on longs rather than ints.

core/src/main/java/org/apache/iceberg/BaseMetadataTable.java

aokolnychyi · 2022-08-31T18:27:35Z

Sorry for the delay. Let me take a look.

aokolnychyi

Looks great! I left some optional nits, feel free to skip them.

aokolnychyi · 2022-08-31T18:36:05Z

api/src/main/java/org/apache/iceberg/PartitionSpec.java

-              nextFieldId(),
-              targetName,
-              Transforms.day(sourceColumn.type()));
+          new PartitionField(sourceColumn.fieldId(), nextFieldId(), targetName, Transforms.day());


The new line length is quite unfortunate.

aokolnychyi · 2022-08-31T18:44:33Z

api/src/main/java/org/apache/iceberg/transforms/Bucket.java


  @SuppressWarnings("unchecked")
-  static <T> Bucket<T> get(Type type, int numBuckets) {
+  static <T, B extends Bucket<T> & SerializableFunction<T, Integer>> B get(


Shall we deprecate this like we did in Truncate?

No, this is internal so we don't need to. Deprecating it now would just add warnings that we don't need.

aokolnychyi · 2022-08-31T18:50:49Z

api/src/main/java/org/apache/iceberg/transforms/Bucket.java

    }
-
-    @Override
-    public boolean canTransform(Type type) {


I am not sure removing canTransform from each specific BucketXXX class was absolutely necessary. We have an unbound generic Bucket, which can transform all types, but BucketInteger can't transform String, for instance.

This conforms to the contract more closely. The functions are abstract and not tied to a type, so this should always be generic.

api/src/main/java/org/apache/iceberg/transforms/Dates.java

api/src/main/java/org/apache/iceberg/transforms/Days.java

api/src/main/java/org/apache/iceberg/transforms/Months.java

api/src/main/java/org/apache/iceberg/transforms/Timestamps.java

aokolnychyi · 2022-08-31T19:17:29Z

api/src/main/java/org/apache/iceberg/transforms/Timestamps.java

+
+      if (timestampMicros >= 0) {
+        OffsetDateTime timestamp =
+            Instant.ofEpochSecond(


Optional: I know it was copied from another place but it is a bit hard to read this block because of formatting, I'd consider adding temp vars.

if (timestampMicros >= 0) { long epochSecond = Math.floorDiv(timestampMicros, 1_000_000); int nanoAdjustment = Math.floorMod(timestampMicros, 1_000_000) * 1000; Instant instant = Instant.ofEpochSecond(epochSecond, nanoAdjustment); return (int) granularity.between(EPOCH, instant.atOffset(ZoneOffset.UTC)); } else { // ... long epochSecond = Math.floorDiv(timestampMicros, 1_000_000); int nanoAdjustment = Math.floorMod(timestampMicros + 1, 1_000_000) * 1000; Instant instant = Instant.ofEpochSecond(epochSecond, nanoAdjustment); return (int) granularity.between(EPOCH, instant.atOffset(ZoneOffset.UTC);) - 1; }

I decided not to modify this. I think it's a good suggestion, but messing this up with a cut & paste error just would not be worth it 😅.

api/src/main/java/org/apache/iceberg/transforms/Years.java

rdblue · 2022-09-02T21:00:31Z

Thanks for the reviews, @aokolnychyi and @stevenzwu!

(cherry picked from commit 223177f)

github-actions bot added API core spark labels Aug 21, 2022

rdblue commented Aug 21, 2022

View reviewed changes

stevenzwu reviewed Aug 22, 2022

View reviewed changes

.palantir/revapi.yml Show resolved Hide resolved

stevenzwu reviewed Aug 23, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/transforms/Bucket.java Outdated Show resolved Hide resolved

stevenzwu reviewed Aug 23, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/transforms/Bucket.java Show resolved Hide resolved

stevenzwu reviewed Aug 23, 2022

View reviewed changes