Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Aug 21, 2022

This refactors the Transform API so that transforms are generic and do not require a Type when they are created or loaded.

Initially, Transform exposed apply to run the transform on a value. This requires knowing the type of the value ahead of time. However, most Transform methods are generic and accept a type argument so the API was mixed. In addition, working with Transform was more difficult because Transform must always have a type when it is created, even though types may change or may not be known. For example, a sort order can reference transformed columns, but does not include the types of those source columns.

This PR is inspired by the Python implementation, which left Transform generic and uses a method to run a function that applies the transform based on a type that is known later, when the schema is known. This PR introduces bind(Type) that returns Function<S, T> to transform values.

This is a large PR that changes the uses in API, core, and Spark 3.3.


public static <T> UnboundTerm<T> truncate(String name, int width) {
return new UnboundTransform<>(ref(name), Transforms.truncate(Types.LongType.get(), width));
return new UnboundTransform<>(ref(name), Transforms.truncate(width));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the places where the old API was awkward. Expressions with transforms needed to guess the source type before the expression was bound.

* @return an identity transform
* @deprecated use {@link #identity()} instead; will be removed in 2.0.0
*/
@Deprecated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure that everything works, I removed the deprecated methods in this class and made sure that all tests in core were passing. Then I added these back marked deprecated (so that Spark 3.2 and other versions would still work).

import org.apache.iceberg.relocated.com.google.common.hash.HashFunction;
import org.apache.iceberg.relocated.com.google.common.hash.Hashing;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.util.BucketUtil;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests instantiated bucket transforms to call the hash method. That method was moved to BucketUtil, so this now tests those functions directly.

default String toHumanString(T value) {
return String.valueOf(value);
if (value instanceof ByteBuffer) {
return TransformUtil.base64encode(((ByteBuffer) value).duplicate());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to duplicate the ByteBuffer for base64encode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is so this method doesn't modify the original buffer. We could either add that here or in the base64encode method.

Copy link
Contributor

@stevenzwu stevenzwu Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Looking at the Base64#encode method, it change the position of the source buffer. should we just save the position and restore it after the encoding is done?

Encodes all remaining bytes from the specified byte buffer into a newly-allocated 
ByteBuffer using the Base64 encoding scheme. Upon return, the source 
buffer's position will be updated to its limit; its limit will not have been 
changed. The returned output buffer's position will be zero and its limit 
will be the number of resulting encoded bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nm. this is not on the critical code path. duplicate is simpler.

@aokolnychyi
Copy link
Contributor

I'll take a look tomorrow morning.


@Override
public boolean canTransform(Type type) {
return type.isPrimitiveType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this correct? If we can canTransform(StringType) on BucketLong, this will pass. Or caller always use the same type as the bucket type and this is not a concern?

Copy link
Contributor

@aokolnychyi aokolnychyi Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a valid point cause we still allow to construct Bucket for a particular type. Should we deprecate or prohibit that? Or still override in children?

static <T> Bucket<T> get(Type type, int numBuckets) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct because it is whether the transform can be applied to a type, not whether the current instance of that transform can be applied to a type. This is used to validate partition specs, sort orders, and bound functions that contain transforms. When we know the concrete type, we check whether the transform can handle it.

Before, we would get the final transform at the point just before that check, when the type was known. That was basically using Transform.fromString(actualType, transform.toString()) to rebuild the transform and then it would call canTransform. Now the same generic transform can be used and canTransform doesn't need a specific transform instance.

MONTH(ChronoUnit.MONTHS, "month"),
DAY(ChronoUnit.DAYS, "day");

static class Apply implements Function<Integer, Integer>, Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: similarly, should this be called DatesFunction?

private final int size;
private final Object[] partitionTuple;
private final Transform[] transforms;
private final Function[] transforms;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Function<?, ?> to avoid warnings about raw usage of parameterized types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it is probably not possible because of array covariance in Java. Forget about it.

public Integer width() {
return width;
public Function<Long, Long> bind(Type type) {
return this;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it different compared to how we handle integers above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference is that this function operations on longs rather than ints.

@rdblue rdblue force-pushed the remove-type-from-transform branch 2 times, most recently from 908cb2f to 9786d2c Compare August 24, 2022 22:16
@aokolnychyi
Copy link
Contributor

Sorry for the delay. Let me take a look.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I left some optional nits, feel free to skip them.

nextFieldId(),
targetName,
Transforms.day(sourceColumn.type()));
new PartitionField(sourceColumn.fieldId(), nextFieldId(), targetName, Transforms.day());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new line length is quite unfortunate.


@SuppressWarnings("unchecked")
static <T> Bucket<T> get(Type type, int numBuckets) {
static <T, B extends Bucket<T> & SerializableFunction<T, Integer>> B get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we deprecate this like we did in Truncate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is internal so we don't need to. Deprecating it now would just add warnings that we don't need.

}

@Override
public boolean canTransform(Type type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure removing canTransform from each specific BucketXXX class was absolutely necessary. We have an unbound generic Bucket, which can transform all types, but BucketInteger can't transform String, for instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conforms to the contract more closely. The functions are abstract and not tied to a type, so this should always be generic.


if (timestampMicros >= 0) {
OffsetDateTime timestamp =
Instant.ofEpochSecond(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: I know it was copied from another place but it is a bit hard to read this block because of formatting, I'd consider adding temp vars.

if (timestampMicros >= 0) {
  long epochSecond = Math.floorDiv(timestampMicros, 1_000_000);
  int nanoAdjustment = Math.floorMod(timestampMicros, 1_000_000) * 1000;
  Instant instant = Instant.ofEpochSecond(epochSecond, nanoAdjustment);
  return (int) granularity.between(EPOCH, instant.atOffset(ZoneOffset.UTC));
} else {
  // ...
  long epochSecond = Math.floorDiv(timestampMicros, 1_000_000);
  int nanoAdjustment = Math.floorMod(timestampMicros + 1, 1_000_000) * 1000;
  Instant instant = Instant.ofEpochSecond(epochSecond, nanoAdjustment);
  return (int) granularity.between(EPOCH, instant.atOffset(ZoneOffset.UTC);) - 1;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to modify this. I think it's a good suggestion, but messing this up with a cut & paste error just would not be worth it 😅.

@rdblue rdblue force-pushed the remove-type-from-transform branch from 97149fb to 7e176fe Compare September 2, 2022 16:40
@rdblue rdblue merged commit 223177f into apache:master Sep 2, 2022
@rdblue
Copy link
Contributor Author

rdblue commented Sep 2, 2022

Thanks for the reviews, @aokolnychyi and @stevenzwu!

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants