-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-7274] Implement the Protobuf schema provider #8690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
alexvanboxel
wants to merge
2
commits into
apache:master
from
alexvanboxel:feature/BEAM-7274-proto-schema
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
114 changes: 114 additions & 0 deletions
114
sdks/java/core/src/main/java/org/apache/beam/sdk/values/RowWithGettersCachedCollections.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.beam.sdk.values; | ||
|
|
||
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.Objects; | ||
| import javax.annotation.Nullable; | ||
| import org.apache.beam.sdk.schemas.Factory; | ||
| import org.apache.beam.sdk.schemas.FieldValueGetter; | ||
| import org.apache.beam.sdk.schemas.Schema; | ||
| import org.apache.beam.sdk.schemas.Schema.FieldType; | ||
| import org.apache.beam.sdk.schemas.Schema.TypeName; | ||
| import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Lists; | ||
| import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Maps; | ||
|
|
||
| /** | ||
| * A Concrete subclass of {@link Row} that delegates to a set of provided {@link FieldValueGetter}s. | ||
| * This is a special version of {@link RowWithGetters} that cached the map and list collection. | ||
| * | ||
| * <p>This allows us to have {@link Row} objects for which the actual storage is in another object. | ||
| * For example, the user's type may be a POJO, in which case the provided getters will simple read | ||
| * the appropriate fields from the POJO. | ||
| */ | ||
| public class RowWithGettersCachedCollections extends RowWithGetters { | ||
| private final Map<Integer, List> cachedLists = Maps.newHashMap(); | ||
| private final Map<Integer, Map> cachedMaps = Maps.newHashMap(); | ||
|
|
||
| RowWithGettersCachedCollections( | ||
| Schema schema, Factory<List<FieldValueGetter>> getterFactory, Object getterTarget) { | ||
| super(schema, getterFactory, getterTarget); | ||
| } | ||
|
|
||
| private List getListValue(FieldType elementType, Object fieldValue) { | ||
| Iterable iterable = (Iterable) fieldValue; | ||
| List<Object> list = Lists.newArrayList(); | ||
| for (Object o : iterable) { | ||
| list.add(getValue(elementType, o, null)); | ||
| } | ||
| return list; | ||
| } | ||
|
|
||
| private Map<?, ?> getMapValue(FieldType keyType, FieldType valueType, Map<?, ?> fieldValue) { | ||
| Map returnMap = Maps.newHashMap(); | ||
| for (Map.Entry<?, ?> entry : fieldValue.entrySet()) { | ||
| returnMap.put( | ||
| getValue(keyType, entry.getKey(), null), getValue(valueType, entry.getValue(), null)); | ||
| } | ||
| return returnMap; | ||
| } | ||
|
|
||
| @SuppressWarnings({"TypeParameterUnusedInFormals", "unchecked"}) | ||
| @Override | ||
| protected <T> T getValue(FieldType type, Object fieldValue, @Nullable Integer cacheKey) { | ||
| if (type.getTypeName().equals(TypeName.ROW)) { | ||
| return (T) | ||
| new RowWithGettersCachedCollections( | ||
| type.getRowSchema(), fieldValueGetterFactory, fieldValue); | ||
| } else if (type.getTypeName().equals(TypeName.ARRAY)) { | ||
| return cacheKey != null | ||
| ? (T) | ||
| cachedLists.computeIfAbsent( | ||
| cacheKey, i -> getListValue(type.getCollectionElementType(), fieldValue)) | ||
| : (T) getListValue(type.getCollectionElementType(), fieldValue); | ||
| } else if (type.getTypeName().equals(TypeName.MAP)) { | ||
| Map map = (Map) fieldValue; | ||
| return cacheKey != null | ||
| ? (T) | ||
| cachedMaps.computeIfAbsent( | ||
| cacheKey, i -> getMapValue(type.getMapKeyType(), type.getMapValueType(), map)) | ||
| : (T) getMapValue(type.getMapKeyType(), type.getMapValueType(), map); | ||
| } else { | ||
| return (T) fieldValue; | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object o) { | ||
| if (this == o) { | ||
| return true; | ||
| } | ||
| if (o == null) { | ||
| return false; | ||
| } | ||
| if (o instanceof RowWithGettersCachedCollections) { | ||
| RowWithGettersCachedCollections other = (RowWithGettersCachedCollections) o; | ||
| return Objects.equals(getSchema(), other.getSchema()) | ||
| && Objects.equals(getterTarget, other.getterTarget); | ||
| } else if (o instanceof Row) { | ||
| return super.equals(o); | ||
| } | ||
| return false; | ||
| } | ||
|
|
||
| @Override | ||
| public int hashCode() { | ||
| return Objects.hash(getSchema(), getterTarget); | ||
| } | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble understanding what this is for. Can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, maybe the name could be better, but it means that the the FieldValueGetters also handle collections like ARRAY, ROW, MAP. If you go to the original implementation ( https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/RowWithGetters.java#L90 ) you see that RowWithGetters does naive handling of ARRAY, ROW and MAP, for protobuf you need more context (the descriptor) to handle them. That's why I need to disable the naive mapping and let the FieldValueGetters handle ARRAY, ROW and MAP.
Feel free to suggest a better name though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reuvenlax is this the last remaining concern you have for this PR?
#8690 (comment) is a good reference for the motivation of this as well.
I'm not crazy about the way this is implemented since it's adding state to
RowWithGetterthat will get checked every time a collection field is accessed. I can't think of a better way to do it without some non-trivial refactoring though. Some ideas:collectionHandledByGetterinRowWithGettersand changing behavior based on it, have two alternateRowWithGettersimplementations, one with the special handling for collections and one without. I think this is still an improvement over the separateProtoRowclass that was rejected since it's not explicitly tied to a particularSchemaProvider, in fact it sounds like it could be re-used for AvroGenericRecordinstances.SchemaProvider/FieldValueGetterimplementations that need it, and makeRowWithGettersalways behave as ifcollectionHandledByGetteris true. I think this could be a cleaner approach? But it's challenging because that special logic stores cached values that wouldn't be appropriate to move into theFieldValueGetterimplementations. Maybe aSchemaProvidercould have some way to indicate that a certain row is expensive to access and should be cached inRowWithGetters?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored from your input: I've created a RowWithGettersCachedCollection that inherits for RowWithGetters. This cached is the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand this a bit more? Why does it not work to cache lists for protocol buffers? We saw repeated array conversion to be a big problem (which is why we cache them). I'm wondering if we could instead cache a lazy array like we do with iterables.
I'll take a closer look at this code to figure it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we make a ticket out of this or is this blocking?