Skip to content

Conversation

@rahul8383
Copy link
Contributor

@rahul8383 rahul8383 commented May 5, 2020

schema.logicaltypes.FixedBytes logical type expects an argument - the length of the byte[].

When an invalid input value (with length < expectedLength) is provided while building the Row with FixedBytes logical type, IllegalArgumentException is expected. But, the Exception is not thrown. The below code illustrates the behaviour:

 Schema schema = Schema.builder().addLogicalTypeField("char", FixedBytes.of(10)).build();
 byte[] byteArray = {1, 2, 3, 4, 5};
 Row row = Row.withSchema(schema).withFieldValue("char", byteArray).build();
 System.out.println(Arrays.toString(row.getLogicalTypeValue("char", byte[].class)));

The above code prints "[1, 2, 3, 4, 5]" with length 5 to the console, whereas the expected length of FixedBytes, is 10.

This PR fixes the issue.
Negative and Positive Test for FixedBytes Logical type are added.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@rahul8383
Copy link
Contributor Author

R: @reuvenlax @TheNeuralBit

@rahul8383
Copy link
Contributor Author

I have a couple of questions regarding the behaviour of logical types:

@iemejia iemejia changed the title [BEAM-9887] Expected Exception when building Row with logical types w… [BEAM-9887] Throw IllegalArgumentException when building Row with logical types with Invalid input May 5, 2020
@iemejia iemejia requested review from TheNeuralBit and reuvenlax May 5, 2020 07:54
@rahul8383 rahul8383 force-pushed the build_row_with_logicalTypes_using_withFieldValue branch from eab9420 to 5687c87 Compare May 5, 2020 12:20
@TheNeuralBit
Copy link
Member

Hm so there are several ways of manually building a Row instance that provide different levels of runtime type-checking. Row#addValues explicitly validates everything, and Row#attachValues explicitly does not, for performance reasons. In SQL we have an option to switch between the two:

Row row =
verifyRowValues
? Row.withSchema(outputSchema).addValues(fieldValues).build()
: Row.withSchema(outputSchema).attachValues(fieldValues);

So we can have runtime type-checking for debugging, but then turn it off for performance.

I'm not sure how withFieldValue is intended to work. I'm not sure if the missing toInputType(toBaseType(value)) for that code path is intentional or an oversight. Can you clarify @reuvenlax?

@rahul8383
Copy link
Contributor Author

rahul8383 commented May 5, 2020

@TheNeuralBit
schemas.logicaltypes.LogicalTypesTest.testFixedBytesIllegalArgument Test will fail even if addValue() method is used instead of withFieldValue() method.

@reuvenlax
Copy link
Contributor

reuvenlax commented May 5, 2020

@TheNeuralBit withFieldValue should replace addValues for most users. addValues is difficult and error prone and withFieldValues allows building a row based on named fields instead of positional fields.

@reuvenlax
Copy link
Contributor

LGTM

@rahul8383
Copy link
Contributor Author

@reuvenlax @TheNeuralBit
Thanks for the review.
Could you please clarify the questions that I have asked above?
I wanted to understand more about logical types for my PR #11581

@TheNeuralBit
Copy link
Member

Will this line ever get hit?

We always convert logical types to their base type when serializing with SchemaCoder, and convert back to the input type when deserializing. Other than that I think the only time it should get called is when constructing a Row instance (unless you use attachValues).

Can we consider that the input value provided is of BaseType, which we can convert to InputType and store in memory?

Would this just be so that we're guaranteed to call toInputType whenever setting a value on Row? This PR accomplishes the same thing right?

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM other than a couple of minor comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these tests are really checking Row's verification, I think they would be better in RowTest. Could you move them there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the tests to RowTest.java

case in point!
How can I write FixedBytes test which tests the behaviour of appending zeros? To test this behaviour, the input value should have length < expectedLength. But, if the input value's length is less than expected length, an IllegalArgumentException is thrown while building the Row.

@rahul8383
Copy link
Contributor Author

We always convert logical types to their base type when serializing with SchemaCoder, and convert back to the input type when deserializing. Other than that I think the only time it should get called is when constructing a Row instance (unless you use attachValues).

In that case, there is no need to handle this else case right? as we are making sure that the input has expected length while building the Row.


Even if attachValues is used while building the Row and the provided input value is invalid(invalid length), during serialization in SchemaCoder, the input value cannot be converted to base type as it doesn't have expected length and an IllegalArgumentException will be thrown.

Would this just be so that we're guaranteed to call toInputType whenever setting a value on Row? This PR accomplishes the same thing right?

Can we support this feature: depending on the type of the input value provided while building the Row, we can call toInputType(toBaseType(inputValue)) or toInputType(inputValue) i.e. support for providing base value while building the Row. If both the InputType and BaseType are one and the same, we can directly call toInputType(inputValue). I am thinking that this might be helpful for logical types like FixedBytes or FixedLengthString.

@rahul8383 rahul8383 force-pushed the build_row_with_logicalTypes_using_withFieldValue branch from 5687c87 to 7e4e266 Compare May 6, 2020 00:22
@rahul8383 rahul8383 force-pushed the build_row_with_logicalTypes_using_withFieldValue branch from 7e4e266 to 4566de4 Compare May 6, 2020 00:33
@TheNeuralBit
Copy link
Member

In that case, there is no need to handle this else case right? as we are making sure that the input has expected length while building the Row.

checkArgument ensures that the condition is true, so we are rejecting byte arrays that are too short, and accepting anything longer than the fixed size. I don't think there's any capability that pads zeroes for short arrays, this logic just allows longer byte arrays, and copies out the appropriate portion.

@TheNeuralBit
Copy link
Member

Whoops sorry I completely misread that. We accept short byte arrays, and zero-pad them. We do not accept long byte arrays so that we don't have to truncate them.

@TheNeuralBit
Copy link
Member

Ahh ok. I'm sorry for being so dense, I see what you're saying now. In toBaseType we validate that the array's length == the fixed length:

public byte[] toBaseType(byte[] input) {
checkArgument(input.length == byteArraySize);
return input;
}

This is where the exception checked in your new tests is thrown.

So we never actually get into toInputType where we could accept a shorter byte array and zero-pad. And there doesn't seem to be any other way to set a value by base type with a shorter byte array and pad it.

This seems to be a holdover. Previously Row stored logical type values as their base type, so we probably called toBaseType(toInputType(x)).

@TheNeuralBit
Copy link
Member

I'd be +1 for just dropping the padding logic. I don't think it should be the responsibility of the LogicalType to coerce values like this. What do you think @reuvenlax?

@rahul8383
Copy link
Contributor Author

This seems to be a holdover. Previously Row stored logical type values as their base type, so we probably called toBaseType(toInputType(x)).

Even before the code which changed Row to store logical type values instead of base values in memory, while building the row, the input value is expected to be of correct length.

private Object verifyLogicalType(Object value, LogicalType logicalType, String fieldName) {
return verify(logicalType.toBaseType(value), logicalType.getBaseType(), fieldName);
}

I think this issue is present since the introduction of FixedBytes.

@rahul8383
Copy link
Contributor Author

If there are no comments, can we close this PR?

@rahul8383
Copy link
Contributor Author

I'd be +1 for just dropping the padding logic. I don't think it should be the responsibility of the LogicalType to coerce values like this. What do you think @reuvenlax?

@TheNeuralBit @reuvenlax I have taken care of this for FixedBytes and other standard logical types in #11581

@TheNeuralBit TheNeuralBit merged commit b0932f2 into apache:master May 11, 2020
@rahul8383
Copy link
Contributor Author

cc: @ibzib

@ibzib can this fix be cherry-picked for 2.21.0 release?

yirutang pushed a commit to yirutang/beam that referenced this pull request Jul 23, 2020
…ical types with Invalid input (apache#11609)

* [BEAM-9887] Expected Exception when building Row with logical types with Invalid input

* Fix failed BeamComplexTypeTest.testNullDatetimeFields Test to handle null values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants