Skip to content

Conversation

@omarismail94
Copy link
Contributor

@omarismail94 omarismail94 commented Nov 24, 2020

Please add a meaningful description for your change here

Replace Charset.defaultCharset() with StandardCharsets.UTF_8 so that reliance on encoding set in locale is not used.

The defaultCharset() method relies on underlying OS system default.

Files that are NOT test files that are being changed:

  • runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPortableClientEntryPoint.java
  • runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/logging/JulHandlerPrintStreamAdapterFactory.java
  • sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java
  • sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/GceMetadataUtil.java
  • sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/KeyPairUtils.java
  • sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/crosslanguage/ReadBuilder.java
  • sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/services/SnowflakeBatchServiceImpl.java

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@omarismail94
Copy link
Contributor Author

omarismail94 commented Nov 24, 2020

R: @TheNeuralBit @nielsbasjes

this.buffer = new StringBuilder();
this.decoder =
Charset.defaultCharset()
StandardCharsets.UTF_8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scwhittle I was wondering if you could comment on this since you added this line in #11096. Do you know if this change is safe? Is there a reason this needs to use defaultCharset?

As it is, testLogRawBytes below can fail if the system encoding isn't UTF-8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is going to be used to replace the default System.out/System.err which appears to be specified to use the default charset.
So it seems better to keep it consistent IMO to avoid surprises when someone expects that charset writing to System.out, not UTF-8. If the test is incorrect, it should be fixed (by examining or setting the defaultCharset)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks, I thought that may be the case

How about this - we can make this class (the private constructor and the create() method below) parameterized by Charset. The places where its used as the default System.out/System.err in the Dataflow worker can pass in Charset.defaultCharset(), but the test can pass in Charsets.UTF_8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scwhittle does that work for you? I can make the change if so!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go ahead and do that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was suggesting is that we should add a new argument to the private constructor and the static create() method: Charset charset. Then in the constructor we'll use charset instead of Charset.defaultCharset() to create encoder.

Currently the test is using StandardCharsets.UTF_8, but only to create the expected outputs. It also needs to create a JulHandlerPrintStreamAdapterFactory that uses UTF_8 explicitly, so we need a way to pass that through. I think as it is this test could still fail if the default charset isn't UTF-8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, OK! And because the constructor is private, it can only be accessed from create() method inside the class, meaning this is the only file I have to change 🤯

Ill give this a shot!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well you'll also want to add an argument to create(), and update the places that call it, for example in the test:

Would become return JulHandlerPrintStreamAdapterFactory.create(handler, LOGGER_NAME, Level.INFO, StandardCharsets.UTF_8);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheNeuralBit I had to create a new PR for this as this got merged:

New PR: #13701

@nielsbasjes
Copy link
Contributor

A few weeks ago a colleague of mine mentioned he ran into a character encoding issue when building a Beam application.
I just spoke him and he pointed me to this comment that simply documents what he ran into: https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L335

// We use Woodstox because the StAX implementation provided by OpenJDK reports
// character locations incorrectly. Note that Woodstox still currently reports *byte*
// locations incorrectly when parsing documents that contain multi-byte characters.

Apparently even with this fix in place the project as a whole still has character encoding problems.

@TheNeuralBit
Copy link
Member

Apparently even with this fix in place the project as a whole still has character encoding problems.

That issue is very much constrained to XmlIO, let's not throw out the whole project over it ;)

That comment dates back almost to the original contribution of Beam as the Dataflow SDK! We're using woodstox 4.4.1 which came out in Sep 2014. Maybe this is a simple as upgrading woodstox? It looks like BEAM-10883 is tracking that issue, I'll comment over there.

@iemejia
Copy link
Member

iemejia commented Dec 24, 2020

I wonder if we could add a rule to avoid this use in the future maybe via checkstyle, otherwise we will be fixing this continuously.

@TheNeuralBit
Copy link
Member

@iemejia We do have a rule that stops you from using APIs that use the default charset implicitly (e.g. String.getBytes()), but it doesn't have a problem with the explicit String.getBytes(Charset.defaultCharset()). Maybe we could have a rule that blocks even Charset.defaultCharset() and require it to be suppressed

@iemejia
Copy link
Member

iemejia commented Dec 24, 2020

Yes it seems a bit radical but this PR proves that the issue is around and I remember to have fixed this too in the past.

@omarismail94
Copy link
Contributor Author

I wonder if we could add a rule to avoid this use in the future maybe via checkstyle, otherwise we will be fixing this continuously.

I don't know how checkstyle works, but I can try to figure it out!

Worst case is just brute-forcing current occurrences of Charset.defaultCharset()

@TheNeuralBit
Copy link
Member

I don't know how checkstyle works, but I can try to figure it out!

If you have the bandwidth it would be great to do it as part of this PR! It could be left as a follow-up if you don't though.

We'd probably want to add another entry like this one, but with a regex for Charset.defaultCharset():

-->
<module name="RegexpSinglelineJava">
<property name="id" value="ForbidNonVendoredGuava"/>
<property name="format" value="(\scom\.google\.common\.(?!testing))|(\scom\.google\.thirdparty)"/>
<property name="severity" value="error"/>
<property name="message" value="You are using raw guava, please use vendored guava classes."/>
</module>

You can test it out locally by running ./gradlew checkstyleMain checkstyleTest

Since there are some rare cases where we need the default charset we'll need to be able to suppress the check. Based on https://stackoverflow.com/questions/27688426/ignoring-of-checkstyle-warnings-with-annotation-suppresswarnings it looks like this can be done with @SuppressWarnings

@omarismail94
Copy link
Contributor Author

Thanks Brian, I'll give it a shot!

@omarismail94
Copy link
Contributor Author

retest this please

@omarismail94
Copy link
Contributor Author

retest this please

@iemejia
Copy link
Member

iemejia commented Jan 5, 2021

Run Java PreCommit

1 similar comment
@TheNeuralBit
Copy link
Member

Run Java PreCommit

@TheNeuralBit
Copy link
Member

@omarismail94
Copy link
Contributor Author

Third time the charm!

private byte[] carryOverByteArray;

@SuppressWarnings({
"unchecked" // [BEAM-11327] Replace Charset.defaultCharset() with StandardCharsets.UTF_8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be ForbidDefaultCharset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Changed to Suppress ForbidDefaultCharset checkstyle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's concerning that the precommit didn't break when this was suppressing the wrong warning. Have you seen the new checkstyle rule break checkstyleMain or checkstyleTest?

I tried to get it to fail locally, but even with the SuppressWarnings removed its not triggering on Charset.defaultCharset

Copy link
Contributor Author

@omarismail94 omarismail94 Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaah, I finally found out what causes that. checkstyleMain.enabled is set to false in the build.gradle for the worker project. I set it to true, then removed the SuppressionWarning, and when I run checkstyleMain, I can see the rule gets triggered. I also didn't know you can set the ID used as the label in, SuppressionWarnings, so I fixed that as well!

If you want to test, I think if you set the flag to true in the build.gradle file, and comment out the SuppressionWarnings("ForbidDefaultCharset"), then run the checkstyleMain, it should work

@omarismail94
Copy link
Contributor Author

Run Python_PVR_Flink PreCommit

Copy link
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM All green I am merging now to get this finally in. If there are any extra comments @TheNeuralBit or the others please note them below.

@iemejia iemejia merged commit d2f76f9 into apache:master Jan 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants