-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-11327] Replace Charset.defaultCharset() with StandardCharsets.UTF_8 #13410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| this.buffer = new StringBuilder(); | ||
| this.decoder = | ||
| Charset.defaultCharset() | ||
| StandardCharsets.UTF_8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scwhittle I was wondering if you could comment on this since you added this line in #11096. Do you know if this change is safe? Is there a reason this needs to use defaultCharset?
As it is, testLogRawBytes below can fail if the system encoding isn't UTF-8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is going to be used to replace the default System.out/System.err which appears to be specified to use the default charset.
So it seems better to keep it consistent IMO to avoid surprises when someone expects that charset writing to System.out, not UTF-8. If the test is incorrect, it should be fixed (by examining or setting the defaultCharset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks, I thought that may be the case
How about this - we can make this class (the private constructor and the create() method below) parameterized by Charset. The places where its used as the default System.out/System.err in the Dataflow worker can pass in Charset.defaultCharset(), but the test can pass in Charsets.UTF_8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scwhittle does that work for you? I can make the change if so!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go ahead and do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was suggesting is that we should add a new argument to the private constructor and the static create() method: Charset charset. Then in the constructor we'll use charset instead of Charset.defaultCharset() to create encoder.
Currently the test is using StandardCharsets.UTF_8, but only to create the expected outputs. It also needs to create a JulHandlerPrintStreamAdapterFactory that uses UTF_8 explicitly, so we need a way to pass that through. I think as it is this test could still fail if the default charset isn't UTF-8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, OK! And because the constructor is private, it can only be accessed from create() method inside the class, meaning this is the only file I have to change 🤯
Ill give this a shot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well you'll also want to add an argument to create(), and update the places that call it, for example in the test:
Line 168 in df74d74
| return JulHandlerPrintStreamAdapterFactory.create(handler, LOGGER_NAME, Level.INFO); |
Would become return JulHandlerPrintStreamAdapterFactory.create(handler, LOGGER_NAME, Level.INFO, StandardCharsets.UTF_8);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheNeuralBit I had to create a new PR for this as this got merged:
New PR: #13701
|
A few weeks ago a colleague of mine mentioned he ran into a character encoding issue when building a Beam application. Apparently even with this fix in place the project as a whole still has character encoding problems. |
That issue is very much constrained to XmlIO, let's not throw out the whole project over it ;) That comment dates back almost to the original contribution of Beam as the Dataflow SDK! We're using woodstox 4.4.1 which came out in Sep 2014. Maybe this is a simple as upgrading woodstox? It looks like BEAM-10883 is tracking that issue, I'll comment over there. |
|
I wonder if we could add a rule to avoid this use in the future maybe via checkstyle, otherwise we will be fixing this continuously. |
|
@iemejia We do have a rule that stops you from using APIs that use the default charset implicitly (e.g. |
|
Yes it seems a bit radical but this PR proves that the issue is around and I remember to have fixed this too in the past. |
I don't know how checkstyle works, but I can try to figure it out! Worst case is just brute-forcing current occurrences of |
If you have the bandwidth it would be great to do it as part of this PR! It could be left as a follow-up if you don't though. We'd probably want to add another entry like this one, but with a regex for beam/sdks/java/build-tools/src/main/resources/beam/checkstyle.xml Lines 114 to 120 in 245cf2b
You can test it out locally by running Since there are some rare cases where we need the default charset we'll need to be able to suppress the check. Based on https://stackoverflow.com/questions/27688426/ignoring-of-checkstyle-warnings-with-annotation-suppresswarnings it looks like this can be done with |
|
Thanks Brian, I'll give it a shot! |
|
retest this please |
|
retest this please |
|
Run Java PreCommit |
1 similar comment
|
Run Java PreCommit |
|
https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/2953/ flake looks like BEAM-8101 |
|
Third time the charm! |
| private byte[] carryOverByteArray; | ||
|
|
||
| @SuppressWarnings({ | ||
| "unchecked" // [BEAM-11327] Replace Charset.defaultCharset() with StandardCharsets.UTF_8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be ForbidDefaultCharset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Changed to Suppress ForbidDefaultCharset checkstyle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's concerning that the precommit didn't break when this was suppressing the wrong warning. Have you seen the new checkstyle rule break checkstyleMain or checkstyleTest?
I tried to get it to fail locally, but even with the SuppressWarnings removed its not triggering on Charset.defaultCharset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaaah, I finally found out what causes that. checkstyleMain.enabled is set to false in the build.gradle for the worker project. I set it to true, then removed the SuppressionWarning, and when I run checkstyleMain, I can see the rule gets triggered. I also didn't know you can set the ID used as the label in, SuppressionWarnings, so I fixed that as well!
If you want to test, I think if you set the flag to true in the build.gradle file, and comment out the SuppressionWarnings("ForbidDefaultCharset"), then run the checkstyleMain, it should work
|
Run Python_PVR_Flink PreCommit |
iemejia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM All green I am merging now to get this finally in. If there are any extra comments @TheNeuralBit or the others please note them below.
Please add a meaningful description for your change here
Replace Charset.defaultCharset() with StandardCharsets.UTF_8 so that reliance on encoding set in
localeis not used.The defaultCharset() method relies on underlying OS system default.
Files that are NOT test files that are being changed:
runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPortableClientEntryPoint.javarunners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/logging/JulHandlerPrintStreamAdapterFactory.javasdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.javasdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/GceMetadataUtil.javasdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/KeyPairUtils.javasdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/crosslanguage/ReadBuilder.javasdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/services/SnowflakeBatchServiceImpl.javaThank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.