Skip to content

Conversation

@scwhittle
Copy link
Contributor

…he original System.err

Currently such errors are logged to System.err which is a PrintStream that
publishes to the handler. This is perhaps unlikely to work if earlier publishing
failed and additionally removes a potential deadlock between the PrintStream
object sychronization and the Handler object synchronization. This was attempted
to be fixed earlier by dissallowing the PrintStream object to be synchronized
when calling into the handler. However this is possible to be triggered by
external synchronization on the PrintStream, such as that performed by
Throwable.printStackTrace. Changing the PrintStream to use separate synchronization
for buffering works in most cases but not for cases where the stream is externally
synchronized.

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status --- Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@scwhittle
Copy link
Contributor Author

@lukecwik Luke can you review as you have context from last time?

Changing to synchronizing on buffer instead of the PrintStream changes the precondition to just enforce the invariant on when publish is called within our custom PrintStream. That means that the original deadlock can still occur if this happens:

T1: synchronizes on System.err (Throwable.printStackTrace for example), publishes to handler
T2: synchronized within handler, tries to report error to System.err using Throwable.printStackTrace which synchronizes on the PrintStream

I removed that case by using the custom ErrorManager to print to the original stderr stream, an alternative would be to change the ErrorManager still use our custom PrintStream but remove use of Throwable.printStackTrace so as not to sychronize on the PrintStream.

@kennknowles
Copy link
Member

I believe the trouble here is that it mixes all log levels on stderr, while the log reporting in the UI reports all stderr lines at the same severity. I cannot recall if they are all INFO or all WARN in the UI, but either way it is incorrect and confusing. So we need the corresponding change to make sure that doesn't happen. Just drive by comment because I've encountered this.

@scwhittle
Copy link
Contributor Author

I am only changing how errors from within the DataflowWorkerLoggingHandler itself are reported, for example an error publishing to stackdriver. I agree that logging to stderr is difficult to view in the UI, but that seems separate from the deadlock we originally fixed (the Jira has more details) and the erroneous precondition introduced when that was added. Kenn, does the limited scope of this change address your concerns? I am not changing it to use the original System.err generally

@lukecwik
Copy link
Member

@scwhittle
I see how the deadlock occurs but why introduce the synchronized(buffer) if we are swapping out the error manager?

Also, wouldn't we solve the locking problem if we always held the same "flush" lock when interacting with the error manager?
We could do this by sharing a single lock object across the error manager, print stream, ...

@lukecwik
Copy link
Member

@kennknowles system.err is always intended to log at ERROR level. Are you saying that logging is being misconfigured somehow and logs for other severities are being sent to system.err?

@kennknowles
Copy link
Member

@lukecwik that is how I feel about System.err as well. Java disagrees and writes all logs to stderr unless you disconnect that. IIRC I discovered through some other work that it is not disconnected here.

@kennknowles
Copy link
Member

Had a chat and we confirmed that the worker logging setup already handles my concern appropriately. Please disregard all comments.

@scwhittle
Copy link
Contributor Author

@lukecwik I think to have a single lock, we would need to use the PrintStream lock itself, as that is what is synchronized on by Throwable.printStackTrace and which can be synchronize outside and inside other synchronization blocks if we choose another lock to guard the buffer, ErrorManager, handler.
However using the PrintStream as a lock within the DataflowWorkerLoggingHandler seemed perhaps error prone as that extends a Handler which might have it's own synchronized methods (on itself not the PrintStream) or callers that synchronize on the Handler (and not on the PrintStream).
Additionally it seemed a bit gross since there is not always a PrintStream for a handler, though we could just allow injecting a lock Object into the handler to hide that.
If you think that approach is still preferrable, I can put that together.

Regarding on why to synchronize on buffer, that allows us to keep the preconditioncheck to sanity check our implementation. But that seems overkill, so I will just remove it.

@aaltay
Copy link
Member

aaltay commented Sep 24, 2020

What is the next step for this PR?

@scwhittle
Copy link
Contributor Author

scwhittle commented Sep 25, 2020 via email

…he original System.err

Currently such errors are logged to System.err which is a PrintStream that
publishes to the handler.  This is perhaps unlikely to work if earlier publishing
failed and additionally removes a potential deadlock between the PrintStream
object sychronization and the Handler object synchronization. This was attempted
to be fixed earlier by dissallowing the PrintStream object to be synchronized
when calling into the handler.  However this is possible to be triggered by
external synchronization on the PrintStream, such as that performed by
Throwable.printStackTrace. Changing the PrintStream to use separate synchronization
for buffering works in most cases but not for cases where the stream is externally
synchronized.
@scwhittle
Copy link
Contributor Author

@lukecwik PTAL, see above for why I think having a single lock would be hard/possibly error-prone. Let me know if you think it's worth pursuing that.

@lukecwik
Copy link
Member

lukecwik commented Oct 2, 2020

Run Java PreCommit

@lukecwik
Copy link
Member

lukecwik commented Oct 2, 2020

We can leave as is since the migration to using the Java SDK harness is upcoming and the Java SDK harness doesn't override system.out/system.err due to the long list of issues we have hit with this override to capture system.out/system.err logs in Dataflow.

Copy link
Member

@lukecwik lukecwik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please work with @robinyqiu to get this cherry picked into the 2.25 release branch.

@aaltay
Copy link
Member

aaltay commented Oct 2, 2020

Please work with @robinyqiu to get this cherry picked into the 2.25 release branch.

Should https://issues.apache.org/jira/browse/BEAM-9399 be marked as a release blocker?

@lukecwik lukecwik merged commit 0113e4b into apache:master Oct 2, 2020
@scwhittle scwhittle deleted the logging branch December 8, 2020 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants