Skip to content

Conversation

@sollhui
Copy link
Contributor

@sollhui sollhui commented Feb 19, 2024

Bug

When replay journal, FE meets NullPointerException and crash.

Operation Type 201
java.lang.NullPointerException: null at org.apache.doris.load.routineload.RoutineLoadManager.replayChangeRoutineLoadJob(RoutineLoadManager.java:726) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:669) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.catalog.Env.replayJournal(Env.java:2513) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.catalog.Env.transferToMaster(Env.java:1285) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.catalog.Env.access$1200(Env.java:282) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.catalog.Env$4.runOneCycle(Env.java:2429) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.2-SNAPSHOT] 

Why this happened

It may cause the editLog out of order in the following scenarios:
thread A: create job and record job meta
thread B: change job state and persist in editlog according to meta
thread A: persist in editlog

which will cause the null pointer exception when replaying editLog

How to solve this problem in this pr

Use the lock to make sure the record meta in memory and write editlog is atomicity, when pausing or stopping the job, these two states must be completed simultaneously when the job is visible.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

@sollhui
Copy link
Contributor Author

sollhui commented Feb 19, 2024

run buildall

@sollhui
Copy link
Contributor Author

sollhui commented Feb 19, 2024

run buildall

dataroaring
dataroaring previously approved these changes Feb 19, 2024
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 19, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@sollhui
Copy link
Contributor Author

sollhui commented Feb 19, 2024

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Feb 19, 2024
Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 19, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@xiaokang xiaokang added the p0_b label Feb 19, 2024
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BitoAgent
Copy link

PR Run Status

  • AI Based Review: Successful
  • Static Analysis: Failure - Failed to execute static code analysis using FBinfer.

PR Analysis

  • Main theme: Implementing thread-safety in routine load job operations to avoid editLog out of order issues.
  • PR summary: This PR addresses a critical bug where concurrent updates to routine load jobs could lead to a NullPointerException due to out-of-order editLog entries. By introducing locks around job state changes, the PR aims to ensure atomicity in job metadata recording and editLog persistence, thus preventing the crash.
  • Type of PR: Bug fix
  • Score: 85
  • Relevant tests added: No
  • Estimated effort to review: 3
    The changes are straightforward with the introduction of locking mechanisms around critical sections. However, the impact of these changes on performance and potential deadlock scenarios needs careful consideration.

PR Feedback

  • General suggestions: The approach taken to address the concurrency issue by adding locks around state-changing operations is sound. However, it's crucial to consider the performance implications of introducing locks, especially in high-throughput environments. Additionally, the PR lacks tests that specifically cover the scenarios of concurrent job updates, which are essential to ensure the robustness of the fix. It would be beneficial to include performance benchmarks and tests that simulate concurrent updates to validate the solution comprehensively.

Code feedback

file: fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java

  • Suggestions:
  1. Consider using more granular locks or lock-free mechanisms to mitigate potential performance bottlenecks. [important] Relevant line:In fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java at line 317
+ readLock();
  1. Add specific unit tests to simulate concurrent job updates and ensure that the locking mechanism effectively prevents out-of-order editLogs. [important] Relevant line:In fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java at line 317
+ readLock();
  1. Ensure that all paths that acquire locks are paired with a finally block to guarantee the release of the lock, preventing potential deadlocks. [important] Relevant line:In fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java at line 317
+ readLock();
  1. Evaluate the necessity of read-write locks versus mutual exclusion locks based on the actual read/write ratio to optimize concurrency control. [medium] Relevant line:In fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java at line 317
+ readLock();
  1. Review and possibly increase the test coverage for the RoutineLoadManager class to include scenarios that were not previously considered, such as concurrent access and potential race conditions. [important] Relevant line:In fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java at line 317
+ readLock();
  1. Consider the impact of locking on the overall system performance, especially if the routine load job operations are frequent and concurrent. Profiling and load testing should be conducted to assess the impact. [medium] Relevant line:In fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java at line 317
+ readLock();
  1. Investigate alternative concurrency control mechanisms that could offer better performance or simplicity, such as optimistic concurrency controls or using concurrent data structures. [medium] Relevant line:In fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java at line 317
+ readLock();

mongo360 pushed a commit to mongo360/doris that referenced this pull request Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.5-merged p0_b reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants