Skip to content

Conversation

@lhotari
Copy link
Member

@lhotari lhotari commented Apr 7, 2022

Fixes #15151

Motivation and modifications

The #15031 changes fix a race condition where the lastMarkDeleteEntry field might get updated to a previous value as a result of races.

While looking into the details and attempting to add a unit test which reproduces the issue, I came up with additional changes in ensuring the consistency for persistentMarkDeletePosition field.

It seems that the current solution in persistentMarkDeletePosition has something to ensure that the value doesn't move backwards:

if (persistentMarkDeletePosition == null
|| mdEntry.newPosition.compareTo(persistentMarkDeletePosition) > 0) {
persistentMarkDeletePosition = mdEntry.newPosition;
}

This is wrong since the meaning of the field is not what one would expect it to be.
This PR fixes that issue.

getPersistentMarkDeletedPosition() is used in MLPendingAckStore:

&& subManagedCursor.getPersistentMarkDeletedPosition() != null
&& metadataPositions.firstEntry().getValue()
.compareTo((PositionImpl) subManagedCursor.getPersistentMarkDeletedPosition()) <= 0) {

In addition, this PR fixes an issue where the persistentMarkDeletePosition and lastMarkDeleteEntry fields must be able to move backwards. When the cursor is resetted, ths fields should be cleared.

In #15151 case, the relevant missing line was lastMarkDeleteEntry = new MarkDeleteEntry(markDeletePosition, getProperties(), null, null); which was missing from the initializeCursorPosition method.

@github-actions
Copy link

github-actions bot commented Apr 7, 2022

@lhotari:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@lhotari lhotari changed the title [ML [ML] Follow up on race condition fixes in ManagedCursorImpl #15031 Apr 7, 2022
@lhotari lhotari requested review from congbobo184 and eolivelli April 7, 2022 11:56
@lhotari lhotari added type/bug The PR fixed a bug or issue reported a bug area/broker doc-not-needed Your PR changes do not impact docs labels Apr 7, 2022
@github-actions
Copy link

github-actions bot commented Apr 7, 2022

@lhotari:Thanks for providing doc info!

lhotari referenced this pull request Apr 8, 2022
## Motivation
in order to handle pending ack persistent.

## implement
1. add the transaction pending ack store, it will handle the pending ack metadata store.
2. when the sub unload, we will replay the pendingAckHandle.
3. we use one manageLedger to store the pending ack metadata by one sub , and replay by this managedLedger open cursor.
4. when we commit or abort the transaction, we will append the marker to the pendingAckStore then we will modify state memory in pendingAckHandle
4. we also modify the in memory state when append fail, because we don't know the persistent state, when we replay it, it will produce the wrong operation. so we append fail, we should wait tc time out or client abort this transaction.
5. when we append pending ack log, we will compare the the log position store the biggest topic position is bigger than persistent topic markDeletePosition. if it is smaller, will delete the position.
### Verifying this change
Add the tests for it
@lhotari lhotari force-pushed the lh-fix-race-in-persisting-mark-delete-position branch from 255ec63 to 9344d99 Compare April 12, 2022 14:29
lhotari added a commit to lhotari/pulsar that referenced this pull request Apr 12, 2022
@lhotari
Copy link
Member Author

lhotari commented Apr 12, 2022

the periodic flushing added in #8634 will cause problems unless this fix is applied.

@michaeljmarshall
Copy link
Member

This is wrong since the meaning of the field is not what one would expect it to be.

@lhotari - are you able to describe the meaning of the field more?

lhotari added a commit that referenced this pull request Apr 14, 2022
…15067)

- follow up on #15031
* [ML] Fix race in persisting mark delete position
* [ML] Resetting should reset lastMarkDeleteEntry
* [ML] Reset fields in initializeCursorPosition method

(cherry picked from commit a19a30a)
lhotari added a commit that referenced this pull request Apr 14, 2022
…15067)

- follow up on #15031
* [ML] Fix race in persisting mark delete position
* [ML] Resetting should reset lastMarkDeleteEntry
* [ML] Reset fields in initializeCursorPosition method

(cherry picked from commit a19a30a)
lhotari added a commit that referenced this pull request Apr 14, 2022
…15067)

- follow up on #15031
* [ML] Fix race in persisting mark delete position
* [ML] Resetting should reset lastMarkDeleteEntry
* [ML] Reset fields in initializeCursorPosition method

(cherry picked from commit a19a30a)
lhotari added a commit that referenced this pull request Apr 14, 2022
…15067)

- follow up on #15031
* [ML] Fix race in persisting mark delete position
* [ML] Resetting should reset lastMarkDeleteEntry
* [ML] Reset fields in initializeCursorPosition method

(cherry picked from commit a19a30a)
@lhotari lhotari added cherry-picked/branch-2.7 Archived: 2.7 is end of life cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life cherry-picked/branch-2.10 labels Apr 14, 2022
lhotari added a commit to datastax/pulsar that referenced this pull request Apr 14, 2022
 (apache#15067)

- follow up on apache#15031
* [ML] Fix race in persisting mark delete position
* [ML] Resetting should reset lastMarkDeleteEntry
* [ML] Reset fields in initializeCursorPosition method

(cherry picked from commit a19a30a)
Copy link
Contributor

@congbobo184 congbobo184 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this pr can fix the race condition. the lastMarkDeleteEntry field also will get updated to a previous value as a result of races. two thread also can updateLAST_MARK_DELETE_ENTRY_UPDATER in the same time.

@codelipenghui
Copy link
Contributor

codelipenghui commented Apr 14, 2022

@lhotari Sorry for the late response.

Looks like this PR will not fix the persistentMarkDeletePosition consistency issue?

Assume 2 threads, thread 0 and thread 1, thread 0 reach

PENDING_MARK_DELETED_SUBMITTED_COUNT_UPDATER.incrementAndGet(this);
and thread 1 reach
, the persistentMarkDeletePosition might apply the value from thread 0 or thread 1, it's uncertain, I think we should provide consistent behavior here.

And inProgressLatest is a local variable, mdEntry will not share between threads, I think they are not related to race conditions.

@lhotari
Copy link
Member Author

lhotari commented Apr 14, 2022

, the persistentMarkDeletePosition might apply the value from thread 0 or thread 1, it's uncertain, I think we should provide consistent behavior here.
And inProgressLatest is a local variable, mdEntry will not share between threads, I think they are not related to race conditions.

@codelipenghui That's true. However, I don't think that the problem has become worse than before.
It would be good to address the problems with a single threaded approach with a actor like model where there would be no concurrency.

@codelipenghui
Copy link
Contributor

@lhotari got it. Thanks.

aparajita89 pushed a commit to aparajita89/pulsar that referenced this pull request Apr 18, 2022
 (apache#15067)

- follow up on apache#15031 
* [ML] Fix race in persisting mark delete position
* [ML] Resetting should reset lastMarkDeleteEntry
* [ML] Reset fields in initializeCursorPosition method
Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022
 (apache#15067)

- follow up on apache#15031 
* [ML] Fix race in persisting mark delete position
* [ML] Resetting should reset lastMarkDeleteEntry
* [ML] Reset fields in initializeCursorPosition method
@congbobo184
Copy link
Contributor

@lhotari Hi, do you have a plan to fix my comment? If not I will take over the task

@lhotari
Copy link
Member Author

lhotari commented Apr 22, 2022

@lhotari Hi, do you have a plan to fix my comment? If not I will take over the task

Yes, I'm planning to address it.

@Technoboy-
Copy link
Contributor

This patch fixes this problem #15151

Could you explain why this patch has fixed this issue ?

@Technoboy-
Copy link
Contributor

, the persistentMarkDeletePosition might apply the value from thread 0 or thread 1, it's uncertain, I think we should provide consistent behavior here.
And inProgressLatest is a local variable, mdEntry will not share between threads, I think they are not related to race conditions.

@codelipenghui That's true. However, I don't think that the problem has become worse than before. It would be good to address the problems with a single threaded approach with a actor like model where there would be no concurrency.

I don't agree with this. This will add the complexity of the logic. We should find a way to fix this completely, not merge with other ideas

@lhotari
Copy link
Member Author

lhotari commented Apr 22, 2022

This patch fixes this problem #15151

Could you explain why this patch has fixed this issue ?

This line was missing:

lastMarkDeleteEntry = new MarkDeleteEntry(markDeletePosition, getProperties(), null, null);

Since #15031 added a solution to prevent race conditions in updating the lastMarkDeleteEntry field, the solution prevented the value to move backwards. The lastMarkDeleteEntry field should be resetted in the initializeCursorPosition method for this logic to work correctly when using earliest initialposition:

cursor.initialize(position, properties, new VoidCallback() {
@Override
public void operationComplete() {
log.info("[{}] Opened new cursor: {}", name, cursor);
cursor.setActive();
// Update the ack position (ignoring entries that were written while the cursor was being created)
cursor.initializeCursorPosition(initialPosition == InitialPosition.Latest ? getLastPositionAndCounter()
: getFirstPositionAndCounter());

What ended up happening that #15031 prevented updating the lastMarkDeleteEntry field and when the scheduled flush call happened, that updated the markDelete position to the latest position.

void flush() {
if (!isDirty) {
return;
}
isDirty = false;
asyncMarkDelete(lastMarkDeleteEntry.newPosition, lastMarkDeleteEntry.properties, new MarkDeleteCallback() {

That caused the behavior that the cursor would move the readPosition to the last entry when the flush call happened and the resulted in the entries (messages) to be skipped.

I'm planning to add a test case to reproduce this and I'll use it also for refactoring the solution to fix the remaining ordering issue / race condition that remains.

@lhotari
Copy link
Member Author

lhotari commented Apr 22, 2022

, the persistentMarkDeletePosition might apply the value from thread 0 or thread 1, it's uncertain, I think we should provide consistent behavior here.
And inProgressLatest is a local variable, mdEntry will not share between threads, I think they are not related to race conditions.

@codelipenghui That's true. However, I don't think that the problem has become worse than before. It would be good to address the problems with a single threaded approach with a actor like model where there would be no concurrency.

I don't agree with this. This will add the complexity of the logic. We should find a way to fix this completely, not merge with other ideas

@Technoboy- Please be more specific about your feedback. I don't really get the point of your sentence. Please rephrase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/broker cherry-picked/branch-2.7 Archived: 2.7 is end of life cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life cherry-picked/branch-2.10 doc-not-needed Your PR changes do not impact docs release/2.7.5 release/2.8.4 release/2.9.3 release/2.10.1 type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stuck Subscription while using JClouds Offloaders

7 participants