-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret #2518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…er remains in secret table until batch jobs are processed to remove entry from table. Window of time exists when secret is revoked however still exists in s3table; patch checks table repeatively for striked entry or timeout whichever first.
|
@neils-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @neils-dev for finding the inconsistency between cache and table, and @bharatviswa504 for the insight on the cache lookup. Based on these, I think I found the problem:
Line 116 in fa7dc30
| omMetadataManager.getKeyTable().addCacheEntry( |
"Absent" cache item is injected into the wrong table.
I have uploaded a repro patch to the Jira issue. Test failure can be consistently reproduced with that.
I have also uploaded the fix, please feel free to use it.
…eSecretRequest. Passed integration test in CI where error was found repetitively without failure.
|
Thanks @bharatviswa504 for review and comments. Thanks @adoroszlai for spotting the error in invalidating the cache entry. Have updated commit with invaliding correct cache in revoke. Q. In the S3RevokeSecretRequest, incorrectly trying to strike the s3 secret from the cache would ensure that the s3 key exists in the s3 secret cache. Any subsequent 'get' or 'lookup' for the s3 key would then always 'hit' and retrieve the s3 key the user revoked. How are we observing intermittent errors in this case? Would the s3 secret integration test always fail then? Does the batch delete from the s3 table somehow also invalid the cache in the doubleuffer thread in background - thus intermittent failure observed? |
Thanks @neils-dev for updating the patch.
bharatviswa504
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
We incorrectly update wrong table cache, but if double buffer flush completes and then assert check happens it will succeed, else it will fail. |
Ok. Assert check with double buffer flush invalidates the cache entry - |
|
Thanks @neils-dev @adoroszlai @bharatviswa504 for catching this bug. |
|
Merged. Thanks @neils-dev for the PR and others for reviewing. I have changed the jira title a bit to clarify the issue. |
* master: HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret (apache#2518) HDDS-5608. Fix wrong command in ugrade doc (apache#2524) HDDS-5000. Run CI checks selectively (apache#2479) HDDS-4929. Select target datanodes and containers to move for Container Balancer (apache#2441) HDDS-5283. getStorageSize cast to int can cause issue (apache#2303) HDDS-5449 Recon namespace summary 'du' information should return replicated size of a key (apache#2489) HDDS-5558. vUnit invocation unit() may produce NPE (apache#2513) HDDS-5531. For Link Buckets avoid showing metadata. (apache#2502) HDDS-5549. Add 1.1 to supported versions in security policy (apache#2519) HDDS-5555. remove pipeline manager v1 code (apache#2511) HDDS-5546.OM Service ID change causes OM startup failure. (apache#2512) HDDS-5360. DN failed to process all delete block commands in one heartbeat interval (apache#2420) HDDS-5021. dev-support Dockerfile is badly outdated (apache#2480)
* master: HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret (apache#2518) HDDS-5608. Fix wrong command in ugrade doc (apache#2524) HDDS-5000. Run CI checks selectively (apache#2479) HDDS-4929. Select target datanodes and containers to move for Container Balancer (apache#2441) HDDS-5283. getStorageSize cast to int can cause issue (apache#2303) HDDS-5449 Recon namespace summary 'du' information should return replicated size of a key (apache#2489) HDDS-5558. vUnit invocation unit() may produce NPE (apache#2513) HDDS-5531. For Link Buckets avoid showing metadata. (apache#2502) HDDS-5549. Add 1.1 to supported versions in security policy (apache#2519) HDDS-5555. remove pipeline manager v1 code (apache#2511) HDDS-5546.OM Service ID change causes OM startup failure. (apache#2512) HDDS-5360. DN failed to process all delete block commands in one heartbeat interval (apache#2420) HDDS-5021. dev-support Dockerfile is badly outdated (apache#2480)
What changes were proposed in this pull request?
Problem: Intermittent failure with
testGetS3SecretAndRevokeS3Secret. Occasionally, s3 secret is found after it has been revoked.On S3 Secret Revoke, specifically on call to
S3RevokeSecretRequest, the s3 secret is immediately stricken from the s3 secret cache however the action to remove from the s3 table is done through a transaction log batch job request. These transaction log batch requests are handled by a separate worker. The s3 secret prior to this fix is incorrectly invalidated from the wrong table.Due to this, there are times when the cache and s3 table are inconsistent, where the cache is consistent with the revoke request but the request has not yet propagated to the s3 table. When a key is not found in the cache, it is looked up from the s3 table, hence the problem observed with intermittent integration test failure.This pr proposes a patch that within theS3RevokeSecretRequestrepetitively checks the s3 table entry until it is removed or a timeout condition occurs.Patch provided corrects invalidating wrong table cache in
S3RevokeSecretRequest.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5358
How was this patch tested?
Patch tested through integration test on CI environment through git action workflow:
https://github.com/neils-dev/ozone/actions/runs/1115101396
hadoop-ozone/dev-support/checks/integration.shwith environment variables
$ITERATIONS=60, $MAVEN_OPTS: -Dhttp.keepAlive=false -Dmaven.wagon.http.pool=false -Dmaven.wagon.http.retryHandler.class=standard -Dmaven.wagon.http.retryHandler.count=3hadoop-ozone/dev-support/checks/integration.sh -Dtest=TestSecureOzoneCluster#testGetS3SecretAndRevokeS3Secret