HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret #2518

neils-dev · 2021-08-10T03:24:17Z

What changes were proposed in this pull request?

Problem: Intermittent failure with testGetS3SecretAndRevokeS3Secret. Occasionally, s3 secret is found after it has been revoked.
On S3 Secret Revoke, specifically on call to S3RevokeSecretRequest, the s3 secret is immediately stricken from the s3 secret cache however the action to remove from the s3 table is done through a transaction log batch job request. These transaction log batch requests are handled by a separate worker. The s3 secret prior to this fix is incorrectly invalidated from the wrong table. Due to this, there are times when the cache and s3 table are inconsistent, where the cache is consistent with the revoke request but the request has not yet propagated to the s3 table. When a key is not found in the cache, it is looked up from the s3 table, hence the problem observed with intermittent integration test failure.

~~This pr proposes a patch that within the S3RevokeSecretRequest repetitively checks the s3 table entry until it is removed or a timeout condition occurs.~~

Patch provided corrects invalidating wrong table cache in S3RevokeSecretRequest.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5358

How was this patch tested?

Patch tested through integration test on CI environment through git action workflow:
https://github.com/neils-dev/ozone/actions/runs/1115101396

hadoop-ozone/dev-support/checks/integration.sh
with environment variables $ITERATIONS=60, $MAVEN_OPTS: -Dhttp.keepAlive=false -Dmaven.wagon.http.pool=false -Dmaven.wagon.http.retryHandler.class=standard -Dmaven.wagon.http.retryHandler.count=3

hadoop-ozone/dev-support/checks/integration.sh -Dtest=TestSecureOzoneCluster#testGetS3SecretAndRevokeS3Secret

…n test.

…er remains in secret table until batch jobs are processed to remove entry from table. Window of time exists when secret is revoked however still exists in s3table; patch checks table repeatively for striked entry or timeout whichever first.

bharatviswa504 · 2021-08-10T12:30:05Z

@neils-dev
The explanation you have said is already handled, if cache is explicitly having null, we considered it is deleted, as we delete from table by doubleuffer thread in background.

  public CacheResult<CACHEVALUE> lookup(CACHEKEY cachekey) {

    CACHEVALUE cachevalue = cache.get(cachekey);
    if (cachevalue == null) {
      return new CacheResult<>(CacheResult.CacheStatus.MAY_EXIST,
            null);
    } else {
      if (cachevalue.getCacheValue() != null) {
        return new CacheResult<>(CacheResult.CacheStatus.EXISTS, cachevalue);
      } else {
        // When entity is marked for delete, cacheValue will be set to null.
        return new CacheResult<>(CacheResult.CacheStatus.NOT_EXIST, null);
      }
    }
  }

adoroszlai

Thanks @neils-dev for finding the inconsistency between cache and table, and @bharatviswa504 for the insight on the cache lookup. Based on these, I think I found the problem:

ozone/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/s3/security/S3RevokeSecretRequest.java

Line 116 in fa7dc30

omMetadataManager.getKeyTable().addCacheEntry(

"Absent" cache item is injected into the wrong table.

I have uploaded a repro patch to the Jira issue. Test failure can be consistently reproduced with that.

I have also uploaded the fix, please feel free to use it.

…eSecretRequest. Passed integration test in CI where error was found repetitively without failure.

neils-dev · 2021-08-11T05:04:46Z

Thanks @bharatviswa504 for review and comments. Thanks @adoroszlai for spotting the error in invalidating the cache entry. Have updated commit with invaliding correct cache in revoke.

Q. In the S3RevokeSecretRequest, incorrectly trying to strike the s3 secret from the cache would ensure that the s3 key exists in the s3 secret cache. Any subsequent 'get' or 'lookup' for the s3 key would then always 'hit' and retrieve the s3 key the user revoked. How are we observing intermittent errors in this case? Would the s3 secret integration test always fail then? Does the batch delete from the s3 table somehow also invalid the cache in the doubleuffer thread in background - thus intermittent failure observed?

@neils-dev

Thanks @neils-dev for updating the patch.

bharatviswa504

+1 LGTM

bharatviswa504 · 2021-08-11T16:32:53Z

How are we observing intermittent errors in this case? Would the s3 secret integration test always fail then?

We incorrectly update wrong table cache, but if double buffer flush completes and then assert check happens it will succeed, else it will fail.

neils-dev · 2021-08-11T17:22:38Z

How are we observing intermittent errors in this case? Would the s3 secret integration test always fail then?

We incorrectly update wrong table cache, but if double buffer flush completes and then assert check happens it will succeed, else it will fail.

Ok. Assert check with double buffer flush invalidates the cache entry - cleanupCache with commited transactions during flush - Thanks @bharatviswa504 , @adoroszlai

smengcl · 2021-08-11T21:08:40Z

Thanks @neils-dev @adoroszlai @bharatviswa504 for catching this bug.

smengcl · 2021-08-11T23:52:06Z

Merged. Thanks @neils-dev for the PR and others for reviewing. I have changed the jira title a bit to clarify the issue.

* master: HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret (apache#2518) HDDS-5608. Fix wrong command in ugrade doc (apache#2524) HDDS-5000. Run CI checks selectively (apache#2479) HDDS-4929. Select target datanodes and containers to move for Container Balancer (apache#2441) HDDS-5283. getStorageSize cast to int can cause issue (apache#2303) HDDS-5449 Recon namespace summary 'du' information should return replicated size of a key (apache#2489) HDDS-5558. vUnit invocation unit() may produce NPE (apache#2513) HDDS-5531. For Link Buckets avoid showing metadata. (apache#2502) HDDS-5549. Add 1.1 to supported versions in security policy (apache#2519) HDDS-5555. remove pipeline manager v1 code (apache#2511) HDDS-5546.OM Service ID change causes OM startup failure. (apache#2512) HDDS-5360. DN failed to process all delete block commands in one heartbeat interval (apache#2420) HDDS-5021. dev-support Dockerfile is badly outdated (apache#2480)

neils-dev added 9 commits July 26, 2021 18:41

Commit new workflow def for repeated test of HDDS-5358.

b432c98

Merge remote-tracking branch 'upstream/master' into neils-dev-master.

d1738dc

Fixed git action workflow for single repeatitive HDDS-5358 integratio…

e4dc6ef

…n test.

Added git action workflow for snapshot integration tests.

ed4b066

Added git action for testGetS3SecretAndRevokeS3Secret.

feb24d6

Modified s3secret workflow for CI repeative testing HDDS-5358.

4bed4b4

Trigger Build

8ca47a6

Trigger Build

a100a5e

adoroszlai requested review from smengcl and xiaoyuyao August 10, 2021 09:43

adoroszlai previously requested changes Aug 10, 2021

View reviewed changes

Applied new patch to fix cache inconsistency with s3 table on S3Revok…

0498502

…eSecretRequest. Passed integration test in CI where error was found repetitively without failure.

trigger new CI check

e79807b

adoroszlai requested review from bharatviswa504 and vivekratnavel August 11, 2021 15:21

bharatviswa504 approved these changes Aug 11, 2021

View reviewed changes

adoroszlai added 2 commits August 11, 2021 20:02

trigger new CI check

4e95769

trigger new CI check

eaddb7a

trigger new CI check

dd1d7d5

smengcl approved these changes Aug 11, 2021

View reviewed changes

vivekratnavel approved these changes Aug 11, 2021

View reviewed changes

kerneltime approved these changes Aug 11, 2021

View reviewed changes

smengcl changed the title ~~HDDS-5358. Intermittent failure in testGetS3SecretAndRevokeS3Secret~~ HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret Aug 11, 2021

smengcl merged commit 44be763 into apache:master Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret #2518

HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret #2518

Uh oh!

neils-dev commented Aug 10, 2021 •

edited

Loading

Uh oh!

bharatviswa504 commented Aug 10, 2021

Uh oh!

adoroszlai left a comment •

edited

Loading

Uh oh!

neils-dev commented Aug 11, 2021

Uh oh!

bharatviswa504 left a comment

Uh oh!

bharatviswa504 commented Aug 11, 2021

Uh oh!

neils-dev commented Aug 11, 2021

Uh oh!

smengcl commented Aug 11, 2021

Uh oh!

smengcl commented Aug 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret #2518

HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret #2518

Uh oh!

Conversation

neils-dev commented Aug 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

bharatviswa504 commented Aug 10, 2021

Uh oh!

adoroszlai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neils-dev commented Aug 11, 2021

Uh oh!

bharatviswa504 left a comment

Choose a reason for hiding this comment

Uh oh!

bharatviswa504 commented Aug 11, 2021

Uh oh!

neils-dev commented Aug 11, 2021

Uh oh!

smengcl commented Aug 11, 2021

Uh oh!

smengcl commented Aug 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

neils-dev commented Aug 10, 2021 •

edited

Loading

adoroszlai left a comment •

edited

Loading