Skip to content

Conversation

@neils-dev
Copy link
Contributor

@neils-dev neils-dev commented Aug 10, 2021

What changes were proposed in this pull request?

Problem: Intermittent failure with testGetS3SecretAndRevokeS3Secret. Occasionally, s3 secret is found after it has been revoked.
On S3 Secret Revoke, specifically on call to S3RevokeSecretRequest, the s3 secret is immediately stricken from the s3 secret cache however the action to remove from the s3 table is done through a transaction log batch job request. These transaction log batch requests are handled by a separate worker. The s3 secret prior to this fix is incorrectly invalidated from the wrong table. Due to this, there are times when the cache and s3 table are inconsistent, where the cache is consistent with the revoke request but the request has not yet propagated to the s3 table. When a key is not found in the cache, it is looked up from the s3 table, hence the problem observed with intermittent integration test failure.

This pr proposes a patch that within the S3RevokeSecretRequest repetitively checks the s3 table entry until it is removed or a timeout condition occurs.

Patch provided corrects invalidating wrong table cache in S3RevokeSecretRequest.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5358

How was this patch tested?

Patch tested through integration test on CI environment through git action workflow:
https://github.com/neils-dev/ozone/actions/runs/1115101396

hadoop-ozone/dev-support/checks/integration.sh
with environment variables $ITERATIONS=60, $MAVEN_OPTS: -Dhttp.keepAlive=false -Dmaven.wagon.http.pool=false -Dmaven.wagon.http.retryHandler.class=standard -Dmaven.wagon.http.retryHandler.count=3

hadoop-ozone/dev-support/checks/integration.sh -Dtest=TestSecureOzoneCluster#testGetS3SecretAndRevokeS3Secret

@bharatviswa504
Copy link
Contributor

@neils-dev
The explanation you have said is already handled, if cache is explicitly having null, we considered it is deleted, as we delete from table by doubleuffer thread in background.

  public CacheResult<CACHEVALUE> lookup(CACHEKEY cachekey) {

    CACHEVALUE cachevalue = cache.get(cachekey);
    if (cachevalue == null) {
      return new CacheResult<>(CacheResult.CacheStatus.MAY_EXIST,
            null);
    } else {
      if (cachevalue.getCacheValue() != null) {
        return new CacheResult<>(CacheResult.CacheStatus.EXISTS, cachevalue);
      } else {
        // When entity is marked for delete, cacheValue will be set to null.
        return new CacheResult<>(CacheResult.CacheStatus.NOT_EXIST, null);
      }
    }
  }

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @neils-dev for finding the inconsistency between cache and table, and @bharatviswa504 for the insight on the cache lookup. Based on these, I think I found the problem:

"Absent" cache item is injected into the wrong table.

I have uploaded a repro patch to the Jira issue. Test failure can be consistently reproduced with that.

I have also uploaded the fix, please feel free to use it.

…eSecretRequest. Passed integration test in CI where error was found repetitively without failure.
@neils-dev
Copy link
Contributor Author

Thanks @bharatviswa504 for review and comments. Thanks @adoroszlai for spotting the error in invalidating the cache entry. Have updated commit with invaliding correct cache in revoke.

Q. In the S3RevokeSecretRequest, incorrectly trying to strike the s3 secret from the cache would ensure that the s3 key exists in the s3 secret cache. Any subsequent 'get' or 'lookup' for the s3 key would then always 'hit' and retrieve the s3 key the user revoked. How are we observing intermittent errors in this case? Would the s3 secret integration test always fail then? Does the batch delete from the s3 table somehow also invalid the cache in the doubleuffer thread in background - thus intermittent failure observed?

@adoroszlai adoroszlai dismissed their stale review August 11, 2021 15:21

Thanks @neils-dev for updating the patch.

Copy link
Contributor

@bharatviswa504 bharatviswa504 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@bharatviswa504
Copy link
Contributor

How are we observing intermittent errors in this case? Would the s3 secret integration test always fail then?

We incorrectly update wrong table cache, but if double buffer flush completes and then assert check happens it will succeed, else it will fail.

@neils-dev
Copy link
Contributor Author

How are we observing intermittent errors in this case? Would the s3 secret integration test always fail then?

We incorrectly update wrong table cache, but if double buffer flush completes and then assert check happens it will succeed, else it will fail.

Ok. Assert check with double buffer flush invalidates the cache entry - cleanupCache with commited transactions during flush - Thanks @bharatviswa504 , @adoroszlai

@smengcl
Copy link
Contributor

smengcl commented Aug 11, 2021

Thanks @neils-dev @adoroszlai @bharatviswa504 for catching this bug.

@smengcl smengcl changed the title HDDS-5358. Intermittent failure in testGetS3SecretAndRevokeS3Secret HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret Aug 11, 2021
@smengcl smengcl merged commit 44be763 into apache:master Aug 11, 2021
@smengcl
Copy link
Contributor

smengcl commented Aug 11, 2021

Merged. Thanks @neils-dev for the PR and others for reviewing. I have changed the jira title a bit to clarify the issue.

errose28 added a commit to errose28/ozone that referenced this pull request Aug 12, 2021
* master:
  HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret (apache#2518)
  HDDS-5608. Fix wrong command in ugrade doc (apache#2524)
  HDDS-5000. Run CI checks selectively (apache#2479)
  HDDS-4929. Select target datanodes and containers to move for Container Balancer (apache#2441)
  HDDS-5283. getStorageSize cast to int can cause issue (apache#2303)
  HDDS-5449 Recon namespace summary 'du' information should return replicated size of a key (apache#2489)
  HDDS-5558. vUnit invocation unit() may produce NPE (apache#2513)
  HDDS-5531. For Link Buckets avoid showing metadata. (apache#2502)
  HDDS-5549. Add 1.1 to supported versions in security policy (apache#2519)
  HDDS-5555. remove pipeline manager v1 code (apache#2511)
  HDDS-5546.OM Service ID change causes OM startup failure. (apache#2512)
  HDDS-5360. DN failed to process all delete block commands in one heartbeat interval (apache#2420)
  HDDS-5021. dev-support Dockerfile is badly outdated (apache#2480)
errose28 added a commit to errose28/ozone that referenced this pull request Aug 12, 2021
* master:
  HDDS-5358. Incorrect cache entry invalidation causes intermittent failure in testGetS3SecretAndRevokeS3Secret (apache#2518)
  HDDS-5608. Fix wrong command in ugrade doc (apache#2524)
  HDDS-5000. Run CI checks selectively (apache#2479)
  HDDS-4929. Select target datanodes and containers to move for Container Balancer (apache#2441)
  HDDS-5283. getStorageSize cast to int can cause issue (apache#2303)
  HDDS-5449 Recon namespace summary 'du' information should return replicated size of a key (apache#2489)
  HDDS-5558. vUnit invocation unit() may produce NPE (apache#2513)
  HDDS-5531. For Link Buckets avoid showing metadata. (apache#2502)
  HDDS-5549. Add 1.1 to supported versions in security policy (apache#2519)
  HDDS-5555. remove pipeline manager v1 code (apache#2511)
  HDDS-5546.OM Service ID change causes OM startup failure. (apache#2512)
  HDDS-5360. DN failed to process all delete block commands in one heartbeat interval (apache#2420)
  HDDS-5021. dev-support Dockerfile is badly outdated (apache#2480)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants