Skip to content

Conversation

@aswinshakil
Copy link
Member

@aswinshakil aswinshakil commented Feb 2, 2023

What changes were proposed in this pull request?

The patch implements SnapshotDeletingService, it goes through the deleted snapshot's deletedTable and does either of the following.

  1. Move it to the next non-deleted snapshot, if there is none move it to active object store DB.
  2. Or Update the deletedTable of the current snapshot.

Follow-up TODO

Right now the SnapshotDeletingService doesn't handle the following and will be done in the next patch.
Tracked here: HDDS-7883

  1. Handle the cleanup of renamed keys between snapshots.
  2. To accommodate keys from FSO buckets.
  3. Cleaning up SnapshotChain, Checkpoint directory. OMSnapshotPurgeRequest will do these cleanups.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7740

How was this patch tested?

The patch was tested with UTs and manual testing.

@prashantpogde prashantpogde added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Feb 2, 2023
@umamaheswararao
Copy link
Contributor

@hemantk-12 @sumitagrawl

Change-Id: I6fe3de2e9409757bb2871bfbcbac8abd7e11dd53
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aswinshakil Thanks for working on this, given few comments

@smengcl
Copy link
Contributor

smengcl commented Feb 10, 2023

Relevant UT failure:

Error:  Errors: 
Error:    TestOMSnapshotDeleteRequest.testEntryExists:272 » NullPointer

Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @aswinshakil

omMetadataReader = new OmMetadataReader(keyManager, prefixManager,
this, LOG, AUDIT, metrics);
omSnapshotManager = new OmSnapshotManager(this);
snapshotChainManager = new SnapshotChainManager(metadataManager);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neils-dev mentioned this should be placed in OmMetadataManagerImpl

@prashantpogde
Copy link
Contributor

We could also create an instance of Keydeletion/DirectoryDeletionService service that could operate on a snapshot/active OS instance. That way we can create as many instances as needed to scale.


if (nextSnapshot != null) {
omNextSnapshot = (OmSnapshot) omSnapshotManager
.checkForSnapshot(nextSnapshot.getVolumeName(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that validateAndUpdateCache() runs within the applyTransaction() method of the ozoneManager ratis state machine, and therefor needs to run quickly. Is that not correct?

@GeorgeJahad
Copy link
Contributor

GeorgeJahad commented Mar 8, 2023

I know we are in a hurry, but snapshot delete is the most complicated and dangerous part of the snapshot system, so I'd like to see more tests for this subsystem.

If you are too busy, we can create a separate PR and have someone on my team write the tests, in particular unit tests for the following methods:

getNextActiveSnapshot(SnapshotInfo snapInfo,
createRepeatedOmKeyInfo(List<KeyInfo> keyInfoList)
splitRepeatedOmKeyInfo(SnapshotMoveKeyInfos.Builder toActiveDb,
getPreviousSnapshot(SnapshotInfo snapInfo)
checkKeyExistInPreviousTable(

In addition, I'd like a unit test that confirms that we are correctly starting and stopping within the bucket scope.

@aswinshakil
Copy link
Member Author

@GeorgeJahad I agree we need more tests, I'll add them eventually along with my other PR's. SnapshotDeletingService is not yet fully completed yet. There are follow-up JIRA HDDS-7883, The things not addressed here will be updated in the follow-up patch.

@aswinshakil
Copy link
Member Author

I'm disabling the tests for this PR and will update it in another patch, there are current PR and my upcoming patch that would break the tests when merging to master.

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest change where we no longer move keys into active DB lgtm (rather, those reclaimable keys are retained in the current snapshot checkpoint DB, and would be cleaned up by SDT)

Follow-up TODOs from this PR:

  1. Re-enable tests
  2. Add more tests
  3. Finish the rename logic
  4. Revisit the locking
  5. Implement block reclamation in SDT

@smengcl smengcl merged commit 0ebb555 into apache:master Mar 15, 2023
@smengcl
Copy link
Contributor

smengcl commented Mar 15, 2023

Thanks @aswinshakil for the main logic implementation. Thanks @GeorgeJahad @sumitagrawl @neils-dev @DaveTeng0 @hemantk-12 @prashantpogde for reviewing this.

errose28 added a commit to errose28/ozone that referenced this pull request Mar 16, 2023
* master: (262 commits)
  HDDS-8153. Integrate ContainerBalancer with MoveManager (apache#4391)
  HDDS-8090. When getBlock from a datanode fails, retry other datanodes. (apache#4357)
  HDDS-8163 Use try-with-resources to ensure close rockdb connection in SstFilteringService (apache#4402)
  HDDS-8065. Provide GNU long options (apache#4394)
  HDDS-7930. [addendum] input stream does not refresh expired block token.
  HDDS-7930. input stream does not refresh expired block token. (apache#4378)
  HDDS-7740. [Snapshot] Implement SnapshotDeletingService (apache#4244)
  HDDS-8076. Use container cache in Key listing API. (apache#4346)
  HDDS-8091. [addendum] Generate list of config tags from ConfigTag enum - Hadoop 3.1 compatibility fix (apache#4374)
  HDDS-8144. TestDefaultCertificateClient#testTimeBeforeExpiryGracePeriod fails as we approach DST. (apache#4382)
  HDDS-8151. Support fine grained lifetime for root CA certificate (apache#4386)
  HDDS-8150. RpcClientTest and ConfigurationSourceTest not run due to naming convention (apache#4388)
  HDDS-8131. Add Configuration for OM Ratis Log Purge Tuning Parameters. (apache#4371)
  HDDS-8133. Create ozone sh key checksum command (apache#4375)
  HDDS-8142. Check if no entries in Block DB for a container on container delete (apache#4379)
  HDDS-8118. Fail container delete on non empty chunks dir (apache#4367)
  HDDS-8028. JNI for RocksDB SST Dump tool (apache#4315)
  HDDS-8129. ContainerStateMachine allows two different tasks with the same container id running in parallel. (apache#4370)
  HDDS-8119. Remove loosely related AutoCloseable from SendContainerOutputStream (apache#4368)
  close db connection (apache#4366)
  ...
private SnapshotInfo getNextActiveSnapshot(SnapshotInfo snapInfo,
SnapshotChainManager chainManager, OmSnapshotManager omSnapshotManager)
throws IOException {
while (chainManager.hasNextPathSnapshot(snapInfo.getSnapshotPath(),
Copy link
Contributor

@hemantk-12 hemantk-12 May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might cause an infinite loop because snapInfo is not getting reset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants