Skip to content

Conversation

@xichen01
Copy link
Contributor

@xichen01 xichen01 commented Dec 9, 2024

What changes were proposed in this pull request?

Design doc for find Block missing Key tool

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11891

How was this patch tested?

N/A

@sumitagrawl
Copy link
Contributor

@xichen01 I have few questions ...

  1. What is the trigger point to check for a file for blocks at DN? Is it all keys in system will be verified?
  2. Recon already have job to check if any missing container. So IMO, checking container state to check for blocks may not be required. But if any container is not in healthy state (In Quasi-closed, deleted, ...) where chances of missing block is there, that can be reported as additional information by quering from SCM.
  3. How about having similar thing in Recon also ?

@xichen01
Copy link
Contributor Author

@xichen01 I have few questions ...

  1. What is the trigger point to check for a file for blocks at DN? Is it all keys in system will be verified?
  2. Recon already have job to check if any missing container. So IMO, checking container state to check for blocks may not be required. But if any container is not in healthy state (In Quasi-closed, deleted, ...) where chances of missing block is there, that can be reported as additional information by quering from SCM.
  3. How about having similar thing in Recon also ?

@sumitagrawl Thanks for your questions.

  1. What is the trigger point...

Keys/drums/volumes/clusters can be checked based on “OzoneAddress”

  1. Recon already have job to check...

The output of the command will be all the missing keys, if we skip the Container state check, we may need to get this information from Recon.
And we have encountered some Missing Key Container seems to have never existed in the cluster, there is no any record in the SCM and Recon, this kind of scenario Recon can be found?

  1. How about having similar ...

In the long run, it is possible, but this may require more development, and a command tool will be more flexible and simple

@errose28
Copy link
Contributor

This seems similar to the ozone debug read-replicas tool, except that it is expected to run faster because it is only checking block existence, not block data. Could we just add a flag to that command to tell it to only pull block metadata and accomplish the same result? Also how does the proposed headBlock differ from the existing getBlock request?

@xichen01
Copy link
Contributor Author

@errose28 Thanks for your questions.

except that it is expected to run faster because it is only checking block existence

Yes, it performs better, and in our internal version, with 6 buckets checked in parallel, the total QPS can be around 70k. And the main bottleneck is OM's ListKeys.

Could we just add a flag to that command to tell it to only pull block metadata and accomplish the same result ?

Do you mean we just check the DN Block in RocksDB, not to check the disk Block file? I think it's possible.

how does the proposed headBlock differ from the existing getBlock request

headBlock (which may be called something else) checks a number of Blocks at a time instead of one, and his return value can be simpler, i.e., it only returns the Block that is an exception, since it only checks for existence.

@xichen01
Copy link
Contributor Author

@sumitagrawl @errose28 Is there any update?

Comment on lines 2 to 3
title: Erasure Coding in Ozone
summary: Use Erasure Coding algorithm for efficient storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update title and summary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@sumitagrawl
Copy link
Contributor

sumitagrawl commented Jan 7, 2025

The output of the command will be all the missing keys, if we skip the Container state check, we may need to get this information from Recon.
And we have encountered some Missing Key Container seems to have never existed in the cluster, there is no any record in the SCM and Recon, this kind of scenario Recon can be found?

Yes, Recon already have capability to identify missing container and reports them. It monitors all key and verify if any container is missing for the keys.

But at block level, if some block is deleted at physical disk, there is no direct mechanism to identify this till data is not read via Recon. But,

There is a DN Container scan task is there which verify if container metadata and disk are in consistent state. Else mark the container to un-healthy so that replication can re-replicate. (I remember this is disabled by default, need recheck this).

cc: @errose28

@xichen01
Copy link
Contributor Author

xichen01 commented Jan 7, 2025

Recon already have job to check if any missing container. So IMO, checking container state to check for blocks may not be required. But if any container is not in healthy state (In Quasi-closed, deleted, ...) where chances of missing block is there, that can be reported as additional information by quering from SCM.


The output of the command will be all the missing keys, if we skip the Container state check, we may need to get this information from Recon.
And we have encountered some Missing Key Container seems to have never existed in the cluster, there is no any record in the SCM and Recon, this kind of scenario Recon can be found?

Yes, Recon already have capability to identify missing container and reports them. It monitors all key and verify if any container is missing for the keys.

But at block level, if some block is deleted at physical disk, there is no direct mechanism to identify this till data is not read via Recon. But,

There is a DN Container scan task is there which verify if container metadata and disk are in consistent state. Else mark the container to un-healthy so that replication can re-replicate. (I remember this is disabled by default, need recheck this).

cc: @errose28

Thanks for your information.
Recon can handle Container exception keys ,but for Container exception keys if we don't list them in the output, then our output result will only report a part of the "Block missing Key", which may cause ambiguity, so in order to report the "Block missing" Key completely, so I think the container state check is necessary.
And if we want to check the Block on Datanode, container state check is hard to bypass.

There is a DN Container scan task is there which verify if container metadata and disk are in consistent state.

This relies on the Block being correctly placed in the Container, and the Block not being incorrectly deleted by the DN (i.e., a key that should not have been deleted through the normal deletion process), which is not guaranteed for a cluster that has been upgraded many times and run for a long time.

@kerneltime
Copy link
Contributor

Can you include the text in https://www.notion.so/meeting-room-Conference-Room-d17916fda32244f2b5edfec93c165cee?pvs=21 here itself, I tried to access it but I do not have access to it.

1. Retrieve Key metadata:
- Gather metadata such as the Key name and BlockID (consisting of ContainerID and localID).
- After collecting sufficient Key metadata, organize it for further processing.
- There are three approaches to retrieve metadata (detailed in [Three Approaches to Retrieve Metadata](https://www.notion.so/meeting-room-Conference-Room-d17916fda32244f2b5edfec93c165cee?pvs=21)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notion should be markdown, we should include it here in the PR itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@errose28
Copy link
Contributor

errose28 commented Feb 5, 2025

I like the idea of being able to quickly check block existence without downloading all the data. We are working on a similar set of tools that might be useful in HDDS-12206. The final implementation will look something like this:

  • Command ozone debug replicas verify <URI>... [--checksums] [--metadata] [--padding] where:
    • ozone debug replicas verify does one iteration through the namespace provided in <URI>. It may use multiple threads to work through all the keys.
    • For each key under <URI>, the client uses the block locations from OM without consulting SCM
    • Client uses the block info to checks on the replicas as specified by the flags:
      • --checksums: Download data from the datanode to do client side checksum verification. Replaces the read-replicas commad.
      • --padding: Check for EC missing padding. Replaces the find-missing-padding command and is a no-op for non-EC keys.
      • --metadata (or other name): Only check block existence by GetBlock calls to the datanodes without downloading data.

We could add this another flag like --container-state or similar to check the SCM container state for each container referenced in the keys and raise an error if it is deleted or has no replicas. This check would not involve the datanodes. I think this would encapsulate the requirements in this doc, although the underlying implementation is different. Using SCM as described in this doc may have some issues:

  • The client will need block tokens to get datanode block metadata, and these are currently only generated by the OM, and then by datanodes for verifications.
    • We would need to get either block or container tokens to the client from SCM if we want to use SCM's container data to query the datanodes.
  • Building a reverse container -> key mapping is an expensive task. On a large namespace this will not fit in memory and need to spill to disk somewhere. It will likely make the command more expensive than just calling GetBlock on the datanodes for each block in the keys.

Let me know what you think of the approach in HDDS-12206 and if it seems like it could meet the requirements in this doc.

@xichen01
Copy link
Contributor Author

xichen01 commented Feb 7, 2025

Command ozone debug replicas verify ... [--checksums] [--metadata] [--padding] where
.....

  • --metadata (or other name): Only check block existence by GetBlock calls to the datanodes without downloading data.

We could add this another flag like --container-state or similar to check the SCM container state for each container reference...
.....

This can be a similar method to check if the key is readable without downloading the key.

  • The container status check cannot be bypassed. You need to execute getContainer before executing GetBlock. Only normal containers can perform GetBlock.
  • GetBlock does not confirm the existence of the Block file itself, it only gets the Block metadata from the DB, so GetBlock cannot check the existence of the block well.
  • The headBlocks mentioned in the document will check the DB and Block files, and its return value will clearly indicate whether it is missing metadata or Block files, or other exceptions. We can also let it check whether the size of the Block file is the same as the record in the OM (this has a high probability of discovering abnormal overwrites, such as SCM assigning duplicate BlockIDs HDDS-8973. Ozone SCM HA should not allocates duplicate IDs when transferring leadership #5018)

Building a reverse container -> key mapping is an expensive task. On a large namespace this will not fit in memory and need to spill to disk somewhere. It will likely make the command more expensive than just calling GetBlock on the datanodes for each block in the keys.

  • We only need to save the mapping of container -> key of the key being checked in memory. We don't need to save the mapping of all keys. We can control the number of keys being checked through parameters. This part of memory consumption is controllable.
  • The batch processing check mentioned in this document headBlock can achieve very efficient checking. The internal version we are currently using (using listKeys to obtain metadata from OM) can reach a speed of 70k/s (online users have no obvious perception), and the bottleneck here is OM's listKeys. If listKeysLight is used, it can reach a higher speed.

@errose28
Copy link
Contributor

This can be a similar method to check if the key is readable without downloading the key.

Yes I agree this tool would fit well under the ozone debug replicas verify proposal, I think we just need to work out some of the implementation details.

  • The container status check cannot be bypassed. You need to execute getContainer before executing GetBlock. Only normal containers can perform GetBlock.

I don't follow this. GetBlock only needs a block token generated by the OM or datanode. Calling SCM's getContainer is not required, nor is it done on the read or reconstruct paths where this API is currently used. I am also not sure what is meant by a normal container here. Does this mean one where replicas are not missing or deleted? In that case reading the block metadata would fail as expected.

The headBlocks mentioned in the document will check the DB and Block files, and its return value will clearly indicate whether it is missing metadata or Block files, or other exceptions

This sounds like a good addition. I think the new API should be called something like verifyBlocks instead of headBlock though. Head requests usually retrieve metadata in the same way that getBlock already does, but the description here returns results about metadata verification, not the actual metadata.

We only need to save the mapping of container -> key of the key being checked in memory. We don't need to save the mapping of all keys. We can control the number of keys being checked through parameters. 

Got it. I don't think it's clear in the doc whether the steps are applied to a single key or a whole section of the namespace. For the proposed ozone debug replicas verify command we would like it to operate on a specified portion of the namespace. This allows the command to parallelize listing commands across buckets and run verifications for a list batch concurrently while the next listing batch is being fetched.

It sounds like the block verification API is looking to batch requests for multiple blocks in one request to the DN. In this case the per-key mapping should probably be datanode to block, not container to block, since blocks for the key may be under the same datanode but in different containers.

The internal version we are currently using (using listKeys to obtain metadata from OM) can reach a speed of 70k/s (online users have no obvious perception), and the bottleneck here is OM's listKeys. If listKeysLight is used, it can reach a higher speed.

I don't see how this would work because listKeysLight does not include block information. That is what makes it fast. However we need both the block locations in the cluster and the block tokens to read them to run the verification checks. ListKeysResponse returns KeyInfo objects which if you drill down have the block token present. The BasicKeyInfo in ListKeysLightResponse has no informtation that can be used to locate or read the blocks.

Since the list keys API is paginated, the bottleneck on list keys would likely be alleviated by starting the block checks for one listing batch while concurrently fetching the next listing batch as mentioned above.

- Disadvantages: Requires API enhancements to include Block information.
2. Using `listStatus` and `listKeys`:
- Advantages: Can be used directly without additional development.
- Disadvantages: Lower performance and may become a bottleneck.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lower performance comes form the block information that is passed in. If we use listKeysList and add block information, it will be the same as listKeys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just add the BlockID (ContaienrID + LocalID) which should have less of an impact.

@xichen01
Copy link
Contributor Author

xichen01 commented Feb 15, 2025

I don't follow this. GetBlock only needs a block token generated by the OM or datanode. Calling SCM's getContainer is not required, nor is it done on the read or reconstruct paths where this API is currently used. I am also not sure what is meant by a normal container here. Does this mean one where replicas are not missing or deleted? In that case reading the block metadata would fail as expected.

  • Generally, users use this command to check whether the key is readable and if not, what is the reason. Possible reasons include the Container does not exist, or the Container is DELETED, or the Block file does not exist. So getContainer is necessary as a step in checking the readability of a key.
  • A simple process can be summarized as list -> getContainer -> getContainerReplicas -> getBlock/verifyBlocks
  • For clusters with security turned on, the process may be different, and listKey will not return a Block Token, so we may need to add a getKeyInfo step to get the Token.
  • If we want to check the readability of a Key, we need to call getContainerReplicas so that we can check each Block replicas.
  • A "normal Container" is a Container that exists and is in DELETED or DELETING state, or a Container that can provide services.

This sounds like a good addition. I think the new API should be called something like verifyBlocks instead of headBlock though. Head requests usually retrieve metadata in the same way that getBlock already does, but the description here returns results about metadata verification, not the actual metadata.

Yes, I think verifyBlocks should be a better name.

Got it. I don't think it's clear in the doc whether the steps are applied to a single key or a whole section of the namespace. For the proposed ozone debug replicas verify command we would like it to operate on a specified portion of the namespace. This allows the command to parallelize listing commands across buckets and run verifications for a list batch concurrently while the next listing batch is being fetched.

Support for checking Volume / Bucket / Key I will improve the documentation.

It sounds like the block verification API is looking to batch requests for multiple blocks in one request to the DN. In this case the per-key mapping should probably be datanode to block, not container to block, since blocks for the key may be under the same datanode but in different containers.

It's actually a two-level mapping ContainerID -> LocalID -> Key. This allows you to query from BlockID(ContainerID + LocalID) to Key.

I don't see how this would work because listKeysLight does not include block information. That is what makes it fast. However we need both the block locations in the cluster and the block tokens to read them to run the verification checks. ListKeysResponse returns KeyInfo objects which if you drill down have the block token present. The BasicKeyInfo in ListKeysLightResponse has no informtation that can be used to locate or read the blocks.

We need to modify listKeysLight to add at least the BlockID information.

Since the list keys API is paginated, the bottleneck on list keys would likely be alleviated by starting the block checks for one listing batch while concurrently fetching the next listing batch as mentioned above.

For clusters that don't have security turned on, perhaps getting the key list directly from the DB would be the best performance. However, this requires the command to be executed on the OM node.

@xichen01 xichen01 requested a review from errose28 February 15, 2025 09:34
@xichen01
Copy link
Contributor Author

@errose28 PTAL.

@github-actions
Copy link

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions bot added the stale label Nov 12, 2025
@github-actions
Copy link

Thank you for your contribution. This PR is being closed due to inactivity. If needed, feel free to reopen it.

@github-actions github-actions bot closed this Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants