Skip to content

Conversation

@sarvekshayr
Copy link
Contributor

What changes were proposed in this pull request?

Added Byteman rules to inject container states UNHEALTHY, DELETED, and INVALID in order to test and validate the output of ozone debug replicas verify --container-state. This ensures that the ContainerStateVerifier correctly identifies and reports that the checks have failed.

What is the link to the Apache JIRA

HDDS-13326

How was this patch tested?

CI: https://github.com/sarvekshayr/ozone/actions/runs/16200808599/job/45739884752#step:13:193

==============================================================================
Container-State-Verifier :: Test container state on a UNHEALTHY, DELETED an...
==============================================================================
Verify Container State With Unhealthy Container Replica               | PASS |
------------------------------------------------------------------------------
Verify Container State With Deleted Container Replica                 | PASS |
------------------------------------------------------------------------------
Verify Container State With Invalid Container Replica                 | PASS |
------------------------------------------------------------------------------
Container-State-Verifier :: Test container state on a UNHEALTHY, D... | PASS |
3 tests, 3 passed, 0 failed
==============================================================================

@Tejaskriya Tejaskriya self-requested a review July 14, 2025 08:46
Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @sarvekshayr, left 1 comment below.

Although it would be nice to find a way to reduce duplication between the 3 rules added such that we can manage with just 1 rule / just 1 rule file. But I am not aware of a way to do it. @ssulav do you have any ideas for this?

@Tejaskriya Tejaskriya requested review from dombizita and ssulav July 14, 2025 08:52
@ssulav
Copy link
Contributor

ssulav commented Jul 14, 2025

Yes, we can merge all the rules in a single file if we want, and apply the rule file only once.

Actually, its for the same signature, so merging isn't possible.

I have an idea, but not tested. We may achieve this via an environment variable with a conditional check.
export OZONE_CONTAINER_OVERRIDE_STATE=
before applying the rule and have the below rule only.

RULE Override getState() using environment variable
CLASS org.apache.hadoop.ozone.container.common.impl.ContainerData
METHOD getState
AT ENTRY
IF System.getenv("OZONE_CONTAINER_OVERRIDE_STATE") != null
DO
  state = System.getenv("OZONE_CONTAINER_OVERRIDE_STATE");
  traceln("BYTEMAN RULE: Overriding getState() using env var to return " + state);
  return org.apache.hadoop.hdds.protocol.datanode.proto.ContainerProtos$ContainerDataProto$State.valueOf(state)
ENDRULE

@ssulav
Copy link
Contributor

ssulav commented Jul 15, 2025

Also right now we are overriding the org.apache.hadoop.ozone.container.common.impl.ContainerData.getState() to return as we wish with the help of byteman fault. And the debug verify cli helps to return the state, as it is also calling the same API.

I would say this is not the best implementation. It would be better if we could somehow make the state persistent in the datatnode and not just override the API response.

@dombizita dombizita added the tools Tools that helps with debugging label Jul 16, 2025
Copy link
Contributor

@dombizita dombizita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @sarvekshayr! Overall it looks good to me. I see that you reduced code duplication, but I think @ssulav meant for the byteman files as well, right? That they could be merged to one rule, instead of the 3 files; but I'm not sure how that would work

@dombizita
Copy link
Contributor

Also right now we are overriding the org.apache.hadoop.ozone.container.common.impl.ContainerData.getState() to return as we wish with the help of byteman fault. And the debug verify cli helps to return the state, as it is also calling the same API.

I would say this is not the best implementation. It would be better if we could somehow make the state persistent in the datatnode and not just override the API response.

I think it would be a lot of pre-test setup work to actually make these unhealthy containers on the datanodes, which would not even necessarily test this tool's functionality, more the container health handling on the datanodes. I get that overriding the same method which is used by this tool could be not ideal, but in this case overriding this makes sense to me, as this method is used at other places in the code for testing container health related things. What do you think, could be any more place which could be used that wouldn't mean to make scenarios on the DN side to achieve the unhealthy containers?

@sarvekshayr
Copy link
Contributor Author

Thank you for working on this @sarvekshayr! Overall it looks good to me. I see that you reduced code duplication, but I think @ssulav meant for the byteman files as well, right? That they could be merged to one rule, instead of the 3 files; but I'm not sure how that would work

I tested the byteman rule by overriding getState() using environment variable but it didn't seem to work.

@dombizita dombizita requested review from ssulav and removed request for ssulav July 28, 2025 11:38
Copy link
Contributor

@ssulav ssulav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for working on the review coments.

@Tejaskriya Tejaskriya merged commit 9987f6a into apache:master Aug 4, 2025
69 of 70 checks passed
@Tejaskriya
Copy link
Contributor

Thanks for the test @sarvekshayr , and the reviews @ssulav @dombizita

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test tools Tools that helps with debugging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants