HDDS-6976. 0GB data moved by container balancer after successful iteration #3604

siddhantsangwan · 2022-07-18T13:52:28Z

What changes were proposed in this pull request?

In some cases, balancer displays size of data moved as 0GB even after scheduling some moves and successfully completing an iteration. Check the Jira for one such example. This happens even though containers have actually been moved, implying that the problem is in calculating the size.

The cause could be division of two long values causing a truncation of the fractional part in ContainerBalancer#moveContainer():

metrics.incrementDataSizeMovedGBInLatestIteration(
                    containerInfo.getUsedBytes() / OzoneConsts.GB);

If container sizes are less than a GB, the result would be 0.

This PR sets this metric at the end of the iteration instead of incrementing it after every move. Result is logged as a double value.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6976

How was this patch tested?

Existing UTs.

…ation

siddhantsangwan · 2022-07-18T13:54:26Z

@symious can you also review, please?

symious · 2022-07-18T14:43:20Z

We also met some cases that 0GB is moved, but the root cause is not math division.
In our case, the move request was sent out but not executed by datanodes, then after timeout, no container was balanced, thus showing 0GB was moved.

kerneltime · 2022-07-19T00:47:54Z

Why not just log it in the units used for tracking the data (i.e. Bytes).

siddhantsangwan · 2022-07-19T05:38:31Z

We also met some cases that 0GB is moved, but the root cause is not math division.
In our case, the move request was sent out but not executed by datanodes, then after timeout, no container was balanced, thus showing 0GB was moved.

Were you able to find out why those commands weren't executed in the datanodes? Any metrics on how many copy/delete requests were sent to these datanodes in one iteration?

siddhantsangwan · 2022-07-19T05:52:53Z

Why not just log it in the units used for tracking the data (i.e. Bytes).

It's an attempt to make the logs more readable. It's easier to read GBs, but of course bytes is more accurate.

symious · 2022-07-19T06:57:56Z

Were you able to find out why those commands weren't executed in the datanodes? Any metrics on how many copy/delete requests were sent to these datanodes in one iteration?

Can check the replication queue size on DN, and the timeout counts of ContainerBalancer?

kerneltime · 2022-07-19T07:47:27Z

Why not just log it in the units used for tracking the data (i.e. Bytes).

It's an attempt to make the logs more readable. It's easier to read GBs, but of course bytes is more accurate.

I would keep it in bytes and simplify the code. We can create a dashboard off metrics which does the conversion correctly, logs are used by developers who would want accuracy.

siddhantsangwan · 2022-07-19T09:58:25Z

I would keep it in bytes and simplify the code. We can create a dashboard off metrics which does the conversion correctly, logs are used by developers who would want accuracy.

Makes sense. @mukul1987 suggested logging something like this:

Iteration Summary-
Number of datanodes involved: 4
Size moved: 1GB (1073741824 Bytes)
Number of container moves completed: 5

How does this look @kerneltime?

…of iteration

siddhantsangwan · 2022-07-19T10:31:00Z

Updated the PR. Let's fix the division issue in this one and debug timeout issues in another JIRA.

symious · 2022-07-20T02:22:33Z

Sure, our issue is solved with PR: #3497.
Can check if it also fits your problem.

kerneltime · 2022-07-20T05:24:39Z

I would keep it in bytes and simplify the code. We can create a dashboard off metrics which does the conversion correctly, logs are used by developers who would want accuracy.

Makes sense. @mukul1987 suggested logging something like this:
Iteration Summary-
Number of datanodes involved: 4
Size moved: 1GB (1073741824 Bytes)
Number of container moves completed: 5
How does this look @kerneltime?

Ideally if you want to improve readability you need to be able to map the bytes to the right unit and print it (kb, mb, gb, tb..) This should be ok for now, unless hadoop ecosystem has a package that will do the string conversion with appropriate units for you.

siddhantsangwan · 2022-07-20T06:42:01Z

Ideally if you want to improve readability you need to be able to map the bytes to the right unit and print it (kb, mb, gb, tb..) This should be ok for now, unless hadoop ecosystem has a package that will do the string conversion with appropriate units for you.

Thanks for the tip. Updated with appropriate units.

siddhantsangwan · 2022-07-21T06:27:25Z

Thanks for the reviews! I've merged this PR.

…cessful iteration (apache#3604) (cherry picked from commit 0886f62) Change-Id: I9756a9f21e81373d0011c9cdc16eab343a88fdd8

HDDS-6976. 0GB data moved by container balancer after successful iter…

7467c40

…ation

siddhantsangwan requested review from JacksonYao287 and lokeshj1703 July 18, 2022 13:52

log number of container moves and represent size in bytes at the end …

cb5981d

…of iteration

do appropriate size unit conversion in logs

97359a2

kerneltime approved these changes Jul 21, 2022

View reviewed changes

siddhantsangwan merged commit 0886f62 into apache:master Jul 21, 2022

HDDS-6976. 0GB data moved by container balancer after successful iteration #3604

HDDS-6976. 0GB data moved by container balancer after successful iteration #3604

Uh oh!

Conversation

siddhantsangwan commented Jul 18, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

siddhantsangwan commented Jul 18, 2022

Uh oh!

symious commented Jul 18, 2022

Uh oh!

kerneltime commented Jul 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddhantsangwan commented Jul 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddhantsangwan commented Jul 19, 2022

Uh oh!

symious commented Jul 19, 2022

Uh oh!

kerneltime commented Jul 19, 2022

Uh oh!

siddhantsangwan commented Jul 19, 2022

Uh oh!

siddhantsangwan commented Jul 19, 2022

Uh oh!

symious commented Jul 20, 2022

Uh oh!

kerneltime commented Jul 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddhantsangwan commented Jul 20, 2022

Uh oh!

siddhantsangwan commented Jul 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kerneltime commented Jul 19, 2022 •

edited

Loading

siddhantsangwan commented Jul 19, 2022 •

edited

Loading

kerneltime commented Jul 20, 2022 •

edited

Loading