Skip to content

Conversation

@siddhantsangwan
Copy link
Contributor

What changes were proposed in this pull request?

In some cases, balancer displays size of data moved as 0GB even after scheduling some moves and successfully completing an iteration. Check the Jira for one such example. This happens even though containers have actually been moved, implying that the problem is in calculating the size.

The cause could be division of two long values causing a truncation of the fractional part in ContainerBalancer#moveContainer():

metrics.incrementDataSizeMovedGBInLatestIteration(
                    containerInfo.getUsedBytes() / OzoneConsts.GB);

If container sizes are less than a GB, the result would be 0.

This PR sets this metric at the end of the iteration instead of incrementing it after every move. Result is logged as a double value.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6976

How was this patch tested?

Existing UTs.

@siddhantsangwan
Copy link
Contributor Author

@symious can you also review, please?

@symious
Copy link
Contributor

symious commented Jul 18, 2022

We also met some cases that 0GB is moved, but the root cause is not math division.
In our case, the move request was sent out but not executed by datanodes, then after timeout, no container was balanced, thus showing 0GB was moved.

@kerneltime
Copy link
Contributor

kerneltime commented Jul 19, 2022

Why not just log it in the units used for tracking the data (i.e. Bytes).

@siddhantsangwan
Copy link
Contributor Author

siddhantsangwan commented Jul 19, 2022

We also met some cases that 0GB is moved, but the root cause is not math division.
In our case, the move request was sent out but not executed by datanodes, then after timeout, no container was balanced, thus showing 0GB was moved.

Were you able to find out why those commands weren't executed in the datanodes? Any metrics on how many copy/delete requests were sent to these datanodes in one iteration?

@siddhantsangwan
Copy link
Contributor Author

Why not just log it in the units used for tracking the data (i.e. Bytes).

It's an attempt to make the logs more readable. It's easier to read GBs, but of course bytes is more accurate.

@symious
Copy link
Contributor

symious commented Jul 19, 2022

Were you able to find out why those commands weren't executed in the datanodes? Any metrics on how many copy/delete requests were sent to these datanodes in one iteration?

Can check the replication queue size on DN, and the timeout counts of ContainerBalancer?

@kerneltime
Copy link
Contributor

Why not just log it in the units used for tracking the data (i.e. Bytes).

It's an attempt to make the logs more readable. It's easier to read GBs, but of course bytes is more accurate.

I would keep it in bytes and simplify the code. We can create a dashboard off metrics which does the conversion correctly, logs are used by developers who would want accuracy.

@siddhantsangwan
Copy link
Contributor Author

I would keep it in bytes and simplify the code. We can create a dashboard off metrics which does the conversion correctly, logs are used by developers who would want accuracy.

Makes sense. @mukul1987 suggested logging something like this:

Iteration Summary-
Number of datanodes involved: 4
Size moved: 1GB (1073741824 Bytes)
Number of container moves completed: 5

How does this look @kerneltime?

@siddhantsangwan
Copy link
Contributor Author

Updated the PR. Let's fix the division issue in this one and debug timeout issues in another JIRA.

@symious
Copy link
Contributor

symious commented Jul 20, 2022

Sure, our issue is solved with PR: #3497.
Can check if it also fits your problem.

@kerneltime
Copy link
Contributor

kerneltime commented Jul 20, 2022

I would keep it in bytes and simplify the code. We can create a dashboard off metrics which does the conversion correctly, logs are used by developers who would want accuracy.

Makes sense. @mukul1987 suggested logging something like this:

Iteration Summary-
Number of datanodes involved: 4
Size moved: 1GB (1073741824 Bytes)
Number of container moves completed: 5

How does this look @kerneltime?

I would keep it in bytes and simplify the code. We can create a dashboard off metrics which does the conversion correctly, logs are used by developers who would want accuracy.

Makes sense. @mukul1987 suggested logging something like this:

Iteration Summary-
Number of datanodes involved: 4
Size moved: 1GB (1073741824 Bytes)
Number of container moves completed: 5

How does this look @kerneltime?

Ideally if you want to improve readability you need to be able to map the bytes to the right unit and print it (kb, mb, gb, tb..) This should be ok for now, unless hadoop ecosystem has a package that will do the string conversion with appropriate units for you.

@siddhantsangwan
Copy link
Contributor Author

Ideally if you want to improve readability you need to be able to map the bytes to the right unit and print it (kb, mb, gb, tb..) This should be ok for now, unless hadoop ecosystem has a package that will do the string conversion with appropriate units for you.

Thanks for the tip. Updated with appropriate units.

@siddhantsangwan siddhantsangwan merged commit 0886f62 into apache:master Jul 21, 2022
@siddhantsangwan
Copy link
Contributor Author

Thanks for the reviews! I've merged this PR.

duongkame pushed a commit to duongkame/ozone that referenced this pull request Aug 16, 2022
…cessful iteration (apache#3604)

(cherry picked from commit 0886f62)
Change-Id: I9756a9f21e81373d0011c9cdc16eab343a88fdd8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants