-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-6976. 0GB data moved by container balancer after successful iteration #3604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@symious can you also review, please? |
|
We also met some cases that 0GB is moved, but the root cause is not math division. |
|
Why not just log it in the units used for tracking the data (i.e. Bytes). |
Were you able to find out why those commands weren't executed in the datanodes? Any metrics on how many copy/delete requests were sent to these datanodes in one iteration? |
It's an attempt to make the logs more readable. It's easier to read GBs, but of course bytes is more accurate. |
Can check the replication queue size on DN, and the timeout counts of ContainerBalancer? |
I would keep it in bytes and simplify the code. We can create a dashboard off metrics which does the conversion correctly, logs are used by developers who would want accuracy. |
Makes sense. @mukul1987 suggested logging something like this: How does this look @kerneltime? |
|
Updated the PR. Let's fix the division issue in this one and debug timeout issues in another JIRA. |
|
Sure, our issue is solved with PR: #3497. |
Ideally if you want to improve readability you need to be able to map the bytes to the right unit and print it (kb, mb, gb, tb..) This should be ok for now, unless hadoop ecosystem has a package that will do the string conversion with appropriate units for you. |
Thanks for the tip. Updated with appropriate units. |
|
Thanks for the reviews! I've merged this PR. |
…cessful iteration (apache#3604) (cherry picked from commit 0886f62) Change-Id: I9756a9f21e81373d0011c9cdc16eab343a88fdd8
What changes were proposed in this pull request?
In some cases, balancer displays size of data moved as 0GB even after scheduling some moves and successfully completing an iteration. Check the Jira for one such example. This happens even though containers have actually been moved, implying that the problem is in calculating the size.
The cause could be division of two long values causing a truncation of the fractional part in
ContainerBalancer#moveContainer():If container sizes are less than a GB, the result would be 0.
This PR sets this metric at the end of the iteration instead of incrementing it after every move. Result is logged as a double value.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6976
How was this patch tested?
Existing UTs.