Fixed equal distribution strategy when exist disabled middleManagers.#2472
Fixed equal distribution strategy when exist disabled middleManagers.#2472drcrallen merged 1 commit intoapache:masterfrom
Conversation
|
@andresgomezfrr hoe does sorting by ip will fix the issue ? |
|
thought that this code will filter out the disable host since it is less than the |
|
Let me try to explain it! Is true that this code will filter out the disable host using the minWorkerVer. The problem is that when sortedWorkers are sorted using the method compare if two or more worker have the same getCurrCapacityUsed(), the Comparator only return one of them. So on the scenario that I explained sometimes the only worker that the Comparator return is the AAA and it is disabled. On my pull request, if two middleManager have the same getCurrCapacityUsed, I will sort them using the IP address (I use the middleManager IP address because I suppose that it is unique inside druid cluster), doing that all the workers are returned on sortedWorkers and later, them will filter out using the minWorkerVer. |
There was a problem hiding this comment.
so how about comparation = zkWorker.getWorker().getVersion().compareTo(zkWorker2.getWorker().getVersion());
There was a problem hiding this comment.
FillCapacityWorkerSelectStrategy compares workers with capacity(asc) + host base. Two comparators could be merged into one.
There was a problem hiding this comment.
Both solutions works too. Other posible is compare second comparation using version or hostname. Yeah the solution on FillCapacityWorkerSelectStrategy is similar that i did. I'm going to change IP Address to Host to do two comparator more similar.
|
I updated my branch to the current druid master branch. |
|
👍 , nice catch |
|
@andresgomezfrr Can you add a comment in the code about why the host sorting is needed? The assumption that only one middle manager exists per ip address is one of the legacy assumptions that breaks in a lot of ways (like containerized services) and the "correct" assumption at the moment is that a middle manager is unique in its host/port combination. Since the legacy assumption of one worker per ip is "ok" for now, can you please clarify why it was added so that future developers can know if it is an essential assumption or simply a work-around for an issue? |
|
Yeah, you are right @drcrallen, the "correct" assumption is host/port combination. I can see that I added some comments explaining the reason because is needed the host sorting. What do you think about them? |
|
@andresgomezfrr as mention @drcrallen using host is not the best option for the long term, why not using the version as i suggested before ? zkWorker.getWorker().getVersion().compareTo(zkWorker2.getWorker().getVersion()); |
|
@drcrallen told that |
|
@andresgomezfrr if one of the woker is disable the version is empty string so by construction it will be different from other worker unless both are disable. |
|
ohh, it's true @b-slim ! so I think that this is other possible 👍 so .. I suppose that we can change FillCapacityWorkerSelectStrategy too What do you think @drcrallen ? |
|
@andresgomezfrr sure you can use it in both cases ! |
|
@andresgomezfrr thanks for the contrib 👍 |
|
you're welcome! 😄 |
|
@andresgomezfrr i am not sure if you have filled the CLA, general question to @fjy where we can check who is on the list of CLA ? |
|
@b-slim I think only PMC has access to that list right now. For privacy reasons it might stay that way for right now. I'd like to get some sort of github hook in so manual checking isn't needed. But for now bugging one of the PMC members is the best we have. @andresgomezfrr is not on the list so please do fill out the CLA at the link provided. |
|
I have done the CLA. But @drcrallen , I think that my company did the Corporate CLA. |
|
@andresgomezfrr individual CLA found. Please squash and I think this should be good to go. |
…th same currCapacityUsed.
e6e6c67 to
07d714b
Compare
|
Ready!! 😄 |
|
Cool, thanks for the contrib! 👍 |
Fixed equal distribution strategy when exist disabled middleManagers.
Hi all,
This week working on our cluster we detected a little issue. When we have for example 5 middleManagers and all of them have the same currentCapacityUsed and some of them are disabled, maybe the new indexing tasks can't start.
On this scenario, maybe the current equal distribution return an unique middleManager if the middleManager that is returned is disabled. The new tasks can't start and go to pending status.
To fix the issue I changed the logic inside equal distribution, now sort the middleManager using the currentCapacityUsed but if two middleManagers have the same currentCapacityUsed, them are sorted using the IP address.