-
Notifications
You must be signed in to change notification settings - Fork 485
DPL: process region callbacks as early as possible #6553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In case we are not running, we can afford processing the callbacks as soon as possible.
|
@davidrohr this should anticipate the callbacks as early as possible. The mutex should guarantee we do not get overlap even when the callback is received before we switch to running but terminates once we are already in running. |
|
ok, thx, will have to try it in FST with DD to see that it doesn't break anything. Then we can merge it and once we have new RPMs we can try it with AliECS. |
|
I tried this, and it breaks the full system test. |
|
I guess it still comes too late? Does it fail without this PR? |
|
Before this pr it comes late but still before the first data. After the pr it comes after the first data only.
|
|
Ok, the issue is that if the event arrives to late, after the event loop is sitting in uv_run, then the next chance to run the callback is actually after the data has been processed. The latest commit should fix it. Still we need a way to stay in "init" until all the regions have been registered or maybe I need to awake the loop when a memory region event was triggered, so that they get processed immediately. @ironMann @davidrohr Can I assume that all the region events are fired before data starts to be sent? |
|
@ironMann ok, thanks, but then I do not understand why this PR (or the one before) does not work: if you do not transition to ready, DPL will not transition to START, and therefore all the events should be handled synchronously in the callback, unless they are delivered late. I can easily add something which wakes up the event loop when a region event is delivered, but I have the impression the problem is that FairMQ itself delivers the events too late. |
|
Region callback for a given region will fire shortly after its creation, theoretically even before These callbacks are not synchronized to states. |
|
Ok, but how is that possible that DPL is already in RUN state, waiting in the DPL event loop, when the notification happens? And that we know it's actually the case, because @davidrohr saw the crash with the previous incarnation of this PR. |
If one of the above is not true, then indeed it can happen easily that DPL only sees the events during Running. |
|
|
@ktf : With this fix, the callback is processed before the data. Can I interpret this as this is in the state transision going from ready to running? |
The callback is called while holding a shm mutex that is used for the condition variable (and for region creation/destruction). So when multiple devices register, their callback execution of multiple devices will be serialized. This would explain why you don't see them right away, at least not on each device. The idea behind the mutex here (besides the CV) is to avoid region creator destroying a region while a peer is still using it in his callback. |
|
@davidrohr indeed if that is the place where the callback is delivered, maybe the state is already "Running" and that would explain why your code is not executed as the event arrives. What I can improve on my side is making the callback handling awake the event loop in DPL, so that we do not need to wait for the first data before handling it. Still that does not explain why the first invocation of handleRegionCallbacks was skipped, as that happens immediately in ConditionalRun, unless the message |
|
@TimoWilken what's up with the CS8 tests? They are in "Expected — Waiting for status to be reported" since yesterday night. |
|
#6621 should improve the callback processing latency on DPL side. Not sure if that is enough. |
In case we are not running, we can afford processing the callbacks
as soon as possible.