Fix scheduling of notification interfaces#3290
Conversation
3b2e8f9 to
85ad71c
Compare
|
I rather try to implement this without EDIT: @jason-bragg @xiazen does that sound like a fine approach? |
85ad71c to
df4f3e1
Compare
|
@xiazen @jason-bragg please look at the updated PR. I removed the abstraction I had previously introduced, but more importantly: I changed |
|
Looks technically correct, though I'd prefer we move to the shared state pattern (Shared state publisher prototype #3291), as the pattern in this PR relies on the implementer of a notification listener to 'do the right thing' whereas the shared state pattern prevents them from doing the wrong thing. As an immediate fix, this looks fine. As a longer term maintainable approach, more discussion may be needed. |
There was a problem hiding this comment.
Happy to remove IAddressable from the interface. Never really liked it.
There was a problem hiding this comment.
"This interface inherit from IAddressable for threading-safe concern"
It inherits from IAddressable so it can be implemented by system target and notification be called as grain call. The right and proper way to talk between components inside Orleans is via grain (system target) calls. Except for some small number of limited cases when we break isolation. But here is not such a case and thus I don't see a reason to avoid using grain calls for cross component communication and insist on breaking isolation (I call "invoking function directly, even if after that we enqueue an Task directly" as breaking isolation).
There was a problem hiding this comment.
This relies on implementers to know this is necessary. Not obvious to maintainers or those writing new listeners.
There was a problem hiding this comment.
Agreed, it's not great. Sergey and I settled on this as a 'good enough' fix for a patch release, but it's definitely not ideal.
There was a problem hiding this comment.
Task returned by this.ScheduleTask() is not awaited or Ignore'ed. I think it's best to wrap it in a SafeExecute(), so that potential exceptions are handled and logged. Alternatively, we could add another extension that would do that and return void instead of a Task.
There was a problem hiding this comment.
Good point, @sergeybykov. I updated to use Ignore() and SafeExecute().
There was a problem hiding this comment.
While I understand why the above code is outside the task, it would be more maintainable if all the logic was in another function which this call scheduled, like in other instances in this PR. Especially the timer dispose, as that is state and the dispose call may not be thread safe. Also having the dispose outside the task does not save us much since if the code has gotten that far, the task will still be created, unlike the earlier fast exit.
There was a problem hiding this comment.
Task returned by this.ScheduleTask() is not awaited or Ignore'ed. I think it's best to wrap it in a SafeExecute(), so that potential exceptions are handled and logged. Alternatively, we could add another extension that would do that and return void instead of a Task.
|
LGTM |
|
Please wait merging, I might have comments. |
d193412 to
81773b5
Compare
|
Sure thing, @gabikliot |
|
I am sorry, but I am not following. I don't understand the problem in #3273. That bug refers to another bug #3256. The main concern I have is about ordering of silo status change notifications. There was a reason to make them on the caller context - to guarantee ordering. This PR changes multi silo, which I don't know much about, but also queue balancer. |
|
We believe there are several issues here. No doubt it's confusing. The root cause of #3285 we believe is the bug of using a regular, albeit safe, timer instead of registering a grain (system target) timer - 98e3910#diff-9bd4c404a42cdafba3be5d76bf5e71fdR64. Because the timer fires on a thread pool thread, that obviously violates the single-threading guarantee and makes the callback run with no scheduling context. We believe we see the same thing happening with The next level of realization was that even if the caller runs as a system target, e.g. #3256 is attempting to patch the immediate issues without a significant refactoring of the code, so that the limited fix could be included in 1.5.1. At the same time, @ReubenBond is entertaining a more general solution to help ensure calls to and between system targets are enqueued on correct scheduling contexts. That change would be part of 2.0 release. |
|
Ok, that is more or less what I thought. But re:
That is a bit more complicated. In some places, this was done by design. Fix the bugs, then lets open a new and clear issue about why MBR calls other system targets on its own context and not their context and lets discuss it there. We may decide at the end that this is wrong, or that maybe it is not. There was thinking and reasoning behind that. |
That's roughly the plan. Although, @ReubenBond is looking beyond membership, for a general safe invocation option (a proxy or something). Then we can see where it makes sense to use it and where we are fine violating the safety for well understood and documented reasons. |
|
Sounds good. The general safe invocation option is to use system target grain interface. We do it in most cases already. No need for any new mechanism. |
|
I've updated the PR to address the comments. Thanks, @gabikliot. Having the I wonder if we could extract this 'dead silo activation cleanup' code into a separate class, since Catalog is so huge and the method only needs to call |
|
@ReubenBond, can we stick to the plan above?
I thought that is already what Sergey wrote and we are in agreement that 1+2 are unrelated to 3 and we should first discuss 3 before jumping into 4. |
|
What does that mean in concrete terms, @gabikliot? What I'm saying I've done in the most recent update is partially undo the original change, but remove the The issue with What's the bug in Service Fabric integration? I don't consider this to be a bug in SF integration, but an existing issue which is only exposed by SF integration. That's why the code changes are in the consumers of notifications and not the producer (SF integration) |
|
It's not obvious that there should be ordering in |
|
I am not understanding what problem this PR addresses. |
|
@dotnet-bot test netstandard-win-functional |
|
@gabikliot check out the stack trace in #3273, nevermind the text of that issue. This PR prevents that exception from occurring by ensuring that notifications which may result in a grain call are scheduled on an addressable runtime context (a SystemTarget's context). Initially I wanted to use a SystemThread to fix this, but a SystemThread cannot make grain calls in its current state. |
|
Sorry, Reuben. If you are looking for my help, you will have to elaborate more. |
|
No worries, @gabikliot. I'll try to elaborate:
The decision - for the time being - is that consumers of notifications are responsible for ensuring that they handle the notification on the correct scheduling context. Producers are merely responsible for notifying the consumer - with no guarantee of what scheduling context that notification comes from. Is that clearer? If you're familiar with Reactive Extensions, the equivalent is that consumers are responsible for calling |
d376853 to
b64742c
Compare
|
Updated PR to reduce the impact of the changes - nothing superfluous is altered now. If we agree that these are the right changes to make to fix the linked issue, please squash & merge |
b64742c to
a290d7c
Compare
|
Re SF: I guess I am missing all the context about it. I thought so far it was like an MBR oracle, thus is het another runtime component, and has all the access to all the internals. I was not aware it is not. |
|
If SF had access to runtime internals I would have just made it an ST and be done with it, but on principle it does not have access. The intention was to ensure that such components could be made as outside contributions. |
|
Maybe we should have started this whole issue by saying that SF is special, very different from other runtime components. I think what you are saying is it is an internal runtime component, but with limited access, and you are trying to come ip with a model for that: allow others to write components but without full access. We had a similar probelm in the past, with streaming: pulling agent and etc... We indeed finally came up with such a model: all the complex runtiem stuff is in pulling agent, and extensions are via the cache, balancer , ... Specifically, with SF what you could do is: I guess now it is calls directly on all listeners. Instead you can pass it "listener adaptor" that will marshal his calls to the right context. the adapter will have full access to the runtime. Or maybe indeed SF is yet another exception in the list above. But I was under impression that you are pushing to a general pattern of callee marshaling to his context and caller calling on what ever. And that (this new general principle) is what I objected. If you are only looking to patch SF case with what ever hack or exception, then I go ahead. What are your thought on my 4 cases explanation? Does it make sense, from the general standpoint of design principle for runtime components communication? Unrelated from SF issue. |
I mentioned that SF bits don't have access to SystemTarget earlier: #3290 (comment)
What's an internal component? SF lives in the main tree, but it could live in a separate repo (it started in a separate repo). SF does not have
Your impression is correct: I believe that the general pattern should be: if you subscribe to notifications/callbacks on a local (non-IAddressable) interface then you are responsible for marshalling calls onto your own context. You know what you are (poco, grain, ST) and the caller does not and should not. You also don't know what the caller is and you should not. This keeps implementation decisions local rather than having considerations for them throughout the code. It covers every one of your 4 cases above without the caller or callee knowing how the other is implemented.
Point 4 assumes internal access. If it weren't for that I would still be against that pattern for the reason above (caller/callee shouldn't know each other's impl details).
Maybe this is the best approach in the long run. We could create hand-crafted adaptors for each interface, based on the caller's impl (and thereby assume the callee is an oblivious poco as in my current impl). |
|
OK, I was not aware SF is in its own repo without Internal access. so how this all worked with liveness notification going not on any context? yes, that is wrong.
Yes, agree, IF that is what you do. But I was suggesting to keep balancerListener IAddressable, as now, and then my point 4 stands. Maybe indeed for this case you need to use GrainClient and come up with an abstraction of remotable subscriptions? |
It's in this repo now, but it started life in OrleansContrib. We want to allow for external implementations of various APIs, so we need a pattern which can be applied without internal access.
I mean that the thing you're subscribing to is not IAddressable, eg: balancerListener subscribes to balancer: balancer is not IAddressable.
That can't be done in a patch release with a straight face. There is a strong desire for a 'silo-local client', eg for HTTPS case. GrainReferences can be re-targeted to a different We could probably also modify
The first version of this PR had something like this: /// <summary>
/// Provides services for executing actions on a single-threaded, addressable context.
/// </summary>
public interface IAddressableContextScheduler
{
/// <summary>
/// Schedules the provided <paramref name="action"/> on this context.
/// </summary>
/// <param name="action">The action.</param>
/// <returns>A <see cref="Task"/> which completes when the <paramref name="action"/> has completed.</returns>
Task Schedule(Action action);
/// <summary>
/// Schedules the provided <paramref name="action"/> on this context.
/// </summary>
/// <param name="action">The action.</param>
/// <returns>A <see cref="Task"/> which completes when the <paramref name="action"/> has completed.</returns>
Task Schedule(Func<Task> action);
/// <summary>
/// Schedules the provided <paramref name="action"/> on this context.
/// </summary>
/// <param name="action">The action.</param>
/// <returns>A <see cref="Task"/> which completes when the <paramref name="action"/> has completed.</returns>
Task<T> Schedule<T>(Func<Task<T>> action);
}
internal class AddressableSchedulingContext : SystemTarget, IAddressableContextScheduler
{
private readonly OrleansTaskScheduler scheduler;
/// <inheritdoc />
public Task Schedule(Action action)
{
return this.scheduler.RunOrQueueAction(action, this.SchedulingContext);
}
/// <inheritdoc />
public Task Schedule(Func<Task> action)
{
return this.scheduler.RunOrQueueTask(action, this.SchedulingContext);
}
/// <inheritdoc />
public Task<T> Schedule<T>(Func<Task<T>> action)
{
return this.scheduler.RunOrQueueTask(action, this.SchedulingContext);
}
internal static Factory<IAddressableContextScheduler> GetFactory(IServiceProvider sp)
{
var siloProviderRuntime = sp.GetRequiredService<SiloProviderRuntime>();
return GetFactory(sp, siloProviderRuntime);
}
internal static Factory<IAddressableContextScheduler> GetFactory(IServiceProvider sp, SiloProviderRuntime siloProviderRuntime)
{
var contextFactory = FactoryUtility.Create<AddressableSchedulingContext>(sp);
return () =>
{
var result = contextFactory();
siloProviderRuntime.RegisterSystemTarget(result);
return result;
};
}
public AddressableSchedulingContext(
OrleansTaskScheduler scheduler,
ILocalSiloDetails localSiloDetails) : base(GrainId.GetGrainId(UniqueKey.NewKey()), localSiloDetails.SiloAddress)
{
this.scheduler = scheduler;
}
}How do you feel about that approach? The |
That's why I've been advocating separating a patch like fix from the bigger refactoring. Is there s reason the current state of the PR cannot be the patch? It fixes the two cases that've been reported, and is the last fix we are holding 1.5.1 for. |
|
I also advocated for patch only. The major refactoring is in one line: It makes my case 4 from the comment above not existent: it will mean now that ANY MBR oracle (including our standard one with MBR Table) will not call Queue Balancer Listener as grain call, but rather directly, thus violating the basic principle of isolation between STs in Orleans. This is done to help with patching SF Oracle, I get it, but with the bath water we are also throwing the baby away. === ADDED: |
|
It doesn't matter to me how this issue is fixed in the short term as long as the fix follows sound reasoning and does not introduce too much entropy for a patch release. I do not consider this an issue with the SF membership provider, it's an issue with consumers of synchronous notification interfaces. How would we fix this issue only making changes to the SF provider? If we can't do that cleanly, then maybe the issue actually lies elsewhere (which is what I'm arguing). Calling a local That previously Let's say that interface does remain EDIT:
|
|
LGTM for the changes. Looks like you went with the provider runtime. LGTM. ISiloStatusListener is a bit different from any others, since there was a need to send notifications directly as sync calls (neither grain call nor direct call plus queue will work). Sounds like if we want to take it further, we would need to talk over Skype to discuss it in person. |
|
Actually, how does this later commit addresses the original issue? How SF Oracle would send notifications on the scheduling context? It still wouldn't now, right? |
|
I'm confused. How does the latest version solve the issue of marshalling SF notifications that originate on external threads to the subscribers? Just running I can see the fix in the aa35113 version, but am struggling with this one. |
|
Let's skype call so we can discuss it, Gabi. I should have added a comment when I pushed that commit. It works because of a change made back in ~june where FabricMembershipOracle steals TaskScheduler.Current in its Start method and uses it to spin up a task which awaits notifications in ProcessNotifications. Then the OnUpdate method which is called by the OS thread from SF will push notifications onto a queue and pulse the AutoResetEvent which the former is awaiting. ProcessNotifications performs the actual SiloStatusChangeNotification call to subscribers, so it is called on that captured scheduler. It works, I tested. Honestly, though, I feel it's a dodgy fix and I definitely prefer the previous revision. I'm uncomfortable with it - it's a hack. I spoke with Sergey this morning, since he's going to be out for the next fortnight. He also prefers the previous revision. Let me know when you've got time so we can speak. |
|
@dotnet-bot test netfx-bvt |
|
I'm testing it using this branch: https://github.com/ReubenBond/orleans/tree/upgrade-sd-sample |
88ed713 to
091b8c1
Compare
|
Gabi and I had a discussion about it and we're on the same page regarding the principles and ideals. In the end, I feel we should essentially take both the 'new' and 'old' fix.
Verified fix locally using the branch linked above (I build Assuming tests pass, could someone please give it a final review and merge? |
|
Great work @ReubenBond ! |
* Ensure that ISiloStatusListener handlers correctly schedule handler code * Services without a SchedulingContext are executed on the ProviderManager context
* Ensure that ISiloStatusListener handlers correctly schedule handler code * Services without a SchedulingContext are executed on the ProviderManager context
Fixes #3273
Implementations of
ISiloStatusListenerandIRingRangeListenermust correctly marshall calls onto their own execution context in order to preserve correct scheduling semantics..ScheduleTask(...)extension methods toSystemTargetso that implementations can easily schedule these callbacks.AddIAddressableContextSchedulerwhich provides methods for executing actions on an addressable scheduling context (so that grain calls can be made). It can be injected viaFactory<IAddressableContextScheduler>.IAddressablefromIStreamQueueBalanceListenerand havePersistentStreamPullingManagersubscribe to ring range change notifications directly (instead of via a ref,.AsReference<IStreamQueueBalanceListener>())