Skip to content

Fix scheduling of notification interfaces#3290

Merged
gabikliot merged 2 commits into
dotnet:masterfrom
ReubenBond:fix-isilostatuslistener-scheduling
Aug 24, 2017
Merged

Fix scheduling of notification interfaces#3290
gabikliot merged 2 commits into
dotnet:masterfrom
ReubenBond:fix-isilostatuslistener-scheduling

Conversation

@ReubenBond
Copy link
Copy Markdown
Member

@ReubenBond ReubenBond commented Aug 9, 2017

Fixes #3273

Implementations of ISiloStatusListener and IRingRangeListener must correctly marshall calls onto their own execution context in order to preserve correct scheduling semantics.

  • Add .ScheduleTask(...) extension methods to SystemTarget so that implementations can easily schedule these callbacks.
  • Add IAddressableContextScheduler which provides methods for executing actions on an addressable scheduling context (so that grain calls can be made). It can be injected via Factory<IAddressableContextScheduler>.
  • Remove IAddressable from IStreamQueueBalanceListener and have PersistentStreamPullingManager subscribe to ring range change notifications directly (instead of via a ref, .AsReference<IStreamQueueBalanceListener>())
  • Verified locally using Service Fabric Sample (modified to include streams)

@ReubenBond ReubenBond force-pushed the fix-isilostatuslistener-scheduling branch from 3b2e8f9 to 85ad71c Compare August 10, 2017 00:04
@ReubenBond
Copy link
Copy Markdown
Member Author

ReubenBond commented Aug 10, 2017

I rather try to implement this without IAddressableContextScheduler, since I dislike it. I think that's achievable by removing IAddressable from IStreamQueueBalanceListener and therefore not casting PersistentStreamPullingManager when it subscribes to ring change notifications.

EDIT: @jason-bragg @xiazen does that sound like a fine approach?

@ReubenBond ReubenBond force-pushed the fix-isilostatuslistener-scheduling branch from 85ad71c to df4f3e1 Compare August 10, 2017 03:02
@ReubenBond
Copy link
Copy Markdown
Member Author

@xiazen @jason-bragg please look at the updated PR. I removed the abstraction I had previously introduced, but more importantly: I changed IStreamQueueBalanceListener as mentioned in the above comment.

@jason-bragg
Copy link
Copy Markdown
Contributor

Looks technically correct, though I'd prefer we move to the shared state pattern (Shared state publisher prototype #3291), as the pattern in this PR relies on the implementer of a notification listener to 'do the right thing' whereas the shared state pattern prevents them from doing the wrong thing.

As an immediate fix, this looks fine. As a longer term maintainable approach, more discussion may be needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to remove IAddressable from the interface. Never really liked it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This interface inherit from IAddressable for threading-safe concern"

It inherits from IAddressable so it can be implemented by system target and notification be called as grain call. The right and proper way to talk between components inside Orleans is via grain (system target) calls. Except for some small number of limited cases when we break isolation. But here is not such a case and thus I don't see a reason to avoid using grain calls for cross component communication and insist on breaking isolation (I call "invoking function directly, even if after that we enqueue an Task directly" as breaking isolation).

Comment thread src/OrleansRuntime/Catalog/Catalog.cs Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on implementers to know this is necessary. Not obvious to maintainers or those writing new listeners.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's not great. Sergey and I settled on this as a 'good enough' fix for a patch release, but it's definitely not ideal.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task returned by this.ScheduleTask() is not awaited or Ignore'ed. I think it's best to wrap it in a SafeExecute(), so that potential exceptions are handled and logged. Alternatively, we could add another extension that would do that and return void instead of a Task.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @sergeybykov. I updated to use Ignore() and SafeExecute().

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand why the above code is outside the task, it would be more maintainable if all the logic was in another function which this call scheduled, like in other instances in this PR. Especially the timer dispose, as that is state and the dispose call may not be thread safe. Also having the dispose outside the task does not save us much since if the code has gotten that far, the task will still be created, unlike the earlier fast exit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

Comment thread src/OrleansRuntime/Catalog/Catalog.cs Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task returned by this.ScheduleTask() is not awaited or Ignore'ed. I think it's best to wrap it in a SafeExecute(), so that potential exceptions are handled and logged. Alternatively, we could add another extension that would do that and return void instead of a Task.

@xiazen
Copy link
Copy Markdown
Contributor

xiazen commented Aug 10, 2017

LGTM

@gabikliot
Copy link
Copy Markdown
Contributor

Please wait merging, I might have comments.

@ReubenBond ReubenBond force-pushed the fix-isilostatuslistener-scheduling branch from d193412 to 81773b5 Compare August 10, 2017 23:38
@ReubenBond
Copy link
Copy Markdown
Member Author

Sure thing, @gabikliot

@gabikliot
Copy link
Copy Markdown
Contributor

I am sorry, but I am not following.

I don't understand the problem in #3273. That bug refers to another bug #3256.
All are about scheduling context, I see that, but how they are all related?
How #3273 happens only during network partitions? How we escape the context during partitions only?
How TypeManager timers in #3256 are related?
And how that PR addresses the other ones?

The main concern I have is about ordering of silo status change notifications. There was a reason to make them on the caller context - to guarantee ordering. This PR changes multi silo, which I don't know much about, but also queue balancer.

@sergeybykov
Copy link
Copy Markdown
Contributor

We believe there are several issues here. No doubt it's confusing.

The root cause of #3285 we believe is the bug of using a regular, albeit safe, timer instead of registering a grain (system target) timer - 98e3910#diff-9bd4c404a42cdafba3be5d76bf5e71fdR64.

Because the timer fires on a thread pool thread, that obviously violates the single-threading guarantee and makes the callback run with no scheduling context.

We believe we see the same thing happening with FabricMembershipOracle that, unlike MembershipOracle, is not a system target. SF invokes it with notifications on its own thread, and that causes the same issue of multi-threading and no scheduling context. I don't remember if it's #3273 or another one where SF is used for hosting.

The next level of realization was that even if the caller runs as a system target, e.g. MembershipOracle, with a scheduling context and all, in some places we make direct calls to other system targets, which violates their single-threading guarantee, and makes their grain calls to be made in a wrong (callers) context.

#3256 is attempting to patch the immediate issues without a significant refactoring of the code, so that the limited fix could be included in 1.5.1. At the same time, @ReubenBond is entertaining a more general solution to help ensure calls to and between system targets are enqueued on correct scheduling contexts. That change would be part of 2.0 release.

@gabikliot
Copy link
Copy Markdown
Contributor

Ok, that is more or less what I thought.
The 1st and 2nd issues - timer and SF - sure, fix them as you said. Those are just bugs. I would not call it "patch the immediate issues", but rather: fix bugs.

But re:

in some places we make direct calls to other system targets, which violates their single-threading guarantee, and makes their grain calls to be made in a wrong (callers) context.

That is a bit more complicated. In some places, this was done by design.
So lets not jump into too quick realizations.

Fix the bugs, then lets open a new and clear issue about why MBR calls other system targets on its own context and not their context and lets discuss it there. We may decide at the end that this is wrong, or that maybe it is not. There was thinking and reasoning behind that.

@sergeybykov
Copy link
Copy Markdown
Contributor

Fix the bugs, then lets open a new and clear issue about why MBR calls other system targets on its own context and not their context and lets discuss it there.

That's roughly the plan. Although, @ReubenBond is looking beyond membership, for a general safe invocation option (a proxy or something). Then we can see where it makes sense to use it and where we are fine violating the safety for well understood and documented reasons.

@gabikliot
Copy link
Copy Markdown
Contributor

Sounds good. The general safe invocation option is to use system target grain interface. We do it in most cases already. No need for any new mechanism.

@ReubenBond
Copy link
Copy Markdown
Member Author

ReubenBond commented Aug 14, 2017

I've updated the PR to address the comments. Thanks, @gabikliot. Having the Catalog implement ISiloStatusListener was confusing since it never subscribed to the ISiloStatusOracle, but was instead directly called via LocalGrainDirectory. I reworked the flow used to register the Catalog callback with the LocalGrainDirectory so that it registers itself in its constructor instead of it being registered via a property setter in Silo.InjectDependencies().

I wonder if we could extract this 'dead silo activation cleanup' code into a separate class, since Catalog is so huge and the method only needs to call Catalog.DeactivateActivation once, outside of the lock on ActivationDirectory. ActivationDirectory itself is threadsafe for enumeration and is an injected singleton already shared between multiple classes.

@gabikliot
Copy link
Copy Markdown
Contributor

@ReubenBond, can we stick to the plan above?

  1. fix all bugs in the timer in type manager
  2. fix all bugs in Service Fabric integration
  3. open a new issue and describe what is the problem we are founding with system targets calling to system targets directly, where and why is this a problem.
  4. only after we discuss 3 lets implement (or not) what is decided.
    It looks like you guys confusing 1+2 with 3 (at least this PR points to 1 as the problem it is trying to solve), which is totally unrelated and jumping into 4.
    Its possible that you already discussed it all internally and for you it is not "jumping", but it is for me.

I thought that is already what Sergey wrote and we are in agreement that 1+2 are unrelated to 3 and we should first discuss 3 before jumping into 4.

@ReubenBond
Copy link
Copy Markdown
Member Author

ReubenBond commented Aug 14, 2017

What does that mean in concrete terms, @gabikliot? What I'm saying I've done in the most recent update is partially undo the original change, but remove the ISiloStatusListener from Catalog and (for cleanliness) change who is responsible for subscribing Catalog to LocalGrainDirectory. This is just to avoid future confusion.

The issue with TypeManager is addressed in PR #3256. They're the same category of issue (scheduling in system-level components), but there isn't any crossover between the fixes (there doesn't need to be).

What's the bug in Service Fabric integration? I don't consider this to be a bug in SF integration, but an existing issue which is only exposed by SF integration. That's why the code changes are in the consumers of notifications and not the producer (SF integration)

@ReubenBond
Copy link
Copy Markdown
Member Author

ReubenBond commented Aug 14, 2017

It's not obvious that there should be ordering in SiloStatusChangeNotification calls or that those calls must ensure that they fully handle the notification before returning to the caller. With this subtle change, those subtle requirements are gone.

@gabikliot
Copy link
Copy Markdown
Contributor

I am not understanding what problem this PR addresses.

@sergeybykov
Copy link
Copy Markdown
Contributor

@dotnet-bot test netstandard-win-functional

@ReubenBond
Copy link
Copy Markdown
Member Author

@gabikliot check out the stack trace in #3273, nevermind the text of that issue. This PR prevents that exception from occurring by ensuring that notifications which may result in a grain call are scheduled on an addressable runtime context (a SystemTarget's context).

Initially I wanted to use a SystemThread to fix this, but a SystemThread cannot make grain calls in its current state.

@gabikliot
Copy link
Copy Markdown
Contributor

Sorry, Reuben. If you are looking for my help, you will have to elaborate more.
I simply don't see how this is related or how your fix is doing the right thing to queue balancer. It used to make grain call, now it calls directly and then queues. Looks like the wrong pattern to me.

@ReubenBond
Copy link
Copy Markdown
Member Author

ReubenBond commented Aug 15, 2017

No worries, @gabikliot. I'll try to elaborate:

  • Previously, the IStreamQueueBalanceListener was IAddressable for the purposes of having calls scheduled on its context - for single-threadedness - it was never a remote call. This is a divergence from the pattern already in use in the codebase when locally calling SystemTargets. That pattern is making a direct call, scheduled using OrleansTaskScheduler, on the SystemTarget's ISchedulingContext. This call via a GrainReference to the SystemTarget required that the call was made from an addressable context already - the fabric oracle does not have an addressable context.
  • PersistentStreamPullingManager makes remote calls to agents when the cluster changes - so notifications must be executed on an addressable context.
  • The change makes the existing pattern (scheduling calls on the SystemTarget's context) simpler to use by adding an extension method to SystemTarget called ScheduleTask.
  • The change implements the existing patterns in places where it was previously not being used << this is the bug fix.

The decision - for the time being - is that consumers of notifications are responsible for ensuring that they handle the notification on the correct scheduling context. Producers are merely responsible for notifying the consumer - with no guarantee of what scheduling context that notification comes from.

Is that clearer?

If you're familiar with Reactive Extensions, the equivalent is that consumers are responsible for calling .ObserveOn(consumerScheduler) and the producer is responsible for ensuring that it executes on its own scheduler (via .SubscribeOn(producerScheduler)). Maybe the correct pattern would be to use Rx for notifications, then the contract would be clear.

@ReubenBond ReubenBond force-pushed the fix-isilostatuslistener-scheduling branch 2 times, most recently from d376853 to b64742c Compare August 17, 2017 00:34
@ReubenBond
Copy link
Copy Markdown
Member Author

ReubenBond commented Aug 17, 2017

Updated PR to reduce the impact of the changes - nothing superfluous is altered now.

If we agree that these are the right changes to make to fix the linked issue, please squash & merge

@ReubenBond ReubenBond force-pushed the fix-isilostatuslistener-scheduling branch from b64742c to a290d7c Compare August 17, 2017 00:39
@gabikliot
Copy link
Copy Markdown
Contributor

Re SF: I guess I am missing all the context about it. I thought so far it was like an MBR oracle, thus is het another runtime component, and has all the access to all the internals. I was not aware it is not.

@ReubenBond
Copy link
Copy Markdown
Member Author

ReubenBond commented Aug 19, 2017

If SF had access to runtime internals I would have just made it an ST and be done with it, but on principle it does not have access. The intention was to ensure that such components could be made as outside contributions.

@gabikliot
Copy link
Copy Markdown
Contributor

Maybe we should have started this whole issue by saying that SF is special, very different from other runtime components. I think what you are saying is it is an internal runtime component, but with limited access, and you are trying to come ip with a model for that: allow others to write components but without full access.

We had a similar probelm in the past, with streaming: pulling agent and etc... We indeed finally came up with such a model: all the complex runtiem stuff is in pulling agent, and extensions are via the cache, balancer , ...
I think the juries are still out on the question if this was a good or bad model. I think @jason-bragg actually did not support it much. Now we have quite complex interactions between all those. Perhaps, perhaps, it would have been better to give them the full power of ST.
At this point, its up to you guys to decide how to make progress on those questions.

Specifically, with SF what you could do is: I guess now it is calls directly on all listeners. Instead you can pass it "listener adaptor" that will marshal his calls to the right context. the adapter will have full access to the runtime.

Or maybe indeed SF is yet another exception in the list above.

But I was under impression that you are pushing to a general pattern of callee marshaling to his context and caller calling on what ever. And that (this new general principle) is what I objected. If you are only looking to patch SF case with what ever hack or exception, then I go ahead.

What are your thought on my 4 cases explanation? Does it make sense, from the general standpoint of design principle for runtime components communication? Unrelated from SF issue.

@ReubenBond
Copy link
Copy Markdown
Member Author

Maybe we should have started this whole issue by saying that SF is special, very different from other runtime components.

I mentioned that SF bits don't have access to SystemTarget earlier: #3290 (comment)

I think what you are saying is it is an internal runtime component, but with limited access, and you are trying to come ip with a model for that: allow others to write components but without full access.

What's an internal component? SF lives in the main tree, but it could live in a separate repo (it started in a separate repo). SF does not have [InternalsVisibleTo] it.

But I was under impression that you are pushing to a general pattern of callee marshaling to his context and caller calling on what ever. And that (this new general principle) is what I objected.

Your impression is correct: I believe that the general pattern should be: if you subscribe to notifications/callbacks on a local (non-IAddressable) interface then you are responsible for marshalling calls onto your own context. You know what you are (poco, grain, ST) and the caller does not and should not. You also don't know what the caller is and you should not. This keeps implementation decisions local rather than having considerations for them throughout the code. It covers every one of your 4 cases above without the caller or callee knowing how the other is implemented.

What are your thought on my 4 cases explanation? Does it make sense, from the general standpoint of design principle for runtime components communication? Unrelated from SF issue.

Point 4 assumes internal access. If it weren't for that I would still be against that pattern for the reason above (caller/callee shouldn't know each other's impl details).

Instead you can pass it "listener adaptor" that will marshal his calls to the right context. the adapter will have full access to the runtime.

Maybe this is the best approach in the long run. We could create hand-crafted adaptors for each interface, based on the caller's impl (and thereby assume the callee is an oblivious poco as in my current impl).

@gabikliot
Copy link
Copy Markdown
Contributor

OK, I was not aware SF is in its own repo without Internal access.
Hmm....

so how this all worked with liveness notification going not on any context? yes, that is wrong.

if you subscribe to notifications/callbacks on a local (non-IAddressable) interface then you are responsible for marshalling calls onto your own context

Yes, agree, IF that is what you do. But I was suggesting to keep balancerListener IAddressable, as now, and then my point 4 stands.

Maybe indeed for this case you need to use GrainClient and come up with an abstraction of remotable subscriptions?
or use the adaptor idea. adaptor can be general and public.

@ReubenBond
Copy link
Copy Markdown
Member Author

OK, I was not aware SF is in its own repo without Internal access.

It's in this repo now, but it started life in OrleansContrib. We want to allow for external implementations of various APIs, so we need a pattern which can be applied without internal access.

if you subscribe to notifications/callbacks on a local (non-IAddressable) interface then you are responsible for marshalling calls onto your own context

Yes, agree, IF that is what you do. But I was suggesting to keep balancerListener IAddressable, as now, and then my point 4 stands.

I mean that the thing you're subscribing to is not IAddressable, eg: balancerListener subscribes to balancer: balancer is not IAddressable.

Maybe indeed for this case you need to use GrainClient and come up with an abstraction of remotable subscriptions?

That can't be done in a patch release with a straight face. There is a strong desire for a 'silo-local client', eg for HTTPS case. GrainReferences can be re-targeted to a different IRuntimeClient using the IGrainFactory.BindGrainReference(ref) method - I was wrong that we would need to round-trip through serialization. So if we did have a silo-local client, then that could be injected into the interface implementation (eg, FabricMembershipOracle) and incoming subscriptions could be re-bound to it if they are IAddressable. This has a flaw, though, because what happens if the subscriber is not IAddressable but it proxies calls to another subscriber which is? This is the case of IStreamQueueBalancer, which is a POCO that IStreamQueueBalanceListener (IAddressable) subscribes to, and it subscribes to ISiloStatusOracle. So in that case do IStreamQueueBalancer impls need to inject IClusterClient so that it can rebind the IStreamQueueBalancerListener instances which subscribe to it?

We could probably also modify InsideRuntimeClient to support context-free calls like OutsideRuntimeClient does, then we don't need rebinding or context marshalling, but you have even less guarantees about scheduling of notifications and you lose the safety of the exception message catching unintended calls.

or use the adaptor idea. adaptor can be general and public.

The first version of this PR had something like this:

/// <summary>
/// Provides services for executing actions on a single-threaded, addressable context.
/// </summary>
public interface IAddressableContextScheduler
{
    /// <summary>
    /// Schedules the provided <paramref name="action"/> on this context.
    /// </summary>
    /// <param name="action">The action.</param>
    /// <returns>A <see cref="Task"/> which completes when the <paramref name="action"/> has completed.</returns>
    Task Schedule(Action action);

    /// <summary>
    /// Schedules the provided <paramref name="action"/> on this context.
    /// </summary>
    /// <param name="action">The action.</param>
    /// <returns>A <see cref="Task"/> which completes when the <paramref name="action"/> has completed.</returns>
    Task Schedule(Func<Task> action);

    /// <summary>
    /// Schedules the provided <paramref name="action"/> on this context.
    /// </summary>
    /// <param name="action">The action.</param>
    /// <returns>A <see cref="Task"/> which completes when the <paramref name="action"/> has completed.</returns>
    Task<T> Schedule<T>(Func<Task<T>> action);
}

internal class AddressableSchedulingContext : SystemTarget, IAddressableContextScheduler
{
    private readonly OrleansTaskScheduler scheduler;

    /// <inheritdoc />
    public Task Schedule(Action action)
    {
        return this.scheduler.RunOrQueueAction(action, this.SchedulingContext);
    }

    /// <inheritdoc />
    public Task Schedule(Func<Task> action)
    {
        return this.scheduler.RunOrQueueTask(action, this.SchedulingContext);
    }

    /// <inheritdoc />
    public Task<T> Schedule<T>(Func<Task<T>> action)
    {
        return this.scheduler.RunOrQueueTask(action, this.SchedulingContext);
    }

    internal static Factory<IAddressableContextScheduler> GetFactory(IServiceProvider sp)
    {
        var siloProviderRuntime = sp.GetRequiredService<SiloProviderRuntime>();
        return GetFactory(sp, siloProviderRuntime);
    }

    internal static Factory<IAddressableContextScheduler> GetFactory(IServiceProvider sp, SiloProviderRuntime siloProviderRuntime)
    {
        var contextFactory = FactoryUtility.Create<AddressableSchedulingContext>(sp);
        return () =>
        {
            var result = contextFactory();
            siloProviderRuntime.RegisterSystemTarget(result);
            return result;
        };
    }

    public AddressableSchedulingContext(
        OrleansTaskScheduler scheduler,
        ILocalSiloDetails localSiloDetails) : base(GrainId.GetGrainId(UniqueKey.NewKey()), localSiloDetails.SiloAddress)
    {
        this.scheduler = scheduler;
    }
}

How do you feel about that approach? The FabricMembershipOracle would receive Factory<IAddressableContextScheduler> schedulingContextFactory in its constructor and create a context there which is can use to marshal notifications to its subscribers.

@sergeybykov
Copy link
Copy Markdown
Contributor

That can't be done in a patch release with a straight face.

That's why I've been advocating separating a patch like fix from the bigger refactoring. Is there s reason the current state of the PR cannot be the patch? It fixes the two cases that've been reported, and is the last fix we are holding 1.5.1 for.

@gabikliot
Copy link
Copy Markdown
Contributor

gabikliot commented Aug 23, 2017

I also advocated for patch only.
BUT: I see this PR not as a patch but rather as major refactoring.

The major refactoring is in one line:
removing IAddressable from IStreamQueueBalanceListener :IAddressable

It makes my case 4 from the comment above not existent: it will mean now that ANY MBR oracle (including our standard one with MBR Table) will not call Queue Balancer Listener as grain call, but rather directly, thus violating the basic principle of isolation between STs in Orleans.

This is done to help with patching SF Oracle, I get it, but with the bath water we are also throwing the baby away.

===

ADDED:
I mean, there could be 10 ways to patch/fix the SF issue (use provider runtime, write adapter layer, give it internal access to Runtime and use similar to Silo) without fundamentally changing how existing component works. Why are we insisting on changing them?

@ReubenBond
Copy link
Copy Markdown
Member Author

ReubenBond commented Aug 23, 2017

It doesn't matter to me how this issue is fixed in the short term as long as the fix follows sound reasoning and does not introduce too much entropy for a patch release.

I do not consider this an issue with the SF membership provider, it's an issue with consumers of synchronous notification interfaces. How would we fix this issue only making changes to the SF provider? If we can't do that cleanly, then maybe the issue actually lies elsewhere (which is what I'm arguing).

Calling a local SystemTarget via a GrainReference is the exception, not the norm. I provided the data here. You said you doubted it. Please search the code and see - maybe I'm wrong, show me how.

That previously IAddressable interface is subscribing to something which might not have an addressable context. How do you propose it should work? I already showed how the implied strategy from the 4 points is not feasible without larger changes in this comment and suggested a more isolated solution in this comment. Maybe you have an opinion on that?

Let's say that interface does remain IAddressable. In that case, what about ISiloStatusListener? That currently is not IAddressable. Should it be? If it was, how about the interaction between LocalGrainDirectory and Catalog: are we breaking isolation when LocalGrainDirectory calls directly into Catalog using its own context? Catalog also has a scheduling context, it's also a SystemTarget. Should LocalGrainDirectory call Catalog via a GrainReference?

EDIT:

I mean, there could be 10 ways to patch/fix the SF issue (use provider runtime, write adapter layer, give it internal access to Runtime and use similar to Silo) without fundamentally changing how existing component works. Why are we insisting on changing them?

  • Write adapter layer

    • I proposed something like this above. If that kind of thing is insufficient, then maybe you can post pseudocode for what you believe this would look like.
  • Use provider runtime

    • Could you elaborate? You mean that we share the provider runtime's scheduling context with the fabric oracle and capture it? That could work. I'm not sure if it's the right solution, but I can implement that instead of this.
  • Give it internal access to Runtime and use similar to Silo

    • The intention is to allow outside developers to provide implementations of these interfaces. If we're forced to use internals then clearly that's not a realized goal.
  • Why are we insisting on changing them?

    • I'm not insisting on changing things. This PR makes things more uniform without changing much (which is why the code changes are minimal, it changes things along in the path of least resistance)

@gabikliot
Copy link
Copy Markdown
Contributor

LGTM for the changes.

Looks like you went with the provider runtime. LGTM.

ISiloStatusListener is a bit different from any others, since there was a need to send notifications directly as sync calls (neither grain call nor direct call plus queue will work).
But for others I thought we should keep using grain calls.

Sounds like if we want to take it further, we would need to talk over Skype to discuss it in person.

@gabikliot
Copy link
Copy Markdown
Contributor

Actually, how does this later commit addresses the original issue? How SF Oracle would send notifications on the scheduling context? It still wouldn't now, right?

@sergeybykov
Copy link
Copy Markdown
Contributor

I'm confused. How does the latest version solve the issue of marshalling SF notifications that originate on external threads to the subscribers? Just running Start or BecomeActive on the ProviderManager's context is not enough, is it?

I can see the fix in the aa35113 version, but am struggling with this one.

@ReubenBond
Copy link
Copy Markdown
Member Author

Let's skype call so we can discuss it, Gabi.

I should have added a comment when I pushed that commit. It works because of a change made back in ~june where FabricMembershipOracle steals TaskScheduler.Current in its Start method and uses it to spin up a task which awaits notifications in ProcessNotifications. Then the OnUpdate method which is called by the OS thread from SF will push notifications onto a queue and pulse the AutoResetEvent which the former is awaiting. ProcessNotifications performs the actual SiloStatusChangeNotification call to subscribers, so it is called on that captured scheduler.

It works, I tested.

Honestly, though, I feel it's a dodgy fix and I definitely prefer the previous revision. I'm uncomfortable with it - it's a hack. I spoke with Sergey this morning, since he's going to be out for the next fortnight. He also prefers the previous revision.

Let me know when you've got time so we can speak.

@ReubenBond
Copy link
Copy Markdown
Member Author

@dotnet-bot test netfx-bvt
@dotnet-bot test netstandard-win-functional

@ReubenBond
Copy link
Copy Markdown
Member Author

ReubenBond commented Aug 24, 2017

I'm testing it using this branch: https://github.com/ReubenBond/orleans/tree/upgrade-sd-sample
Run it, kill one of the spawned processes, observer the failure (search logs for "RuntimeContext").
Then build this PR from cmdline to produce nuget packages, upgrade the sample using those packages, and repeat the repro process.

@ReubenBond ReubenBond force-pushed the fix-isilostatuslistener-scheduling branch from 88ed713 to 091b8c1 Compare August 24, 2017 13:00
@ReubenBond
Copy link
Copy Markdown
Member Author

Gabi and I had a discussion about it and we're on the same page regarding the principles and ideals.

In the end, I feel we should essentially take both the 'new' and 'old' fix.

  • FabricMembershipOracle will execute on the SchedulingContext of the ProviderManager SystemTarget. This also affects implementations of IReminderService and IMultiClusterOracle - if they aren't implemented by SystemTarget then they get a chance to capture the ProviderManager's TaskScheduler.
  • Interfaces which are not IAddressable are responsible for marshalling onto their own context. In addition to this, we removed IAddressable from IStreamQueueBalanceListener because the classes which call it are not IAddressable. Eventually, we should implement either "silo local client" or IAddressableContextScheduler (see prev comments) so that we can handle these scenarios better (using GrainReferences).

Verified fix locally using the branch linked above (I build netfx version of the repo to make that work).

Assuming tests pass, could someone please give it a final review and merge?

@gabikliot gabikliot merged commit ee33c7c into dotnet:master Aug 24, 2017
@gabikliot
Copy link
Copy Markdown
Contributor

Great work @ReubenBond !

@ReubenBond ReubenBond deleted the fix-isilostatuslistener-scheduling branch August 24, 2017 21:13
jdom pushed a commit to jdom/orleans that referenced this pull request Aug 24, 2017
* Ensure that ISiloStatusListener handlers correctly schedule handler code

* Services without a SchedulingContext are executed on the ProviderManager context
ReubenBond added a commit that referenced this pull request Aug 24, 2017
* Ensure that ISiloStatusListener handlers correctly schedule handler code

* Services without a SchedulingContext are executed on the ProviderManager context
@github-actions github-actions Bot locked and limited conversation to collaborators Dec 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants