-
Notifications
You must be signed in to change notification settings - Fork 97
async_hooks-based tracing doesn't work for gRPC server (in Node 8) #661
Description
From #656
@mrene wrote:
I have a similar issue (node 8.9.4, async/await with
GCLOUD_TRACE_NEW_CONTEXT, grpc 1.7.3 server) - I get valid traces for the first call and then I start seeinggrpc: Cannot create nested root spans.
I then added some more tracing inside my own methods and I started seeingCustom Span API: creating child for an already closed span ... grpc:/.... After a while, when the traces get pushed, I get another valid trace then the same errors repeat themselves.The errors relating to the custom spans are reported because
cls.getRootContext()returns a span that is closed (the span from the last call). Debugging this intocls-ahrevealed some weird behaviour betweengrpc-native-coreandasync_hooks. Every root span gets created while both theexecutionAsyncIdandtriggerAsyncIdare 0; thebeforecallback also does not execute before handling a new call and therefore thecontextvariable withincls-ahpoints to the last used context.cls-ah seems to do the following:
- maintain a global context object which is used as cls
- maintain a hash of context objects, indexed by
executionAsyncId- whenever an
initevent occurs:
- copy the current global context object into the hash (regardless of the parent id specified by the callback)
- whenever an
enterevent occurs:
- if the hash contains the entered id, set the global context object to its content
In this scenario, because we're lacking an
enter, the first grpc call trace is valid because the context object is initially initialized to{}and we process to creating a root span because the context doesn't contain any previous root span. Upon the 2nd call, the context still contains the previous root span (in a closed state), only errors arise. When the traces get flushed, anenterevent occurs and sets the globalcontextobject back to its initial value (because the trace writer context was created before the first grpc call, so it copied the initial value).At this point this looks more like a grpc bug although it could either be handled in the plugin (using the
AsyncResourceAPI or withingrpc-native-core- I'm sure there's a param to carry over execution context ids when switching from native->js).
It seems the
AsyncHooksembedder's api hasn't landed in NAN yet (nodejs/node#13254) so a JS fix is the only possible option.I've got a hacked up version of the plugin to validate the entire hypothesis and it patches the problem.
I'd like to get some of your opinions before opening a PR, got some comments/questions here:
- Copying the current context from
initmight not be the right thing, sinceinitcan happen on its own withoutenter
- Part of the problem is that contexts created after the first span directly inherit it even if they don't share a parent context
- Every async context will probably point to the same object
contextis set back to its original value inrunAndReturnbut as soon as more get created, they will be set back (throughbeforeandinit) and the original value will not be restored.- This could be mitigated by implementing
after()and resetting values accordingly.- This feels like it
emitDestroy()could be called byspan#endSpan()since it basically knows whenever it's donerunAndReturn()should probably setcontexts[uid]as well ascontextWhat do you think?
@kjin wrote:
@mrene Thanks for investigating the
async_hooks-based tracing issue! Please go ahead and open the PR.I've been thinking through the observations you had, here are my thoughts so far:
Copying the current context from
initmight not be the right thing, sinceinitcan happen on its own withoutenterCopying the current context in
initis necessary to preserve it across asynchronous boundaries. Maybe I'm misunderstanding what you're saying?Part of the problem is that contexts created after the first span directly inherit it even if they don't share a parent context
Every async context will probably point to the same objectI think you were correct in creating a new
AsyncResourcefor each request being handled, so that subsequent requests after the first one don't point to the same object.
contextis set back to its original value inrunAndReturnbut as soon as more get created, they will be set back (throughbeforeandinit) and the original value will not be restored.
This could be mitigated by implementingafter()and resetting values accordingly.
I believe implementingEDIT: See new comment.after()to reset context would effectively be a no-op, sincebefore()will run anyway for the next async resource. Would you be able to further articulate what the potential problem could be here?This feels like it
emitDestroy()could be called byspan#endSpan()since it basically knows whenever it's doneI'm open to (and for this and other reasons, actively thinking about) the idea of creating our own
AsyncResourceobject for root spans and emitting the lifecycle events correspondingly. Let's talk about it in a separate issue (please feel free to open it if I haven't already).
runAndReturn()should probably setcontexts[uid]as well ascontextWhat should
contexts[uid]be set to?
@mrene After looking into it more, I believe your reasoning is correct that
aftershould be implemented to reset the previous context.I'll open a PR to implement
aftersoon-ish, probably tomorrow, and gRPC server tracing should just work (with NEW_CONTEXT-based tracing, at least) without need for additional changes to the plugin. That being said, it might still make sense for you to open a PR against gRPC to add async_hooks embedder support. (if you feel motivated; I can do it as well 😄 )
@mrene wrote:
Copying the current context in init is necessary to preserve it across asynchronous boundaries. Maybe I'm misunderstanding what you're saying?
I mean
initshouldn't be constructed asbefore- the runtime isn't switching to it yet. one of its argument is thetriggerAsyncId. I agree that the context should be copied, just that it should be copied fromcontexts[triggerAsyncId]rather than thecontextglobal var being set at that moment. (In most cases those will be the same but it's possible to pass another parent in the async_hooks API)What should contexts[uid] be set to?
The same thing, basically
context = contexts[uid] = {};whenever it is being initialized. This would make sure that the context is preserved if anenterevent to the same context occurs afterwards. It'll give that context its own object, which it doesn't have now (only context initialized while this one is active will be correctly linked).Thinking about this more makes me realize that running this without creating another
AsyncResourcecould cause the same id to be re-used twice, so it might not be advisable until this is implemented.I noticed a small difference in behaviour with
continuation-local-storage, it does not store any data unless the current context has been activated, whereascls-ahinitializes an empty object so things can technically be saved/fetched even ifrunAndReturnhasn't been called. Nothing critical but it might hide subtle bugs.