-
Notifications
You must be signed in to change notification settings - Fork 752
Implement A Thread Time View for Universal Traces #2320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement A Thread Time View for Universal Traces #2320
Conversation
These will be required later when implementing offline symbol resolution for ELF images.
Starting with the V6 file format, the processor number is always set. This enables consumers to build tools based on processor usage. For example, this enables tools such as ThreadTime and Processor stacks to be implemented for universal traces.
marklio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this looks reasonable. I have some question/suggestions (some of which) I think merit discussion, so I'm gonna mark this "comment" instead of approve for now
| // The value is the amount of time that thread was switched out. | ||
|
|
||
| TraceThread thread = data.Thread(); | ||
| if (thread == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this hardening against a plausible scenario? Or is this representative of a "do not crash" principle in this codebase? That is, would a null thread here be indicative of a bug elsewhere that we are making hard to diagnose by simply writing to Debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For universal traces this would be a bug. We create all of the threads that we know about when opening the trace for the first time. It's important to check for null here since we don't want to break the tool if we introduce a bug elsewhere. That said, we could potentially do something better here. From a layering perspective, there's not a good log to write to, but that's something that we could fix going forward. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hard call when a tool is so widely used on a diversity of inputs we can't hope to reasonably cover in validation. Do we use Watson to look at crashes? If this were my tool, I'd want to be set up to get the very best diagnostic artifacts. What is the experience of this "log and ignore" approach? Is it actually less annoying than an actionable crash for the type of user that runs the tool? Would be sweet to have the UI be isolated from this processing in a way that allowed actionable diagnostics to flow and provide a good UI experience, but that's a lot of refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Watson - I don't usually look at it too much, but on occasion yes. I would say that yes, the current behavior is less annoying than the crash. I usually see this fall into one of two buckets:
- There is a single or very small number of events that get ignored because the thread for whatever reason didn't get created.
- We're missing something important in the trace and when you open the view it's effectively empty.
In the first case, the user's analysis is almost always unaffected. In the second case, it's very obvious something went wrong and the user will re-collect and/or report the issue.
I do agree with you - there is a pattern missing here around how we egress diagnostics. Some of the very core pieces of TraceEvent have logging capabilities that PerfView just has to turn on. That pattern likely just needs to be extended here (and frankly to many other places).
| // If the thread wasn't previously marked as blocked, then mark it. | ||
| if (!m_threadState[(int)thread.ThreadIndex].ThreadBlocked) | ||
| { | ||
| double startTimeStampRelMsec = m_eventLog.QPCTimeToRelMSec(data.TimeStampQPC - (long)data.Value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below, you use data.TimeStampRelativeMSec. This is calculation different? The answer to this may answer my question above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TraceEvent does most of its calculations in milliseconds, though the data type is a double so we get fractional milliseconds. The reason this one is in QPC is because the entity that writes the event is writing it in QPC and not milliseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure I'm missing something. On line 731 (with the same data), it uses .TimeStampRelativeMSec. And further below on 754, which seems like the same "mark as blocked" scenario, it also uses .TimeStampRelativeMSec. What is different about THIS callsite that it does the calculation from QPC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should block this particular improvement. I'm more just curious in what our story is for reacting to customer pain with this tool.
|
Thank you for the review @marklio. Answered your questions and pushed some changes. |
Traces generated by https://github.com/microsoft/one-collect can provide a comprehensive view of system behavior through CPU and context switch events.
Modify the
ThreadTimeStackComputerto support aggregation of CPU and blocked time from theUniversal.Eventsprovider defined in one-collect.Expose the new view for nettrace files that contain these events.