Skip to content

Conversation

@brianrob
Copy link
Member

Traces generated by https://github.com/microsoft/one-collect can provide a comprehensive view of system behavior through CPU and context switch events.

Modify the ThreadTimeStackComputer to support aggregation of CPU and blocked time from the Universal.Events provider defined in one-collect.

Expose the new view for nettrace files that contain these events.

These will be required later when implementing offline symbol resolution
for ELF images.
Starting with the V6 file format, the processor number is always set.
This enables consumers to build tools based on processor usage.  For
example, this enables tools such as ThreadTime and Processor stacks to
be implemented for universal traces.
@brianrob brianrob marked this pull request as ready for review October 17, 2025 00:01
Copy link
Collaborator

@marklio marklio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this looks reasonable. I have some question/suggestions (some of which) I think merit discussion, so I'm gonna mark this "comment" instead of approve for now

// The value is the amount of time that thread was switched out.

TraceThread thread = data.Thread();
if (thread == null)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this hardening against a plausible scenario? Or is this representative of a "do not crash" principle in this codebase? That is, would a null thread here be indicative of a bug elsewhere that we are making hard to diagnose by simply writing to Debug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For universal traces this would be a bug. We create all of the threads that we know about when opening the trace for the first time. It's important to check for null here since we don't want to break the tool if we introduce a bug elsewhere. That said, we could potentially do something better here. From a layering perspective, there's not a good log to write to, but that's something that we could fix going forward. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hard call when a tool is so widely used on a diversity of inputs we can't hope to reasonably cover in validation. Do we use Watson to look at crashes? If this were my tool, I'd want to be set up to get the very best diagnostic artifacts. What is the experience of this "log and ignore" approach? Is it actually less annoying than an actionable crash for the type of user that runs the tool? Would be sweet to have the UI be isolated from this processing in a way that allowed actionable diagnostics to flow and provide a good UI experience, but that's a lot of refactoring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Watson - I don't usually look at it too much, but on occasion yes. I would say that yes, the current behavior is less annoying than the crash. I usually see this fall into one of two buckets:

  1. There is a single or very small number of events that get ignored because the thread for whatever reason didn't get created.
  2. We're missing something important in the trace and when you open the view it's effectively empty.

In the first case, the user's analysis is almost always unaffected. In the second case, it's very obvious something went wrong and the user will re-collect and/or report the issue.

I do agree with you - there is a pattern missing here around how we egress diagnostics. Some of the very core pieces of TraceEvent have logging capabilities that PerfView just has to turn on. That pattern likely just needs to be extended here (and frankly to many other places).

// If the thread wasn't previously marked as blocked, then mark it.
if (!m_threadState[(int)thread.ThreadIndex].ThreadBlocked)
{
double startTimeStampRelMsec = m_eventLog.QPCTimeToRelMSec(data.TimeStampQPC - (long)data.Value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below, you use data.TimeStampRelativeMSec. This is calculation different? The answer to this may answer my question above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceEvent does most of its calculations in milliseconds, though the data type is a double so we get fractional milliseconds. The reason this one is in QPC is because the entity that writes the event is writing it in QPC and not milliseconds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure I'm missing something. On line 731 (with the same data), it uses .TimeStampRelativeMSec. And further below on 754, which seems like the same "mark as blocked" scenario, it also uses .TimeStampRelativeMSec. What is different about THIS callsite that it does the calculation from QPC?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should block this particular improvement. I'm more just curious in what our story is for reacting to customer pain with this tool.

@brianrob
Copy link
Member Author

Thank you for the review @marklio. Answered your questions and pushed some changes.

@brianrob brianrob merged commit 338bf05 into microsoft:main Oct 17, 2025
5 checks passed
@brianrob brianrob deleted the dev/brianrob/universal-threadtime branch October 17, 2025 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants