Skip to content

Conversation

@pentp
Copy link
Contributor

@pentp pentp commented Aug 4, 2021

  • Hide write latency for memAccessKind == PERFSCORE_MEMORY_READ_WRITE like for PERFSCORE_MEMORY_WRITE.
  • Fix memory access latencies for many instructions that previously didn't add the instruction latency to memory access latency or overwrote memory latency with register access latency.
  • Adjust some instruction latencies for YMM register size.
  • Fix latencies for a lot of instructions by using more precise uops.info data.

Fixes #49647

@ghost ghost added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member labels Aug 4, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Aug 16, 2021
@kunalspathak
Copy link
Contributor

@dotnet/jit-contrib , @briansull

@kunalspathak
Copy link
Contributor

                            result.insLatency = PERFSCORE_LATENCY_3C;

Need += here?


Refers to: src/coreclr/jit/emitxarch.cpp:14921 in 51c7ec6. [](commit_id = 51c7ec6, deletion_comment = False)

@kunalspathak
Copy link
Contributor

            result.insLatency = PERFSCORE_LATENCY_23C;

+= ?


Refers to: src/coreclr/jit/emitxarch.cpp:15266 in 51c7ec6. [](commit_id = 51c7ec6, deletion_comment = False)

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing. I have added few questions.

assert(latency >= 0.0);

if (memAccessKind == PERFSCORE_MEMORY_WRITE)
if (memAccessKind == PERFSCORE_MEMORY_WRITE || memAccessKind == PERFSCORE_MEMORY_READ_WRITE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we are hiding latency for PERFSCORE_MEMORY_READ_WRITE as well? Is the assumption in the comment below holds true even for PERFSCORE_MEMORY_READ_WRITE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're only hiding the write part of R/W here so that add [mem], x doesn't end up with a higher perfscore than mov reg, [mem]; add reg, x; mov [mem], reg.
The assumption below is now assumed to be true for both writes and I've adjusted some of the instruction latencies to reflect that.

case INS_setle:
case INS_setg:
result.insLatency = PERFSCORE_LATENCY_1C;
result.insLatency += PERFSCORE_LATENCY_1C;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, need to also do it for xchg, call, fstp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much memory latency affects call [mem], uops.info doesn't have any info on this. It's probably going to be less than the 3C from throughput, so not worth changing it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fld and fstp I'm also not sure about what's the correct value (again no data at uops.info).

{
// ins reg, mem
result.insThroughput = PERFSCORE_THROUGHPUT_2X;
// insLatency is set above (see - Model the memory latency)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update Model the memory latency to not count this twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't counted twice, this 2/3C latency is in addition to the modeled memory access base latency (i.e., mov latency). To put it another way, mov rax, [mem] has lower latency than movdqa ymm, [mem].

@pentp
Copy link
Contributor Author

pentp commented Sep 7, 2021

                            result.insLatency = PERFSCORE_LATENCY_3C;

Need += here?

Refers to: src/coreclr/jit/emitxarch.cpp:14921 in 51c7ec6. [](commit_id = 51c7ec6, deletion_comment = False)

This is for lea which should ignore (overwrite) memory access latency.

            result.insLatency = PERFSCORE_LATENCY_23C;

+= ?

Refers to: src/coreclr/jit/emitxarch.cpp:15266 in 51c7ec6. [](commit_id = 51c7ec6, deletion_comment = False)

xchg [mem], reg latency is complicated/bad, could be anywhere between 10 and 45 depending on the exact instruction form+sequence and CPU, so probably this is just an average and the instruction should be avoided if possible.

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your responses.

@kunalspathak kunalspathak merged commit 404a89f into dotnet:main Sep 7, 2021
@pentp pentp deleted the perfscore-fixes branch September 7, 2021 20:41
@ghost ghost locked as resolved and limited conversation to collaborators Oct 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PerfScore inconcistencies with memory accesses

3 participants