-
Notifications
You must be signed in to change notification settings - Fork 72
Open
Description
You note the effect for Skylake on the wiki ("Minimum store-forwarding latency is 3 on new(ish) chips, but the load has to arrive at exactly the right time to achieve this"), but I'm seeing a similar effect on earlier CPUs too; tested with Sandybridge and Ivybridge so far.
loop:
mov [rsp], edi
dec ecx
mov edi, [rsp]
jnz loopruns at 6.3 cyc/iter
loop:
mov edi, [rsp]
dec ecx
mov [rsp], edi
jnz loopruns at 5 cyc/iter
loop:
mov [rsp], edi
dec ecx
nop
nop
nop
nop
nop
nop
mov edi, [rsp]
jnz looprun a 5 cyc/iter
I'm also seeing stores being apparently re-issued a lot? For the second variant:
5003448550 cycles:u
4000194964 instructions:u # 0.80 insn per cycle
666535816 uops_dispatched_port.port_0:u
333645348 uops_dispatched_port.port_1:u
833721986 uops_dispatched_port.port_2:u
1166399269 uops_dispatched_port.port_3:u
4976661537 uops_dispatched_port.port_4:u
1000143676 uops_dispatched_port.port_5:u
8977107631 uops_dispatched.core:u
What are your thoughts on this? Have you seen instruction replay explained somewhere? Did you see variable forwarding latency mentioned anywhere?
Metadata
Metadata
Assignees
Labels
No labels