Skip to content

store forwarding does not have fixed latency #81

@amonakov

Description

@amonakov

You note the effect for Skylake on the wiki ("Minimum store-forwarding latency is 3 on new(ish) chips, but the load has to arrive at exactly the right time to achieve this"), but I'm seeing a similar effect on earlier CPUs too; tested with Sandybridge and Ivybridge so far.

loop:
mov [rsp], edi
dec ecx
mov edi, [rsp]
jnz loop

runs at 6.3 cyc/iter

loop: 
mov edi, [rsp] 
dec ecx 
mov [rsp], edi 
jnz loop

runs at 5 cyc/iter

loop:
mov [rsp], edi
dec ecx
nop
nop
nop
nop
nop
nop
mov edi, [rsp]
jnz loop

run a 5 cyc/iter

I'm also seeing stores being apparently re-issued a lot? For the second variant:

        5003448550      cycles:u                                                    
        4000194964      instructions:u            #    0.80  insn per cycle         
         666535816       uops_dispatched_port.port_0:u                                   
         333645348       uops_dispatched_port.port_1:u                                   
         833721986       uops_dispatched_port.port_2:u                                   
        1166399269       uops_dispatched_port.port_3:u                                   
        4976661537       uops_dispatched_port.port_4:u                                   
        1000143676       uops_dispatched_port.port_5:u                                   
        8977107631      uops_dispatched.core:u                                      

What are your thoughts on this? Have you seen instruction replay explained somewhere? Did you see variable forwarding latency mentioned anywhere?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions