Conversation
|
@stas00 I used to get ~66msec / token (batch size = 1) for DS-inference with fp16. Can you confirm if you are also observing performance drops? |
|
I'm not sure if we are using the same hardware, I'm getting pretty similar performance, please see: and the diff from before 40msec => 44msec int8 itself is of course slower than fp16 |
|
So, you are saying you only dropped performance by 4msec? The int8 numbers match exactly for me. Also, if possible please let me know about your pytorch and CUDA versions |
|
no idea, you're not using the same machine, so it's normal to have different results. Even the GPUs could be slightly different I think, or perhaps PCIe type/channels. specs: |
* wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip
This PR is adding: