Removes as much expression evaluation as possible on matching inputs to
KernelExecutor. Results for llama based latency tests on H200-DGX are
tracked in #3813. In that PR there
are also CPU based profiling results showing how much latency has been
improved for KernelExecutor. This PR does not add any new functionality.
Graph 0:
Total time: 12ms -> 3.1ms
KernelExecutor::runFusion: 35.5us -> 10.5us.
Graph 1:
Total time: 19.8ms -> 10.6ms
KernelExecutor::runFusion: 36.4us -> 13.1us
Graph 2:
Total time: 18.9ms -> 18.8ms
KernelExecutor::runFusion: 28.6 us -> 11.1us
For Graph 2 we would need to improve ExprEvalExecutor as it's taking up
60% of the runtime and Kernel Executor only 20%.
No description provided.