API calls in cuda-bindings currently are made through 3 layers.
As an experiment to measure the performance impact of calling through these layers, I "flattened" the call so the top layer just directly calls the C function pointer in the library (currently handled by the bottom layer). The overhead of each of these layers is pretty small, by design, but there is still some Python exception handling, as well as our library initialization check (cuPythonInit()) along the way.
While we lose some safety and version independence doing this, it is useful as an experiment to see what the cost of that flexibility is.
My changes
Measuring this with the benchmark in #659, I do not see any measurable change. Branch predictors must be pretty good these days.
Before: Mean +- std dev: 2.77 us +- 0.37 us
After: Mean +- std dev: 2.76 us +- 0.21 us
API calls in
cuda-bindingscurrently are made through 3 layers.As an experiment to measure the performance impact of calling through these layers, I "flattened" the call so the top layer just directly calls the C function pointer in the library (currently handled by the bottom layer). The overhead of each of these layers is pretty small, by design, but there is still some Python exception handling, as well as our library initialization check (
cuPythonInit()) along the way.While we lose some safety and version independence doing this, it is useful as an experiment to see what the cost of that flexibility is.
My changes
Measuring this with the benchmark in #659, I do not see any measurable change. Branch predictors must be pretty good these days.