[PERF]: Measure the performance impact of the "layered design" in cuda-bindings

API calls in `cuda-bindings` currently are made through 3 layers.

As an experiment to measure the performance impact of calling through these layers, I "flattened" the call so the top layer just directly calls the C function pointer in the library (currently handled by the bottom layer).  The overhead of each of these layers is pretty small, by design, but there is still some Python exception handling, as well as our library initialization check (`cuPythonInit()`) along the way.  

While we lose some safety and version independence doing this, it is useful as an experiment to see what the cost of that flexibility is.

[My changes](https://github.com/NVIDIA/cuda-python/compare/main...mdboom:cuda-python:flatten-layers?expand=1)

Measuring this with the benchmark in #659, I do not see any measurable change.  Branch predictors must be pretty good these days.

```
Before: Mean +- std dev: 2.77 us +- 0.37 us
After: Mean +- std dev: 2.76 us +- 0.21 us
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF]: Measure the performance impact of the "layered design" in cuda-bindings #1605

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[PERF]: Measure the performance impact of the "layered design" in cuda-bindings #1605

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions