[FEA] Research how to include model accuracy

The primary considerations when implementing a solution for benchmarking model accuracy are as follows:
* How do benchmark authors include their accuracy result in the benchmarking reports (what APIs or tools do they use?)
* How are the accuracy result displayed?
  * in the pytest output
  * in the ASV output
* How are accuracy results stored?

Since accuracy results require developers to compute them (ie. they can't be measured independently like time or resource usage), the standard pytest fixture won't be able to capture accuracy without the developer getting involved.

A typical benchmark in python using the `rapids-pytest-benchmark` fixture is written like this:
```
def bench_jaccard(gpubenchmark, graphWithAdjListComputed): 
    gpubenchmark(cugraph.jaccard, graphWithAdjListComputed)
```
The above example will run time and memory usage measurements on the function call `cugraph.jaccard(graphWithAdjListComputed)`.  A common accuracy check for this is to ensure the result (in this case, the similarity coefficient for each src/dst vertex pair) is "close enough" to a reference implementation (in this case the `jaccard` algo in `NetworkX`). To do this, the developer clearly needs to be involved in order to write the code to compare properly.

Stepping back a little, what does a developer need to see in a benchmark report of model accuracy, using the example above? Ideally they might want to see the comparison of coefficients for every vertex pair, but that won't display well, and also won't provide an at-a-glance view of accuracy. Instead, a single value could be used to describe the percentage of coefficients that deviated outside of a certain tolerance when compared to the coefficient for the reference implementation. Stepping back further, we can say that, just like time and resource utilization, we could consolidate model accuracy down to a single number.

Given that, how do we modify the existing benchmarking API that devs use to accept a new value that has to be computed by the developer?

We can extend the example above as follows:
```
def getAccuracyComputeFunc(nxPreds):
    """
    Create a function object using the Nx result (a closure) that accepts
    a cuGraph result and returns the overall accuracy (assuming Nx is
    the standard).
    """
    def accuracyCompare(result):
        # decompose Nx vals into arrays
        src = []
        dst = []
        coeff = []
        for u, v, p in nxPreds:
            nxSrc.append(u)
            nxDst.append(v)
            nxCoeff.append(p)
        # decompose cuG vals into arrays
        cuSrc = result["source"].to_array()
        cuDst = result["destination"].to_array()
        cuCoeff = result["jaccard_coeff"].to_array()
        # compare
        err = 0
        tol = 1.0e-06
        numCoeffs = len(cuCoeff)
        assert numCoeffs == len(nxCoeff)
        for i in range(numCoeffs):
            if(abs(cuCoeff[i] - nxCoeff[i]) > tol*1.1):
                err += 1
        return (numCoeffs - err)/numCoeffs
    return accuracyCompare

def bench_jaccard(gpubenchmark, precomputedGraph):
    # Get results for Nx jaccard
    preds = nx.jaccard_coefficients(precomputedGraph.foNx, precomputedGraph.edges)

    # Assign a callable to be used for an additional metric. The callable will be called and
    # passed the result of the benchmarked function and return a value to be included in
    # reports under the name "bench_jaccard_accuracy"
    gpubenchmark.addMetric(getAccuracyComputeFunc(preds), suffix="_accuracy")

    # Run the benchmark to get time, memory, and accuracy
    gpubenchmark(cugraph.jaccard, precomputedGraph.forCuGraph)
```

`gpubenchmark.addMetric(callable, suffix)` is the key addition here, and allows for any number of arbitrary metrics to be added. This is a nice, general-purpose approach which scales to many other potential metrics we may want to report.

The accuracy results would be added to the ASV reports as a separate benchmark - just like the `_gpumem` and `_gputil` currently are, with no changes needed to `asvdb` or `ASV`.

The pytest reports that are printed to the console should also be updated to include the new metric(s). This is possibly the hardest part of this feature since the current console report generation is very ugly code (taken from `pytest-benchmark`, made worse by `rapids-pytest-benchmark`).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Research how to include model accuracy #38

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Research how to include model accuracy #38

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions