The primary considerations when implementing a solution for benchmarking model accuracy are as follows:
- How do benchmark authors include their accuracy result in the benchmarking reports (what APIs or tools do they use?)
- How are the accuracy result displayed?
- in the pytest output
- in the ASV output
- How are accuracy results stored?
Since accuracy results require developers to compute them (ie. they can't be measured independently like time or resource usage), the standard pytest fixture won't be able to capture accuracy without the developer getting involved.
A typical benchmark in python using the rapids-pytest-benchmark fixture is written like this:
def bench_jaccard(gpubenchmark, graphWithAdjListComputed):
gpubenchmark(cugraph.jaccard, graphWithAdjListComputed)
The above example will run time and memory usage measurements on the function call cugraph.jaccard(graphWithAdjListComputed). A common accuracy check for this is to ensure the result (in this case, the similarity coefficient for each src/dst vertex pair) is "close enough" to a reference implementation (in this case the jaccard algo in NetworkX). To do this, the developer clearly needs to be involved in order to write the code to compare properly.
Stepping back a little, what does a developer need to see in a benchmark report of model accuracy, using the example above? Ideally they might want to see the comparison of coefficients for every vertex pair, but that won't display well, and also won't provide an at-a-glance view of accuracy. Instead, a single value could be used to describe the percentage of coefficients that deviated outside of a certain tolerance when compared to the coefficient for the reference implementation. Stepping back further, we can say that, just like time and resource utilization, we could consolidate model accuracy down to a single number.
Given that, how do we modify the existing benchmarking API that devs use to accept a new value that has to be computed by the developer?
We can extend the example above as follows:
def getAccuracyComputeFunc(nxPreds):
"""
Create a function object using the Nx result (a closure) that accepts
a cuGraph result and returns the overall accuracy (assuming Nx is
the standard).
"""
def accuracyCompare(result):
# decompose Nx vals into arrays
src = []
dst = []
coeff = []
for u, v, p in nxPreds:
nxSrc.append(u)
nxDst.append(v)
nxCoeff.append(p)
# decompose cuG vals into arrays
cuSrc = result["source"].to_array()
cuDst = result["destination"].to_array()
cuCoeff = result["jaccard_coeff"].to_array()
# compare
err = 0
tol = 1.0e-06
numCoeffs = len(cuCoeff)
assert numCoeffs == len(nxCoeff)
for i in range(numCoeffs):
if(abs(cuCoeff[i] - nxCoeff[i]) > tol*1.1):
err += 1
return (numCoeffs - err)/numCoeffs
return accuracyCompare
def bench_jaccard(gpubenchmark, precomputedGraph):
# Get results for Nx jaccard
preds = nx.jaccard_coefficients(precomputedGraph.foNx, precomputedGraph.edges)
# Assign a callable to be used for an additional metric. The callable will be called and
# passed the result of the benchmarked function and return a value to be included in
# reports under the name "bench_jaccard_accuracy"
gpubenchmark.addMetric(getAccuracyComputeFunc(preds), suffix="_accuracy")
# Run the benchmark to get time, memory, and accuracy
gpubenchmark(cugraph.jaccard, precomputedGraph.forCuGraph)
gpubenchmark.addMetric(callable, suffix) is the key addition here, and allows for any number of arbitrary metrics to be added. This is a nice, general-purpose approach which scales to many other potential metrics we may want to report.
The accuracy results would be added to the ASV reports as a separate benchmark - just like the _gpumem and _gputil currently are, with no changes needed to asvdb or ASV.
The pytest reports that are printed to the console should also be updated to include the new metric(s). This is possibly the hardest part of this feature since the current console report generation is very ugly code (taken from pytest-benchmark, made worse by rapids-pytest-benchmark).
The primary considerations when implementing a solution for benchmarking model accuracy are as follows:
Since accuracy results require developers to compute them (ie. they can't be measured independently like time or resource usage), the standard pytest fixture won't be able to capture accuracy without the developer getting involved.
A typical benchmark in python using the
rapids-pytest-benchmarkfixture is written like this:The above example will run time and memory usage measurements on the function call
cugraph.jaccard(graphWithAdjListComputed). A common accuracy check for this is to ensure the result (in this case, the similarity coefficient for each src/dst vertex pair) is "close enough" to a reference implementation (in this case thejaccardalgo inNetworkX). To do this, the developer clearly needs to be involved in order to write the code to compare properly.Stepping back a little, what does a developer need to see in a benchmark report of model accuracy, using the example above? Ideally they might want to see the comparison of coefficients for every vertex pair, but that won't display well, and also won't provide an at-a-glance view of accuracy. Instead, a single value could be used to describe the percentage of coefficients that deviated outside of a certain tolerance when compared to the coefficient for the reference implementation. Stepping back further, we can say that, just like time and resource utilization, we could consolidate model accuracy down to a single number.
Given that, how do we modify the existing benchmarking API that devs use to accept a new value that has to be computed by the developer?
We can extend the example above as follows:
gpubenchmark.addMetric(callable, suffix)is the key addition here, and allows for any number of arbitrary metrics to be added. This is a nice, general-purpose approach which scales to many other potential metrics we may want to report.The accuracy results would be added to the ASV reports as a separate benchmark - just like the
_gpumemand_gputilcurrently are, with no changes needed toasvdborASV.The pytest reports that are printed to the console should also be updated to include the new metric(s). This is possibly the hardest part of this feature since the current console report generation is very ugly code (taken from
pytest-benchmark, made worse byrapids-pytest-benchmark).