feat(stats_tests): implement KS test#329
Conversation
Implements versions of the one-sample and two-sample KS test
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #329 +/- ##
==========================================
+ Coverage 94.26% 94.72% +0.45%
==========================================
Files 58 60 +2
Lines 12943 14238 +1295
==========================================
+ Hits 12201 13487 +1286
- Misses 742 751 +9 ☔ View full report in Codecov by Sentry. |
|
statrs/src/stats_tests/ks_test.rs Line 246 in 5fad268 My implementation to calculate the exact p-value for a one sample KS test requires the use of nalgebra to do some matrix multiplication. I was wondering if there were suggestions for how to add the feature gate. I'm not too familiar with feature gating, but the two ideas that came to my mind were to
|
|
Thanks for this! Since feature gating is for conditional compilation, we won't be able to just bar the enum, since that's from a specific code path, but if you walked down that code path and feature gated the whole path that would compile with the feature flag disabled. I'll use the review tools to make the suggestions so they're near to the code. |
src/stats_tests/ks_test.rs
Outdated
| use core::f64; | ||
| use std::iter::zip; | ||
|
|
||
| use nalgebra::DMatrix; |
There was a problem hiding this comment.
either gate this or write the use statement near to usage
| /// `n`s and will error with if there are ties in the input data or the input data is too | ||
| /// large. The threshold for too large is data with length 170 lining up with the | ||
| /// implementation of [`factorial`] being used. | ||
| TwoSidedExact, |
There was a problem hiding this comment.
gate this like you mentioned
I believe it can go before or after the docstring.
| 2.0 * sum | ||
| } | ||
|
|
||
| fn onesample_marsaglia_et_al_twosided_pvalue(d: f64, n: f64) -> Result<f64, KSTestError> { |
There was a problem hiding this comment.
this one would be gated
src/stats_tests/ks_test.rs
Outdated
|
|
||
| let mut mm = DMatrix::<f64>::zeros(m, m); | ||
|
|
||
| // PERF: definitely a better way to fill the matrix. Also could cache the |
There was a problem hiding this comment.
There's certainly another way to fill this, but I think the readability would improve significantly, but I don't think the performance would significantly. I don't see any overwrites in my review. But some of the evaluations could be SIMD as you do have a few instances of, "set the first column by this expressions" and "set these antidiagonals by this expression" so it would be interesting to see how much the gains could be.
Some bigger gains could be improving calculating a matrix element of M^n, where M=mm especially where the number of bins is large (matmul is O(N^3)) as nalgebra is primarily targeted for graphics. faer may be better choice for this or using nalgebra-lapack to connect to battle-tested fortran. But let's take that decision out of this scope. For computing a matrix element,
If we have the unit vector along dimension k to be u, then we can compute matrix element k,k as u^T M u, which could be expressed as below, where n/2 is floor division.
and for odd n, we could raise M to the floor of half the power and add the explicit multiply of M, again with n / 2 being floor divide,
diagonalization would be useful here as well, unsure at what size the tradeoff will benefit to do n/2 matmul vs diagonalize since both are O(N^3), but matmul seems like less bookkeeping to parallelize.
And the factorial function is cached prior, so you're good there for less than 170!
There was a problem hiding this comment.
Implemented. Thanks for the suggestion. Some quick benchmarking below
| Original | Updated | |
|---|---|---|
| d=0.15; n=8.0 | 638ns | 521ns |
| d=0.15; n=135.0 | 928μs | 60μs |
| d=0.62; n=8.0 | 4.4μs | 3.0μs |
| d=0.62; n=135.0 | 44.3ms | 2.7ms |
KS test one-sample method requires the nalgebra feature to do matrix multiplication
improved process to find the k, k matrix element after raising the matrix to a power.
|
The test failure is related to something we're fixing in #325, so this is good to go. Thanks for the quick bench. If you've got that runner available, would you be able to share it? I'd use it to test if another optimization is significantly worthwhile. |
It was just some basic code I pulled together with criterion, so not sure how helpful it is, but see below fn bench_exacts(c: &mut Criterion) {
let mut group = c.benchmark_group("Exact");
for i in [(0.15, 8.0), (0.15, 135.0), (0.62, 8.0), (0.62, 135.0)].iter() {
group.bench_with_input(
BenchmarkId::new("Original", format!("{:?}", i)),
i,
|b, (d, n)| b.iter(|| original(*d, *n)),
);
group.bench_with_input(
BenchmarkId::new("Alternative", format!("{:?}", i)),
i,
|b, (d, n)| b.iter(|| better_matmul(*d, *n)),
);
}
group.finish();
}
criterion_group!(benches, bench_exacts);
criterion_main!(benches); |
Implements versions of the one-sample and two-sample KS test