Skip to content

feat(stats_tests): implement KS test#329

Merged
YeungOnion merged 4 commits intostatrs-dev:masterfrom
mdahlin:ks_test
Feb 24, 2025
Merged

feat(stats_tests): implement KS test#329
YeungOnion merged 4 commits intostatrs-dev:masterfrom
mdahlin:ks_test

Conversation

@mdahlin
Copy link
Contributor

@mdahlin mdahlin commented Feb 16, 2025

Implements versions of the one-sample and two-sample KS test

Implements versions of the one-sample and two-sample KS test
@codecov
Copy link

codecov bot commented Feb 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.72%. Comparing base (f4136d5) to head (fa6c3f5).
Report is 13 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #329      +/-   ##
==========================================
+ Coverage   94.26%   94.72%   +0.45%     
==========================================
  Files          58       60       +2     
  Lines       12943    14238    +1295     
==========================================
+ Hits        12201    13487    +1286     
- Misses        742      751       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mdahlin
Copy link
Contributor Author

mdahlin commented Feb 16, 2025

KSOneSampleAlternativeMethod::TwoSidedExact => {

My implementation to calculate the exact p-value for a one sample KS test requires the use of nalgebra to do some matrix multiplication. I was wondering if there were suggestions for how to add the feature gate. I'm not too familiar with feature gating, but the two ideas that came to my mind were to

  1. separate ks_onesample and ks_twosample and feature gate all of ks_onesample
  2. feature gate specifically the enum variant (though this seems like it could cause some issues)

@YeungOnion
Copy link
Contributor

Thanks for this!

Since feature gating is for conditional compilation, we won't be able to just bar the enum, since that's from a specific code path, but if you walked down that code path and feature gated the whole path that would compile with the feature flag disabled.

I'll use the review tools to make the suggestions so they're near to the code.

use core::f64;
use std::iter::zip;

use nalgebra::DMatrix;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either gate this or write the use statement near to usage

/// `n`s and will error with if there are ties in the input data or the input data is too
/// large. The threshold for too large is data with length 170 lining up with the
/// implementation of [`factorial`] being used.
TwoSidedExact,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gate this like you mentioned
I believe it can go before or after the docstring.

2.0 * sum
}

fn onesample_marsaglia_et_al_twosided_pvalue(d: f64, n: f64) -> Result<f64, KSTestError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one would be gated


let mut mm = DMatrix::<f64>::zeros(m, m);

// PERF: definitely a better way to fill the matrix. Also could cache the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's certainly another way to fill this, but I think the readability would improve significantly, but I don't think the performance would significantly. I don't see any overwrites in my review. But some of the evaluations could be SIMD as you do have a few instances of, "set the first column by this expressions" and "set these antidiagonals by this expression" so it would be interesting to see how much the gains could be.

Some bigger gains could be improving calculating a matrix element of M^n, where M=mm especially where the number of bins is large (matmul is O(N^3)) as nalgebra is primarily targeted for graphics. faer may be better choice for this or using nalgebra-lapack to connect to battle-tested fortran. But let's take that decision out of this scope. For computing a matrix element,

If we have the unit vector along dimension k to be u, then we can compute matrix element k,k as u^T M u, which could be expressed as below, where n/2 is floor division.

$$ u^T M^n u = \left(u^T M^{n/2}\right) \left(M^{n/2}u\right) = v^T v, \quad n \textrm{ mod } 2 == 0 $$

and for odd n, we could raise M to the floor of half the power and add the explicit multiply of M, again with n / 2 being floor divide,

$$ v^T \left(M^{n/2 + 1}u\right) $$

diagonalization would be useful here as well, unsure at what size the tradeoff will benefit to do n/2 matmul vs diagonalize since both are O(N^3), but matmul seems like less bookkeeping to parallelize.

And the factorial function is cached prior, so you're good there for less than 170!

Copy link
Contributor Author

@mdahlin mdahlin Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. Thanks for the suggestion. Some quick benchmarking below

Original Updated
d=0.15; n=8.0 638ns 521ns
d=0.15; n=135.0 928μs 60μs
d=0.62; n=8.0 4.4μs 3.0μs
d=0.62; n=135.0 44.3ms 2.7ms

KS test one-sample method requires the nalgebra feature to do matrix
multiplication
improved process to find the k, k matrix element after raising the
matrix to a power.
@YeungOnion
Copy link
Contributor

The test failure is related to something we're fixing in #325, so this is good to go. Thanks for the quick bench. If you've got that runner available, would you be able to share it? I'd use it to test if another optimization is significantly worthwhile.

@YeungOnion YeungOnion merged commit 7ef8595 into statrs-dev:master Feb 24, 2025
9 of 10 checks passed
@mdahlin
Copy link
Contributor Author

mdahlin commented Mar 1, 2025

If you've got that runner available, would you be able to share it?

It was just some basic code I pulled together with criterion, so not sure how helpful it is, but see below

fn bench_exacts(c: &mut Criterion) {
    let mut group = c.benchmark_group("Exact");
    for i in [(0.15, 8.0), (0.15, 135.0), (0.62, 8.0), (0.62, 135.0)].iter() {
        group.bench_with_input(
            BenchmarkId::new("Original", format!("{:?}", i)),
            i,
            |b, (d, n)| b.iter(|| original(*d, *n)),
        );
        group.bench_with_input(
            BenchmarkId::new("Alternative", format!("{:?}", i)),
            i,
            |b, (d, n)| b.iter(|| better_matmul(*d, *n)),
        );
    }
    group.finish();
}
criterion_group!(benches, bench_exacts);
criterion_main!(benches);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants