We can start with the maths in negative binomial models.
In negative binomial distribution, variance is related to the mean $\mu$ and dispersion parameter $\theta$, which gives:
$$Var(Y) = \mu + \frac{\hat\mu^2}{\theta}$$
Rearranging the variance formula gives:
$$\hat\theta = \frac{\hat\mu^2}{Var(Y)-\hat\mu}$$
If the empirical variance ≤ μ (no overdispersion), θ is set to a large value (≈ Poisson model).
// Estimate theta using: theta = mu_hat^2 / (variance - mu_hat) if variance > mu_hat.
double theta_est;
if (variance > mu_hat) {
theta_est = (mu_hat * mu_hat) / (variance - mu_hat);
} else {
// If variance is not greater than mean, approximate Poisson behavior.
theta_est = 1e12; // large number to approximate theta -> infinity
}
$$\hat\mu = \frac{\Sigma(y)}{n}$$
double mu_hat = sum_y / n_cols; // Fitted mean under the intercept-only model (MLE).
For a random variable $Y$, the variance defined as:
$$Var(Y) = \mathbb{E}[Y^2] - (\mathbb{E}[Y])^2$$
which can be define as:
$$Var(Y) = \frac{\Sigma(y^2)}{n} - {\mu} ^2$$
// Estimate the variance as (mean of squares) - (square of the mean).
double variance = (sum_y2 / n_cols) - (mu_hat * mu_hat);
The per-observation deviance for NB is:
$$d(y,\hat\mu) = 2\left[ ylog(\frac{y}{\hat\mu}) - (y + \hat\theta)log(\frac{y+\hat\theta}{\hat\mu +\hat \theta}) \right]$$
Summed over all cells to get the total deviance per gene. The implementation in code comes as follows:
double dev_row = 0.0;
// Second pass: compute deviance contribution from each cell.
for (int j = 0; j < n_cols; j++){
double y = gene_expression(i, j);
double term = 0.0;
// Add y * log(y/mu_hat); define 0*log(0) as 0.
if (y > 0)
term = y * std::log(y / mu_hat);
// Subtract (y + theta_est)*log((y + theta_est)/(mu_hat + theta_est)).
term -= (y + theta_est) * std::log((y + theta_est) / (mu_hat + theta_est));
// Multiply by 2 as per the deviance definition.
dev_row += 2 * term;
}
dev[i] = dev_row;
}
Note
This might be a bad idea and will not work as well as Seurat's implementation for filtering high-variable genes since it uses standardized variance instead of deviance. We used the previous code to calculate the deviance in the splicing modality with minimal changes. The motivation of using deviance using the intercept-only method comes from this paper Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model.
We can start with the maths in negative binomial models.$\mu$ and dispersion parameter $\theta$ , which gives:
In negative binomial distribution, variance is related to the mean
Rearranging the variance formula gives:
If the empirical variance ≤ μ (no overdispersion), θ is set to a large value (≈ Poisson model).
For a random variable$Y$ , the variance defined as:
which can be define as:
The per-observation deviance for NB is:
Summed over all cells to get the total deviance per gene. The implementation in code comes as follows: