Applying the deviance calculation for feature selection in gene expression modality

> [!NOTE]
This might be a bad idea and will not work as well as Seurat's implementation for filtering high-variable genes since it uses **standardized variance** instead of deviance. We used the previous code to calculate the deviance in the splicing modality with minimal changes. The motivation of using deviance using the intercept-only method comes from this  paper [Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model](
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6#Tab).


We can start with the maths in negative binomial models.
In negative binomial distribution, variance is related to the mean $\mu$ and dispersion parameter $\theta$, which gives:
```math
Var(Y) = \mu + \frac{\hat\mu^2}{\theta} 
```
Rearranging the variance formula gives:
```math
\hat\theta = \frac{\hat\mu^2}{Var(Y)-\hat\mu}
```
If the empirical variance ≤ μ (no overdispersion), θ is set to a large value (≈ Poisson model).

```cpp
    // Estimate theta using: theta = mu_hat^2 / (variance - mu_hat) if variance > mu_hat.
    double theta_est;
    if (variance > mu_hat) {
      theta_est = (mu_hat * mu_hat) / (variance - mu_hat);
    } else {
      // If variance is not greater than mean, approximate Poisson behavior.
      theta_est = 1e12;  // large number to approximate theta -> infinity
    }
```


```math
\hat\mu = \frac{\Sigma(y)}{n} 
```
```cpp
double mu_hat = sum_y / n_cols;  // Fitted mean under the intercept-only model (MLE).
```



For a random variable $Y$, the variance defined as:
```math
Var(Y) = \mathbb{E}[Y^2] - (\mathbb{E}[Y])^2 
```
which can be define as:
```math
Var(Y) = \frac{\Sigma(y^2)}{n} - {\mu} ^2
```

```cpp
// Estimate the variance as (mean of squares) - (square of the mean).
double variance = (sum_y2 / n_cols) - (mu_hat * mu_hat);
```  

The per-observation deviance for NB is:
```math
d(y,\hat\mu) = 2\left[ ylog(\frac{y}{\hat\mu}) - (y + \hat\theta)log(\frac{y+\hat\theta}{\hat\mu +\hat \theta}) \right]
```
Summed over all cells to get the total deviance per gene. The implementation in code comes as follows:
```cpp
    double dev_row = 0.0;
    // Second pass: compute deviance contribution from each cell.
    for (int j = 0; j < n_cols; j++){
      double y = gene_expression(i, j);
      double term = 0.0;
      
      // Add y * log(y/mu_hat); define 0*log(0) as 0.
      if (y > 0)
        term = y * std::log(y / mu_hat);
      
      // Subtract (y + theta_est)*log((y + theta_est)/(mu_hat + theta_est)).
      term -= (y + theta_est) * std::log((y + theta_est) / (mu_hat + theta_est));
      
      // Multiply by 2 as per the deviance definition.
      dev_row += 2 * term;
    }
    
    dev[i] = dev_row;
  }
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Applying the deviance calculation for feature selection in gene expression modality #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Applying the deviance calculation for feature selection in gene expression modality #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions