Skip to content

Applying the deviance calculation for feature selection in gene expression modality #3

@Arshammik

Description

@Arshammik

Note

This might be a bad idea and will not work as well as Seurat's implementation for filtering high-variable genes since it uses standardized variance instead of deviance. We used the previous code to calculate the deviance in the splicing modality with minimal changes. The motivation of using deviance using the intercept-only method comes from this paper Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model.

We can start with the maths in negative binomial models.
In negative binomial distribution, variance is related to the mean $\mu$ and dispersion parameter $\theta$, which gives:

$$Var(Y) = \mu + \frac{\hat\mu^2}{\theta}$$

Rearranging the variance formula gives:

$$\hat\theta = \frac{\hat\mu^2}{Var(Y)-\hat\mu}$$

If the empirical variance ≤ μ (no overdispersion), θ is set to a large value (≈ Poisson model).

    // Estimate theta using: theta = mu_hat^2 / (variance - mu_hat) if variance > mu_hat.
    double theta_est;
    if (variance > mu_hat) {
      theta_est = (mu_hat * mu_hat) / (variance - mu_hat);
    } else {
      // If variance is not greater than mean, approximate Poisson behavior.
      theta_est = 1e12;  // large number to approximate theta -> infinity
    }
$$\hat\mu = \frac{\Sigma(y)}{n}$$
double mu_hat = sum_y / n_cols;  // Fitted mean under the intercept-only model (MLE).

For a random variable $Y$, the variance defined as:

$$Var(Y) = \mathbb{E}[Y^2] - (\mathbb{E}[Y])^2$$

which can be define as:

$$Var(Y) = \frac{\Sigma(y^2)}{n} - {\mu} ^2$$
// Estimate the variance as (mean of squares) - (square of the mean).
double variance = (sum_y2 / n_cols) - (mu_hat * mu_hat);

The per-observation deviance for NB is:

$$d(y,\hat\mu) = 2\left[ ylog(\frac{y}{\hat\mu}) - (y + \hat\theta)log(\frac{y+\hat\theta}{\hat\mu +\hat \theta}) \right]$$

Summed over all cells to get the total deviance per gene. The implementation in code comes as follows:

    double dev_row = 0.0;
    // Second pass: compute deviance contribution from each cell.
    for (int j = 0; j < n_cols; j++){
      double y = gene_expression(i, j);
      double term = 0.0;
      
      // Add y * log(y/mu_hat); define 0*log(0) as 0.
      if (y > 0)
        term = y * std::log(y / mu_hat);
      
      // Subtract (y + theta_est)*log((y + theta_est)/(mu_hat + theta_est)).
      term -= (y + theta_est) * std::log((y + theta_est) / (mu_hat + theta_est));
      
      // Multiply by 2 as per the deviance definition.
      dev_row += 2 * term;
    }
    
    dev[i] = dev_row;
  }

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions