Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis
CARMANIA is a self-supervised genomic language model framework that augments next-token prediction with a transition-matrix regularization loss. This integration improves biological sequence modeling by aligning predicted transitions with empirical bigram(2-mer) statistics, allowing for better long-range dependency modeling and functional interpretation.
The following models are already available for use on Hugging Face Hub:
- 🦠🧬
MsAlEhR/carmania-big-10k-prok-genome - 🦠🧬
MsAlEhR/carmania-4k-scp-gene-taxa - 👤🧬
MsAlEhR/carmania-160k-seqlen-human
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained(
"MsAlEhR/carmania-160k-seqlen-human",
trust_remote_code=True,
torch_dtype=torch.float16, # fixed dtype (or autocast)
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
"MsAlEhR/carmania-160k-seqlen-human",
trust_remote_code=True,
model_max_length=160000,
)
inputs = tokenizer("ACGTAGGCTA", return_tensors="pt").to("cuda")
outputs = model(**inputs)An experimental notebook exploring CARMANIA-driven sequence optimization using Enformer scores is now available.
This lightweight module perturbs input DNA sequences and uses Enformer’s predicted regulatory signals as a scoring function to iteratively generate variants with improved activity.
📄 Notebook:
carmania_enformer_guided_generation.ipynb
@article{refahi2026context,
title={Context-aware regularization with markovian integration for attention-based nucleotide analysis},
author={Refahi, Mohammad Saleh and Abavisani, Mahdi and Sokhansanj, Bahrad and Brown, James R and Rosen, Gail},
journal={Advances in Neural Information Processing Systems},
volume={38},
pages={108518--108544},
year={2026}
}
