A Foundation Language Model for RNA Regulation
This repository contains codes and links of pre-trained weights for RNA foundation language model LAMAR. LAMAR outperformed benchmark models in various RNA regulation tasks, helping to decipher the regulation rules of RNA.
LAMAR was developed by Rnasys Lab and Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health (SINH), Chinese Academy of Sciences (CAS).

https://www.biorxiv.org/content/10.1101/2024.10.12.617732v2
The environment can be created with LAMAR_requirements.txt.
git clone https://github.com/zhw-e8/LAMAR.git
cd ./LAMAR
conda create -n lamar python==3.11
conda activate lamar
pip install -r LAMAR_requirements.txtThe pretraining was conducted on A800 80GB GPUs, and the fine-tuning was conducted on the Sugon Z-100 16GB and Tesla V100 32GB clusters of GPUs.
The environments are a little different on different devices. And now the unified environment is provided.
Pretraining environment:
A800: environment_A800_pretrain.yml
Fine-tuning environment:
Sugon Z-100: environment_Z100_finetune.yml
V100 (ppc64le): environment_V100_finetune.yml
accelerate >= 0.26.1
torch >= 1.13
transformers >= 4.32.1
datasets >= 2.12.0
pandas >= 2.0.3
safetensors >= 0.4.1
After creating the environment, LAMAR package can be installed.
pip install .The pretrained weights can be downloaded from https://huggingface.co/zhw-e8/LAMAR/tree/main.
Notice: In our model, the tokenizer, config and pretrained weights should be loaded locally. So, we encourage the users to specify the absolute path or ensure the correct relative path is used.
from LAMAR.modeling_nucESM2 import EsmModel
from transformers import AutoConfig, AutoTokenizer
from safetensors.torch import load_file, load_model
import torch
seq = "ATACGATGCTAGCTAGTGACTAGCTGATCGTAGCTG"
model_max_length = 1026
device = torch.device("cuda:0")
# instance tokenizer and config
tokenizer = AutoTokenizer.from_pretrained("tokenizer/single_nucleotide/", model_max_length=model_max_length)
config = AutoConfig.from_pretrained(
"config/config_150M.json", vocab_size=len(tokenizer), pad_token_id=tokenizer.pad_token_id,
mask_token_id=tokenizer.mask_token_id, token_dropout=False, positional_embedding_type='rotary',
hidden_size=768, intermediate_size=3072, num_attention_heads=12, num_hidden_layers=12
)
# intance the model and load pretrained weights
model = EsmModel(config)
weights = load_file('pretrain/saving_model/mammalian80D_4096len1mer1sw_80M/checkpoint-250000/model.safetensors')
weights_dict = {}
for k, v in weights.items():
new_k = k.replace('esm.', '') if 'esm' in k else k
weights_dict[new_k] = v
model.load_state_dict(weights_dict, strict=False)
model = model.to(device)
# Compute embeddings
model.eval()
with torch.no_grad():
inputs = tokenizer(seq, return_tensors="pt")
input_ids = inputs['input_ids'].to(device)
attention_mask = inputs['attention_mask'].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
embedding = outputs.last_hidden_state[0, 1 : -1, :]In our paper, we compared the embeddings of necleotides, functional elements and transcripts from pretrained and untrained LAMAR. The paths of scripts are as followed:
- Compute embeddings of nucleotides: src/embedding/NucleotideEmbeddingMultipleTimes.ipynb
- Compute embeddings of functional elements: src/embedding/FunctionalElementEmbedding.ipynb
- Compute embeddings of transcripts: src/embedding/RNAEmbedding.ipynb
- Compute embeddings of splice sites: src/embedding/SpliceSiteEmbedding.ipynb
The paths of scripts:
- Tokenization: src/SpliceSitePred/tokenize_data.ipynb
- Fine-tune: src/SpliceSitePred/finetune.ipynb
- Evaluate: src/SpliceSitePred/evaluation.ipynb
The paths of scripts:
- Tokenization: src/UTR5TEPred/tokenize_data.ipynb
- Fine-tune: src/UTR5TEPred/finetune.ipynb
- Evaluate: src/UTR5TEPred/evaluate.ipynb
The paths of scripts:
- Tokenization: src/IRESPred/tokenize_data.ipynb
- Fine-tune: src/IRESPred/finetune.ipynb
- Evaluate: src/IRESPred/evaluate.ipynb
The paths of scripts:
- Tokenization: src/UTR3DegPred/tokenize_data.ipynb
- Fine-tune: src/UTR3DegPred/finetune.ipynb
- Evaluate: src/UTR3DegPred/evaluate.ipynb
The performance of LAMAR was compared to baseline methods. The scripts: https://github.com/zhw-e8/LAMAR_baselines