LAMAR

A Foundation Language Model for RNA Regulation

This repository contains codes and links of pre-trained weights for RNA foundation language model LAMAR. LAMAR outperformed benchmark models in various RNA regulation tasks, helping to decipher the regulation rules of RNA.

LAMAR was developed by Rnasys Lab and Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health (SINH), Chinese Academy of Sciences (CAS).

Citation

https://www.biorxiv.org/content/10.1101/2024.10.12.617732v2

Create environment

The environment can be created with LAMAR_requirements.txt.

git clone https://github.com/zhw-e8/LAMAR.git
cd ./LAMAR

conda create -n lamar python==3.11
conda activate lamar
pip install -r LAMAR_requirements.txt

The pretraining was conducted on A800 80GB GPUs, and the fine-tuning was conducted on the Sugon Z-100 16GB and Tesla V100 32GB clusters of GPUs.
The environments are a little different on different devices. And now the unified environment is provided.
Pretraining environment:
A800: environment_A800_pretrain.yml
Fine-tuning environment:
Sugon Z-100: environment_Z100_finetune.yml
V100 (ppc64le): environment_V100_finetune.yml

Required packages

accelerate >= 0.26.1
torch >= 1.13
transformers >= 4.32.1
datasets >= 2.12.0
pandas >= 2.0.3
safetensors >= 0.4.1

Usage

Install package

After creating the environment, LAMAR package can be installed.

pip install .

Download pretrained weights

The pretrained weights can be downloaded from https://huggingface.co/zhw-e8/LAMAR/tree/main.

Compute embeddings

Notice: In our model, the tokenizer, config and pretrained weights should be loaded locally. So, we encourage the users to specify the absolute path or ensure the correct relative path is used.

from LAMAR.modeling_nucESM2 import EsmModel
from transformers import AutoConfig, AutoTokenizer
from safetensors.torch import load_file, load_model
import torch


seq = "ATACGATGCTAGCTAGTGACTAGCTGATCGTAGCTG"
model_max_length = 1026
device = torch.device("cuda:0")
# instance tokenizer and config
tokenizer = AutoTokenizer.from_pretrained("tokenizer/single_nucleotide/", model_max_length=model_max_length)
config = AutoConfig.from_pretrained(
    "config/config_150M.json", vocab_size=len(tokenizer), pad_token_id=tokenizer.pad_token_id,
    mask_token_id=tokenizer.mask_token_id, token_dropout=False, positional_embedding_type='rotary', 
    hidden_size=768, intermediate_size=3072, num_attention_heads=12, num_hidden_layers=12
)
# intance the model and load pretrained weights
model = EsmModel(config)
weights = load_file('pretrain/saving_model/mammalian80D_4096len1mer1sw_80M/checkpoint-250000/model.safetensors')
weights_dict = {}
for k, v in weights.items():
    new_k = k.replace('esm.', '') if 'esm' in k else k
    weights_dict[new_k] = v
model.load_state_dict(weights_dict, strict=False)
model = model.to(device)
# Compute embeddings
model.eval()
with torch.no_grad():
    inputs = tokenizer(seq, return_tensors="pt")
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    outputs = model(
        input_ids=input_ids, 
        attention_mask=attention_mask
    )
    embedding = outputs.last_hidden_state[0, 1 : -1, :]

In our paper, we compared the embeddings of necleotides, functional elements and transcripts from pretrained and untrained LAMAR. The paths of scripts are as followed:

Compute embeddings of nucleotides: src/embedding/NucleotideEmbeddingMultipleTimes.ipynb
Compute embeddings of functional elements: src/embedding/FunctionalElementEmbedding.ipynb
Compute embeddings of transcripts: src/embedding/RNAEmbedding.ipynb
Compute embeddings of splice sites: src/embedding/SpliceSiteEmbedding.ipynb

Predict splice sites from pre-mRNA sequences

The paths of scripts:

Tokenization: src/SpliceSitePred/tokenize_data.ipynb
Fine-tune: src/SpliceSitePred/finetune.ipynb
Evaluate: src/SpliceSitePred/evaluation.ipynb

Predict the translation efficiencies of mRNAs based on 5' UTRs (HEK293 cell line)

The paths of scripts:

Tokenization: src/UTR5TEPred/tokenize_data.ipynb
Fine-tune: src/UTR5TEPred/finetune.ipynb
Evaluate: src/UTR5TEPred/evaluate.ipynb

Annotate the IRES

The paths of scripts:

Tokenization: src/IRESPred/tokenize_data.ipynb
Fine-tune: src/IRESPred/finetune.ipynb
Evaluate: src/IRESPred/evaluate.ipynb

Predict the half-lives of mRNAs based on 3' UTRs (BEAS-2B cell line)

The paths of scripts:

Tokenization: src/UTR3DegPred/tokenize_data.ipynb
Fine-tune: src/UTR3DegPred/finetune.ipynb
Evaluate: src/UTR3DegPred/evaluate.ipynb

Baseline methods

The performance of LAMAR was compared to baseline methods. The scripts: https://github.com/zhw-e8/LAMAR_baselines

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
IRESPred/data		IRESPred/data
LAMAR		LAMAR
ReadMe		ReadMe
SpliceSitePred/data		SpliceSitePred/data
UTR3DegPred/data		UTR3DegPred/data
UTR5TEPred/data		UTR5TEPred/data
config		config
metrics		metrics
src		src
tokenizer		tokenizer
LAMAR_requirements.txt		LAMAR_requirements.txt
LICENSE		LICENSE
README.md		README.md
environment_A800_pretrain.yml		environment_A800_pretrain.yml
environment_V100_finetune.yml		environment_V100_finetune.yml
environment_Z100_finetune.yml		environment_Z100_finetune.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LAMAR

Citation

Create environment

Required packages

Usage

Install package

Download pretrained weights

Compute embeddings

Predict splice sites from pre-mRNA sequences

Predict the translation efficiencies of mRNAs based on 5' UTRs (HEK293 cell line)

Annotate the IRES

Predict the half-lives of mRNAs based on 3' UTRs (BEAS-2B cell line)

Baseline methods

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LAMAR

Citation

Create environment

Required packages

Usage

Install package

Download pretrained weights

Compute embeddings

Predict splice sites from pre-mRNA sequences

Predict the translation efficiencies of mRNAs based on 5' UTRs (HEK293 cell line)

Annotate the IRES

Predict the half-lives of mRNAs based on 3' UTRs (BEAS-2B cell line)

Baseline methods

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages