MPore

MPore is a pipeline for database-driven identification and activity assessment of methyltransferases in prokaryotic genomes using Oxford Nanopore R10 sequencing data.

MPore is designed to be user-friendly: it includes automated basecalling, motif extraction, statistical modeling, and visualization of methylation patterns.

For detailed methodology and benchmarking, see [Publication reference, DOI].

Features

Performs basecalling from POD5 files using Dorado for high-accuracy modification calls
Identifies candidate methyltransferases via homology search against REBASE using PROKKA and BLASTP
Generates genome-wide methylation signals from Nanopore sequencing data
Evaluates MTase activity using L1-regularized logistic regression
Produces visualizations of enzymes, their recognition motifs, and site-specific methylation patterns

External Tools

MPore relies on external tools that must be installed before running the pipeline.
Please ensure that the following software is available on your system.

Dorado

Dorado is required for basecalling and methylation detection.

Verify installation

Make sure Dorado is available in your PATH:

ls -l $(command -v dorado)

If this command does not return a valid path, Dorado is not correctly installed.

Model availability

If Dorado was installed using the self-contained release, ensure that the required methylation detection models are present in:

<dorado_directory>/lib

The following models must be available:

dna_r10.4.1_e8.2_400bps_hac@v5.0.0
dna_r10.4.1_e8.2_400bps_hac@v5.0.0_6mA@v1
dna_r10.4.1_e8.2_400bps_hac@v5.0.0_4mC_5mC@v1

If u installed dorado as container you can locate your Dorado installation with:

find $HOME -type d -name "dorado-*"

Download models

To download all available models:

dorado download --model all

Alternatively, you can download only the required models:

dorado download --model \
dna_r10.4.1_e8.2_400bps_hac@v5.0.0 \
dna_r10.4.1_e8.2_400bps_hac@v5.0.0_6mA@v1 \
dna_r10.4.1_e8.2_400bps_hac@v5.0.0_4mC_5mC@v1

For detailed installation instructions, please refer to the official Dorado documentation.

Installation

Create a Conda environment
MPore is designed to run within a dedicated Conda environment.

conda create -c conda-forge -c bioconda -n Bacterial_context1 snakemake=8.20.5
conda activate Bacterial_context1
snakemake -help

Clone the MPore repository

git clone [https://github.com/DiltheyLab/MPore.git]
cd MPore

Initialization

Prepare the CSV file and user motif list
After installation, navigate to the MPore directory and create a CSV file containing the following columns:

File_name
Reference_path
pod5_path The File_name column is used as the isolate identifier throughout the downstream analysis.

For visualization and compatibility purposes, it is recommended to:

avoid overly long File_name entries
avoid whitespace characters
use continuous, descriptive strings instead

An example input CSV file is shown below:

File_name,Reference_path,pod5_path
12256U,/home/azlan/Myco_Data/ref/12256U.fasta,/home/azlan/Myco_Data/pod5s/12256U
8958VA,/home/azlan/M_hominis/ref/8958VA.fasta,/home/azlan/Myco_Data/pod5s/8958VA

In this example, File_name corresponds to the isolates 12256U and 8958VA, each linked to their respective reference genome and POD5 directory.

Reference contig name formatting

It is strongly recommended to verify and, if necessary, adjust contig names in the reference FASTA files. Whitespace characters or additional annotations in contig headers can cause issues with several tools used in the pipeline. A simple check of contig headers can be performed using:

grep ">" 12256U.fasta

Example output:

>12256U Mycoplasma hominis, complete genome

A properly formatted contig header should contain only a single identifier.
An example of how to reformat the FASTA file is shown below:

zcat 12256U.fa.gz \ | awk '/^>/{print $1; next} {print}' \ > 12256U_formatted.fasta

Verification of the reformatted file:

grep ">" 12256U_formatted.fasta

Expected output:

>12256U

Motif list

In addition to the input CSV file, the user must provide a text file containing DNA motifs of interest, with one motif per line. If no specific motifs are of interest, a dummy motif file can be provided, for example:

GATC

In this case, GATC will be used as a placeholder motif. all motifs associated with candidate methyltransferases will be included in the analysis in addition to the user-defined motifs.

Note: An empty motif file must not be used as input.

Set up environment variables

Next, configure the required paths and runtime options via the config.yaml file located in the MPore directory. Navigate to the MPore folder and open the configuration file, for example using nano:

cd /MPore
nano config.yaml

The config.yaml file contains all variables required to run MPore and may look like the following:

input_csv: Data_Test.csv
output_dir: /mnt/azlan/Nanomotif_data/Outpu
dorado_path: /home/azlan/Tools/dorado-0.8.0-linux-x64
user_motif_list: Motifs.txt
tsv_data: TSV_Enzyme.csv
tsv_rebase_data: TSV_REBASE_data.tsv
split: true
log_analysis: true
mode: 1
heatmap: true

Configuration parameters

Below is an overview and explanation of each configuration parameter:

INPUT_CSV
Path to the input CSV file created in Step 1
OUTPUT_DIR
Directory where all MPore results will be written.
DORADO_PATH
Path to the Dorado installation directory.
USER_MOTIF_LIST
Text file containing user-defined DNA motifs of interest
TSV_DATA
REBASE-derived file listing methyltransferases, their recognition sites, and associated methylation types (should not be modified).
TSV_REBASE_data
Concatenated REBASE file containing methyltransferases, recognition sites, and methylation types (should not be modified).
SPLIT
Enables a memory-efficient workflow at the cost of increased runtime. When enabled, intermediate and result files are split into smaller chunks.
LOG_ANALYSIS
Enables MPore’s statistical modeling and activity assessment for candidate methyltransferases
MODE
Analysis mode selection:
- mode: 1 — cross-isolate analysis (default), fitting a regularized logistic regression model across isolates
- mode: 2 — isolate-specific analysis, fitting a regularized logistic regression model per isolate
heatmap
Enables heatmap generation for summary figures (see Step 5)

Recommended settings

It is recommended to enable:
log_analysis: true
to activate MPore’s activity assessment for candidate methyltransferases.
split: true
if available RAM is limited or unknown, to reduce memory usage at the expense of longer runtime.

Run the pipeline

MPore is executed using Snakemake.
The optional prefix /usr/bin/time -v -o snakemake_resource_usage.txt records the total runtime and resource usage of the workflow.

This prefix can be removed if runtime and memory profiling is not required.
For routine inspection and debugging, the log files located in MPore/logs are recommended, as they provide detailed information on when each individual pipeline step started and finished.

/usr/bin/time -v -o snakemake_resource_usage.txt \
snakemake -s Snakemake_entire_thing \
  --configfile config.yaml \
  --conda-frontend conda \
  --cores all \
  --use-conda \
  --resources dorado=1 \
  > snakemake_pipeline_out.txt \
  2> snakemake_pipeline_err.txt

Command options

-s Snakemake_entire_thing
Specifies the main Snakemake workflow file
--configfile config.yaml
Loads all configuration variables defined in Step 2
conda-fronted conda
Uses Conda (instead of Mamba) to create and manage environments located in MPore/Environments
cores all
Allows Snakemake to use all available CPU cores
use-conda
Enables automatic creation and activation of rule-specific Conda environments
resources dorado =1
Limits Dorado execution to a single job at a time. This is recommended for GPU- and memory-intensive basecalling steps
> snakemake_pipeline_out.txt
Redirects standard output (e.g. executed rules, echo statements) to a file
2> snakemake_pipeline_err.txt
Redirects standard error messages to a separate file for debugging

Logging and debugging

Snakemake log files are written to the MPore/logs directory
These logs provide detailed, rule-specific execution information and are highly recommended for debugging and performance assessment

Output

The output generated by MPore consists of the following files and directories:

BAM and BED files
Generated during Dorado basecalling
BED files contain site-specific methylation information, including coverage and the number of modified reads
PROKKA annotations
Predicted coding sequences (CDS) for each isolate
Annotation files are stored within the corresponding file_name output directories
BLASTP result files
Text files containing BLASTP alignment results of predicted CDS against the REBASE database
All_isolates_gene_loci.csv
Lists all enzymes with an e-value < 1e-25, including candidate methyltransferases used for downstream analyses and their gene loci
Beta_coef_p_values_{methyltype}.csv
Contains enzymes and their beta coefficient estimates derived from L1-regularized logistic regression {methyltype} can be one of 4mC, 5mC, or 6mA
The file also includes the originating gene loci for each enzyme
Context_influence_{methyltype}.xlsx
Summarizes the influence of flanking genomic context on average genome-wide methylation levels
MTase_presence_e_25_values.csv
Provides an overview of identified methyltransferases (MTases) and their corresponding e-values across all analyzed isolates
Sample_DF_{file_name}_{methyltype}.csv
Summarizes all analyzed motifs (both user-defined and REBASE-derived), including average methylation scores per motif
Sample_DF_detailed_{file_name}_{methyltype}.csv
Contains per-site methylation scores for each analyzed motif
Plots_{methyltype}/
Directory containing boxplots comparing methylation scores across isolates
Associated enzyme information is displayed beneath each plot when applicable
Multipanel_plot_{file_name}.png
Combined visualization showing all relevant plots for a given isolate (see Section 5)
Heatmap_methylation_score_{context}.png
Heatmap summarizing the global methylation signal across motifs of identified methyltransferases for a given genomic context

Workflow and multipanel

This figure illustrates the overall workflow of MPore and the data structure used for L1-regularized logistic regression. The bar plot summarizes methyltransferase detection results for the benchmark dataset ([Link]). In addition, a representative multipanel plot generated by MPore for Mycoplasma hominis is shown. For a detailed description of the methodology and analysis workflow, please refer to the MPore application note ([Link]). Contact: Azlan@uni-duesseldorf.de

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
MPore		MPore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MPore

Features

External Tools

Dorado

Verify installation

Model availability

Download models

Installation

Initialization

Reference contig name formatting

Motif list

Note: An empty motif file must not be used as input.

Configuration parameters

Recommended settings

Run the pipeline

Command options

Logging and debugging

Output

About

Uh oh!

Releases

Packages

Languages

License

DiltheyLab/MPore

Folders and files

Latest commit

History

Repository files navigation

MPore

Features

External Tools

Dorado

Verify installation

Model availability

Download models

Installation

Initialization

Reference contig name formatting

Motif list

Note: An empty motif file must not be used as input.

Configuration parameters

Recommended settings

Run the pipeline

Command options

Logging and debugging

Output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages