MPore is a pipeline for database-driven identification and activity assessment of methyltransferases in prokaryotic genomes using Oxford Nanopore R10 sequencing data.
MPore is designed to be user-friendly: it includes automated basecalling, motif extraction, statistical modeling, and visualization of methylation patterns.
For detailed methodology and benchmarking, see [Publication reference, DOI].
- Performs basecalling from POD5 files using Dorado for high-accuracy modification calls
- Identifies candidate methyltransferases via homology search against REBASE using PROKKA and BLASTP
- Generates genome-wide methylation signals from Nanopore sequencing data
- Evaluates MTase activity using L1-regularized logistic regression
- Produces visualizations of enzymes, their recognition motifs, and site-specific methylation patterns
MPore relies on external tools that must be installed before running the pipeline.
Please ensure that the following software is available on your system.
Dorado is required for basecalling and methylation detection.
Make sure Dorado is available in your PATH:
ls -l $(command -v dorado)If this command does not return a valid path, Dorado is not correctly installed.
If Dorado was installed using the self-contained release, ensure that the required methylation detection models are present in:
<dorado_directory>/libThe following models must be available:
dna_r10.4.1_e8.2_400bps_hac@v5.0.0
dna_r10.4.1_e8.2_400bps_hac@v5.0.0_6mA@v1
dna_r10.4.1_e8.2_400bps_hac@v5.0.0_4mC_5mC@v1If u installed dorado as container you can locate your Dorado installation with:
find $HOME -type d -name "dorado-*"To download all available models:
dorado download --model allAlternatively, you can download only the required models:
dorado download --model \
dna_r10.4.1_e8.2_400bps_hac@v5.0.0 \
dna_r10.4.1_e8.2_400bps_hac@v5.0.0_6mA@v1 \
dna_r10.4.1_e8.2_400bps_hac@v5.0.0_4mC_5mC@v1For detailed installation instructions, please refer to the official Dorado documentation.
- Create a Conda environment
MPore is designed to run within a dedicated Conda environment.conda create -c conda-forge -c bioconda -n Bacterial_context1 snakemake=8.20.5 conda activate Bacterial_context1 snakemake -help
- Clone the MPore repository
git clone [https://github.com/DiltheyLab/MPore.git] cd MPore
- Prepare the CSV file and user motif list
After installation, navigate to theMPoredirectory and create a CSV file containing the following columns:
File_nameReference_pathpod5_pathTheFile_namecolumn is used as the isolate identifier throughout the downstream analysis.
For visualization and compatibility purposes, it is recommended to:
- avoid overly long
File_nameentries - avoid whitespace characters
- use continuous, descriptive strings instead
An example input CSV file is shown below:
File_name,Reference_path,pod5_path
12256U,/home/azlan/Myco_Data/ref/12256U.fasta,/home/azlan/Myco_Data/pod5s/12256U
8958VA,/home/azlan/M_hominis/ref/8958VA.fasta,/home/azlan/Myco_Data/pod5s/8958VAIn this example, File_name corresponds to the isolates 12256U and 8958VA, each linked to their respective reference genome and POD5 directory.
It is strongly recommended to verify and, if necessary, adjust contig names in the reference FASTA files. Whitespace characters or additional annotations in contig headers can cause issues with several tools used in the pipeline. A simple check of contig headers can be performed using:
grep ">" 12256U.fastaExample output:
>12256U Mycoplasma hominis, complete genomeA properly formatted contig header should contain only a single identifier.
An example of how to reformat the FASTA file is shown below:
zcat 12256U.fa.gz \ | awk '/^>/{print $1; next} {print}' \ > 12256U_formatted.fastaVerification of the reformatted file:
grep ">" 12256U_formatted.fastaExpected output:
>12256U In addition to the input CSV file, the user must provide a text file containing DNA motifs of interest, with one motif per line. If no specific motifs are of interest, a dummy motif file can be provided, for example:
GATC
In this case, GATC will be used as a placeholder motif.
all motifs associated with candidate methyltransferases will be included in the analysis in addition to the user-defined motifs.
- Set up environment variables
Next, configure the required paths and runtime options via the config.yaml file located in the MPore directory.
Navigate to the MPore folder and open the configuration file, for example using nano:
cd /MPore
nano config.yamlThe config.yaml file contains all variables required to run MPore and may look like the following:
input_csv: Data_Test.csv
output_dir: /mnt/azlan/Nanomotif_data/Outpu
dorado_path: /home/azlan/Tools/dorado-0.8.0-linux-x64
user_motif_list: Motifs.txt
tsv_data: TSV_Enzyme.csv
tsv_rebase_data: TSV_REBASE_data.tsv
split: true
log_analysis: true
mode: 1
heatmap: trueBelow is an overview and explanation of each configuration parameter:
INPUT_CSV
Path to the input CSV file created in Step 1OUTPUT_DIR
Directory where all MPore results will be written.DORADO_PATH
Path to the Dorado installation directory.USER_MOTIF_LIST
Text file containing user-defined DNA motifs of interestTSV_DATA
REBASE-derived file listing methyltransferases, their recognition sites, and associated methylation types (should not be modified).TSV_REBASE_data
Concatenated REBASE file containing methyltransferases, recognition sites, and methylation types (should not be modified).SPLIT
Enables a memory-efficient workflow at the cost of increased runtime. When enabled, intermediate and result files are split into smaller chunks.LOG_ANALYSIS
Enables MPore’s statistical modeling and activity assessment for candidate methyltransferasesMODE
Analysis mode selection:mode: 1— cross-isolate analysis (default), fitting a regularized logistic regression model across isolatesmode: 2— isolate-specific analysis, fitting a regularized logistic regression model per isolate
heatmap
Enables heatmap generation for summary figures (see Step 5)
It is recommended to enable:
log_analysis: true
to activate MPore’s activity assessment for candidate methyltransferases.
split: true
if available RAM is limited or unknown, to reduce memory usage at the expense of longer runtime.
MPore is executed using Snakemake.
The optional prefix /usr/bin/time -v -o snakemake_resource_usage.txt records the total runtime and resource usage of the workflow.
This prefix can be removed if runtime and memory profiling is not required.
For routine inspection and debugging, the log files located in MPore/logs are recommended, as they provide detailed information on when each individual pipeline step started and finished.
/usr/bin/time -v -o snakemake_resource_usage.txt \
snakemake -s Snakemake_entire_thing \
--configfile config.yaml \
--conda-frontend conda \
--cores all \
--use-conda \
--resources dorado=1 \
> snakemake_pipeline_out.txt \
2> snakemake_pipeline_err.txt-s Snakemake_entire_thing
Specifies the main Snakemake workflow file--configfile config.yaml
Loads all configuration variables defined in Step 2conda-fronted conda
Uses Conda (instead of Mamba) to create and manage environments located inMPore/Environmentscores all
Allows Snakemake to use all available CPU coresuse-conda
Enables automatic creation and activation of rule-specific Conda environmentsresources dorado =1
Limits Dorado execution to a single job at a time. This is recommended for GPU- and memory-intensive basecalling steps> snakemake_pipeline_out.txt
Redirects standard output (e.g. executed rules, echo statements) to a file2> snakemake_pipeline_err.txt
Redirects standard error messages to a separate file for debugging
- Snakemake log files are written to the
MPore/logsdirectory - These logs provide detailed, rule-specific execution information and are highly recommended for debugging and performance assessment
The output generated by MPore consists of the following files and directories:
-
BAM and BED files
Generated during Dorado basecalling
BED files contain site-specific methylation information, including coverage and the number of modified reads -
PROKKA annotations
Predicted coding sequences (CDS) for each isolate
Annotation files are stored within the correspondingfile_nameoutput directories -
BLASTP result files
Text files containing BLASTP alignment results of predicted CDS against the REBASE database -
All_isolates_gene_loci.csv
Lists all enzymes with an e-value <1e-25, including candidate methyltransferases used for downstream analyses and their gene loci -
Beta_coef_p_values_{methyltype}.csv
Contains enzymes and their beta coefficient estimates derived from L1-regularized logistic regression{methyltype}can be one of4mC,5mC, or6mA
The file also includes the originating gene loci for each enzyme -
Context_influence_{methyltype}.xlsx
Summarizes the influence of flanking genomic context on average genome-wide methylation levels -
MTase_presence_e_25_values.csv
Provides an overview of identified methyltransferases (MTases) and their corresponding e-values across all analyzed isolates -
Sample_DF_{file_name}_{methyltype}.csv
Summarizes all analyzed motifs (both user-defined and REBASE-derived), including average methylation scores per motif -
Sample_DF_detailed_{file_name}_{methyltype}.csv
Contains per-site methylation scores for each analyzed motif -
Plots_{methyltype}/
Directory containing boxplots comparing methylation scores across isolates
Associated enzyme information is displayed beneath each plot when applicable -
Multipanel_plot_{file_name}.png
Combined visualization showing all relevant plots for a given isolate (see Section 5) -
Heatmap_methylation_score_{context}.png
Heatmap summarizing the global methylation signal across motifs of identified methyltransferases for a given genomic context
- Workflow and multipanel
This figure illustrates the overall workflow of MPore and the data structure used for L1-regularized logistic regression. The bar plot summarizes methyltransferase detection results for the benchmark dataset ([Link]). In addition, a representative multipanel plot generated by MPore for Mycoplasma hominis is shown. For a detailed description of the methodology and analysis workflow, please refer to the MPore application note ([Link]). Contact: Azlan@uni-duesseldorf.de