2M-EC_platform

（For subsequent updates, suggestions, or inquiries, please contact the author via email: oceanddl@sina.com）

2M-EC Platform Study Documentation

This EC (endometrial cancer) repository includes data preprocessing, model construction/validation, and 2M-EC platform deployment. The repository also incorporates our in-house developed tools for molecular matching and ROMA data collection.

Repository Structure

2M-EC_platform

Stores trained models and data processors. Deployment programs. Includes 4 test samples for platform validation.

Gridsearch

Used to optimize analyte-algorithm combinations. The model training methodology referenced literature（Osipov, A., et al. The Molecular Twin artificial-intelligence platform integrates multi-omic data to predict outcomes for pancreatic adenocarcinoma patients. Nat Cancer 5, 299-314 (2024).

modeling

Training, saving and validation of mass spectrometry models and fusion models.

ML (machine learning)

New sample data requires spectral preprocessing followed by binning.

Identification

Protein molecular weight calculation and matching.

ROMA.py

Web-based clinical cohort input and result collection for ROMA.

Maldi spectrum preprocessing.R

Prior to new sample prediction, spectrum data must undergo preprocessing and spectral binning, with the binning methodology following published literature protocols（Weis, C., et al. Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning. Nat Med 28, 164-174 (2022).）

Dataset of EC

This repository provides all datasets supporting the project, consisting of four essential files : Cervical2.zip, Uterine2.zip, Plasma-PM2.zip, Plasma-PP2.zip. Each archive contains both training datasets for model development and independent validation datasets for performance evaluation, enabling complete pipeline assessment from training to validation. All files are accessible via: Baidu Netdisk (https://pan.baidu.com/s/1Bxc0J5z0vab9ggsvZ2s_RQ?pwd=sf1b, extraction code: sf1b).

Database of 1160 multi-omics raw profiles from uterine secretions, cervical secretions and plasma with MALDI-TOF Mass spectrometry, processed dataset and clinical metadata

For each site, the data consists of MALDI-TOF mass spectra in the form of .txt files and a aggregated meta-data file (Metadata_CI.csv) with clinical information to model and align.

The EC folder structure obtained after download is the following:

EC
├── Cervical 
│   ├── raw_192CM
│   ├── M1/2
│   ├── TS1
│   └── vali_C
│ 
│
├── Uterine
│   ├── raw_246UM
│   ├── M2
│   ├── TS2
│   ├── NSMP
│   ├── p53
│   └── vali_U
│
├── Plasma
│   ├── raw_361PM
│   ├── raw_361PP
│   ├── M1/2 (PM)
│   ├── M1/2 (PP)
│   ├── TS1
│   ├── vali_PM
│   └── vali_PP
│
├── Metadata_CI.csv
│
├──README.md

Sites where MALDI-TOF MS profiles were collected

University of Fudan, China

For details on the dataset extraction and preprocessing, please refer to the Methods section in the article corresponding to the publication https://www.nature.com/articles/s41591-021-01619-9.

Conversion to Dataset

Raw MALDI-TOF MS spectra were converted to .txt format using flexAnalysis software (Bruker, Germany). Spectral preprocessing was performed with R packages MALDIquant and MALDIquantForeign, implementing: 1) square root transformation for variance stabilization, 2) Savitzky-Golay filtering (15-point window) for smoothing, and 3) SNIP algorithm (20 iterations) for baseline correction. Following established protocols, we conducted mass-to-charge ratio (m/z) binning with sample-specific resolutions: 9,000 evenly distributed bins for uterine/cervical metabolic profiles (100-1,000 Da), 900 evenly distributed bins for plasma metabolic profiles (100-1,000 Da) and 675 evenly distributed bins for plasma peptidomic profiles (3-30 kDa). The binning approach was adjusted based on the mass spectrometer's resolution. All processed data retained intensity values normalized to total intensity value of the spectrum.

We recommend using Python package for MALDI-TOF MS preprocessing and machine learning analysis, maldi-learn (https://github.com/BorgwardtLab/maldi-learn), to load and analyse EC data.

The github package comes with an elaborate README.md file, which gives details on installation and usage examples. The code tools for data processing can be found at https://github.com/lmsac/2M-EC.git.

A note on structure of EC Dataset

We implemented a stratified cohort design comprising a modeling cohort and an external test cohort. Modeling cohort (n=436) was partitioned into three M sub-cohorts, M1 (cervical and plasma, n=314), M2 (cervical, plasma and uterine, n=436), and M3 (uterine annotated with subtyping, n=77). The raw spectra processed dataset is uploaded in a tabular form as a stratified cohort. The raw data for p53 and nsmp were obtained from Uterine and the processed dataset was uploaded. The raw data files can be aligned according to Samples.

The first column, "Samples," represents the clinical identification numbers of the patients, and "target" indicates the patient grouping, where 1 denotes the endometrial cancer group and 0 denotes the non-endometrial cancer group. Additionally, for the "p53" and "nsmp" groupings, 1 represents the specific molecular subtype group of the cancer, while 0 represents the other three coded molecular subtype groups combined.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
2M-EC		2M-EC
Girdsearch		Girdsearch
Identification		Identification
ML		ML
maldi_learn		maldi_learn
modeling		modeling
.DS_Store		.DS_Store
.gitattributes		.gitattributes
Maldi spectrum preprocessing.R		Maldi spectrum preprocessing.R
Metadata_CI.xlsx		Metadata_CI.xlsx
README.md		README.md
ROMA.py		ROMA.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2M-EC Platform Study Documentation

Repository Structure

2M-EC_platform

Gridsearch

modeling

ML (machine learning)

Identification

ROMA.py

Maldi spectrum preprocessing.R

Dataset of EC

Sites where MALDI-TOF MS profiles were collected

Conversion to Dataset

A note on structure of EC Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

2M-EC Platform Study Documentation

Repository Structure

2M-EC_platform

Gridsearch

modeling

ML (machine learning)

Identification

ROMA.py

Maldi spectrum preprocessing.R

Dataset of EC

Sites where MALDI-TOF MS profiles were collected

Conversion to Dataset

A note on structure of EC Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages