nextflow andersenlab/nemascan-sim-nf --strainfile /path/to/strainfile --vcf /path/to/vcf -output-dir my-results
Mandatory argument (General):
--strainfile File A TSV file with two columns: the first is a name for the strain set and the second is a comma-separated strain list without spaces
--vcf File Generally a CaeNDR release date (i.e. 20231213). Can also provide a user-specified VCF with index in same folder
Optional arguments (General):
--nqtl File A CSV file with the number of QTL to simulate per phenotype, one value per line (Default is located: data/simulate_nqtl.csv)
--h2 File A CSV file with phenotype heritability, one value per line (Default is located: data/simulate_h2.csv)
--reps Integer The number of replicates to simulate per number of QTL and heritability (Default: 2)
--maf File A CSV file where each line is a minor allele frequency threshold to test for simulations (Default: data/simulate_maf.csv)
--effect File A CSV file where each line is an effect size range (e.g. 0.2-0.3) to test for simulations (Default: data/simulate_effect_sizes.csv)
--qtlloc File A BED file with three columns: chromosome name (numeric 1-6), start postion, end postion. The genomic range specified is where markers will be pulled from to simulate QTL (Default: null [which defaults to using the whole genome to randomly simulate a QTL])
--sthresh String Significance threshold for QTL - Options: BF - for bonferroni correction, EIGEN - for SNV eigen value correction, or another number e.g. 4
--group_qtl Integer If two QTL are less than this distance from each other, combine the QTL into one, (DEFAULT = 1000)
--ci_size Integer Number of SNVs to the left and right of the peak marker used to define the QTL confidence interval, (DEFAULT = 150)
--sparse_cut Decimal Any off-diagonal value in the genetic relatedness matrix greater than this is set to 0 (Default: 0.05)
--simulate_qtlloc Boolean Whether to simulate QTLs in specific genomic regions (Default: false)
-output-dir String Name of folder that will contain the results (Default: Simulations_{date})
The process PLINK_RECODE_VCF runs two plink commands.
First, it takes the VCF file and performs LD pruning with the --indep-pairwise 50 10 0.8 command. This outputs a set of plink files, of which is the plink.prune.in file used by the second plink command and used in the final step to create the markers.txt file used to generate the genotype matrix.
Other filtering parameters specified in the first command:
--snps-only:
--biallelic-only
--maf
--set-missing-var-ids
--geno
--not-chr: SPECIFIC TO FIRST PLINK COMMAND
All the files generated by the first plink command should be stored with the plink.prune.in prefix.
- Per the plink documentation the
--indep-pairwisecommand produces: "a pruned subset of of markers that are in appximate linkage equilibrium with each other, writing the IDs toplink.prune.in(and the IDs of all excluded variants toplink.prune.out). (Results may be slightly different from PLINK 1.07, due to a minor bugfix in the r2 computation when missing data is present, and more systematic handling of multicollinearity.) Output files are valid input for --extract/--exclude in a future PLINK run.
Second, the process runs a plink command to generate a set of plink files for downstream processing with the --recode command
Trait simulation involves the following sequential steps:
- Select causal variants.
- Simulate phenotypes based on these variants.
- Update PLINK fileset with simulated phenotypes.
Causal variants are chosen from the available marker set by the PYTHON_SIMULATE_EFFECTS_GLOBAL process, which executes the bin/create_causal_vars.py script.
This script takes the following inputs:
- A
.bimfile (PLINK binary marker information). - The desired number of causal variants (
nQTL). - The effect range, specified either as a numeric range (e.g.,
0.4-0.9) or asgamma.
The selection process involves two main steps:
- Variant Selection: The
select_variants()function randomly choosesnQTLvariants without replacement from the markers listed in the.bimfile. - Effect Size Assignment:
- If the effect range is
gamma, thesimulate_effect_gamma()function assigns effect sizes. These sizes are drawn from a gamma distribution (gamma(effect_shape=0.4, effect_scale=1.66)), and each variant is randomly assigned a direction (positive or negative effect, i.e.,1or-1). - If a numeric range (e.g.,
0.4-0.9) is provided, thesimulate_effect_uniform()function assigns effect sizes. These are drawn from a uniform distribution spanning the specifiedlow_endtohigh_end, and a direction (1or-1) is also randomly assigned.
- If the effect range is
The script outputs a file named causal_variants.txt in the designated output directory. This file lists the selected causal variant IDs and their assigned effect sizes.
For our example replicate 5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05 (assuming nQTL=5), the causal_variants.txt file is generated by the PYTHON_SIMULATE_EFFECTS_GLOBAL process. An example of this file, located at data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/PYTHON_SIMULATE_EFFECTS_GLOBAL/causal_variants.txt, would look like this:
// filepath: data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/PYTHON_SIMULATE_EFFECTS_GLOBAL/causal_variants.txt
2:14378543 0.38901223435837223
2:7459092 -4.532985639400811
3:1537674 -0.0063686130105886415
4:16212978 0.47805937866664927
2:13733845 -0.03046281285149475
The causal_variants.txt file (generated in the previous step) is used by the GCTA_SIMULATE_PHENOTYPES process. This process employs the gcta64 --simu-qt command to simulate quantitative traits based on the selected causal variants. (Refer to the GCTA GWAS Simulation documentation for more details).
Key parameters for gcta64 --simu-qt:
--simu-causal-loci: This parameter takes thecausal_variants.txtfile, which provides the SNP IDs and their effect sizes for the simulation.--simu-hsq: This specifies the target heritability (h²) of the trait. The value for this parameter is taken from the file provided to the--h2pipeline parameter.
This process generates two primary output files:
{prefix}.par: A par file with a header, detailing:QTL: SNP ID of the causal variant.RefAllele: Reference allele.Frequency: Allele frequency.Effect size: The effect size used in the simulation for that QTL.- For our example replicate
5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05(assuming nQTL=5, h2=0.2, MAF=0.05, effect=gamma, and strain setce.96.allout15_irrepressible.grosbeak), the GCTA simulation step produces a*.parfile. An example, named5_1_0.2_0.05_gamma_ce.96.allout15_irrepressible.grosbeak_sims.parand located indata/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/GCTA_SIMULATE_PHENOTYPES/, is shown below:// filepath: data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/GCTA_SIMULATE_PHENOTYPES/5_1_0.2_0.05_gamma_ce.96.allout15_irrepressible.grosbeak_sims.par QTL RefAllele Frequency Effect 2:7459092 A 0.0520833 -4.53299 2:13733845 A 0.125 -0.0304628 2:14378543 A 0.104167 0.389012 3:1537674 G 0.125 -0.00636861 4:16212978 T 0.0625 0.478059
{prefix}.phen: A phenotype file without a header, containing:- Column 1: Family ID.
- Column 2: Individual ID.
- Column 3: Simulated phenotype value.
- Similarly, the
*.phenfile, named5_1_0.2_0.05_gamma_ce.96.allout15_irrepressible.grosbeak_sims.phenand found indata/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/GCTA_SIMULATE_PHENOTYPES/, contains the simulated phenotype values: -
// filepath: data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/GCTA_SIMULATE_PHENOTYPES/5_1_0.2_0.05_gamma_ce.96.allout15_irrepressible.grosbeak_sims.phen MY2713 MY2713 -6.04369 JU323 JU323 11.1615 XZ1734 XZ1734 4.15316 QG4080 QG4080 9.589 DL226 DL226 -4.79251 ECA2533 ECA2533 8.34439 XZ1672 XZ1672 5.00596 ED3046 ED3046 -2.37417 NIC1786 NIC1786 11.8277 NIC274 NIC274 -22.5104
This step, handled by the PLINK_UPDATE_BY_H2 process, integrates the simulated phenotypes (from {prefix}.phen) into the primary PLINK fileset used for simulation (referred to as the "TO_SIMS" fileset). This associates the newly generated phenotype data with the existing genetic data for each individual, preparing it for downstream analyses such as association mapping.
The filtering parameters that are applied should remain consistent across all plink commands
After trait simulation, the pipeline maps simulated phenotypes using GCTA to evaluate GWA power and precision. The mapping phase involves constructing a genetic relatedness matrix (GRM), verifying phenotypic variance, and performing association mapping. These processes are parameterized by two variables defined in main.nf:
- mode:
"inbred"or"loco"— determines the type of GRM and the GWA algorithm - type:
"pca"or"nopca"— determines whether the first principal component eigenvector is included as a covariate
Each simulation replicate is mapped under all four combinations (inbred x pca, inbred x nopca, loco x pca, loco x nopca), driven by Nextflow channel combinations:
ch_mode = Channel.of(
["inbred", "fastGWA"],
["loco", "mlma"]
)
ch_type = Channel.of(
"pca",
"nopca"
)Source: modules/gcta/make_grm/main.nf
The GCTA_MAKE_GRM process performs two sequential GCTA operations: building the GRM and estimating variance via REML.
The GRM construction method depends on the mode parameter:
if [[ ${mode} == "inbred" ]]; then
GRM_OPTION="--make-grm-inbred"
else
GRM_OPTION="--make-grm"
fi
gcta64 --bfile TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group} \
--autosome --maf ${maf} ${GRM_OPTION} \
--out TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_gcta_grm_${mode} \
--thread-num ${task.cpus}mode == "inbred": Uses--make-grm-inbred, which constructs a kinship matrix tailored for populations of inbred organisms. The inbred GRM adjusts relatedness estimates to account for the reduced heterozygosity characteristic of inbred lines.mode == "loco": Uses--make-grm, which constructs a standard genome-wide relatedness matrix using all autosomal markers.
Both modes filter to autosomal markers (--autosome) above the minor allele frequency threshold (--maf). Because main.nf defines ch_mode with both ["inbred", "fastGWA"] and ["loco", "mlma"], each simulation replicate produces two GRMs — one inbred and one standard.
gcta64 --grm TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_gcta_grm_${mode} \
--pheno ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sims.phen \
--reml --out check_vp \
--thread-num ${task.cpus}This estimates the phenotypic variance (Vp) from the simulated phenotype data using restricted maximum likelihood (REML). (See GCTA GREML analysis documentation).
The output of the gcta64 --reml analysis is a plain text file with the *.hsq extension containing variance component estimates. For our example replicate 5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05, here's an example of its content:
// filepath: data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/GCTA_MAKE_GRM/check_vp.hsq
Source Variance SE
V(G) 46.564277 40.308593
V(e) 158.873235 39.887704
Vp 205.437512 30.661982
V(G)/Vp 0.226659 0.185504
logL -301.643
logL0 -302.603
LRT 1.920
df 1
Pval 8.2935e-02
n 96
Outputs: GRM binary files (.grm.bin, .grm.N.bin, .grm.id), the REML estimates (check_vp.hsq), and all input PLINK/phenotype files passed through for downstream use.
Source: bin/check_vp.py
The PYTHON_CHECK_VP process inspects the estimated phenotypic variance (Vp) from the *.hsq file (produced by GCTA REML). This step ensures that the simulated phenotypes exhibit sufficient variance, preventing near-zero variance from causing numerical issues in the association mapping step.
Inputs to bin/check_vp.py:
- The
*.hsqfile (containing Vp estimates). - The current phenotype file (e.g.,
{prefix}.phenfrom the GCTA simulation step).
Script Logic:
- The script parses the
*.hsqfile to extract theVpvalue. - It then checks the
Vpagainst a threshold:- If
Vpis less than0.000001: The script scales up the phenotype values for all individuals by multiplying them by1000. This adjustment aims to increase the phenotypic variance. - If
Vpis greater than or equal to0.000001: The original phenotype values are retained without modification.
- If
- The script writes the (potentially modified) phenotype data to a temporary file named
new_phenos.temp. - Finally, this
new_phenos.tempfile is renamed to reflect the specific simulation parameters, following a pattern like${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sims.pheno. This becomes the final phenotype file for this simulation iteration.// filepath: data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/PYTHON_CHECK_VP/5_1_0.2_0.05_gamma_ce.96.allout15_irrepressible.grosbeak_sims.pheno MY2713 MY2713 -6.04369 JU323 JU323 11.1615 XZ1734 XZ1734 4.15316 QG4080 QG4080 9.589
Source: modules/gcta/perform_gwa/main.nf
This process receives the GRM and variance-checked phenotype data, then performs three sub-steps controlled by the mode and type parameters.
This step always runs regardless of mode or type:
gcta64 --grm TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_gcta_grm_${mode} \
--make-bK-sparse ${sparse_cut} \
--out ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sparse_grm_${mode} \
--thread-num ${task.cpus}The --make-bK-sparse command sets all off-diagonal GRM values with absolute value less than or equal to sparse_cut (default: 0.05) to zero. This produces a sparse representation of the GRM.
The type parameter controls whether PCA eigenvectors are extracted as covariates:
if [[ ${type} == "pca" ]]; then
gcta64 --grm TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_gcta_grm_${mode} \
--pca 1 \
--out ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sparse_grm_${mode} \
--thread-num ${task.cpus}
COVAR="--qcovar ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sparse_grm_${mode}.eigenvec"
else
COVAR=""
fitype == "pca": Extracts the first principal component from the GRM (--pca 1), producing an.eigenvecfile. This eigenvector is passed as a quantitative covariate (--qcovar) to the mapping command to control for population structure.type == "nopca": No PCA is performed. TheCOVARvariable is set to an empty string, so no covariate is included in the mapping command.
The mode parameter determines both the GCTA mapping algorithm and the GRM input flag:
if [[ ${mode} == "inbred" ]]; then
GRM_OPTION='--grm-sparse'
COMMAND='--fastGWA-mlm-exact'
else
GRM_OPTION="--grm"
COMMAND="--mlma-loco"
fi
gcta64 ${COMMAND} \
--bfile TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group} \
${GRM_OPTION} ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sparse_grm_${mode} \
${COVAR} \
--out ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_lmm-exact_${mode}_${type} \
--pheno ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sims.pheno \
--maf ${maf} \
--thread-num ${task.cpus}Inbred mode (mode == "inbred"):
- Uses
--fastGWA-mlm-exactto perform an exact mixed linear model (MLM) association analysis - Uses
--grm-sparseto input the sparse GRM created from the inbred relatedness matrix - Output format:
.fastGWA(columns: CHR, SNP, POS, A1, A2, N, AF1, BETA, SE, P)
LOCO mode (mode == "loco"):
- Uses
--mlma-locoto perform MLM-based association using a leave-one-chromosome-out approach, where the GRM excludes the chromosome of the tested variant - Uses
--grmto input the sparse GRM created from the standard relatedness matrix - Output format:
.loco.mlma(columns: Chr, SNP, bp, A1, A2, Freq, b, se, p), renamed to.mlmaafter completion
if [[ ${mode} == "loco" ]]; then
mv "${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_lmm-exact_${mode}_${type}.loco.mlma" \
"${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_lmm-exact_${mode}_${type}.mlma"
fi| Mode | Type | GRM Construction | GWA Command | GRM Flag | Covariate | Output Format |
|---|---|---|---|---|---|---|
| inbred | pca | --make-grm-inbred |
--fastGWA-mlm-exact |
--grm-sparse |
--qcovar .eigenvec |
.fastGWA |
| inbred | nopca | --make-grm-inbred |
--fastGWA-mlm-exact |
--grm-sparse |
none | .fastGWA |
| loco | pca | --make-grm |
--mlma-loco |
--grm |
--qcovar .eigenvec |
.mlma |
| loco | nopca | --make-grm |
--mlma-loco |
--grm |
none | .mlma |
- QTL regions are defined with the NF process
R_GET_GCTA_INTERVALS - This process runs an Rscript
bin/Get_GCTA_Intervals.Rwhich defines QTL intervals from the mapping outputs ofGCTA_PERFORM_GWAand several pipeline parameters - Essentially the script takes the raw mapping results, applies significance criteria, groups significant markers into QTLs, defines confidence intervals for those QTL regions and estimates their effect size.
- The script has several processing steps
- Load the libraries and command line arguments given by the pipeline
- Load the input data
- Phenotype Data
- GCTA Mapping Data
- Genotype matrix
- Set the significance threshold
- The script accepts a commandline argument (argument #10) which specifies the significance threshold to be applied to markers.
- This argument can one of the following:
BF- Bonferroni thresholdEIGEN- Defined by the number of independent tests from Eigen decomposition of the genotype matrix- A user defined numeric value that is used as the threshold.
- The
--sthreshargument supplied to the pipeline sets the input for this argument to the process.- The default setting for the pipeline
--sthreshargument is theBFthreshold in thenextflow.configfile.
- The default setting for the pipeline
- Process Mapping Data w/
process_mapping_df()function
This is the core function of the script and performs many operations to define QTL intervals. The function returns the variable Processed which contains the original mapping data and these additional columns
strainvalueallelevar.expstartPOS: the starting position of the QTL intervalpeakPOS: The position of the peak marker of the QTL intervalendPOS: the end position of the QTL intervalpeak_id: the id of the QTL interval. Is1if there is just one QTL identified for the trait or2..Infif there are multiple QTL identified for the trait.interval_size: The number of bases spanned by the QTL interval
- Threshold application
- Step calculates the significance threshold and identifies marker SNPs exceeding that threshold.
- First the mapping df is grouped by trait (in the case that multiple mappings of different traits occurred)
- Depending on the threshold set, SNPs are flagged as being above
1or below0the significance threshold in a newly created columnaboveBF
- Note: The function uses an externally defined variable
QTL_cutoffwhich is not passed as an argument to the script. - The column to denote if a SNP is above the significance threshold is named
aboveBFregardless of the significance threshold that is applied. This is likely required so that the outputs have standard formatting for later processing steps.
- Step calculates the significance threshold and identifies marker SNPs exceeding that threshold.
- Filtering
- After applying the significance threshold to flag SNPs as either above (
1) or below (0) the significance threshold in the columnaboveBFthere are three possible next steps
- If more than 15% of the total SNPs are above the significance threshold all columns added by the mapping function
process_mapping_df()are set toNA. - If there are no significant SNPs the columns added by the
process_mapping_df()are also set toNA
- After applying the significance threshold to flag SNPs as either above (
- Variant effect calculation for significant SNPs
- This step adds the
var.expcolumn to the processed mapping result by correlating phenotype values with genotype values at the significant SNPs. - Uses pearsons correlation R2 between the phenotype values and allelic state (REF/ALT).
- This step adds the
- QTL interval definition
- Identifies the most significant SNP in a QTL region of interest The output is a processed dataframe containing the original mapping data augmented with QTL interval information (start, peak, end positions, peak ID, interval size, and Variance explained)
The process R_ASSESS_SIMS runs the Rscript Assess_Sims.R to evaluate the performance of GWAS simulations.
It loads the simulated trait outputs, mapping outputs, and a number of simulation pipeline parameters.
This final process outputs a simulation_assessment_results.tsv to the analysis directory. Each row represents a QTL simulated or Detected with the following columns:
QTL: Peak marker ID for the QTL intervalSimulated: TRUE/FALSE if the QTL was simulatedDetected: TRUE/FALSE if the QTL was detected in mappingCHROM: Chromosome of the QTL (numeric ID e.g., 1 = I, 2 = II, etc.)POS: Position of the markerRefAllele: Reference allele for the markerFrequency: Allele frequency of the markerEffect: Effect size of the markerSimulated.QTL.Var.Exp: Variance explained by the simulated QTLlog10p: -log10(p-value) of the marker from mappingaboveBF: TRUE/FALSE if the peak marker is above the significance threshold (seealgorithm_idcolumn)startPOS: Start position of the QTL intervalpeakPOS: Peak position of the QTL intervalendPOS: End position of the QTL intervaldetected.peak: TRUE/FALSE if the marker is the detected peak in mappinginterval.Frequency: Allele frequency of the peak marker in the QTL intervalBETA: Effect size estimate from mapping for the peak markerinterval.log10p: -log10(p-value) of the peak marker in the QTL intervalpeak_id: Numeric ID of the QTL interval (e.g 1, 2, ... n, where N is the total number of QTL detected)interval_size: Size of the QTL interval in base pairsinterval.var.exp: Variance explained by the peak marker in the QTL intervaltop.hit: TRUE/FALSE if the marker is the top hit in QTL intervalnQTL: Number of QTL simulated for the traitsimREP: Replicate number of the simulationh2: Heritability of the simulated traitmaf: Minor allele frequency threshold used in simulationeffect_distribution: Effect size range used in simulationstrain_set_id: Name of the strain set used in simulationalgorithm_id: Mapping method (Inbred, Loco, Inbred + PCA, LOCO + PCA) and significance threshold (e.g.inbred_pca_EIGEN, orinbred_pca_BF)