Skip to content

AndersenLab/nemascan-sims-nf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

180 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Usage

nextflow andersenlab/nemascan-sim-nf --strainfile /path/to/strainfile --vcf /path/to/vcf -output-dir my-results

Mandatory argument (General):

--strainfile      File               A TSV file with two columns: the first is a name for the strain set and the second is a comma-separated strain list without spaces
--vcf             File               Generally a CaeNDR release date (i.e. 20231213). Can also provide a user-specified VCF with index in same folder

Optional arguments (General):

--nqtl            File               A CSV file with the number of QTL to simulate per phenotype, one value per line (Default is located: data/simulate_nqtl.csv)
--h2              File               A CSV file with phenotype heritability, one value per line (Default is located: data/simulate_h2.csv)
--reps             Integer            The number of replicates to simulate per number of QTL and heritability (Default: 2)
--maf             File               A CSV file where each line is a minor allele frequency threshold to test for simulations (Default: data/simulate_maf.csv)
--effect          File               A CSV file where each line is an effect size range (e.g. 0.2-0.3) to test for simulations (Default: data/simulate_effect_sizes.csv)
--qtlloc          File               A BED file with three columns: chromosome name (numeric 1-6), start postion, end postion. The genomic range specified is where markers will be pulled from to simulate QTL (Default: null [which defaults to using the whole genome to randomly simulate a QTL])
--sthresh         String             Significance threshold for QTL - Options: BF - for bonferroni correction, EIGEN - for SNV eigen value correction, or another number e.g. 4
--group_qtl       Integer            If two QTL are less than this distance from each other, combine the QTL into one, (DEFAULT = 1000)
--ci_size         Integer            Number of SNVs to the left and right of the peak marker used to define the QTL confidence interval, (DEFAULT = 150)
--sparse_cut      Decimal            Any off-diagonal value in the genetic relatedness matrix greater than this is set to 0 (Default: 0.05)
--simulate_qtlloc Boolean            Whether to simulate QTLs in specific genomic regions (Default: false)
-output-dir       String             Name of folder that will contain the results (Default: Simulations_{date})

Simulations

Preparing marker sets

PLINK_RECODE_VCF

The process PLINK_RECODE_VCF runs two plink commands.

First, it takes the VCF file and performs LD pruning with the --indep-pairwise 50 10 0.8 command. This outputs a set of plink files, of which is the plink.prune.in file used by the second plink command and used in the final step to create the markers.txt file used to generate the genotype matrix.

Other filtering parameters specified in the first command: --snps-only: --biallelic-only --maf --set-missing-var-ids --geno --not-chr: SPECIFIC TO FIRST PLINK COMMAND

All the files generated by the first plink command should be stored with the plink.prune.in prefix.

  • Per the plink documentation the --indep-pairwise command produces: "a pruned subset of of markers that are in appximate linkage equilibrium with each other, writing the IDs to plink.prune.in (and the IDs of all excluded variants to plink.prune.out). (Results may be slightly different from PLINK 1.07, due to a minor bugfix in the r2 computation when missing data is present, and more systematic handling of multicollinearity.) Output files are valid input for --extract/--exclude in a future PLINK run.

Second, the process runs a plink command to generate a set of plink files for downstream processing with the --recode command

Simulated traits

Trait simulation involves the following sequential steps:

  1. Select causal variants.
  2. Simulate phenotypes based on these variants.
  3. Update PLINK fileset with simulated phenotypes.

1. Selecting Causal Variants

Causal variants are chosen from the available marker set by the PYTHON_SIMULATE_EFFECTS_GLOBAL process, which executes the bin/create_causal_vars.py script.

This script takes the following inputs:

  • A .bim file (PLINK binary marker information).
  • The desired number of causal variants (nQTL).
  • The effect range, specified either as a numeric range (e.g., 0.4-0.9) or as gamma.

The selection process involves two main steps:

  1. Variant Selection: The select_variants() function randomly chooses nQTL variants without replacement from the markers listed in the .bim file.
  2. Effect Size Assignment:
    • If the effect range is gamma, the simulate_effect_gamma() function assigns effect sizes. These sizes are drawn from a gamma distribution (gamma(effect_shape=0.4, effect_scale=1.66)), and each variant is randomly assigned a direction (positive or negative effect, i.e., 1 or -1).
    • If a numeric range (e.g., 0.4-0.9) is provided, the simulate_effect_uniform() function assigns effect sizes. These are drawn from a uniform distribution spanning the specified low_end to high_end, and a direction (1 or -1) is also randomly assigned.

The script outputs a file named causal_variants.txt in the designated output directory. This file lists the selected causal variant IDs and their assigned effect sizes.

For our example replicate 5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05 (assuming nQTL=5), the causal_variants.txt file is generated by the PYTHON_SIMULATE_EFFECTS_GLOBAL process. An example of this file, located at data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/PYTHON_SIMULATE_EFFECTS_GLOBAL/causal_variants.txt, would look like this:

// filepath: data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/PYTHON_SIMULATE_EFFECTS_GLOBAL/causal_variants.txt
2:14378543 0.38901223435837223
2:7459092 -4.532985639400811
3:1537674 -0.0063686130105886415
4:16212978 0.47805937866664927
2:13733845 -0.03046281285149475

2. Simulating Phenotypes with GCTA_SIMULATE_PHENOTYPES

The causal_variants.txt file (generated in the previous step) is used by the GCTA_SIMULATE_PHENOTYPES process. This process employs the gcta64 --simu-qt command to simulate quantitative traits based on the selected causal variants. (Refer to the GCTA GWAS Simulation documentation for more details).

Key parameters for gcta64 --simu-qt:

  • --simu-causal-loci: This parameter takes the causal_variants.txt file, which provides the SNP IDs and their effect sizes for the simulation.
  • --simu-hsq: This specifies the target heritability (h²) of the trait. The value for this parameter is taken from the file provided to the --h2 pipeline parameter.

This process generates two primary output files:

  • {prefix}.par: A par file with a header, detailing:
    • QTL: SNP ID of the causal variant.
    • RefAllele: Reference allele.
    • Frequency: Allele frequency.
    • Effect size: The effect size used in the simulation for that QTL.
    • For our example replicate 5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05 (assuming nQTL=5, h2=0.2, MAF=0.05, effect=gamma, and strain set ce.96.allout15_irrepressible.grosbeak), the GCTA simulation step produces a *.par file. An example, named 5_1_0.2_0.05_gamma_ce.96.allout15_irrepressible.grosbeak_sims.par and located in data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/GCTA_SIMULATE_PHENOTYPES/, is shown below:
      // filepath: data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/GCTA_SIMULATE_PHENOTYPES/5_1_0.2_0.05_gamma_ce.96.allout15_irrepressible.grosbeak_sims.par
      QTL           RefAllele   Frequency    Effect
      2:7459092     A           0.0520833    -4.53299
      2:13733845    A           0.125        -0.0304628
      2:14378543    A           0.104167     0.389012
      3:1537674     G           0.125        -0.00636861
      4:16212978    T           0.0625       0.478059
      
  • {prefix}.phen: A phenotype file without a header, containing:
    • Column 1: Family ID.
    • Column 2: Individual ID.
    • Column 3: Simulated phenotype value.
    • Similarly, the *.phen file, named 5_1_0.2_0.05_gamma_ce.96.allout15_irrepressible.grosbeak_sims.phen and found in data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/GCTA_SIMULATE_PHENOTYPES/, contains the simulated phenotype values:
    • // filepath: data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/GCTA_SIMULATE_PHENOTYPES/5_1_0.2_0.05_gamma_ce.96.allout15_irrepressible.grosbeak_sims.phen
      MY2713 MY2713 -6.04369 
      JU323 JU323 11.1615 
      XZ1734 XZ1734 4.15316 
      QG4080 QG4080 9.589 
      DL226 DL226 -4.79251 
      ECA2533 ECA2533 8.34439 
      XZ1672 XZ1672 5.00596 
      ED3046 ED3046 -2.37417 
      NIC1786 NIC1786 11.8277 
      NIC274 NIC274 -22.5104 
      

3. Updating PLINK Fileset with Simulated Phenotypes (PLINK_UPDATE_BY_H2)

This step, handled by the PLINK_UPDATE_BY_H2 process, integrates the simulated phenotypes (from {prefix}.phen) into the primary PLINK fileset used for simulation (referred to as the "TO_SIMS" fileset). This associates the newly generated phenotype data with the existing genetic data for each individual, preparing it for downstream analyses such as association mapping.

The filtering parameters that are applied should remain consistent across all plink commands

GWAS Mappings

After trait simulation, the pipeline maps simulated phenotypes using GCTA to evaluate GWA power and precision. The mapping phase involves constructing a genetic relatedness matrix (GRM), verifying phenotypic variance, and performing association mapping. These processes are parameterized by two variables defined in main.nf:

  • mode: "inbred" or "loco" — determines the type of GRM and the GWA algorithm
  • type: "pca" or "nopca" — determines whether the first principal component eigenvector is included as a covariate

Each simulation replicate is mapped under all four combinations (inbred x pca, inbred x nopca, loco x pca, loco x nopca), driven by Nextflow channel combinations:

ch_mode = Channel.of(
    ["inbred", "fastGWA"],
    ["loco", "mlma"]
)

ch_type = Channel.of(
    "pca",
    "nopca"
)

Creating GRM and Estimating Phenotypic Variance (GCTA_MAKE_GRM)

Source: modules/gcta/make_grm/main.nf

The GCTA_MAKE_GRM process performs two sequential GCTA operations: building the GRM and estimating variance via REML.

Step 1: Build the GRM

The GRM construction method depends on the mode parameter:

if [[ ${mode} == "inbred" ]]; then
    GRM_OPTION="--make-grm-inbred"
else
    GRM_OPTION="--make-grm"
fi

gcta64 --bfile TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group} \
        --autosome --maf ${maf} ${GRM_OPTION} \
        --out TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_gcta_grm_${mode} \
        --thread-num ${task.cpus}
  • mode == "inbred": Uses --make-grm-inbred, which constructs a kinship matrix tailored for populations of inbred organisms. The inbred GRM adjusts relatedness estimates to account for the reduced heterozygosity characteristic of inbred lines.
  • mode == "loco": Uses --make-grm, which constructs a standard genome-wide relatedness matrix using all autosomal markers.

Both modes filter to autosomal markers (--autosome) above the minor allele frequency threshold (--maf). Because main.nf defines ch_mode with both ["inbred", "fastGWA"] and ["loco", "mlma"], each simulation replicate produces two GRMs — one inbred and one standard.

Step 2: REML Variance Estimation

gcta64 --grm TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_gcta_grm_${mode} \
        --pheno ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sims.phen \
        --reml --out check_vp \
        --thread-num ${task.cpus}

This estimates the phenotypic variance (Vp) from the simulated phenotype data using restricted maximum likelihood (REML). (See GCTA GREML analysis documentation).

The output of the gcta64 --reml analysis is a plain text file with the *.hsq extension containing variance component estimates. For our example replicate 5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05, here's an example of its content:

// filepath: data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/GCTA_MAKE_GRM/check_vp.hsq
Source	Variance	SE
V(G)	46.564277	40.308593
V(e)	158.873235	39.887704
Vp	205.437512	30.661982
V(G)/Vp	0.226659	0.185504
logL	-301.643
logL0	-302.603
LRT	1.920
df	1
Pval	8.2935e-02
n	96

Outputs: GRM binary files (.grm.bin, .grm.N.bin, .grm.id), the REML estimates (check_vp.hsq), and all input PLINK/phenotype files passed through for downstream use.

Verify and Adjust Phenotypic Variance (PYTHON_CHECK_VP)

Source: bin/check_vp.py

The PYTHON_CHECK_VP process inspects the estimated phenotypic variance (Vp) from the *.hsq file (produced by GCTA REML). This step ensures that the simulated phenotypes exhibit sufficient variance, preventing near-zero variance from causing numerical issues in the association mapping step.

Inputs to bin/check_vp.py:

  • The *.hsq file (containing Vp estimates).
  • The current phenotype file (e.g., {prefix}.phen from the GCTA simulation step).

Script Logic:

  1. The script parses the *.hsq file to extract the Vp value.
  2. It then checks the Vp against a threshold:
    • If Vp is less than 0.000001: The script scales up the phenotype values for all individuals by multiplying them by 1000. This adjustment aims to increase the phenotypic variance.
    • If Vp is greater than or equal to 0.000001: The original phenotype values are retained without modification.
  3. The script writes the (potentially modified) phenotype data to a temporary file named new_phenos.temp.
  4. Finally, this new_phenos.temp file is renamed to reflect the specific simulation parameters, following a pattern like ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sims.pheno. This becomes the final phenotype file for this simulation iteration.
    // filepath: data/test_data/5_1_gamma_ce.96.allout15_irrepressible.grosbeak_0.05/PYTHON_CHECK_VP/5_1_0.2_0.05_gamma_ce.96.allout15_irrepressible.grosbeak_sims.pheno
    MY2713 MY2713 -6.04369
    JU323 JU323 11.1615
    XZ1734 XZ1734 4.15316
    QG4080 QG4080 9.589
    

Mappings with GCTA_PERFORM_GWA

Source: modules/gcta/perform_gwa/main.nf

This process receives the GRM and variance-checked phenotype data, then performs three sub-steps controlled by the mode and type parameters.

Step 1: Create Sparse GRM

This step always runs regardless of mode or type:

gcta64 --grm TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_gcta_grm_${mode} \
    --make-bK-sparse ${sparse_cut} \
    --out ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sparse_grm_${mode} \
    --thread-num ${task.cpus}

The --make-bK-sparse command sets all off-diagonal GRM values with absolute value less than or equal to sparse_cut (default: 0.05) to zero. This produces a sparse representation of the GRM.

Step 2: Conditional PCA

The type parameter controls whether PCA eigenvectors are extracted as covariates:

if [[ ${type} == "pca" ]]; then
    gcta64 --grm TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_gcta_grm_${mode} \
        --pca 1 \
        --out ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sparse_grm_${mode} \
        --thread-num ${task.cpus}

    COVAR="--qcovar ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sparse_grm_${mode}.eigenvec"
else
    COVAR=""
fi
  • type == "pca": Extracts the first principal component from the GRM (--pca 1), producing an .eigenvec file. This eigenvector is passed as a quantitative covariate (--qcovar) to the mapping command to control for population structure.
  • type == "nopca": No PCA is performed. The COVAR variable is set to an empty string, so no covariate is included in the mapping command.

Step 3: Association Mapping

The mode parameter determines both the GCTA mapping algorithm and the GRM input flag:

if [[ ${mode} == "inbred" ]]; then
    GRM_OPTION='--grm-sparse'
    COMMAND='--fastGWA-mlm-exact'
else
    GRM_OPTION="--grm"
    COMMAND="--mlma-loco"
fi

gcta64 ${COMMAND} \
    --bfile TO_SIMS_${nqtl}_${rep}_${h2}_${maf}_${effect}_${group} \
    ${GRM_OPTION} ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sparse_grm_${mode} \
    ${COVAR} \
    --out ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_lmm-exact_${mode}_${type} \
    --pheno ${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_sims.pheno \
    --maf ${maf} \
    --thread-num ${task.cpus}

Inbred mode (mode == "inbred"):

  • Uses --fastGWA-mlm-exact to perform an exact mixed linear model (MLM) association analysis
  • Uses --grm-sparse to input the sparse GRM created from the inbred relatedness matrix
  • Output format: .fastGWA (columns: CHR, SNP, POS, A1, A2, N, AF1, BETA, SE, P)

LOCO mode (mode == "loco"):

  • Uses --mlma-loco to perform MLM-based association using a leave-one-chromosome-out approach, where the GRM excludes the chromosome of the tested variant
  • Uses --grm to input the sparse GRM created from the standard relatedness matrix
  • Output format: .loco.mlma (columns: Chr, SNP, bp, A1, A2, Freq, b, se, p), renamed to .mlma after completion
if [[ ${mode} == "loco" ]]; then
    mv "${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_lmm-exact_${mode}_${type}.loco.mlma" \
       "${nqtl}_${rep}_${h2}_${maf}_${effect}_${group}_lmm-exact_${mode}_${type}.mlma"
fi

Summary of the Four Mapping Conditions

Mode Type GRM Construction GWA Command GRM Flag Covariate Output Format
inbred pca --make-grm-inbred --fastGWA-mlm-exact --grm-sparse --qcovar .eigenvec .fastGWA
inbred nopca --make-grm-inbred --fastGWA-mlm-exact --grm-sparse none .fastGWA
loco pca --make-grm --mlma-loco --grm --qcovar .eigenvec .mlma
loco nopca --make-grm --mlma-loco --grm none .mlma

Define QTL Regions of Interest

  • QTL regions are defined with the NF process R_GET_GCTA_INTERVALS
  • This process runs an Rscript bin/Get_GCTA_Intervals.R which defines QTL intervals from the mapping outputs of GCTA_PERFORM_GWA and several pipeline parameters
  • Essentially the script takes the raw mapping results, applies significance criteria, groups significant markers into QTLs, defines confidence intervals for those QTL regions and estimates their effect size.
  • The script has several processing steps
    1. Load the libraries and command line arguments given by the pipeline
    2. Load the input data
      • Phenotype Data
      • GCTA Mapping Data
      • Genotype matrix
    3. Set the significance threshold
      • The script accepts a commandline argument (argument #10) which specifies the significance threshold to be applied to markers.
      • This argument can one of the following:
        • BF - Bonferroni threshold
        • EIGEN - Defined by the number of independent tests from Eigen decomposition of the genotype matrix
        • A user defined numeric value that is used as the threshold.
      • The --sthresh argument supplied to the pipeline sets the input for this argument to the process.
        • The default setting for the pipeline --sthresh argument is the BF threshold in the nextflow.config file.
    4. Process Mapping Data w/ process_mapping_df() function

process_mapping_df()

This is the core function of the script and performs many operations to define QTL intervals. The function returns the variable Processed which contains the original mapping data and these additional columns

  • strain
  • value
  • allele
  • var.exp
  • startPOS: the starting position of the QTL interval
  • peakPOS: The position of the peak marker of the QTL interval
  • endPOS: the end position of the QTL interval
  • peak_id: the id of the QTL interval. Is 1 if there is just one QTL identified for the trait or 2..Inf if there are multiple QTL identified for the trait.
  • interval_size: The number of bases spanned by the QTL interval
  1. Threshold application
    • Step calculates the significance threshold and identifies marker SNPs exceeding that threshold.
      1. First the mapping df is grouped by trait (in the case that multiple mappings of different traits occurred)
      2. Depending on the threshold set, SNPs are flagged as being above 1 or below 0 the significance threshold in a newly created column aboveBF
    • Note: The function uses an externally defined variable QTL_cutoff which is not passed as an argument to the script.
    • The column to denote if a SNP is above the significance threshold is named aboveBF regardless of the significance threshold that is applied. This is likely required so that the outputs have standard formatting for later processing steps.
  2. Filtering
    • After applying the significance threshold to flag SNPs as either above (1) or below (0) the significance threshold in the column aboveBF there are three possible next steps
    1. If more than 15% of the total SNPs are above the significance threshold all columns added by the mapping function process_mapping_df() are set to NA.
    2. If there are no significant SNPs the columns added by the process_mapping_df() are also set to NA
  3. Variant effect calculation for significant SNPs
    • This step adds the var.exp column to the processed mapping result by correlating phenotype values with genotype values at the significant SNPs.
    • Uses pearsons correlation R2 between the phenotype values and allelic state (REF/ALT).
  4. QTL interval definition
    • Identifies the most significant SNP in a QTL region of interest The output is a processed dataframe containing the original mapping data augmented with QTL interval information (start, peak, end positions, peak ID, interval size, and Variance explained)

Assessing Mapping Performance

The process R_ASSESS_SIMS runs the Rscript Assess_Sims.R to evaluate the performance of GWAS simulations.

It loads the simulated trait outputs, mapping outputs, and a number of simulation pipeline parameters.

This final process outputs a simulation_assessment_results.tsv to the analysis directory. Each row represents a QTL simulated or Detected with the following columns:

  • QTL: Peak marker ID for the QTL interval
  • Simulated: TRUE/FALSE if the QTL was simulated
  • Detected: TRUE/FALSE if the QTL was detected in mapping
  • CHROM: Chromosome of the QTL (numeric ID e.g., 1 = I, 2 = II, etc.)
  • POS: Position of the marker
  • RefAllele: Reference allele for the marker
  • Frequency: Allele frequency of the marker
  • Effect: Effect size of the marker
  • Simulated.QTL.Var.Exp: Variance explained by the simulated QTL
  • log10p: -log10(p-value) of the marker from mapping
  • aboveBF: TRUE/FALSE if the peak marker is above the significance threshold (see algorithm_id column)
  • startPOS: Start position of the QTL interval
  • peakPOS: Peak position of the QTL interval
  • endPOS: End position of the QTL interval
  • detected.peak: TRUE/FALSE if the marker is the detected peak in mapping
  • interval.Frequency: Allele frequency of the peak marker in the QTL interval
  • BETA: Effect size estimate from mapping for the peak marker
  • interval.log10p: -log10(p-value) of the peak marker in the QTL interval
  • peak_id: Numeric ID of the QTL interval (e.g 1, 2, ... n, where N is the total number of QTL detected)
  • interval_size: Size of the QTL interval in base pairs
  • interval.var.exp: Variance explained by the peak marker in the QTL interval
  • top.hit: TRUE/FALSE if the marker is the top hit in QTL interval
  • nQTL: Number of QTL simulated for the trait
  • simREP: Replicate number of the simulation
  • h2: Heritability of the simulated trait
  • maf: Minor allele frequency threshold used in simulation
  • effect_distribution: Effect size range used in simulation
  • strain_set_id: Name of the strain set used in simulation
  • algorithm_id: Mapping method (Inbred, Loco, Inbred + PCA, LOCO + PCA) and significance threshold (e.g. inbred_pca_EIGEN, or inbred_pca_BF)

About

Nextflow workflow for running GWA simulations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors