This repository provides scripts to run Hail workflows on the Imperial College High Performance Computing (HPC) cluster, specifically for:
- Converting multiple single-sample gVCF files into a multi-sample VDS (Variant Dataset) using the Hail Combiner.
git clone https://your.repo.url/here.git
cd Hail # or the name of your cloned directoryThis repository uses a predefined Conda environment (conda_env.yml) to ensure all dependencies are consistent. The environment must be created on the login node of the Imperial HPC.
See Imperial HPC Conda guide if needed.
# Enable conda in your shell (adjust the path if needed)
eval "$(~/anaconda3/bin/conda shell.bash hook)"
# Remove existing environment (if it exists)
conda env remove -n hail
# Create the environment from the provided file
conda env create --file conda_env.ymlThis section describes how to convert a list of single-sample gVCF files into a Hail Variant Dataset (VDS).
Create a text file where each line is the absolute path to a .gvcf.gz file you want to include in the multi-sample dataset.
Example (my_gvcf_list.txt):
/rds/general/project/example/data/sample01.gvcf.gz
/rds/general/project/example/data/sample02.gvcf.gz
Edit the set_variables.sh file to define:
- File paths (gVCF list, output VDS, logs, etc.)
- Runtime parameters (threads, memory, etc.)
Each variable is documented in the file to help guide configuration.
Use the provided submission script to run the pipeline. You must pass the absolute path to your set_variables.sh file:
bash scripts/submit_gVCF_to_VDS.sh /full/path/to/set_variables.sh