- Build name: RSV Washington Focused Build
- Pathogen/Strain: RSV
- Scope: Whole genome builds for both RSV A and RSV B
- Purpose: This repository contains the Nextstrain build for Washington State genomic surveillance of RSV.
- Nextstrain Build Location: https://nextstrain.org/groups/wadoh#rsv
- Getting Started
- Running the Build
- Repository File Structure Overview
- Expected Outputs and Interpretation
- Scientific Decisions
- Adapting for Another State
- Contributing
- License
- Acknowledgements
- Update Example Data
- Sending data to the nextclade_data repo
Some high-level features and capabilities specific to this build include:
- Scope: This repository will generate two views, a whole genome build for RSV A and RSV B showing the evolution of the entire genome.
- Subsampling: The RSV Washington Focused Build uses a tiered subsampling strategy which allows for filtering NCBI data based on geographic location. The subsampling criteria is set to select all sequences from Washington State. For contextual data, up to 10,000 sequences are selected from all other non-Washington locations. These criteria can be modified as needed.
During the Ingest pipeline, this build pulls all RSV genomes that are publicly available from NCBI.
Input metadata and sequences for RSV-A and RSV-B are also available via https://data.nextstrain.org:
-
-
Sequence and Metadata: NCBI
-
Expected Inputs:
data/a/sequences.fastadata/b/sequences.fastadata/a/metadata.tsvdata/b/metadata.tsv
-
RSV sequences and metadata can be downloaded in the /ingest folder using
nextstrain build --cpus 1 ingest or nextstrain build --cpus 1 . if running directly from the /ingest directory.
Follow the standard installation instructions for Nextstrain's suite of software tools.
To check that Nextstrain is installed:
nextstrain check-setup
git clone https://github.com/NW-PaGe/rsv.git
cd rsv
From within the repository the build can be run with:
nextstrain build . --configfile profiles/wadoh/configfile_wadoh.yaml
To test the pipeline with the provided example data located in example_data/, you will need to copy over the contents of this folder, including the a/ and b/ subfolders, into the data/ folder. The Snakefile will pull ingest the contents of the data/ folder into the build.
The file structure of the repository is as follows with *" folders denoting folders that are the build's expected outputs.
.
├── config
── auspice_config.json
── configfile.yaml
── description.md
── outliers.txt
├── example_data
├── ingest
├── logs
├── nextclade
├── profiles
├── scripts
├── workflow
├── README.md
├── auspice*
├── results*
├── data*
└── Snakefile
config/: In addition to the files listed the config folder contains the sequence data for the A and B reference sequences, clade information, and coloring data.auspice_config.json: Config file pertaining to how data is displayed in Auspiceconfigfile.yaml: Config file used in creating the build, including filtering and subsamplingdescription.md: Markdown file for the description displayed under the visualizations in Auspiceoutliers.txt: Text file of sequences to be excluded from the build
ingest: Contains files for ingest pipeline. See following section.example_data: Contains example data to be run through build for testing.nextclade: This folder contains the workflow used to make the nextclade datasets for RSV. More information on nextclade datasets can be found in thenextclade repo.profiles: Contains profiles that can be specified during pipelinedefault/: Default profile fileswadoh/: contains files used for Washington-specific pipelineauspice_config_wa.json/: Customized auspice configfileconfigfile_wadoh.yaml/: Customized configfilewa_filter.smk/: Snakefile with customized rules
scripts/: A set of necessary python scripts used in pipelineworkflow/:snakemake_rules/: contains a variety of snakemake rules. Most notably:core.smk: The sequence of rules that will get followed in running the build, this includes indexing and aligning sequences, and filtering based on the criteria found in the config files.
data/:Snakefile: A list of rules in the order they will be run in when the workflow is run.
The ingest pipeline is based on the Nextstrain mpox ingest workflow (https://github.com/nextstrain/mpox/tree/master/ingest).
Running the ingest pipeline produces ingest/data/{a,b}/metadata.tsv and ingest/data/{a,b}/sequences.fasta.
This repository uses git subrepo to manage copies of ingest scripts in ingest/vendored, from nextstrain/ingest. To pull new changes from the central ingest repository, first install git subrepo, then run:
See ingest/vendored/README.md for instructions on how to update the vendored scripts.
After successfully running the build there will be two output folders containing the build results.
auspice/folder contains two jsons each for both RSV A and RSV B builds. One contains the root sequence used in building the tree. The second all of the visualization information that will be used by Auspice in displaying the data, including gene annotaions and coloring information.results/folder contains nested folders for both RSV A and RSV B trees. These folders contain any other outputs created in the running of the build, including sequence alignments and the tree.json.
- Subsampling: All Washington sequences are included in this build while maintaining national/global context. Outside of Washington, sequences are sampled evenly across countries and years of collection. This decision was made after it was observed that Washington sequences grouped on the trees with sequences from very disparate locations, indicating that samples from neighboring regions were no more likely to influence what was coming into Washington that samples from farther away.
- Root Selection: No root was specified for the trees in this build, allowing for Nextstrain default of using least squares to determine the best root based on the data.
- Reference Sequences: - RSV A: These builds use reference sequence A/England/397/2017 (EPI_ISL_412866, GISAID ID: PP109421.1), a virus collected in England on October 30th, 2017, as the reference sequence for RSV A. This sequence has the duplication in the G-protein shared by all currently circulating variants. - RSV B: These builds use reference sequence B/Australia/VIC-RCH056/2019 (EPI_ISL_1653999, GISAID ID: OP975389) as the reference sequence for RSV B. This sequence has the duplication in the G-protein shared by all currently circulating variants.
- Parent Node: No sequence is specified as a parent node in this build. Mutations in the tree are therefore called against the inferred root node sequence.
- Molecular clock IQD range: IQD range was of 4 is consistent with the Nextrain global RSV build
- Other adjustments:
config/includes.txt: These sequences are always included into our sampling strategy as they are relevant to our epidemiological investigations.config/excludes.txt: These sequences are always excluded from our subsampling and filtering due to duplication, known data errors or based on epidemiological linkage knowledge.
This build can be customized for use by other demes, including as states, cities, counties, or countries.
For any questions please submit them to our Discussions page otherwise software issues and requests can be logged as a Git Issue.
This project is licensed under a modified GPL-3.0 License. You may use, modify, and distribute this work, but commercial use is strictly prohibited without prior written permission.
These data are generously shared by labs around the world and deposited in NCBI Genbank by the authors. Please contact these labs first if you plan to publish using these data. We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work. Please note that although data generators have generously shared data in an open fashion, that does not mean there should be free license to publish on this data. Data generators should be cited where possible and collaborations should be sought in some circumstances.
We also gratefully acknowledge the work done by the Bedford lab and Nextstrain team who were the original authors of this build.
Example data is used by CI. It can also be used as a small subset of real-world data.
Example data should be updated every time metadata schema is changed. To update, run:
nextstrain build --docker . update_example_data -FFrom within the destination directory, run
rsync -a <path-to>/rsv/nextclade/datasets/ .