| title | Technical Specification | |||
|---|---|---|---|---|
| date | 2024-09-11 | |||
| authors |
|
seqspec is an open-source file format specification and command-line tool for annotating sequencing libraries that utilized YAML for data representation. This document outlines the specification and explains various use-cases.
The seqspec schema is designed to annotate sequencing libraries through three main Pydantic models: Assay, Region, and Read. An Assay contains the library_spec (a tree of Region objects, possibly nested) and the sequence_spec (a list of Read objects). Files (e.g., FASTQ/BAM/SRA) can be associated with individual reads via a list of File objects.
Each seqspec file is associated with a sequencing run and documents the designed library structure and the designed read structure. A simple (but incomplete example) looks like the following:
library_protocol: 10xv3 Chromium scRNAseq
library_kit: Truseq dual index
sequence_protocol: Illumina Novaseq 6000
sequence_kit: Illumina Novaseq 6000 v1.5 kit
modalities:
- Modality1
- Modality2
sequence_spec:
- read_id: Read1
modality: Modality1
primer_id: Region2
strand: pos
min_len: 10
max_len: 100
files:
- file_id: R1.fastq.gz
...
library_spec:
- region_id: Modality1
regions:
- region_id: Region1
...
- region_id: Region2
...
- region_id: Modality2Each object has clearly defined fields and helpful input variants (e.g., ReadInput, RegionInput) used by tools. The full JSON Schema is in seqspec/schema/seqspec.schema.json.
The Assay object contains overall metadata for the sequencing run.
Fields:
seqspec_version: String specifying the version of the seqspec specification, adhering to semantic versioning.assay_id: Identifier for the assay.name: The name of the assay.doi: The doi of the paper that describes the assay.date: The seqspec creation date.description: A short description of the assay.modalities: The modalities the assay targets. E.g. "dna", "rna", "tag", "protein", "atac", "crispr".lib_struct: Optional link to Teichmann lab library structure page.library_protocol: The protocol/machine/tool to generate the library insert. (can be a modality-specific list)library_kit: The kit used to make the library sequence_protocol compatible. (can be a modality-specific list)sequence_protocol: The protocol/machine/tool to generate sequences. (can be a modality-specific list)sequence_kit: The kit used with the protocol to sequence the library. (can be a modality-specific list)sequence_spec: The spec for the sequence structure, an array of Read objects.library_spec: The spec for the library structure, an array of Region objects.
:::{note}
For library_protocol, library_kit, sequence_protocol, and sequence_kit, values can be a single string or a list of typed objects (LibProtocol, LibKit, SeqProtocol, SeqKit) where each entry can be scoped by modality.
:::
Example:
!Assay
seqspec_version: 0.3.0
assay_id: SPLiT-seq/Illumina
name: SPLiT-seq
doi: https://doi.org/10.1126/science.aam8999
date: 15 March 2018
description: split-pool ligation-based transcriptome sequencing
modalities:
- rna
lib_struct: https://teichlab.github.io/scg_lib_structs/methods_html/SPLiT-seq.html
library_protocol: SPLiT-seq
library_kit: Custom
sequence_protocol: Illumina NovaSeq 6000 (EFO:0008637)
sequence_kit:
- !SeqKit
kit_id: "NovaSeq 6000 S2 Reagent Kit v1.5 (100\u2009cycles)"
name: illumina
modality: rna
sequence_spec: ...
library_spec: ...The library_spec contains a list of, possibly nested, Region objects which detail individual segments within the sequencing library molecule, specifying types, sequences, and relationships between segments. The order of the Regions in the library_spec (top to bottom) corresponds to their linear ordering in the library molecule from the 5' -> 3' end.
modalities:
- rna
library_spec:
- region_id: rna # <-- must be a "modality" region
regions: # <-- a list containing the linear ordering of the "regions" for the "rna" library molecule
- region_id: illumina_p5
...
- region_id: read1_primer
...
- region_id: cell_bc
...
- region_id: umi
...:::{important}
The top-most Region object must be a "modality" Region and contain nested Regions describing the library structure for that modality.
:::
Each Region has the following properties which are useful to annotate the element of the library molecule:
region_idis a free-form string and must be unique across all regions in theseqspecfile.- if the assay contains multiple regions of the same
region_typeit may be useful to append an integer to the end of theregion_idto differentiate those regions. For example, if the assay had fourbarcodesthen each of the individualbarcoderegions could have theregion_idsbarcode-1,barcode-2,barcode-3,barcode-4.
- if the assay contains multiple regions of the same
region_typecan be one of the following:atac: The modality for chromatin accesibility capturebarcode: A region corresponding to a synthetic barcode sequence often associated with samples or cellscdna: Complementary DNA generated from an RNA productcrispr: The modality for barcode-based CRISPR assaycustom_primer: A synthesized segment of nucleic acid used to initiate DNA synthesis.dna: Deoxyribonucleic acid, targets often generated for MPRA assays.fastq: A region corresponding to a FASTQ file.fastq_link: A region corresponding to a FASTQ file that is stored remotely (via url).gdna: Genomic DNA, targets often obtained with ATACseq.hic: The modality corresponding to high-throughput chromosome conformation capture, a technique for studying the three-dimensional structure of genomes.illumina_p5: A sequencing primer specific to Illumina platforms, used to bind the library molecule to the flow cell.illumina_p7: A sequencing primer specific to Illumina platforms, used to bind the library molecule to the flow cell.index5: A barcode sequence used for multiplexing and sample identification in sequencing, associated with the P5 end.index7: A barcode sequence used for multiplexing and sample identification in sequencing, associated with the P7 end.linker: A short, synthetic DNA sequence used to connect two molecules or fragments.ME1: Mosaic end 1, used in the Nextera Library kit for library preparation.ME2: Mosaic end 2, used in the Nextera Library kit for library preparation.methyl: The modality for methylation sequencing which assays the presence of a methyl group.named: A custom named region for grouping other regions.meta: A top-level modality placeholder used byseqspec init.nextera_read1: A read sequence obtained from the first end in paired-end Nextera library sequencing.nextera_read2: A read sequence obtained from the second end in paired-end Nextera library sequencing.poly_A: A sequence of multiple adenine nucleotides.poly_G: A sequence of multiple guanine nucleotides.poly_T: A sequence of multiple thymine nucleotides.poly_C: A sequence of multiple cytosine nucleotides.protein: The modality corresponding to assaying cell-surface proteins.rna: The modality corresponding to assaying RNA.s5: A sequencing primer or adaptor typically used in the Nextera kit in conjunction with ME1.s7: A sequencing primer or adaptor typically used in the Nextera kit in conjunction with ME2.sgrna_target: A sequence corresponding to the guide RNA spacer region that determines the genomic target of CRISPR-based perturbations.tag: A short sequence of DNA or RNA used to label or identify a sample, protein, or other grouping.truseq_read1: The first read primer in a paired-end sequencing run using the Illumina TruSeq Library preparation kit.truseq_read2: The second read primer in a paired-end sequencing run using the Illumina TruSeq Library preparation kit.umi: Unique Molecular Identifier, a short nucleotide sequence used to tag individual molecules.
sequence_typecan be one of the following:fixed: indicates that sequence string is known and fixed in length and nucleotide composition (if specified, thensequencemust contain the fixed nucleotide sequence.)joined: indicates that the sequence is created (joined) from nested regions (if specified, then theregions:property for thatRegionmust containRegions, aka must be non-null.)onlist: indicates that the sequence is derived from an onlist (if specified, thenonlistmust be non-null andsequencemust comprise allN's)random: indicates that the sequence is not known a-priori (if specified, then thesequencemust comprise allXs)
sequence:a representation of the sequence, must match the pattern^[ACGTRYMKSWHBVDNX]+$- if the
sequence_typeisfixedthen the actual sequence string is provided - if the
sequence_typeisjoinedthen field must be the concatenation of the nested regions - if the
sequence_typeisonlistthen field must anNstring of length of the shortest sequence on the onlist - if the
sequence_typeisrandomthen the field must be anXstring
- if the
min_len: an integer greater than or equal to 0 and less than or equal to 2048. It represents the minimum possible length of thesequencemax_len: an integer greater than or equal to 0 and less than or equal to 2048. It represents the maximum length of thesequenceonlist: can benullor contain aFileobject (seeFileObject section below)file_id: a freeform string that uniquely identifies the file.filename: a freeform string that matches the name of the file being annotatedfilesize: an integer that represents the size of the compressed file (in bytes)filetype:a free form string that specifies the file type (usually the extension of thefilename, e.g. R1.fastq.gz hasfiletype: fastq.)url: a freeform string that specifies either the url location of the file, or the local path of the file (relative to this seqspec file)urltype: can be one of ["local", "ftp", "http", "https"] specifies the type of theurlmd5: the md5sum of the uncompressed file infilename, must match the pattern^[a-f0-9]{32}$
regionscan either benullor contain a list ofregionsas specified above.
Example:
!Region
region_id: barcode-1
region_type: barcode
sequence_type: onlist
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist: !Onlist
file_id: barcode-1_onlist.txt
filename: barcode-1_onlist.txt
filetype: txt
filesize: 120
url: ./
urltype: local
md5: 5b62453df2771f5aa856f78797f16591
regions: nullFor more information about the various fields, please see the JSON schema specification (seqspec/schema/seqspec.schema.json). For consistency across assays I suggest following a standard naming conventions for common regions. I've made a collection of "named" regions available; please see docs/examples/regions for a list of example regions.
The sequence_spec contains a list of Read objects which describe the sequencing "reads" that are generated from sequencing the molecule described in the library_spec. A crucial concept is that Read objects contain a primer_id which maps to a single region_id in the library_spec. Importantly, Reads can contain Files which I describe in the subsequent section.
sequence_spec:
- read_id: Read1
modality: Modality1
primer_id: Region2
strand: pos
min_len: 10
max_len: 100
files:
- file_id: R1.fastq.gz
...A Read object is annotated with the following attributes:
read_id: A freeform string that functions as a unique identifier for the read.name: A freeform string that functions as the name of the read.modality: A string that matches the modality of the assay generating the read.primer_id: A string that matches the region id of the primer used to generate the read (in thelibrary_spec).min_len: An integer greater than or equal to zero specifying the minimum length of the read.max_len: An integer greater than or equal to zero specifying the maximum length of the read.strand: One of ["pos", "neg"], denotes the strandedness of the read.files: A list ofFileobjects that contain sequences that match the structure of the parentRead.
Example:
- !Read
read_id: read_001
name: Read 1 of Sample A
modality: rna
primer_id: primer_25
min_len: 50
max_len: 300
strand: pos
files:
- !File
- file_id: read_001.fastq.gz
...Files are annotated with the File object. Files can be local or remote (e.g., FASTQ, BAM, POD5, TXT, SRA). File objects contain the following attributes:
file_id: a freeform string that uniquely identifies the file.filename: a freeform string that matches the name of the file being annotatedfilesize: an integer that represents the size of the compressed file (in bytes)filetype:a free form string that specifies the file type (usually the extension of thefilename, e.g. R1.fastq.gz hasfiletype: fastq.)url: a freeform string that specifies either the url location of the file, or the local path of the file (relative to this seqspec file)urltype: can be one of ["local", "ftp", "http", "https"] specifies the type of theurlmd5: the md5sum of the uncompressed file infilename, must match the pattern^[a-f0-9]{32}$
File objects are used in the Onlist object within "onlist" Regions. They are also used in the Read objects as a list of File objects.
:::{important}
The order of the File objects within the Read objects is extremely important. If you have sets of FASTQ files that are paired by lane, then they must be ordered in the same way within each Read object.
The following illustrates a "correct" ordering. (Note the reads are paired by lane)
- !Read
read_id: Read 1
...
files:
- !File
file_id: R1_L001.fastq.gz
...
- !File
file_id: R1_L002.fastq.gz
...
- !Read
read_id: Read 2
...
files:
- !File
file_id: R2_L001.fastq.gz
...
- !File
file_id: R2_L002.fastq.gz
...And an "incorrect" ordering.
- !Read
read_id: Read 1
...
files:
- !File
file_id: R1_L001.fastq.gz # <-- incorrect
...
- !File
file_id: R1_L002.fastq.gz # <-- incorrect
...
- !Read
read_id: Read 2
...
files:
- !File
file_id: R2_L002.fastq.gz # <-- incorrect
...
- !File
file_id: R2_L001.fastq.gz # <-- incorrect
...:::
seqspec files can be loaded into python as a python object. Manipulation becomes straightforward with dot notation:
from seqspec.utils import load_spec
spec = load_spec("seqspec/assays/10x-RNA-v3/spec.yaml")
print(spec.get_libspec("RNA").sequence)
# AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNXAGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG