A program for viewing QC statistics of .fasta and .bam files.
TQC is currently running on CLIMB, uploading new data every day. It has QC data going back to the published_date of 2022-03-02.
The tqc-client can retrieve data from TQC with the use of the get command, which will be detailed further below.
$ git clone https://github.com/CLIMB-COVID/tqc-client.git
$ conda env create -f environment.yml
$ conda activate tqc-client
The environment variables needed to use TQC are:
TQC_IP: this is the hostname for TQC.TQC_PORT: this is the port number for TQC.
These can be provided by @tombch.
We are now ready to use tqc-client.
Data cannot be added to TQC by users. It can only be retrieved, using get, which will be detailed further below.
To run TQC and look at its options for retrieving data:
$ python tqc.py get --help
To run tqc-client from any directory, this function can be put into your .bashrc file with <PATH-TO-DIR> replaced by the directory for tqc-client.
function tqc() {
python "<PATH-TO-TQC-DIR>/tqc.py" "$@"
}
export -f tqc
central_sample_id: this field is required in all TQC data.run_name: this field is required in all TQC data.pag_name: this field is required in all TQC data.sequencing_org_codefoel_producerphe_private_providerphe_sitecollection_datereceived_datesequencing_org_received_datesequencing_submission_datepublished_datefasta_path: path to the.fastafile on CLIMB.bam_path: path to the.bamfile on CLIMB.library_primers: formatted information on library primers.library_primers_reported: information on library primers, before formatting.pag_suppressed: whether the PAG has been suppressed or not.- Each entry in this field is either
VALIDorSUPPRESSED. - NOTE: At the moment, TQC is not regularly reupdating its data.
- This means that PAGs that have been recently suppressed may still be marked as 'VALID' in TQC.
- I am working on implementing regular scans for changed metadata!
- Each entry in this field is either
pag_basic_qc: whether the PAG passed Majora basic QC.
num_bases: number of characters in the sequence.pc_acgt: the percentage of the charactersA,C,GandTin the sequence.pc_masked: the percentage of the charactersNandXin the sequence.pc_ambiguous: the percentage of the charactersW,S,M,K,R,Y,B,D,HandVin the sequence.pc_invalid: the percentage of all other characters in the sequence that are not ACGT, masked or ambiguous.longest_gap: the length of the longest subsequence of masked and/or invalid bases in the sequence.longest_ungap: the length of the longest subsequence of ACGT and/or ambiguous bases in the sequence.
num_pos: number of positions (the number of bases in the reference genome).mean_cov: mean coverage - this is the sum of coverages, divided bynum_pos.pc_pos_cov_gteX: percentage of positions with coverage greater than or equal toX.- there are fields for
X = 1,X = 5,X = 10,X = 20,X = 50,X = 100, andX = 200.
- there are fields for
If library_primers information was given (that could be mapped to a .bed file):
pc_tiles_medcov_gteX: the percentage of tiles with a median coverage greater than or equal toX.- there are fields for
X = 1,X = 5,X = 10,X = 20,X = 50,X = 100, andX = 200.
- there are fields for
tile_n: the number of tiles.tile_vector: vector containing the median coverage for each tile.
Data is retrieved and written to stdout as a tab-separated table using the get command of TQC.
The TQC database can be filtered by various arguments provided after get.
Passing multiple arguments to TQC will return only the rows that match all the constraints from the arguments.
For example, the command:
$ python tqc.py --published-date 2022-03-24 --library-primers 3
will return all rows that satisfy the following condition:
(published_date = 2022-03-22) AND (library_primers = 3)
Some arguments can be passed any number of values, and this will return rows that match any of the values within the argument.
For example, the command:
$ python tqc.py --published-date 2022-03-22 --library-primers 2 3 4
will return all rows that satisfy the following condition:
(published_date = 2022-03-22) AND ( (library_primers = 2) OR (library_primers = 3) OR (library_primers = 4) )
and the command:
$ python tqc.py --sequencing-org-code ABCD --published-date 2022-03-22 2022-03-24 --library-primers 2 3 4
will return all rows that satisfy the following condition:
(sequencing_org_code = ABCD) AND ( (published_date = 2022-03-22) OR (published_date = 2022-03-24) ) AND ( (library_primers = 2) OR (library_primers = 3) OR (library_primers = 4) )
By default, TQC returns data only on unsuppressed PAGs that passed basic QC. The types of PAGs returned can be explicitly determined by passing VALID and/or SUPPRESSED to the --pag-suppressed argument, and passing PASS and/or FAIL to the --pag-basic-qc argument.
To return all PAGs, add --all to the command. For example:
$ python tqc.py --all
will return everything.
String fields with empty cells can be matched by passing _ to an argument, in place of a value.
For example, the command:
$ python tqc.py --published-date 2022-03-22 --library-primers 3 4 _
will return all rows that satisfy the following condition:
(published_date = 2022-03-22) AND ( (library_primers = 3) OR (library_primers = 4) OR (library_primers = EMPTY) )
where EMPTY denotes a cell in the library_primers column with the empty string as its value.
Some fields store data as integers or floats, and these can be filtered using various operators. These operators are:
lt: less thangt: greater thanleq: less than or equalgeq: greater than or equaleq: equalneq: not equal
To see which fields can be filtered using these operators, see python tqc.py --help.
For example, the command:
$ python tqc.py --published-date 2022-03-22 --pc-acgt leq 90 --pc-ambiguous gt 0
will return all rows that satisfy the following condition:
(published_date = 2022-03-22) AND (pc_acgt <= 90) AND (pc_ambiguous > 0)
The same numeric field can be passed more than once, unlike non-numeric fields.
For example, the command:
$ python tqc.py --pc-acgt geq 70 --pc-acgt leq 90 --pc-ambiguous gt 0
will return all rows that satisfy the following condition:
(70 <= pc_acgt <= 90) AND (pc_ambiguous > 0)
There are five fields that store date information in TQC: collection_date, received_date, sequencing_org_received_date, sequencing_submission_date and published_date.
Rows can be matched by any number of individual dates:
$ python tqc.py --published-date 2022-03-02 2022-03-05 2022-03-06
this will return all rows that satisfy the following condition:
(published_date = 2022-03-02) OR (published_date = 2022-03-05) OR (published_date = 2022-03-06)
or between two dates:
$ python tqc.py --published-date-range 2022-03-02 2022-03-09
this will return all rows that satisfy the following condition:
2022-03-02 <= published_date <= 2022-03-09
or by any number of individual ISO weeks (given in a YYYY-WW format):
$ python tqc.py --published-iso-week 2022-01 2022-04 2022-05
this will return all rows that satisfy the following condition:
(published_iso_week = 2022-01) OR (published_iso_week = 2022-04) OR (published_iso_week = 2022-05)
or between two ISO weeks:
$ python tqc.py --published-iso-week-range 2022-01 2022-05
this will return all rows that satisfy the following condition:
2022-01 <= published_iso_week <= 2022-05
The data returned by TQC can also be merged with additional metadata by giving the path to a .tsv file to the --metadata argument. This will display a left join between the TQC data and the given metadata table.
For reference.
$ python -m client.python tqc.py --help
usage: tqc.py get [-h] [--central-sample-id CENTRAL_SAMPLE_ID [CENTRAL_SAMPLE_ID ...]]
[--run-name RUN_NAME [RUN_NAME ...]] [--pag-name PAG_NAME [PAG_NAME ...]]
[--pag-suppressed PAG_SUPPRESSED [PAG_SUPPRESSED ...]]
[--pag-basic-qc PAG_BASIC_QC [PAG_BASIC_QC ...]] [--all]
[--sequencing-org-code SEQUENCING_ORG_CODE [SEQUENCING_ORG_CODE ...]]
[--foel-producer FOEL_PRODUCER [FOEL_PRODUCER ...]]
[--phe-private-provider PHE_PRIVATE_PROVIDER [PHE_PRIVATE_PROVIDER ...]]
[--phe-site PHE_SITE [PHE_SITE ...]] [--collection-pillar COLLECTION_PILLAR [COLLECTION_PILLAR ...]]
[--collection-date YYYY-MM-DD [YYYY-MM-DD ...] | --collection-date-range YYYY-MM-DD YYYY-MM-DD |
--collection-iso-week YYYY-WW [YYYY-WW ...] | --collection-iso-week-range YYYY-WW YYYY-WW]
[--received-date YYYY-MM-DD [YYYY-MM-DD ...] | --received-date-range YYYY-MM-DD YYYY-MM-DD |
--received-iso-week YYYY-WW [YYYY-WW ...] | --received-iso-week-range YYYY-WW YYYY-WW]
[--sequencing-org-received-date YYYY-MM-DD [YYYY-MM-DD ...] | --sequencing-org-received-date-range
YYYY-MM-DD YYYY-MM-DD | --sequencing-org-received-iso-week YYYY-WW [YYYY-WW ...] |
--sequencing-org-received-iso-week-range YYYY-WW YYYY-WW]
[--sequencing-submission-date YYYY-MM-DD [YYYY-MM-DD ...] | --sequencing-submission-date-range
YYYY-MM-DD YYYY-MM-DD | --sequencing-submission-iso-week YYYY-WW [YYYY-WW ...] |
--sequencing-submission-iso-week-range YYYY-WW YYYY-WW] [--published-date YYYY-MM-DD [YYYY-MM-DD ...]
| --published-date-range YYYY-MM-DD YYYY-MM-DD | --published-iso-week YYYY-WW [YYYY-WW ...] |
--published-iso-week-range YYYY-WW YYYY-WW] [--fasta-path FASTA_PATH [FASTA_PATH ...]]
[--bam-path BAM_PATH [BAM_PATH ...]] [--library-primers LIBRARY_PRIMERS [LIBRARY_PRIMERS ...]]
[--library-primers-reported LIBRARY_PRIMERS_REPORTED [LIBRARY_PRIMERS_REPORTED ...]]
[--num-bases OPERATOR VALUE] [--pc-acgt OPERATOR VALUE] [--pc-masked OPERATOR VALUE]
[--pc-invalid OPERATOR VALUE] [--pc-ambiguous OPERATOR VALUE] [--longest-gap OPERATOR VALUE]
[--longest-ungap OPERATOR VALUE] [--num-pos OPERATOR VALUE] [--mean-cov OPERATOR VALUE]
[--pc-pos-cov-gte1 OPERATOR VALUE] [--pc-pos-cov-gte5 OPERATOR VALUE]
[--pc-pos-cov-gte10 OPERATOR VALUE] [--pc-pos-cov-gte20 OPERATOR VALUE]
[--pc-pos-cov-gte50 OPERATOR VALUE] [--pc-pos-cov-gte100 OPERATOR VALUE]
[--pc-pos-cov-gte200 OPERATOR VALUE] [--pc-tiles-medcov-gte1 OPERATOR VALUE]
[--pc-tiles-medcov-gte5 OPERATOR VALUE] [--pc-tiles-medcov-gte10 OPERATOR VALUE]
[--pc-tiles-medcov-gte20 OPERATOR VALUE] [--pc-tiles-medcov-gte50 OPERATOR VALUE]
[--pc-tiles-medcov-gte100 OPERATOR VALUE] [--pc-tiles-medcov-gte200 OPERATOR VALUE]
[--tile-n OPERATOR VALUE] [--metadata TSV_PATH] [--host HOST] [--port PORT]
operators: lt, gt, leq, geq, eq, neq
options:
-h, --help show this help message and exit
--central-sample-id CENTRAL_SAMPLE_ID [CENTRAL_SAMPLE_ID ...]
--run-name RUN_NAME [RUN_NAME ...]
--pag-name PAG_NAME [PAG_NAME ...]
--pag-suppressed PAG_SUPPRESSED [PAG_SUPPRESSED ...]
Default: valid PAGs only
--pag-basic-qc PAG_BASIC_QC [PAG_BASIC_QC ...]
Default: passed PAGs only
--all Ignore defaults regarding PAG suppression and basic QC passing
--sequencing-org-code SEQUENCING_ORG_CODE [SEQUENCING_ORG_CODE ...]
--foel-producer FOEL_PRODUCER [FOEL_PRODUCER ...]
--phe-private-provider PHE_PRIVATE_PROVIDER [PHE_PRIVATE_PROVIDER ...]
--phe-site PHE_SITE [PHE_SITE ...]
--collection-pillar COLLECTION_PILLAR [COLLECTION_PILLAR ...]
--collection-date YYYY-MM-DD [YYYY-MM-DD ...]
--collection-date-range YYYY-MM-DD YYYY-MM-DD
--collection-iso-week YYYY-WW [YYYY-WW ...]
--collection-iso-week-range YYYY-WW YYYY-WW
--received-date YYYY-MM-DD [YYYY-MM-DD ...]
--received-date-range YYYY-MM-DD YYYY-MM-DD
--received-iso-week YYYY-WW [YYYY-WW ...]
--received-iso-week-range YYYY-WW YYYY-WW
--sequencing-org-received-date YYYY-MM-DD [YYYY-MM-DD ...]
--sequencing-org-received-date-range YYYY-MM-DD YYYY-MM-DD
--sequencing-org-received-iso-week YYYY-WW [YYYY-WW ...]
--sequencing-org-received-iso-week-range YYYY-WW YYYY-WW
--sequencing-submission-date YYYY-MM-DD [YYYY-MM-DD ...]
--sequencing-submission-date-range YYYY-MM-DD YYYY-MM-DD
--sequencing-submission-iso-week YYYY-WW [YYYY-WW ...]
--sequencing-submission-iso-week-range YYYY-WW YYYY-WW
--published-date YYYY-MM-DD [YYYY-MM-DD ...]
--published-date-range YYYY-MM-DD YYYY-MM-DD
--published-iso-week YYYY-WW [YYYY-WW ...]
--published-iso-week-range YYYY-WW YYYY-WW
--fasta-path FASTA_PATH [FASTA_PATH ...]
--bam-path BAM_PATH [BAM_PATH ...]
--library-primers LIBRARY_PRIMERS [LIBRARY_PRIMERS ...]
--library-primers-reported LIBRARY_PRIMERS_REPORTED [LIBRARY_PRIMERS_REPORTED ...]
--num-bases OPERATOR VALUE
--pc-acgt OPERATOR VALUE
--pc-masked OPERATOR VALUE
--pc-invalid OPERATOR VALUE
--pc-ambiguous OPERATOR VALUE
--longest-gap OPERATOR VALUE
--longest-ungap OPERATOR VALUE
--num-pos OPERATOR VALUE
--mean-cov OPERATOR VALUE
--pc-pos-cov-gte1 OPERATOR VALUE
--pc-pos-cov-gte5 OPERATOR VALUE
--pc-pos-cov-gte10 OPERATOR VALUE
--pc-pos-cov-gte20 OPERATOR VALUE
--pc-pos-cov-gte50 OPERATOR VALUE
--pc-pos-cov-gte100 OPERATOR VALUE
--pc-pos-cov-gte200 OPERATOR VALUE
--pc-tiles-medcov-gte1 OPERATOR VALUE
--pc-tiles-medcov-gte5 OPERATOR VALUE
--pc-tiles-medcov-gte10 OPERATOR VALUE
--pc-tiles-medcov-gte20 OPERATOR VALUE
--pc-tiles-medcov-gte50 OPERATOR VALUE
--pc-tiles-medcov-gte100 OPERATOR VALUE
--pc-tiles-medcov-gte200 OPERATOR VALUE
--tile-n OPERATOR VALUE
--metadata TSV_PATH
--host HOST
--port PORT