To help analyze the spread and evolution of the virus, this repo collated and analyzed data related to the viral genome, its variations, and locations in time and space from GISAID and GenBank. Information from the Wikipedia web page and published research papers were categorized and mined to extract epidemiological data, which was then integrated with the public dataset. Genomic and epidemiological data were matched with public information, and the data quality was verified by manual curation.
This is a repo that contains code used in the article "Linking genomic and epidemiologic information to advance the study of COVID-19". For further information please refer to the publication.
This script ued to epidemiological data quality control.
Some imput file you can download, but others you should support the findings of this study are available from the corresponding author upon reasonable request.
This script ued to genome data quality control.
ALL INPUT raw data that support the findings of this study are available from the corresponding author upon reasonable request.
This script ued to match the epidemiological and the genome data.
Two input file are all generated by the first two step(case.py and genome.py)
One manual file(curated_case_genome.tsv) can support the findings of this study are available from the corresponding author upon reasonable request.
This script is used to collect and quality control mutation information.
Every plot contains exactly 6 lines, which may overlap with each other.