Fixup: Add date annotations for rare genotypes#38
Open
kimandrews wants to merge 1 commit intomainfrom
Open
Conversation
joverlee521
reviewed
Jun 10, 2024
genehack
approved these changes
Jun 10, 2024
Six of the samples that are force-included in the Nextclade dataset tree have empty collection date fields in the metadata output from NCBI Datasets. This results in the samples being removed downstream by the TreeTime clock filter. This commit adds collection dates (which were manually extracted from the strain names in the NCBI metadata) for these samples so that they will be included in the Nextclade dataset tree.
cd15009 to
7c2776d
Compare
genehack
approved these changes
Jun 14, 2024
| # | ||
| # Strains with rare genotypes | ||
| # Dates are retrieved from epi-weeks reported within strain names on NCBI | ||
| # Dates are defined as the first day of the epi-week |
Contributor
There was a problem hiding this comment.
non-blocking
"first day" is somewhat ambiguous — could be Sunday, could be Monday… Better be explicit.
Suggested change
| # Dates are defined as the first day of the epi-week | |
| # Dates are defined as the Monday of the epi-week |
Contributor
Author
There was a problem hiding this comment.
I agree this needs to be more explicit. There are many different definitions for epi-weeks, and so the most precise wording for what I did would be "Dates are defined as the first day of the ISO epi-week, which is always a Monday". I can add this info to the annotations.tsv file. It also may be worth discussing whether there is a better approach for defining dates from epi-weeks reported in measles strain names. I started a discussion about this in slack.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
This PR adds collection dates to the ingest metadata output for six samples.
These samples were force-included in the Nextclade dataset tree to increase the representation of rare genotypes in the tree. However, these samples have empty date fields in the metadata output from NCBI Datasets. This results in the samples being removed by the TreeTime clock filter.
Fortunately, the NCBI metadata includes strain names for these six samples, and the collection dates can be extracted from the strain names.
This PR adds the collection dates (which were extracted manually from the strain names) for the six samples to
ingest/defaults/annotations.tsv, which results in collection dates being included in the ingest metadata output, and also results in the samples being included by TreeTime in the Nextclade dataset tree.Related issue(s)
#28
Checklist