Implement sample variant annotation dataflow pipeline#37
Conversation
There was a problem hiding this comment.
@deflaux let me know if you'd prefer I fork into my own options at this point. Not sure how much we want to jam into this object.
There was a problem hiding this comment.
I think this is fine place for it for now.
4dded6d to
6798954
Compare
There was a problem hiding this comment.
To make sure I'm understanding this correctly, are these assumptions true?
- By default, this pipeline will yield an output record for every alternate allele in 1,000 Genomes within BRCA1 that is a SNP and has an effect other than synonymous.
- For 1,000 genomes, restricting to sample HG00261 has no bearing on the output of this pipeline since all samples have calls for all variants (and we are also not retrieving/looking at the genotype within the call).
- If we change the job parameters to run on Platinum Genomes and a callSetId within it, we will only annotate the variants that the specified callSetId has.
There was a problem hiding this comment.
That's correct. People typical run a variant annotation program on a single VCF, so I think the behavior is reasonably well aligned with a user's expectations.
There was a problem hiding this comment.
CH, is right but you still want to keep track of what you're annotating since metadata is still important if you combine datasets or are comparing them. If you can cache them that will save you time later on.
|
This looks good to me - merge it at your convenience. |
There was a problem hiding this comment.
Is this necessary? Why not convert the list of Contigs to a PCollection directly?
64aa02f to
6b46e28
Compare
|
Rebased, made some performance changes, and added some timing information. The end result is that it will currently work well on small regions, but performs quite poorly on whole variant-sets, on account of SearchVariants throughput. This should improve over time. |
Implement sample variant annotation dataflow pipeline
|
Nice sample CH! |
…tation Implement sample variant annotation dataflow pipeline
See various caveats and disclaimers in comments: this is a limited sample application.
One thing which may need revision is the output; right now it's really only human readable (at best). Open to suggestions on a better output format.