The data.Genotypes class only really supports bi-allelic, phased variants. (There is some limited support for unphased genotypes, but it isn't really something we care too much about at the moment.)
I've designed the class so that it could easily be extended to support multi-allelic variants, so it would be great if we could add formal support for tandem repeats (TRs). Probably the best way to do this would be to use the existing tr_harmonizer library in trtools.
A potential implementation strategy
We could extend (sub-class) the data.Genotypes class to create a new data.GenotypesTR class.
- First, we would just need to override the
data.Genotypes._variant_arr() method, which takes a cyvcf2.Variant instance as input and creates a np array containing ID, CHROM, POS, REF, and ALT fields. Instead, we could convert the cyvcf2.Variant instance to a trtools.utils.tr_harmonizer.TRRecord before calling that method. And within the method, we can store the alleles in a new column with dtype "object" in the np array, where the object is just a tuple of allele strings containing REF, ALT1, ALT2, etc.
- Next, we would have to somehow inherit the way that the
data.Genotypes.data property is filled. Basically, we would want to store STR lengths instead of 0s and 1s here. Our plan is to do this conversion after we call the parent function. We might also be able to call tr_harmonizer's GetLengthGenotypes() method to get the lengths.
- Lastly, we should copy all of our test cases and make sure all of the other methods in the
data.Genotypes class still work in data.GenotypesTR
- We might need to override or disable the
check_biallelic and check_maf() methods
Additional methods
If we ever want to merge TRs and SNPs into one data.Genotypes instance, we might want to develop a merge_variants() method which can merge one instance with another. Assuming both are sorted, we might want to develop the method so that it keeps the records sorted in the end.
PLINK2 PGEN support
PLINK2 PGEN files support multi-allelic variants, but the pgenlib python library responsible for reading and writing PGEN files does not yet support them. Quoting from the documentation,
Multiallelic variants aren't fully supported yet. Instead, all ALT alleles are effectively collapsed into one.
Once support for multi-allelic variants is added to the pgenlib python library, support for reading and writing TRs to PGEN files should come with it, essentially automatically. We would just need to copy (and slightly adapt) our code from the data.GenotypesTR._variant_arr() method. We might also need to create an equivalent method to do the reverse, too.
Compatability
In theory, once we've made these changes, our existing code should work out-of-the-box with the new data.GenotypesTR class because they utilize the same interface. To confirm, we can copy our tests and just change the input.
The
data.Genotypesclass only really supports bi-allelic, phased variants. (There is some limited support for unphased genotypes, but it isn't really something we care too much about at the moment.)I've designed the class so that it could easily be extended to support multi-allelic variants, so it would be great if we could add formal support for tandem repeats (TRs). Probably the best way to do this would be to use the existing
tr_harmonizerlibrary intrtools.A potential implementation strategy
We could extend (sub-class) the
data.Genotypesclass to create a newdata.GenotypesTRclass.data.Genotypes._variant_arr()method, which takes acyvcf2.Variantinstance as input and creates a np array containing ID, CHROM, POS, REF, and ALT fields. Instead, we could convert thecyvcf2.Variantinstance to atrtools.utils.tr_harmonizer.TRRecordbefore calling that method. And within the method, we can store the alleles in a new column with dtype "object" in the np array, where the object is just a tuple of allele strings containing REF, ALT1, ALT2, etc.data.Genotypes.dataproperty is filled. Basically, we would want to store STR lengths instead of 0s and 1s here. Our plan is to do this conversion after we call the parent function. We might also be able to calltr_harmonizer'sGetLengthGenotypes()method to get the lengths.data.Genotypesclass still work indata.GenotypesTRcheck_biallelicandcheck_maf()methodsAdditional methods
If we ever want to merge TRs and SNPs into one
data.Genotypesinstance, we might want to develop amerge_variants()method which can merge one instance with another. Assuming both are sorted, we might want to develop the method so that it keeps the records sorted in the end.PLINK2 PGEN support
PLINK2 PGEN files support multi-allelic variants, but the
pgenlibpython library responsible for reading and writing PGEN files does not yet support them. Quoting from the documentation,Once support for multi-allelic variants is added to the
pgenlibpython library, support for reading and writing TRs to PGEN files should come with it, essentially automatically. We would just need to copy (and slightly adapt) our code from thedata.GenotypesTR._variant_arr()method. We might also need to create an equivalent method to do the reverse, too.Compatability
In theory, once we've made these changes, our existing code should work out-of-the-box with the new
data.GenotypesTRclass because they utilize the same interface. To confirm, we can copy our tests and just change the input.