Skip to content

support for a TR-based GenotypesPLINK class #73

@aryarm

Description

@aryarm

The data.Genotypes class only really supports bi-allelic, phased variants. (There is some limited support for unphased genotypes, but it isn't really something we care too much about at the moment.)

I've designed the class so that it could easily be extended to support multi-allelic variants, so it would be great if we could add formal support for tandem repeats (TRs). Probably the best way to do this would be to use the existing tr_harmonizer library in trtools.

A potential implementation strategy

We could extend (sub-class) the data.Genotypes class to create a new data.GenotypesTR class.

  1. First, we would just need to override the data.Genotypes._variant_arr() method, which takes a cyvcf2.Variant instance as input and creates a np array containing ID, CHROM, POS, REF, and ALT fields. Instead, we could convert the cyvcf2.Variant instance to a trtools.utils.tr_harmonizer.TRRecord before calling that method. And within the method, we can store the alleles in a new column with dtype "object" in the np array, where the object is just a tuple of allele strings containing REF, ALT1, ALT2, etc.
  2. Next, we would have to somehow inherit the way that the data.Genotypes.data property is filled. Basically, we would want to store STR lengths instead of 0s and 1s here. Our plan is to do this conversion after we call the parent function. We might also be able to call tr_harmonizer's GetLengthGenotypes() method to get the lengths.
  3. Lastly, we should copy all of our test cases and make sure all of the other methods in the data.Genotypes class still work in data.GenotypesTR
  4. We might need to override or disable the check_biallelic and check_maf() methods

Additional methods

If we ever want to merge TRs and SNPs into one data.Genotypes instance, we might want to develop a merge_variants() method which can merge one instance with another. Assuming both are sorted, we might want to develop the method so that it keeps the records sorted in the end.

PLINK2 PGEN support

PLINK2 PGEN files support multi-allelic variants, but the pgenlib python library responsible for reading and writing PGEN files does not yet support them. Quoting from the documentation,

Multiallelic variants aren't fully supported yet. Instead, all ALT alleles are effectively collapsed into one.

Once support for multi-allelic variants is added to the pgenlib python library, support for reading and writing TRs to PGEN files should come with it, essentially automatically. We would just need to copy (and slightly adapt) our code from the data.GenotypesTR._variant_arr() method. We might also need to create an equivalent method to do the reverse, too.

Compatability

In theory, once we've made these changes, our existing code should work out-of-the-box with the new data.GenotypesTR class because they utilize the same interface. To confirm, we can copy our tests and just change the input.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions