Conversation
dunno why I named them like that originally - must not have been thinking
|
@gymreklab, it's ready for you! I would appreciate any comments you might have, but specifically, it would be helpful if you could look at the |
|
I think it is looking good! Some quick comments:
|
Ok, this has been fixed. It now says "If not provided, it will be computed from the sum of the squared effect sizes"
The other version should be implemented now!
I'm thinking of doing this within the haptools-paper repo instead of as part of our tests here, since I'm not sure how to define "similar" in an automated test. The haptools-paper repo is meant for some of the pipeline/analysis/sanity check work. Would that be ok? |
|
@gymreklab, regarding your third point about sanity checks, here's the plot you asked for: |
|
In any case, I think we're ready to merge now! |

Overview
This PR adapts the
simphenotypesubcommand to work with the new.hapfile format.It adds a new
PhenoSimulatorclass that uses thedata.Genotypes,data.Phenotypes, anddata.Haplotypeclasses from thedatamodule to create a PLINK2-compatible.phenofile.Usage and docs
The
simphenotypecommand docs can be found here.Usage of the PhenoSimulator class is documented in the API docs here. The class uses a subclass of the
data.Haplotypeclass which is documented here.I've also made some changes to the other docs. The biggest change is that the file format section is now more similar to PLINK's documentation: each type of input has its own page within the section.
Details
The
PhenoSimulatorclassThe new class is initialized with a
data.Genotypesinstance containing transformed haplotypes or regular variants. At the time of initialization, it creates an internaldata.Phenotypesinstance into which it stores any phenotypes that it generates. To create phenotypes, one need only call thedata.PhenoSimulator.run()method. Here's its signature.The
run()method generates phenotypes from a list ofsim_phenotype.Haplotypeobjects, where each haplotype is encoded as an independent causal variable in the linear model.where
and
depends on heritability$h^2$ , which is an input to the model (with a default of 1). Users can also specify a prevalence for the disease if it should be modeled as case/control.
The final phenotypes are returned as a numpy array. They are also stored in the internal phenotypes object for safe-keeping.
The
data.GenotypesclassThe most notable change is that this class now has an
index()andsubset()method. Thesubset()method allows for pandas-style subsetting of the entire class by variant and sample IDs. Theindex()method performs some indexing internally to improve the amortized cost of the subsetting operation. The first timesubset()is called, it will index the instance usingindex(). Subsequent calls tosubset()will automatically utilize the stored index.The
data.HaplotypesclassThe
transform()method of thedata.Haplotypesclass will now utilize thedata.Genotypes.subset()method for improved speed.The
data.Phenotypesanddata.CovariatesclassesI rewrote much of the
data.Phenotypesanddata.Covariatesclass to have them use PLINK2-compatible.phenoand.covarfiles. Perhaps most notably, thedata.Covariatesclass is a subclass ofdata.Phenotypesnow, and both classes can store more than one phenotype/covariate. There's also a newdata.Phenotypes.append()method for adding another phenotype to an existingdata.Phenotypesinstance.Testing
I added the following tests to a new
TestSimPhenotypeclass intests/test_simphenotype.py:test_one_hap_zero_noise()Try the
run()method with a single haplotype.test_one_hap_zero_noise_neg_beta()Try the
run()method with a single haplotype and a negative effect size.test_two_haps_zero_noise()Try the
run()method twice, each with one haplotype.test_combined_haps_zero_noise()Try the
run()method with two independent effects from two haplotypes.test_noise()The previous test used a heritability value of 1. This test does the same thing, but with decreasing heritability values.
test_case_control()Perform the previous test but generate case/control phenotypes this time.
I also added tests to the existing
TestGenotypesclass intests/test_data.py:test_subset_genotypesTest the
data.Genotypes.subset()method on various combinations of parameters.Future work
I still need to verify that the phenotypes we generate look good in a Manhattan plot.
In the future, we don't want to generate transformed haplotypes within the
simphenotypecommand. Instead, we will use the output of thetransformcommand as input to thesimphenotypecommand. And we'll utilize the local ancestry information in thetransformcommand. (Currently, the local ancestry info is just ignored.)We may want to implement an alternative way to simulate case/control phenotypes in the future. Currently, the user specifies a fraction of samples that should be positive (aka the
prevalenceparameter), and we just convert the quantitative phenotypes to booleans by thresholding the quantitative phenotypes accordingly. But we might want to consider using a logistic regression model to generate the phenotypes, instead. It's unclear to me how the resulting phenotypes might be different, but one thing to note is that errors in a logistic regression would be modeled via a binomial distribution and currently they are modeled via a normal distribution.An extension of the Genotypes class could allow us to load STR genotypes using
trtools. Phenotype simulation of STRs should come automatically with that change.There are a lot of improvements I want to make to classes in the data module, as well. See #19 and #49 -- not to mention that I'd like to add full support for PLINK2 files after I merge #16.
Eventually, it might be nice to support interaction terms in the phenotype simulation (see #4).