Skip to content

dendrograms, correlation and marker genes filtering#425

Merged
falexwolf merged 25 commits intoscverse:masterfrom
fidelram:tl.dendrogram
Mar 4, 2019
Merged

dendrograms, correlation and marker genes filtering#425
falexwolf merged 25 commits intoscverse:masterfrom
fidelram:tl.dendrogram

Conversation

@fidelram
Copy link
Copy Markdown
Collaborator

This PR makes more transparent the computation of the hierarchical clustering underlying the dendrograms. Now, by default, the dendrograms are computed based on the PCA using sc.tl.dendrogram.
Also, now is possible to directly plot a dendrogram without any other data:

image

Since for the computation of the hierarchical clustering, a correlation matrix is computed I also added a visualization for this (mostly borrowing code from https://github.com/deeptools/deepTools). The new plotting function is called sc.pl.correlation

image

Also I added a function to filter the results from sc.tl.rank_genes_groups based on fold change and fraction of genes that are expressing the gene within and outside the group by categories.

For example,

image

The first image show the case without filtering.

@falexwolf
Copy link
Copy Markdown
Member

Sorry for the late response! This seems to have come just after I went through the issues last weekend...

It looks great! 😄

Some small notes:

  • sc.pl.correlation should be sc.pl.correlation_matrix (there will be other "correlation plots", just think of the typical bivariate scatter plot...)
  • sc.tl.dendrogram suggests it is a function that can be generically applied to any hierarchical clustering of observations. We could even have dendrograms of variables, right? I'm fine with putting it into the API with just that generic name, but it would be good to have a .. note:: in the docstring, which states that this does a very specific thing: computing hierarchical clustering on predefined groups using Pearson correlation as a distance metric; I know that this is super standard in the field, but we should nonetheless be very clear about it. In particular as Scanpy grows and we extend its functionality to other methods for grouping observations, structuring their relations (e.g. hierarchical clustering with another distance metric or so, or something that we don't think of at this stage), I fear that people might start to get confused. Even now, they don't know what, for instance, the relation of tl.dendrogram to PAGA is: instead of correlating cluster mediod vectors, PAGA computes the connectivity between clusters in the underlying graph. Also, it is not restricted to a tree. It would be great to have a note like that (I can also put it; also, I wanted to rewrite the PAGA docstring anyways and I'll make a link to tl.dendrogram...).

Thanks again!

@fidelram
Copy link
Copy Markdown
Collaborator Author

@falexwolf thanks for the feedback. :)

I agree with your comments on the sc.tl.dendrogram. Similar reasoning originally motivated me to separate and expose the implementation of the function. I expect that now, is easier to extend the creation of a correlation matrix to other methods and groupings as you suggest.

Currently, by default sc.tl.dendrogram uses PCA by recycling the function used by sc.tl.neighbors (tools._utils.choose_representation()). Any other embedding in .obsm can be used (as is the case by sc.tl.neighbors. Also, any group of genes can be given as parameter

What tl.dendrogram does not do is to use the underlying network to compute a distance matrix as I think seurat does and apparently you also do in PAGA.

For me, what is important is that the plotting functions get the dendrogram data from .uns and thus the generation of the hierarchical clustering is separated and can be computed by any other method.

@falexwolf
Copy link
Copy Markdown
Member

Great! 😄

What tl.dendrogram does not do is to use the underlying network to compute a distance matrix as I think seurat does and apparently you also do in PAGA.

That is fine: if you just compute distances on the medians/medoids that's a small-scale problem that won't cost any time. If Seurat uses medians, they will not use the underlying network, because they aren't contained in there. If they use medoids, they might use the underlying network, but it will be negligible save in compute time... So don't worry about that...

@fidelram
Copy link
Copy Markdown
Collaborator Author

I will get back to this as soon as I can. Meanwhile lets keep the PR open.

@falexwolf
Copy link
Copy Markdown
Member

Just let me know when you think this is ready to go!

@fidelram
Copy link
Copy Markdown
Collaborator Author

fidelram commented Feb 3, 2019

I am getting an error elsewhere that I want to revise before submitting a final version. Hopefully tomorrow

@fidelram
Copy link
Copy Markdown
Collaborator Author

fidelram commented Feb 5, 2019

I think it is ok to merge now.

I also updated some of the plotting functions to accept a gene_symbol column:

image

What is missing is sc.pl.rank_genes_groups and sc.pl.violin any volunteers?

@aopisco
Copy link
Copy Markdown
Contributor

aopisco commented Feb 12, 2019

@fidelram how can I try your sc.pl.correlation_matrix?
great work by the way!

@fidelram
Copy link
Copy Markdown
Collaborator Author

@falexwolf Can we merge this branch?

@flying-sheep flying-sheep force-pushed the master branch 2 times, most recently from 3efb194 to fc84096 Compare February 12, 2019 11:38
@flying-sheep
Copy link
Copy Markdown
Member

Ah, sorry for being in the way here with the unrelated logging changes. Alex is currently a bit ill I learned, which is why he probably didn’t do it yet. I didn’t have time to review the whole thing, but if y’all want I can do that too

@fidelram
Copy link
Copy Markdown
Collaborator Author

fidelram commented Feb 13, 2019 via email

@aopisco
Copy link
Copy Markdown
Contributor

aopisco commented Feb 21, 2019

@fidelram @flying-sheep reiterating my interest in trying this out 💯

@falexwolf
Copy link
Copy Markdown
Member

Sorry about the terribly late response. I was super sick for weeks. Thank you so much, @fidelram.

Let's merge it. :)

@falexwolf falexwolf merged commit 77e34d7 into scverse:master Mar 4, 2019
@fidelram
Copy link
Copy Markdown
Collaborator Author

fidelram commented Mar 4, 2019

Thanks!

@wangjiawen2013
Copy link
Copy Markdown

Thanks a lot. All of these new features are what we need!

I notice that the tutorial has not been updated yet (such as sc.tl.filter_rank_genes_groups( ) and rna velocity function in https://github.com/theislab/scanpy/tree/master/scanpy/tools). I find these features occasionally. Could you add them in scanpy tutorial ?

@fidelram
Copy link
Copy Markdown
Collaborator Author

fidelram commented Mar 18, 2019 via email

awnimo pushed a commit to dpeerlab/scanpy that referenced this pull request Dec 17, 2019
dendrograms, correlation and marker genes filtering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants