Skip to content

Miscellaneous Fixes#7

Merged
vikramalva merged 79 commits intoproteinevolution:masterfrom
MartinGuehmann:ReduceMemoryFuerther
Dec 9, 2020
Merged

Miscellaneous Fixes#7
vikramalva merged 79 commits intoproteinevolution:masterfrom
MartinGuehmann:ReduceMemoryFuerther

Conversation

@MartinGuehmann
Copy link
Contributor

@MartinGuehmann MartinGuehmann commented Dec 2, 2020

This PR reduces memory needs further and fixes the convex clustering algorithm. There was a condition that would make it crash and the variance calculation did not consider the edges between the sequences where we have no data.

If accepted this pull request will also accept pull requests #1, #2, #4, #5, and #6.

HashMap requires a key and a value, both must be Objects, that means both
are pointers in the HasMap in require 8 bytes each in 64 bit Java.

Additionally, comes the memory for the Objects, if we use the same object for
key and value, we can save that memory.

To achieve that we need be able to use MinimalHsp as key in a HashMap, since
we only want to use query and hit of MinimalHsp, the overriden methods
hashCode and equals should only depend on those.
And query and hit should be final so that they cannot be changed once 
MinimalHsp is in a HashMap, since this would screw up the HashMap.
In particular, declare local variables as close to where they are used.
Especially keep them in a local scope.
… key and value in the HashTable to save memory
…trings

Strings need a lot of memory for representing two numbers seperated by an
underscore. However, the value for the key is already contained in the
MinimalAttractionValue object itself.

To use an MinimalAttractionValue as key to itsself, it quals and hashCode
function must depend on its values of query and hit. Since the att field is
supposed the value part in the HashTable, this is ignored by quals and hashCode.

This is a bit wired, but HashMap does not allow primitive types then a long
as key would be the choice or a pair of two ints, if Java would allow to pass
the object itself than a pointer onto it.
…mory

Internally the HashSet also uses a HashMap, which is filled with a pointer to
a static dummy object, so we save memory on the value object, but not on the
pointer itself, which is not a smart design.
Tempory variables should not be members of a class, especially if they are only used locally
…f it contains a gap

String.replaceAll can be implemented in a way that it returns a new String even so the original 
String does not contain the gap character. This wastes time and memory for allocating the new
String.
…ction.java to reduce meomory usage

However, since hashkeys[i] = i, this looks to be superflous
This wasn't really a map and just used memory without need.
And made the code harder to read.
… it is used as such anyway

This simplifies the code and saves memory for the HashMap and the wrappers it requires for
primitives. In fact the Integer objects were basically used as indeces.
…mplexity

This avoids adding dummy objects or fields. In principle, this could reduce memory
needs, however HashSet uses internally a HashMap and uses a static dummy Object for
filling the value part. That is a not very nice implementation.

However, this is now out of sight of the programmer so that other code issues get clearer.
…ith a captial "F" in line with all the other menu items
…uster.java

Don't use "booleanVar == false" use instead "!booleanVar"
…lue and ClusterMethods.computeComplexAttractionValue
… and computeComplexAttractionValue

This value is provide by the ClusterData object already.
…_values

The attraction values for the same edges seem to be supposed to be avaraged.
However, it was something else then avaraging.

If there was only an edge between node A and B but not between B and A, then 
the attraction value would be only half the size if it where.

In fact, it is questionable whether the attraction values should be treated differently
if they come from two different HSPs, then if they came from the same.
The question is whether something is above a limit not the other way around
…for each cluster, it is always the same anyway
@vikramalva
Copy link
Member

@MartinGuehmann we are currently working on an entirely new implementation of CLANS and are therefore not maintaining this version anymore.

@MartinGuehmann
Copy link
Contributor Author

Thanks for letting me know @vikramalva. Actually, something your colleague could have told me, when he gave me the link to this version.

In fact, I agree the code is quite a mess, even so this is already better than the previous version I could get into my fingers. This version has been heavily refactored, but still this is something else than object orientated programming, and thus not very nice to work with.

So what is it? Some implementation in c++? I thought about taking out the data class and the algorithm and convert it to c++. Much better memory management, Java just asks for more memory even so it could free some memory first. And then these default data structures, seriously to store an int of 4 bytes in an ArrayList, I need another 8 bytes in 64 bit Java for a pointer. That's just insane.

For now I got CLANS what I want to do, it can load my data, basically a graph with 90000 nodes and 600 million edges. It can load the data, run 5000 rounds and save it to disk and that from the command line without user interaction. That way I can run it on the cluster computer of my university and put it into my pipeline.

I am not anymore interested in the clustering algorithm, since I do this step with another program.

@vikramalva vikramalva merged commit 27db6c0 into proteinevolution:master Dec 9, 2020
@vikramalva
Copy link
Member

Thanks a lot for the work you put into this version of CLANS @MartinGuehmann. I am testing it out presently and will definitely use it for my own work. I look forward to loading large graphs that I was previously unable to load. I am not directly involved with the development of the new version of CLANS and therefore I do not have any information on how it is being implemented.

@MartinGuehmann MartinGuehmann deleted the ReduceMemoryFuerther branch December 21, 2021 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants