Miscellaneous Fixes by MartinGuehmann · Pull Request #7 · proteinevolution/clans

MartinGuehmann · 2020-12-02T01:45:36Z

This PR reduces memory needs further and fixes the convex clustering algorithm. There was a condition that would make it crash and the variance calculation did not consider the edges between the sequences where we have no data.

If accepted this pull request will also accept pull requests #1, #2, #4, #5, and #6.

HashMap requires a key and a value, both must be Objects, that means both are pointers in the HasMap in require 8 bytes each in 64 bit Java. Additionally, comes the memory for the Objects, if we use the same object for key and value, we can save that memory. To achieve that we need be able to use MinimalHsp as key in a HashMap, since we only want to use query and hit of MinimalHsp, the overriden methods hashCode and equals should only depend on those. And query and hit should be final so that they cannot be changed once MinimalHsp is in a HashMap, since this would screw up the HashMap.

In particular, declare local variables as close to where they are used. Especially keep them in a local scope.

… key and value in the HashTable to save memory

…n the hHashMaps in BlastVersion2.gethits

…n the HashMaps in FileHandling2.blast

…trings Strings need a lot of memory for representing two numbers seperated by an underscore. However, the value for the key is already contained in the MinimalAttractionValue object itself. To use an MinimalAttractionValue as key to itsself, it quals and hashCode function must depend on its values of query and hit. Since the att field is supposed the value part in the HashTable, this is ignored by quals and hashCode. This is a bit wired, but HashMap does not allow primitive types then a long as key would be the choice or a pair of two ints, if Java would allow to pass the object itself than a pointer onto it.

…n unused parameter from a function

…a String to save memory

…mory Internally the HashSet also uses a HashMap, which is filled with a pointer to a static dummy object, so we save memory on the value object, but not on the pointer itself, which is not a smart design.

Tempory variables should not be members of a class, especially if they are only used locally

…o set the members on construction

…f it contains a gap String.replaceAll can be implemented in a way that it returns a new String even so the original String does not contain the gap character. This wastes time and memory for allocating the new String.

…ction.java to reduce meomory usage However, since hashkeys[i] = i, this looks to be superflous

This wasn't really a map and just used memory without need. And made the code harder to read.

… it is used as such anyway This simplifies the code and saves memory for the HashMap and the wrappers it requires for primitives. In fact the Integer objects were basically used as indeces.

…mplexity This avoids adding dummy objects or fields. In principle, this could reduce memory needs, however HashSet uses internally a HashMap and uses a static dummy Object for filling the value part. That is a not very nice implementation. However, this is now out of sight of the programmer so that other code issues get clearer.

…ction.java

…shMap

…ith a captial "F" in line with all the other menu items

…uster.java Don't use "booleanVar == false" use instead "!booleanVar"

…luster

…s and premature declaration

…rDetectionResults

…esults

…sterDetectionResults

…dex shown in the cluster result window

…ave memory

…lue and ClusterMethods.computeComplexAttractionValue

… and computeComplexAttractionValue This value is provide by the ClusterData object already.

…king the copies more similar

This way it is easier to maintain

…_values The attraction values for the same edges seem to be supposed to be avaraged. However, it was something else then avaraging. If there was only an edge between node A and B but not between B and A, then the attraction value would be only half the size if it where. In fact, it is questionable whether the attraction values should be treated differently if they come from two different HSPs, then if they came from the same.

…ce the help calls it so

…mpute_attraction_values

The question is whether something is above a limit not the other way around

…for each cluster, it is always the same anyway

vikramalva · 2020-12-02T09:49:40Z

@MartinGuehmann we are currently working on an entirely new implementation of CLANS and are therefore not maintaining this version anymore.

MartinGuehmann · 2020-12-02T20:02:00Z

Thanks for letting me know @vikramalva. Actually, something your colleague could have told me, when he gave me the link to this version.

In fact, I agree the code is quite a mess, even so this is already better than the previous version I could get into my fingers. This version has been heavily refactored, but still this is something else than object orientated programming, and thus not very nice to work with.

So what is it? Some implementation in c++? I thought about taking out the data class and the algorithm and convert it to c++. Much better memory management, Java just asks for more memory even so it could free some memory first. And then these default data structures, seriously to store an int of 4 bytes in an ArrayList, I need another 8 bytes in 64 bit Java for a pointer. That's just insane.

For now I got CLANS what I want to do, it can load my data, basically a graph with 90000 nodes and 600 million edges. It can load the data, run 5000 rounds and save it to disk and that from the command line without user interaction. That way I can run it on the cluster computer of my university and put it into my pipeline.

I am not anymore interested in the clustering algorithm, since I do this step with another program.

vikramalva · 2020-12-09T14:39:19Z

Thanks a lot for the work you put into this version of CLANS @MartinGuehmann. I am testing it out presently and will definitely use it for my own work. I look forward to loading large graphs that I was previously unable to load. I am not directly involved with the development of the new version of CLANS and therefore I do not have any information on how it is being implemented.

MartinGuehmann added 30 commits October 7, 2020 02:26

Add a bash script for making and jaring Clans from the command line

07d9f14

Cleanup places for better HasMap usage, before replace

be2e352

In particular, declare local variables as close to where they are used. Especially keep them in a local scope.

Make ClusterDataLoadHelper.parse_hsp_block use the same MinimalHsp as…

3f3485c

… key and value in the HashTable to save memory

Reduce memory by using the same MinimalHsp object for key and value i…

ce0fbb4

…n the hHashMaps in BlastVersion2.gethits

Reduce memory by using the same MinimalHsp object for key and value i…

9bac3cb

…n the HashMaps in FileHandling2.blast

Cleanup code initialize a local HasMap as late as possible, removed a…

3892993

…n unused parameter from a function

Use of MinimalAttractionValue itself as key in the HashTable instead …

fc22539

…a String to save memory

Use HashSets instead of HashMaps in SelectedSubsetHandling to save me…

92b77f1

…mory Internally the HashSet also uses a HashMap, which is filled with a pointer to a static dummy object, so we save memory on the value object, but not on the pointer itself, which is not a smart design.

Remove tmp members from IterationsComputerThread

9942c68

Tempory variables should not be members of a class, especially if they are only used locally

Simplify code: Use the two argement construcor of AminoAcidSequence t…

b808005

…o set the members on construction

Use Integer instead of String as type for the HashMaps in ClusterDete…

28cdf87

…ction.java to reduce meomory usage However, since hashkeys[i] = i, this looks to be superflous

Remove hashkeys from ClusterDetection.java, since hashkeys[i] = i

d646435

This wasn't really a map and just used memory without need. And made the code harder to read.

White space cleanup in ClusterDetection.java

d6ad91d

Remove unused parameter of ClusterDetection.getconnecteds

71f6358

Turn clusterhash in ClusterDetection.multilinkage into a 2D-array, as…

a394ea5

… it is used as such anyway This simplifies the code and saves memory for the HashMap and the wrappers it requires for primitives. In fact the Integer objects were basically used as indeces.

Use existing Integer objects for HashMaps and HashSets in ClusterDete…

5d82e11

…ction.java

Accelerate loading with many HSPs, by presetting the capacity of a Ha…

bf613bc

…shMap

Rename basevec to remainingSeqIDs in ClusterDetection.java

f313999

Rename the methods in ClusterDetection.java for convex clustering

fa2059c

Rename currvec to newClusterSeqIDs in ClusterDetection.java

84140a6

Having the "Find clusters" menu item in the "Windows" menu starting w…

b336f7d

…ith a captial "F" in line with all the other menu items

Clean up sort code in ClusterDetection.java

bcd7387

Rename retvec to returnClusters in ClusterDetectionn.getConvex

9743c8f

Clean up white space and add camelCasing for varibables in SequenceCl…

c5265ec

…uster.java Don't use "booleanVar == false" use instead "!booleanVar"

Add also camelCasing and fix naming for member variables in SequenceC…

1d295c5

…luster

Simplify code by moving into the SequenceCluster constructors

9faab95

MartinGuehmann added 25 commits October 11, 2020 02:10

Cleanup white space in WindowClusterDetectionResults.java

f1c00f0

Cleanup variable names in WindowClusterDetectionResults further: Name…

150477c

…s and premature declaration

Do not add sequences if the cancel button was pressed in WindowCluste…

0666831

…rDetectionResults

Better label the buttons with what they do in WindowClusterDetectionR…

e096511

…esults

Add the sequences to the sequence groups as they show up in WindowClu…

3c079c4

…sterDetectionResults

Flexibilize the naming of the new sequence groups and conserve the in…

b3a489a

…dex shown in the cluster result window

Use Integer as key in HashMap for only moving selected sequences to s…

e9e4b0b

…ave memory

Remove unused parameter from ClusterMethods.computeSimpleAttractionVa…

b1dd26b

…lue and ClusterMethods.computeComplexAttractionValue

Cleanup white space in ClusterMethods.java, no code changes

44dc625

Further code cleanup in ClusterMethods.java

bfa9844

Remove minpal from the argument lists of computeSimpleAttractionValue…

eed57af

… and computeComplexAttractionValue This value is provide by the ClusterData object already.

Prepare to merge duplicated code in compute_attraction_values() by ma…

c7dcedb

…king the copies more similar

Cleanup white space in ClusterData.java

08e6da3

Merge duplicated code in ClusterData.compute_attraction_values

15189d5

This way it is easier to maintain

Merge more duplicated code in ClusterData.compute_attraction_values

bec46f2

Provide -rounds synonym for the -dorounds command line parameter, sin…

a801c31

…ce the help calls it so

Add convex clustering to the no graphical user interface options

7fe6d05

Reduce memory, just use the HashMap to store things in ClusterData.co…

0f9d813

…mpute_attraction_values

Remove a not needed HashMap

1bc9182

Fix the variance calculation in ConvexClustering

501ce67

Make an above limit check more readable

f3479e8

The question is whether something is above a limit not the other way around

Calculate the attraction limit in ConvexClustering only once instead …

305fee0

…for each cluster, it is always the same anyway

Fix an ArrayIndexOutOfBoundsException in ConvexClustering

1b26079

Correct the variance calculation in ConvexClustering further

c69f68b

vikramalva merged commit 27db6c0 into proteinevolution:master Dec 9, 2020

MartinGuehmann deleted the ReduceMemoryFuerther branch December 21, 2021 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miscellaneous Fixes#7

Miscellaneous Fixes#7
vikramalva merged 79 commits intoproteinevolution:masterfrom
MartinGuehmann:ReduceMemoryFuerther

MartinGuehmann commented Dec 2, 2020 •

edited

Loading

Uh oh!

vikramalva commented Dec 2, 2020

Uh oh!

MartinGuehmann commented Dec 2, 2020

Uh oh!

vikramalva commented Dec 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MartinGuehmann commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vikramalva commented Dec 2, 2020

Uh oh!

MartinGuehmann commented Dec 2, 2020

Uh oh!

vikramalva commented Dec 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MartinGuehmann commented Dec 2, 2020 •

edited

Loading