15 Oct 12:06

nromashchenko

30bec12

2.1.2 Latest

Latest

Important bugfix to the search function producing invalid results in 2.1.1, as well as other minor bugfixes: #53, #58

Assets 2

06 Oct 10:10

nromashchenko

v2.1.1

16f2068

2.1.1

Performance improvements to the mkdb command with orthoxml input
Added a check for non-unique protein IDs in the input fasta files. Now it gives a more informative error message
fixed #49

Assets 2

25 Nov 11:04

nromashchenko

v2.1.0

ff5e459

2.1.0

What's Changed

This release contains various performance improvements for classification with the focus on single-thread speed and parallel scaling.

Assets 2

13 Nov 10:40

alpae

v2.0.5

ca69493

2.0.5

Full Changelog: v2.0.4...v2.0.5

Assets 2

01 Jul 06:56

alpae

v2.0.4

9140bde

2.0.4

What's Changed

[FIX] freeze numpy dependency to <2 (issue #34)
[ADD] experimental support to build omamer databases from orthoxml/fasta files
Bump pypa/gh-action-pypi-publish from 1.8.12 to 1.9.0 by @dependabot in #33

Full Changelog: v2.0.3...v2.0.4

Contributors

dependabot

Assets 2

28 Mar 16:32

alpae

v2.0.3

7116115

2.0.3

v2.0.3

Assets 2

10 Nov 19:20

alex-wave

v2.0.2

bf62ea0

2.0.2

changed method for hiding taxa in build process. Now takes a file containing taxa to hide on separate lines.
checks and improved feedback for root taxon and requested taxa to hide.
root taxon set by default to the root level in speciestree.nwk (previously hard-coded to default to LUCA)

Assets 2

31 Oct 13:53

alpae

v2.0.1

a5fa800

2.0.1

What's Changed

remove dependency for filehash library
return better error message if build dependencies are not met, but trying to building an omamer database
minor fixes
Bump actions/checkout from 3 to 4 by @dependabot in #24

Full Changelog: v2.0.0...v2.0.1

Contributors

dependabot

Assets 2

20 Oct 10:51

alex-wave

v2.0.0

2c295d2

2.0.0

Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM.
Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many k-mers as we have observed by chance, for each family that has a match to a given query.
UX improvements - more feedback during interactive search runs, whilst maintaining small log files.

Brief overview of major changes to OMAmer

The OMAmer placement algorithm consists of two steps: placing a query sequence into a protein family (root level HOG in OMA), before placing it into a sub-family. The original OMAmer publication focused on providing better and faster subfamily-level assignments than methods based on closest-sequence. Recently, the group has developed OMArk, a software package for proteome (protein-coding gene repertoire) quality assessment. The original OMAmer method was developed using a smaller taxonomic range than required for OMArk, which meant that the largest gene families were much smaller and less diverse in k-mer content. The largest HOG in OMA (November 2022 release) contains over 101,000 proteins and represents 53.9% of the k-mer index, based on the 6-mers that OMAmer uses by default. This means that a random protein sequence is very likely to be associated with this HOG.

In order to allow for this, we developed a scoring mechanism based on the binomial distribution. For each family, we estimated the probability of a random k-mer matching. We can then compute the $\textrm{Binomial}(N_{\textrm{query}}, P_{\textrm{family}})$ distribution for each family with matches (with probability $P_{\textrm{family}}$), with the number of draws ($N_{\textrm{query}}$) being the number of k-mers in the query sequence. Computing the complementary CDF (survival function), we can compute the probability of matching at least as many k-mers matches as we have observed by chance, for each family that has a match. Note: the results of this test are computed in negative-log units (natural log) for accuracy.

This is then used to filter the list of families which have an overlap with the query sequence (argument “--family-alpha”, default $10^{-6}$), to give us a list of candidate families. Candidates are then ordered by a normalised k-mer count, in the same way as the original algorithm. The expected count is now computed using the binomial approximation, with any ties broken based on the proportion of the query sequence covered by matching k-mers, then by the p-value computed above. By default, only the top family is taken forward. Sub-family placement is as in the original manuscript. Further software optimisation was performed, but did not affect the underlying method. As an example, it is now possible to run using the LUCA database in under 12GB of memory, whereas before this was using in excess of 40GB.

Assets 2

14 Jun 07:43

alpae

v0.2.6

2ce5251

0.2.6

What's Changed

support for numpy>1.23

Full Changelog: v0.2.5...v0.2.6

Assets 2

Releases: DessimozLab/omamer

2.1.2

Uh oh!

2.1.1

Uh oh!

2.1.0

What's Changed

Uh oh!

2.0.5

Uh oh!

2.0.4

What's Changed

Contributors

Uh oh!

2.0.3

Uh oh!

2.0.2

Uh oh!

2.0.1

What's Changed

Contributors

Uh oh!

2.0.0

Brief overview of major changes to OMAmer

Uh oh!

0.2.6

What's Changed

Uh oh!