Skip to content

Releases: DessimozLab/omamer

2.1.2

15 Oct 12:06

Choose a tag to compare

Important bugfix to the search function producing invalid results in 2.1.1, as well as other minor bugfixes: #53, #58

2.1.1

06 Oct 10:10

Choose a tag to compare

  • Performance improvements to the mkdb command with orthoxml input
  • Added a check for non-unique protein IDs in the input fasta files. Now it gives a more informative error message
  • fixed #49

2.1.0

25 Nov 11:04

Choose a tag to compare

What's Changed

This release contains various performance improvements for classification with the focus on single-thread speed and parallel scaling.

2.0.5

13 Nov 10:40

Choose a tag to compare

Full Changelog: v2.0.4...v2.0.5

2.0.4

01 Jul 06:56

Choose a tag to compare

What's Changed

  • [FIX] freeze numpy dependency to <2 (issue #34)
  • [ADD] experimental support to build omamer databases from orthoxml/fasta files
  • Bump pypa/gh-action-pypi-publish from 1.8.12 to 1.9.0 by @dependabot in #33

Full Changelog: v2.0.3...v2.0.4

2.0.3

28 Mar 16:32

Choose a tag to compare

v2.0.3

2.0.2

10 Nov 19:20

Choose a tag to compare

  • changed method for hiding taxa in build process. Now takes a file containing taxa to hide on separate lines.
  • checks and improved feedback for root taxon and requested taxa to hide.
  • root taxon set by default to the root level in speciestree.nwk (previously hard-coded to default to LUCA)

2.0.1

31 Oct 13:53

Choose a tag to compare

What's Changed

  • remove dependency for filehash library
  • return better error message if build dependencies are not met, but trying to building an omamer database
  • minor fixes
  • Bump actions/checkout from 3 to 4 by @dependabot in #24

Full Changelog: v2.0.0...v2.0.1

2.0.0

20 Oct 10:51

Choose a tag to compare

  • Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM.
  • Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many k-mers as we have observed by chance, for each family that has a match to a given query.
  • UX improvements - more feedback during interactive search runs, whilst maintaining small log files.

Brief overview of major changes to OMAmer

The OMAmer placement algorithm consists of two steps: placing a query sequence into a protein family (root level HOG in OMA), before placing it into a sub-family. The original OMAmer publication focused on providing better and faster subfamily-level assignments than methods based on closest-sequence. Recently, the group has developed OMArk, a software package for proteome (protein-coding gene repertoire) quality assessment. The original OMAmer method was developed using a smaller taxonomic range than required for OMArk, which meant that the largest gene families were much smaller and less diverse in k-mer content. The largest HOG in OMA (November 2022 release) contains over 101,000 proteins and represents 53.9% of the k-mer index, based on the 6-mers that OMAmer uses by default. This means that a random protein sequence is very likely to be associated with this HOG.

In order to allow for this, we developed a scoring mechanism based on the binomial distribution. For each family, we estimated the probability of a random k-mer matching. We can then compute the $\textrm{Binomial}(N_{\textrm{query}}, P_{\textrm{family}})$ distribution for each family with matches (with probability $P_{\textrm{family}}$), with the number of draws ($N_{\textrm{query}}$) being the number of k-mers in the query sequence. Computing the complementary CDF (survival function), we can compute the probability of matching at least as many k-mers matches as we have observed by chance, for each family that has a match. Note: the results of this test are computed in negative-log units (natural log) for accuracy.

This is then used to filter the list of families which have an overlap with the query sequence (argument “--family-alpha”, default $10^{-6}$), to give us a list of candidate families. Candidates are then ordered by a normalised k-mer count, in the same way as the original algorithm. The expected count is now computed using the binomial approximation, with any ties broken based on the proportion of the query sequence covered by matching k-mers, then by the p-value computed above. By default, only the top family is taken forward. Sub-family placement is as in the original manuscript. Further software optimisation was performed, but did not affect the underlying method. As an example, it is now possible to run using the LUCA database in under 12GB of memory, whereas before this was using in excess of 40GB.

0.2.6

14 Jun 07:43

Choose a tag to compare

What's Changed

  • support for numpy>1.23

Full Changelog: v0.2.5...v0.2.6