Skip to content

tiny-count: selective performance improvements via Cython#218

Merged
taimontgomery merged 30 commits intomasterfrom
issue-215
Oct 4, 2022
Merged

tiny-count: selective performance improvements via Cython#218
taimontgomery merged 30 commits intomasterfrom
issue-215

Conversation

@AlexTate
Copy link
Member

Currently, this PR offers Cython optimizations for HTSeq's StepVector. It represents a working Cython integration with an embedded package. I intend to address several other targets for Cython optimization before removing the draft status from this PR.

Builds on Linux will be supported in later commits. For now, these changes are only compatible with macOS systems on which the xcode-select --install command has been executed. Support for Linux will be offered in a later commit.

AlexTate added 4 commits July 29, 2022 18:17
…ementation. Linux compatability requires essentially removing the extra_*_args kwds, but further testing is needed on both Linux and macOS
… changes are actually pretty minimal. There's some extra fluff here to make switching between Cython and HTSeq StepVector easier
… last commit, but it has already been pushed to origin.
@AlexTate AlexTate requested a review from taimontgomery July 31, 2022 02:43
@AlexTate AlexTate changed the title Counter: selective performance improvements via Cython tiny-count: selective performance improvements via Cython Aug 4, 2022
After doing a lot of profiling and fiddling with the routine in FeatureSelector.choose(), I'm once again back to flattening the list of Stage 2 candidates via a list comprehension. 1/2 of Stage 2 is performed in the comp., and step 2/2 (calculating min_rank) is now performed with one call on the resulting list. While runtimes are better with this approach, elimination by hierarchy is still flawed.

I've also removed the inner generator expression in FeatureCounter.assign_features(). After the recent switch to calling union(*) on StepVector steps, it really doesn't make sense to screen for zero length step sets. This is much more efficient.

Strands are now represented with a boolean value which simplifies some things. Strand selection is now performed via a chain of bitwise XOR operations. This prevents us from having to form and pass a tuple for each evaluation.
By slightly adjusting the way case-insensitive dictionary entries are constructed, the construction time of these dictionaries has been substantially reduced which makes GFF parsing faster. During a few test runs I saw reductions in the ballpark of 50% for the GFF parsing stage.
…nition of a templated class. Cython doesn't particularly care, but this is useful for C++ to C++ direct testing for easier debugging. This has also led to some much needed improvements to the PyRef wrapper for PyObjects. This refactoring has also fixed a major issue with the StepVector following recent changes which caused massive accumulations of features. It has also led to a slight reduction in memory.

At the end of the day, the procedure and design of our specialized Cython StepVector has been substantially improved.
…toring keys/values rather than NamedTuples. The overhead for the convenience of having named indices was far too high, and definitely not worth it for this particular feature
…en our Cython or HTSeq's StepVector.

setup.py: refactored args for Cython.Extension, added '-O3', and added an extension entry for test_stepvec.pyx

_stepvector.pyx: misc. improvements
# Conflicts:
#	setup.py
#	tiny/rna/counter/features.py
#	tiny/rna/counter/hts_parsing.py
#	tiny/rna/counter/matching.py
#	tiny/rna/counter/statistics.py
…le args aren't used on other platforms. Also cleaning up some boilerplate code
…s. This has the advantage of 1) skipping the condition checks if not built with debugging enabled, and 2) the debug member variable in the PyRef class was bad design, and this allows it to be removed.
… been built on the user's machine. If not, tiny-count produces an error and exists. This should only be relevant if the user is running tinyRNA from the project directory, so the error message has been written accordingly
…tation files can be used for counting and they may be heterogenous in type, this option will need to be defined for each file if we decide to include it.

ReferenceTables has been changed to also use "gene_id" as a feature's ID attribute. "ID" is still considered the primary attribute if both are present. This should make things easier for users with GFF2/GTF files
…thon under tests/cython_tests/stepvector would otherwise cause it to be included.
…from its contents. Since subdependencies aren't version locked, these changes include many, many updates that have nothing to do with Cython. Testing shows expected outputs on macOS and Linux using these lockfiles. Unfortunately I don't have a better solution for this at the moment.
Conda will print extra newlines before "done" in the "Executing transaction... done" message under certain conditions. The installation script checks for this "success string" to confirm installation/update since Conda doesn't always return a non-zero exit code on failure. The presence of newlines means the pattern fails to match the success string. To remedy this, the setup script now removes newlines from before checking for the success string.

setup.py:
Very minor cleanup to group Cython Extension record creation under a function
…not necessary to patch it in order to use the StepVector, as I had once thought.

Updated unit_tests_stepvector.py accordingly and also added the Cython native test, since there isn't a clean way I'm aware of for doing this with the unittest module.
@AlexTate AlexTate changed the base branch from master to issue-3 October 2, 2022 00:58
@AlexTate AlexTate changed the base branch from issue-3 to master October 2, 2022 00:58
…ll grammar corrections in the table in README.md, and a correction for the "unassigned" class explanation in tiny-plot.md
@AlexTate
Copy link
Member Author

AlexTate commented Oct 2, 2022

Updates 10/1:

After further research, I found that the other targets for Cythonization did not yield performance improvements that I felt were sufficient to offset the maintenance costs. In other cases these targets represented sections of the project that I suspect are liable to large changes in the near future, so it doesn't yet make sense to invest the time. For now we'll limit Cythonization to the StepVector.

Object and reference count management in the StepVector's PyObject wrapper (PyRef) has been improved. This is a much better design and has resulted in a small reduction of memory usage. Debug statements in PyRef are now surrounded by preprocessor directives so that the "if (debug)" condition isn't incessantly checked if the StepVector hasn't been specifically compiled for debugging.

Performance improvements have been made in GFF parsing routines and in FeatureCounter.assign_features()

Unit tests have been added for all languages involved:

  • Tests in C++ for the wrapped C++ StepVector core
  • Tests in Cython for the C++ <-> Python interface
  • Tests in Python for direct usage of the StepVector, as well as the integrated usage via HTSeq's GenomicArray

Conda dependencies had to be updated following the addition of the Cython dependency. Unfortunately we don't currently have a good way of doing this without updating the entire subdependency tree, since they are not pinned in environment.yml. Testing on macOS and Linux with the new dependency configuration produces expected outputs. I'll be thinking about better ways of doing this in the future.

Additional command line arguments have been added to tiny-count and the Run Config to allow users to select between the HTSeq and Cython StepVector, and to allow features which have multiple ID values.

A feature's ID can now be defined by either an ID or gene_id attribute. If both attributes are present, only the value for ID is used.

The installation script has also been corrected so that it can handle the arbitrary newlines that Conda now inserts between "Executing transaction ..." and "done". This is used for determining if the environment installation/upgrade was successful, so this was causing an occasional error when the arbitrary newlines were present.

@AlexTate AlexTate marked this pull request as ready for review October 2, 2022 01:40
…encies on macOS.

setup.py:
Now treats the Cython StepVector and tests as optional extensions so that installation doesn't error out when there are unforeseen system incompatibilities. Installation also determines the path to the macOS SDK in a way that will be compatible with conda-build pipelines (if/when we have a bioconda recipe for tinyRNA). Also: better condition checks for determining when we need to search for the SDK, and better comments/docstrings.

setup.sh:
CommandLineTools is now checked for and installed if not present because it is required for the C++ compiler on macOS. This marks another curious exception to the Conda isolation model. As far as I can tell, build dependencies on Linux are already covered by Conda subdependencies.
…ard errors, and instead falls back to using HTSeq's StepVector. This is because setup.py has been changed to skip installation of extensions that produce build errors.

Minor correctness change to the --step-vector command line option (is now --stepvector). Also cleaned up get_macos_sdk_path() in setup.py
@AlexTate
Copy link
Member Author

AlexTate commented Oct 3, 2022

Updates 10/3:
Resilience improvements for Cython components.

  1. setup.py: now treats the current Cython components as "optional." For example, if there are build errors due to the host config, the installer will emit a warning and continue with the installation rather than halting. The routine for sourcing the macOS SDK has been preemptively updated so that it will be compatible with the conda-forge/bioconda build pipeline.
  2. setup.sh: now checks for Xcode's command line tools and installs them if they're missing. This marks another curious exception to the Conda isolation model. As far as I can tell, build dependencies on Linux are already covered by Conda subdependencies.
  3. tiny-count will attempt to use the Cython StepVector if instructed, but if importing fails (for example, due to a build issue during installation) then it will emit a warning and continue counting with HTSeq's implementation rather than halting.

Misc.: tiny-count's new --step-vector command line option has been changed to --stepvector for consistency

@taimontgomery
Copy link
Collaborator

taimontgomery commented Oct 4, 2022

Tested as new installation on Mojave with a non-sudo user, as a reinstallation on Monterey with sudo user, and as a reinstallation on Linux server with sudo user. All installed Cython step vector and ran successfully. Both Mac installations already had Xcode command line tools installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants