tiny-count: selective performance improvements via Cython#218
tiny-count: selective performance improvements via Cython#218taimontgomery merged 30 commits intomasterfrom
Conversation
…ementation. Linux compatability requires essentially removing the extra_*_args kwds, but further testing is needed on both Linux and macOS
… changes are actually pretty minimal. There's some extra fluff here to make switching between Cython and HTSeq StepVector easier
… last commit, but it has already been pushed to origin.
After doing a lot of profiling and fiddling with the routine in FeatureSelector.choose(), I'm once again back to flattening the list of Stage 2 candidates via a list comprehension. 1/2 of Stage 2 is performed in the comp., and step 2/2 (calculating min_rank) is now performed with one call on the resulting list. While runtimes are better with this approach, elimination by hierarchy is still flawed. I've also removed the inner generator expression in FeatureCounter.assign_features(). After the recent switch to calling union(*) on StepVector steps, it really doesn't make sense to screen for zero length step sets. This is much more efficient. Strands are now represented with a boolean value which simplifies some things. Strand selection is now performed via a chain of bitwise XOR operations. This prevents us from having to form and pass a tuple for each evaluation.
By slightly adjusting the way case-insensitive dictionary entries are constructed, the construction time of these dictionaries has been substantially reduced which makes GFF parsing faster. During a few test runs I saw reductions in the ballpark of 50% for the GFF parsing stage.
…nition of a templated class. Cython doesn't particularly care, but this is useful for C++ to C++ direct testing for easier debugging. This has also led to some much needed improvements to the PyRef wrapper for PyObjects. This refactoring has also fixed a major issue with the StepVector following recent changes which caused massive accumulations of features. It has also led to a slight reduction in memory. At the end of the day, the procedure and design of our specialized Cython StepVector has been substantially improved.
…toring keys/values rather than NamedTuples. The overhead for the convenience of having named indices was far too high, and definitely not worth it for this particular feature
…en our Cython or HTSeq's StepVector. setup.py: refactored args for Cython.Extension, added '-O3', and added an extension entry for test_stepvec.pyx _stepvector.pyx: misc. improvements
…in CaseInsensitiveDict
# Conflicts: # setup.py # tiny/rna/counter/features.py # tiny/rna/counter/hts_parsing.py # tiny/rna/counter/matching.py # tiny/rna/counter/statistics.py
…le args aren't used on other platforms. Also cleaning up some boilerplate code
…s. This has the advantage of 1) skipping the condition checks if not built with debugging enabled, and 2) the debug member variable in the PyRef class was bad design, and this allows it to be removed.
… been built on the user's machine. If not, tiny-count produces an error and exists. This should only be relevant if the user is running tinyRNA from the project directory, so the error message has been written accordingly
…tation files can be used for counting and they may be heterogenous in type, this option will need to be defined for each file if we decide to include it. ReferenceTables has been changed to also use "gene_id" as a feature's ID attribute. "ID" is still considered the primary attribute if both are present. This should make things easier for users with GFF2/GTF files
…thon under tests/cython_tests/stepvector would otherwise cause it to be included.
…from its contents. Since subdependencies aren't version locked, these changes include many, many updates that have nothing to do with Cython. Testing shows expected outputs on macOS and Linux using these lockfiles. Unfortunately I don't have a better solution for this at the moment.
Conda will print extra newlines before "done" in the "Executing transaction... done" message under certain conditions. The installation script checks for this "success string" to confirm installation/update since Conda doesn't always return a non-zero exit code on failure. The presence of newlines means the pattern fails to match the success string. To remedy this, the setup script now removes newlines from before checking for the success string. setup.py: Very minor cleanup to group Cython Extension record creation under a function
…not necessary to patch it in order to use the StepVector, as I had once thought. Updated unit_tests_stepvector.py accordingly and also added the Cython native test, since there isn't a clean way I'm aware of for doing this with the unittest module.
…ll grammar corrections in the table in README.md, and a correction for the "unassigned" class explanation in tiny-plot.md
|
Updates 10/1: After further research, I found that the other targets for Cythonization did not yield performance improvements that I felt were sufficient to offset the maintenance costs. In other cases these targets represented sections of the project that I suspect are liable to large changes in the near future, so it doesn't yet make sense to invest the time. For now we'll limit Cythonization to the StepVector. Object and reference count management in the StepVector's PyObject wrapper (PyRef) has been improved. This is a much better design and has resulted in a small reduction of memory usage. Debug statements in PyRef are now surrounded by preprocessor directives so that the "if (debug)" condition isn't incessantly checked if the StepVector hasn't been specifically compiled for debugging. Performance improvements have been made in GFF parsing routines and in FeatureCounter.assign_features() Unit tests have been added for all languages involved:
Conda dependencies had to be updated following the addition of the Cython dependency. Unfortunately we don't currently have a good way of doing this without updating the entire subdependency tree, since they are not pinned in environment.yml. Testing on macOS and Linux with the new dependency configuration produces expected outputs. I'll be thinking about better ways of doing this in the future. Additional command line arguments have been added to tiny-count and the Run Config to allow users to select between the HTSeq and Cython StepVector, and to allow features which have multiple ID values. A feature's ID can now be defined by either an The installation script has also been corrected so that it can handle the arbitrary newlines that Conda now inserts between "Executing transaction ..." and "done". This is used for determining if the environment installation/upgrade was successful, so this was causing an occasional error when the arbitrary newlines were present. |
…encies on macOS. setup.py: Now treats the Cython StepVector and tests as optional extensions so that installation doesn't error out when there are unforeseen system incompatibilities. Installation also determines the path to the macOS SDK in a way that will be compatible with conda-build pipelines (if/when we have a bioconda recipe for tinyRNA). Also: better condition checks for determining when we need to search for the SDK, and better comments/docstrings. setup.sh: CommandLineTools is now checked for and installed if not present because it is required for the C++ compiler on macOS. This marks another curious exception to the Conda isolation model. As far as I can tell, build dependencies on Linux are already covered by Conda subdependencies.
…ard errors, and instead falls back to using HTSeq's StepVector. This is because setup.py has been changed to skip installation of extensions that produce build errors. Minor correctness change to the --step-vector command line option (is now --stepvector). Also cleaned up get_macos_sdk_path() in setup.py
|
Updates 10/3:
Misc.: tiny-count's new |
|
Tested as new installation on Mojave with a non-sudo user, as a reinstallation on Monterey with sudo user, and as a reinstallation on Linux server with sudo user. All installed Cython step vector and ran successfully. Both Mac installations already had Xcode command line tools installed. |
Currently, this PR offers Cython optimizations for HTSeq's StepVector. It represents a working Cython integration with an embedded package. I intend to address several other targets for Cython optimization before removing the draft status from this PR.
Builds on Linux will be supported in later commits. For now, these changes are only compatible with macOS systems on which the
xcode-select --installcommand has been executed. Support for Linux will be offered in a later commit.