tiny-count: support for sequence-based read counting by AlexTate · Pull Request #279 · MontgomeryLab/tinyRNA

AlexTate · 2023-01-27T22:24:34Z

tiny-count can now perform sequence-based counting as an alternative to feature-based counting. This is useful to users who don't have GFF annotations for their experiment, or users who want to count reads against a set of known sequences.

tiny-count automatically switches to this counting mode when a user's Paths File doesn't have any GFFs listed.
tinyRNA no longer requires GFF files at pipeline startup.

Technical Details

In sequence-based counting mode, Stage 1 selectors cannot be evaluated and are therefore ignored (Select for..., with value..., Classify as..., Source Filter, and Type Filter ). Stage 2 and Stage 3 are evaluated as they would be for feature-based counting.

In sequence-based counting mode, "feature" intervals are defined by the @SQ headers of input SAM files. These headers only define a sequence identifier, which is used as the "Feature ID", and a length for each sequence. These headers correspond to the reference fasta that the reads were aligned against (in bowtie's case, this is the fasta input to bowtie-build). Reads are counted for alignments to each of these reference sequences on both strands. Unlike in feature-based counting, all rules are evaluated for all reference sequences in Stages 2 and 3.

SAM @SQ headers are evaluated to ensure that they are present in each file, they contain the required fields, no identifier appears more than once in each file, and identifiers have a consistent length indicated in the headers of all input SAM files.

Closes #277

…d to skip empty paths under the gff_files key

…utputs since this code is essentially an enumerate() with an upper limit

…r lines start with the same flag

…t is shared between it and the new class for non-genomic references. The overlap between the two is pretty much the GenomicArray/StepVector and related functions

…les but for non-GFF read counting. It produces a GenomicArray of reference sequences, where there is one "chromosome" for each sequence, and each chromosome contains an entry for the sequence on the + and - strand. Tags and aliases don't apply in this mode, but they are returned nonetheless for compatibility

…n classes based on the presence of GFF files. Also changing the order in which assign_features and count_reads occur in the class because my preference has changed since it was written

… them for non-genomic counting. The validating regex pattern was copied from the Aug. '22 version of the SAM v1 specification

…gned during calls to .get(). This would allow us to create groups of GFF files, each with a distinct Stage 1 ruleset, that would be pooled in the same tables. Also changing the approach with validation/parsing of SAM headers for non-genomic counting. Since @sq headers could be very abundant, I'm reusing the parsing results from validation rather than reparsing. NonGenomicAnnotations therefore is now constructed with a dictionary of {sequence_id: sequence_length} which is produced during successful validation.

…lidator. @sq headers are first validated for syntax (error on missing or incomplete @sq header). Next they are checked for duplicate entries under the same identifier in each file, and for inconsistent length definitions between files. This is intended to catch issues that might arise from standalone use of tiny-count with third party SAM files, which may have been produced by more than one alignment event using different indexes

the reference parsing object is constructed, and also where references are validated. It just made sense for these to be among the initial operations when running tiny-count. The routine for reorganizing the YAML representation of gff_files is now part of the PathsFile class. I'm happy with how this has cleaned up the code.

…ing approach

- Moving paths_config/paths_file bookkeeping out of ConfigBase. This is used by Configuration and Resume* classes, but not by PathsFile which also subclasses it. It is also better documented and consolidated in one function to make it clearer - Streamlining path object type checking with is_path_dict(), is_path_str(), and is_path() - The notice about "no GFF files provided" has been moved to get_gff_config() - Comment improvements

…ted when printing reports

…empty string. NonGenomicAnnotations has to emulate this.

AnnotationParsing -> ReferenceBase ReferenceTables -> ReferenceFeatures NonGenomicAnnotations -> ReferenceSeqs Also made some corrections to comments: - Uses of ReferenceTables that also apply to ReferenceSeqs have been changed to "reference parsers" for the most part - A few spelling mistakes

…ferenceFeatures. No code changes and no functional changes. The previous order was mostly arbitrary and it has been bothering me for a while. The new order mostly prioritizes core functions/concepts and the rest is roughly by calling order. This should make it a little easier to browse.

…ional

AlexTate · 2023-02-04T20:44:43Z

The ReferenceFeatures class (formerly ReferenceTables) has many changes in this PR but it looks worse than it is.

The only functional changes involved copying a few methods to the base class, ReferenceBase, and refactoring the method signature of __init__() so that that the selector argument is instead passed via get() (see 15e2795). All other changes are the result of reordering the class' methods to group them more logically and improve readability (see 3247ea6)

… when users don't have GFF files listed. Also correcting instances of "non-genomic" to "sequence-based"

taimontgomery · 2023-02-07T21:27:37Z

Tested successfully with ram1 and lib303 datasets and full feature sets or no feature sets. Lib303 dataset aligned to cel_miRNAs.fa and miRNA counts were consistent with full genome alignment.

AlexTate added 7 commits January 27, 2023 14:00

Updating PathsFile class to no longer treat gff_files as required, an…

8b1114a

…d to skip empty paths under the gff_files key

Minor refactor to match the order of enumerate()'s (index, element) o…

0b3cc40

…utputs since this code is essentially an enumerate() with an upper limit

Updating tiny-count's load_gff_files() to treat GFF files as optional

bd168cc

Updating SAM_reader to properly store header data when multiple heade…

46ffd44

…r lines start with the same flag

Refactoring ReferenceTables to use a base class for functionality tha…

e7c6c7e

…t is shared between it and the new class for non-genomic references. The overlap between the two is pretty much the GenomicArray/StepVector and related functions

Updating FeatureCounter to switch between the two reference annotatio…

3f88265

…n classes based on the presence of GFF files. Also changing the order in which assign_features and count_reads occur in the class because my preference has changed since it was written

AlexTate requested a review from taimontgomery January 27, 2023 22:24

AlexTate added 14 commits February 3, 2023 16:44

SAM file headers need some basic validation now that we're relying on…

f59ef1f

… them for non-genomic counting. The validating regex pattern was copied from the Aug. '22 version of the SAM v1 specification

Updating the FeatureCounter constructor to use the new AnnotationPars…

e8f4173

…ing approach

Misc. cleanup in validation.py: empty results are more clearly indica…

70783b2

…ted when printing reports

Updating unit tests

ddf3bfe

Adding unit tests for SAM @sq header validation

cb11076

Merge branch 'master' into issue-277

515f1ce

Bugfix: empty classifier fields are parsed from Features Sheet as an …

a0df42a

…empty string. NonGenomicAnnotations has to emulate this.

Updating comments in the Paths File to indicate that gff_files is opt…

ff79c98

…ional

AlexTate changed the title ~~tiny-count: support for non-GFF read counting~~ tiny-count: support for sequence-based read counting Feb 4, 2023

AlexTate marked this pull request as ready for review February 4, 2023 04:09

Adding a timeout to the sequence-based counting notice that is issued…

ba97628

… when users don't have GFF files listed. Also correcting instances of "non-genomic" to "sequence-based"

taimontgomery merged commit c311418 into master Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tiny-count: support for sequence-based read counting#279

tiny-count: support for sequence-based read counting#279
taimontgomery merged 22 commits intomasterfrom
issue-277

AlexTate commented Jan 27, 2023 •

edited

Loading

Uh oh!

AlexTate commented Feb 4, 2023

Uh oh!

taimontgomery commented Feb 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlexTate commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Technical Details

Uh oh!

AlexTate commented Feb 4, 2023

Uh oh!

taimontgomery commented Feb 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AlexTate commented Jan 27, 2023 •

edited

Loading