Skip to content

tiny-count: support for sequence-based read counting#279

Merged
taimontgomery merged 22 commits intomasterfrom
issue-277
Feb 7, 2023
Merged

tiny-count: support for sequence-based read counting#279
taimontgomery merged 22 commits intomasterfrom
issue-277

Conversation

@AlexTate
Copy link
Member

@AlexTate AlexTate commented Jan 27, 2023

tiny-count can now perform sequence-based counting as an alternative to feature-based counting. This is useful to users who don't have GFF annotations for their experiment, or users who want to count reads against a set of known sequences.

tiny-count automatically switches to this counting mode when a user's Paths File doesn't have any GFFs listed.
tinyRNA no longer requires GFF files at pipeline startup.

Technical Details

In sequence-based counting mode, Stage 1 selectors cannot be evaluated and are therefore ignored (Select for..., with value..., Classify as..., Source Filter, and Type Filter ). Stage 2 and Stage 3 are evaluated as they would be for feature-based counting.

In sequence-based counting mode, "feature" intervals are defined by the @SQ headers of input SAM files. These headers only define a sequence identifier, which is used as the "Feature ID", and a length for each sequence. These headers correspond to the reference fasta that the reads were aligned against (in bowtie's case, this is the fasta input to bowtie-build). Reads are counted for alignments to each of these reference sequences on both strands. Unlike in feature-based counting, all rules are evaluated for all reference sequences in Stages 2 and 3.

SAM @SQ headers are evaluated to ensure that they are present in each file, they contain the required fields, no identifier appears more than once in each file, and identifiers have a consistent length indicated in the headers of all input SAM files.

Closes #277

…d to skip empty paths under the gff_files key
…utputs since this code is essentially an enumerate() with an upper limit
…t is shared between it and the new class for non-genomic references. The overlap between the two is pretty much the GenomicArray/StepVector and related functions
…les but for non-GFF read counting. It produces a GenomicArray of reference sequences, where there is one "chromosome" for each sequence, and each chromosome contains an entry for the sequence on the + and - strand.

Tags and aliases don't apply in this mode, but they are returned nonetheless for compatibility
…n classes based on the presence of GFF files. Also changing the order in which assign_features and count_reads occur in the class because my preference has changed since it was written
… them for non-genomic counting. The validating regex pattern was copied from the Aug. '22 version of the SAM v1 specification
…gned during calls to .get(). This would allow us to create groups of GFF files, each with a distinct Stage 1 ruleset, that would be pooled in the same tables.

Also changing the approach with validation/parsing of SAM headers for non-genomic counting. Since @sq headers could be very abundant, I'm reusing the parsing results from validation rather than reparsing. NonGenomicAnnotations therefore is now constructed with a dictionary of {sequence_id: sequence_length} which is produced during successful validation.
…lidator. @sq headers are first validated for syntax (error on missing or incomplete @sq header). Next they are checked for duplicate entries under the same identifier in each file, and for inconsistent length definitions between files. This is intended to catch issues that might arise from standalone use of tiny-count with third party SAM files, which may have been produced by more than one alignment event using different indexes
the reference parsing object is constructed, and also where references are validated. It just made sense for these to be among the initial operations when running tiny-count. The routine for reorganizing the YAML representation of gff_files is now part of the PathsFile class. I'm happy with how this has cleaned up the code.
- Moving paths_config/paths_file bookkeeping out of ConfigBase. This is used by Configuration and Resume* classes, but not by PathsFile which also subclasses it. It is also better documented and consolidated in one function to make it clearer
- Streamlining path object type checking with is_path_dict(), is_path_str(), and is_path()
- The notice about "no GFF files provided" has been moved to get_gff_config()
- Comment improvements
…empty string. NonGenomicAnnotations has to emulate this.
AnnotationParsing -> ReferenceBase
ReferenceTables -> ReferenceFeatures
NonGenomicAnnotations -> ReferenceSeqs

Also made some corrections to comments:
- Uses of ReferenceTables that also apply to ReferenceSeqs have been changed to "reference parsers" for the most part
- A few spelling mistakes
…ferenceFeatures. No code changes and no functional changes.

The previous order was mostly arbitrary and it has been bothering me for a while. The new order mostly prioritizes core functions/concepts and the rest is roughly by calling order. This should make it a little easier to browse.
@AlexTate AlexTate changed the title tiny-count: support for non-GFF read counting tiny-count: support for sequence-based read counting Feb 4, 2023
@AlexTate AlexTate marked this pull request as ready for review February 4, 2023 04:09
@AlexTate
Copy link
Member Author

AlexTate commented Feb 4, 2023

The ReferenceFeatures class (formerly ReferenceTables) has many changes in this PR but it looks worse than it is.

The only functional changes involved copying a few methods to the base class, ReferenceBase, and refactoring the method signature of __init__() so that that the selector argument is instead passed via get() (see 15e2795). All other changes are the result of reordering the class' methods to group them more logically and improve readability (see 3247ea6)

… when users don't have GFF files listed.

Also correcting instances of "non-genomic" to "sequence-based"
@taimontgomery
Copy link
Collaborator

Tested successfully with ram1 and lib303 datasets and full feature sets or no feature sets. Lib303 dataset aligned to cel_miRNAs.fa and miRNA counts were consistent with full genome alignment.

@taimontgomery taimontgomery merged commit c311418 into master Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tiny-count: support a sequence-based counting mode when GFF files aren't provided

2 participants