tiny-count: support for sequence-based read counting#279
Merged
taimontgomery merged 22 commits intomasterfrom Feb 7, 2023
Merged
tiny-count: support for sequence-based read counting#279taimontgomery merged 22 commits intomasterfrom
taimontgomery merged 22 commits intomasterfrom
Conversation
…d to skip empty paths under the gff_files key
…utputs since this code is essentially an enumerate() with an upper limit
…r lines start with the same flag
…t is shared between it and the new class for non-genomic references. The overlap between the two is pretty much the GenomicArray/StepVector and related functions
…les but for non-GFF read counting. It produces a GenomicArray of reference sequences, where there is one "chromosome" for each sequence, and each chromosome contains an entry for the sequence on the + and - strand. Tags and aliases don't apply in this mode, but they are returned nonetheless for compatibility
…n classes based on the presence of GFF files. Also changing the order in which assign_features and count_reads occur in the class because my preference has changed since it was written
… them for non-genomic counting. The validating regex pattern was copied from the Aug. '22 version of the SAM v1 specification
…gned during calls to .get(). This would allow us to create groups of GFF files, each with a distinct Stage 1 ruleset, that would be pooled in the same tables. Also changing the approach with validation/parsing of SAM headers for non-genomic counting. Since @sq headers could be very abundant, I'm reusing the parsing results from validation rather than reparsing. NonGenomicAnnotations therefore is now constructed with a dictionary of {sequence_id: sequence_length} which is produced during successful validation.
…lidator. @sq headers are first validated for syntax (error on missing or incomplete @sq header). Next they are checked for duplicate entries under the same identifier in each file, and for inconsistent length definitions between files. This is intended to catch issues that might arise from standalone use of tiny-count with third party SAM files, which may have been produced by more than one alignment event using different indexes
the reference parsing object is constructed, and also where references are validated. It just made sense for these to be among the initial operations when running tiny-count. The routine for reorganizing the YAML representation of gff_files is now part of the PathsFile class. I'm happy with how this has cleaned up the code.
- Moving paths_config/paths_file bookkeeping out of ConfigBase. This is used by Configuration and Resume* classes, but not by PathsFile which also subclasses it. It is also better documented and consolidated in one function to make it clearer - Streamlining path object type checking with is_path_dict(), is_path_str(), and is_path() - The notice about "no GFF files provided" has been moved to get_gff_config() - Comment improvements
…ted when printing reports
…empty string. NonGenomicAnnotations has to emulate this.
AnnotationParsing -> ReferenceBase ReferenceTables -> ReferenceFeatures NonGenomicAnnotations -> ReferenceSeqs Also made some corrections to comments: - Uses of ReferenceTables that also apply to ReferenceSeqs have been changed to "reference parsers" for the most part - A few spelling mistakes
…ferenceFeatures. No code changes and no functional changes. The previous order was mostly arbitrary and it has been bothering me for a while. The new order mostly prioritizes core functions/concepts and the rest is roughly by calling order. This should make it a little easier to browse.
Member
Author
|
The ReferenceFeatures class (formerly ReferenceTables) has many changes in this PR but it looks worse than it is. The only functional changes involved copying a few methods to the base class, ReferenceBase, and refactoring the method signature of __init__() so that that the |
… when users don't have GFF files listed. Also correcting instances of "non-genomic" to "sequence-based"
Collaborator
|
Tested successfully with ram1 and lib303 datasets and full feature sets or no feature sets. Lib303 dataset aligned to cel_miRNAs.fa and miRNA counts were consistent with full genome alignment. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
tiny-count can now perform sequence-based counting as an alternative to feature-based counting. This is useful to users who don't have GFF annotations for their experiment, or users who want to count reads against a set of known sequences.
tiny-count automatically switches to this counting mode when a user's Paths File doesn't have any GFFs listed.
tinyRNA no longer requires GFF files at pipeline startup.
Technical Details
In sequence-based counting mode, Stage 1 selectors cannot be evaluated and are therefore ignored (
Select for...,with value...,Classify as...,Source Filter, andType Filter). Stage 2 and Stage 3 are evaluated as they would be for feature-based counting.In sequence-based counting mode, "feature" intervals are defined by the
@SQheaders of input SAM files. These headers only define a sequence identifier, which is used as the "Feature ID", and a length for each sequence. These headers correspond to the reference fasta that the reads were aligned against (in bowtie's case, this is the fasta input to bowtie-build). Reads are counted for alignments to each of these reference sequences on both strands. Unlike in feature-based counting, all rules are evaluated for all reference sequences in Stages 2 and 3.SAM
@SQheaders are evaluated to ensure that they are present in each file, they contain the required fields, no identifier appears more than once in each file, and identifiers have a consistent length indicated in the headers of all input SAM files.Closes #277