tiny-count: GFF validation and reliability improvements by AlexTate · Pull Request #236 · MontgomeryLab/tinyRNA

AlexTate · 2022-10-07T22:33:05Z

GFF validation now takes place at the start of end-to-end pipeline runs, and at the start of tiny-count when it is called as a standalone step. This PR also introduces support for unstranded features which are represented internally with the value None, as well as a new overlap selector: anchored

Strandedness and an appropriate ID attribute are checked for on a per-feature basis. After parsing all GFF files, the total set of chromosome identifiers is checked against the user's sequence files. In order of priority, sequence files include bowtie indexes, reference genomes, and alignment SAM files. The first two options can state with certainty that there isn't chromosome overlap between GFF and sequence files, and if that is determined to be the case, an error is issued and the script quits. The third option uses a 50,000 line sample from each SAM file as a heuristic, and if it fails to find chromosome overlap, a warning is issued and the script continues with normal execution. A warning is issued for unstranded features before proceeding with counting.

Parent is now accepted as an ID attribute if ID and gene_id are missing. Features describing entire chromosomes are also skipped; Ensembl supplies these in their gff3 files and, in addition to not being useful as a selection target in tiny-count, they also lack strand information so they were throwing errors.

Closes #235

…tion from tiny-count. If multiple ID values are listed, they are now concatenated rather than selecting the first. I think this will be much more intuitive and it also releases ReferenceTables.get_figure_id() so that it can be used without a constructed ReferenceTables object. I've also converted the argparse output in tiny-count to a read-only dictionary. This prefs object is being passed around to a LOT of classes in tiny-count, and in doing so we risk accidentally changing preferences. This "bug" was previously leveraged by the StepVector routine; it has been refactored to no longer rely on the mutibility of prefs.

…e, then the value of Parent is used as the ID. It is no longer treated as an error.

…ceTables to its own standalone function. This allows parsing machinery to be shared with the new GFFValidation class.

…gnment_chroms_mismatch_heuristic()

…tup (configuration.py) and tiny-count startup (counter.py) GFF validation is treated as an optional step that must be specifically requested in configuration.py. This is because we will assume that resume runs are using inputs that have already been validated. GFF validation is skipped in tiny-count during pipeline runs. This is because we will assume that both end-to-end runs and resume runs are using inputs that have already been validated.

…ing printed. Adding this exception so that we can call sys.exit() on validation failure and let the validation report speak for itself, rather than following the report with an unnecessary stacktrace

…cs will now read up to 50,000 lines of each SAM file (while checking every 10,000 lines for chromosome matches) because it is quite a bit faster than I assumed. For 9 library files this represents only ~0.4s of runtime

…xisting tests have been updated.

…all 3 keys were queried with every function call, in reverse order from lowest to highest priority, even if the preferred key was present. Now the chain will check the highest priority keys first, and continue as soon as a match is found

…s mapped from True/False to +/-. Now, if a feature's strand is anything but +/-, it is mapped to None. The GFFValidator produces a warning about this but no longer treats it as a hard error. Per Tai, a strand type of None matches strand selectors for "sense", "antisense", and "both." 5' and 3' anchored selectors can also evaluate these features, but evaluation does not distinguish between 5' and 3' ends.

AlexTate · 2022-10-12T03:29:38Z

Update 10/11:

We are removing the requirement for stranded features. Currently strand is mapped from True/False to +/-. Now, if a feature's strand is anything but +/-, it is mapped to None. The GFFValidator produces a warning about this but no longer treats it as a hard error. Per Tai, a strand type of None matches strand selectors for "sense", "antisense", and "both," and 5'/3' anchored selectors can also evaluate these features, but evaluation does not distinguish between 5' and 3' ends.

… for the Overlap column. I think this makes it easier to explain how the 5'/3' anchored selectors behave with unstranded features

…ures and adding that Parent is now used as a fallback ID attribute

…f unstranded features. Also refined/simplified the Overlap explanation in Stage 2

AlexTate · 2022-10-12T22:57:37Z

Update 10/12: introducing anchored as an overlap selector. If nothing else, this should help explain the behavior of 5'/3' anchored selectors with unstranded features. Documentation has been updated to reflect the changes described above.

… feature has a Parent= but no ID/gene_id=. This was causing an infinite loop when ReferenceTables later tried to find the root ancestor of these features.

…and values after they have been parsed. Note that this happens after comma separated values have been split. This means that value list items can contain URL encoded commas which are then preserved as part of the value (rather than being split on the encoded comma)

…multiple parents aren't supported.

…efer to tiny-collapse as Collapser

…r C. elegans and Arabidopsis, then runs all four files through ReferenceTables.get(). Genomes need only be downloaded once. Nevertheless, this is a long-running test so I've set it for manual activation only

taimontgomery · 2022-10-16T04:18:22Z

Tested on ram and At data with parent attribute only and with without strand info.

AlexTate added 9 commits October 6, 2022 18:08

If a feature lacks an ID/gene_id attribute, but has a Parent attribut…

d0de331

…e, then the value of Parent is used as the ID. It is no longer treated as an error.

The GFF parsing loop (and error handling) has been moved from Referen…

9f47f7c

…ceTables to its own standalone function. This allows parsing machinery to be shared with the new GFFValidation class.

Unit tests for the new GFFValidation class. Still needs tests for ali…

f83f2f1

…gnment_chroms_mismatch_heuristic()

Small corrections for configuration.py's usage of GFFValidator

71ac07a

Script termination via sys.exit() no longer results in a traceback be…

8743abb

…ing printed. Adding this exception so that we can call sys.exit() on validation failure and let the validation report speak for itself, rather than following the report with an unnecessary stacktrace

Final corrections for unit tests. Missing tests have been added and e…

3bd3ca3

…xisting tests have been updated.

AlexTate requested a review from taimontgomery October 9, 2022 22:51

AlexTate marked this pull request as ready for review October 9, 2022 22:51

AlexTate added 2 commits October 11, 2022 19:38

AlexTate marked this pull request as draft October 12, 2022 18:00

AlexTate added 5 commits October 12, 2022 15:41

Decided to add IntervalAnchorMatch to the list of available selectors…

f4e619d

… for the Overlap column. I think this makes it easier to explain how the 5'/3' anchored selectors behave with unstranded features

Updates for the input file requirements table. Removing stranded feat…

6a67b89

…ures and adding that Parent is now used as a fallback ID attribute

Adding the "anchored" overlap selector and updates for the behavior o…

08221c0

…f unstranded features. Also refined/simplified the Overlap explanation in Stage 2

Small correction/refinement of Stage 2 explanation

ba6659e

Update to support the new None strand type

6d21b23

AlexTate marked this pull request as ready for review October 12, 2022 22:58

AlexTate changed the base branch from master to issue-3 October 12, 2022 22:58

AlexTate changed the base branch from issue-3 to master October 12, 2022 22:58

AlexTate added 5 commits October 14, 2022 13:35

Bugfix to avoid circular references in ReferenceTables.parents when a…

b384d1b

… feature has a Parent= but no ID/gene_id=. This was causing an infinite loop when ReferenceTables later tried to find the root ancestor of these features.

Updating GFF file requirements to notify users that features listing …

12b3872

…multiple parents aren't supported.

Unrelated minor changes: correcting user facing error messages that r…

2bf00aa

…efer to tiny-collapse as Collapser

Added a test that downloads complete GFF/GTF genomes from Ensemble fo…

73ed931

…r C. elegans and Arabidopsis, then runs all four files through ReferenceTables.get(). Genomes need only be downloaded once. Nevertheless, this is a long-running test so I've set it for manual activation only

taimontgomery merged commit d9be169 into master Oct 16, 2022

AlexTate mentioned this pull request Oct 22, 2022

Introduce Validation classes #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tiny-count: GFF validation and reliability improvements#236

tiny-count: GFF validation and reliability improvements#236
taimontgomery merged 21 commits intomasterfrom
issue-235

AlexTate commented Oct 7, 2022 •

edited

Loading

Uh oh!

AlexTate commented Oct 12, 2022

Uh oh!

AlexTate commented Oct 12, 2022 •

edited

Loading

Uh oh!

taimontgomery commented Oct 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlexTate commented Oct 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexTate commented Oct 12, 2022

Uh oh!

AlexTate commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taimontgomery commented Oct 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AlexTate commented Oct 7, 2022 •

edited

Loading

AlexTate commented Oct 12, 2022 •

edited

Loading