Skip to content

tiny-count: GFF validation and reliability improvements#236

Merged
taimontgomery merged 21 commits intomasterfrom
issue-235
Oct 16, 2022
Merged

tiny-count: GFF validation and reliability improvements#236
taimontgomery merged 21 commits intomasterfrom
issue-235

Conversation

@AlexTate
Copy link
Member

@AlexTate AlexTate commented Oct 7, 2022

GFF validation now takes place at the start of end-to-end pipeline runs, and at the start of tiny-count when it is called as a standalone step. This PR also introduces support for unstranded features which are represented internally with the value None, as well as a new overlap selector: anchored

Strandedness and an appropriate ID attribute are checked for on a per-feature basis. After parsing all GFF files, the total set of chromosome identifiers is checked against the user's sequence files. In order of priority, sequence files include bowtie indexes, reference genomes, and alignment SAM files. The first two options can state with certainty that there isn't chromosome overlap between GFF and sequence files, and if that is determined to be the case, an error is issued and the script quits. The third option uses a 50,000 line sample from each SAM file as a heuristic, and if it fails to find chromosome overlap, a warning is issued and the script continues with normal execution. A warning is issued for unstranded features before proceeding with counting.

Parent is now accepted as an ID attribute if ID and gene_id are missing. Features describing entire chromosomes are also skipped; Ensembl supplies these in their gff3 files and, in addition to not being useful as a selection target in tiny-count, they also lack strand information so they were throwing errors.

Closes #235

…tion from tiny-count. If multiple ID values are listed, they are now concatenated rather than selecting the first. I think this will be much more intuitive and it also releases ReferenceTables.get_figure_id() so that it can be used without a constructed ReferenceTables object.

I've also converted the argparse output in tiny-count to a read-only dictionary. This prefs object is being passed around to a LOT of classes in tiny-count, and in doing so we risk accidentally changing preferences. This "bug" was previously leveraged by the StepVector routine; it has been refactored to no longer rely on the mutibility of prefs.
…e, then the value of Parent is used as the ID. It is no longer treated as an error.
…ceTables to its own standalone function. This allows parsing machinery to be shared with the new GFFValidation class.
…tup (configuration.py) and tiny-count startup (counter.py)

GFF validation is treated as an optional step that must be specifically requested in configuration.py. This is because we will assume that resume runs are using inputs that have already been validated.

GFF validation is skipped in tiny-count during pipeline runs. This is because we will assume that both end-to-end runs and resume runs are using inputs that have already been validated.
…ing printed. Adding this exception so that we can call sys.exit() on validation failure and let the validation report speak for itself, rather than following the report with an unnecessary stacktrace
…cs will now read up to 50,000 lines of each SAM file (while checking every 10,000 lines for chromosome matches) because it is quite a bit faster than I assumed. For 9 library files this represents only ~0.4s of runtime
@AlexTate AlexTate requested a review from taimontgomery October 9, 2022 22:51
@AlexTate AlexTate marked this pull request as ready for review October 9, 2022 22:51
…all 3 keys were queried with every function call, in reverse order from lowest to highest priority, even if the preferred key was present. Now the chain will check the highest priority keys first, and continue as soon as a match is found
…s mapped from True/False to +/-. Now, if a feature's strand is anything but +/-, it is mapped to None. The GFFValidator produces a warning about this but no longer treats it as a hard error.

Per Tai, a strand type of None matches strand selectors for "sense", "antisense", and "both." 5' and 3' anchored selectors can also evaluate these features, but evaluation does not distinguish between 5' and 3' ends.
@AlexTate
Copy link
Member Author

Update 10/11:

We are removing the requirement for stranded features. Currently strand is mapped from True/False to +/-. Now, if a feature's strand is anything but +/-, it is mapped to None. The GFFValidator produces a warning about this but no longer treats it as a hard error. Per Tai, a strand type of None matches strand selectors for "sense", "antisense", and "both," and 5'/3' anchored selectors can also evaluate these features, but evaluation does not distinguish between 5' and 3' ends.

@AlexTate AlexTate marked this pull request as draft October 12, 2022 18:00
… for the Overlap column. I think this makes it easier to explain how the 5'/3' anchored selectors behave with unstranded features
…ures and adding that Parent is now used as a fallback ID attribute
…f unstranded features. Also refined/simplified the Overlap explanation in Stage 2
@AlexTate
Copy link
Member Author

AlexTate commented Oct 12, 2022

Update 10/12: introducing anchored as an overlap selector. If nothing else, this should help explain the behavior of 5'/3' anchored selectors with unstranded features. Documentation has been updated to reflect the changes described above.

@AlexTate AlexTate marked this pull request as ready for review October 12, 2022 22:58
@AlexTate AlexTate changed the base branch from master to issue-3 October 12, 2022 22:58
@AlexTate AlexTate changed the base branch from issue-3 to master October 12, 2022 22:58
… feature has a Parent= but no ID/gene_id=. This was causing an infinite loop when ReferenceTables later tried to find the root ancestor of these features.
…and values after they have been parsed. Note that this happens after comma separated values have been split. This means that value list items can contain URL encoded commas which are then preserved as part of the value (rather than being split on the encoded comma)
…r C. elegans and Arabidopsis, then runs all four files through ReferenceTables.get(). Genomes need only be downloaded once. Nevertheless, this is a long-running test so I've set it for manual activation only
@taimontgomery
Copy link
Collaborator

Tested on ram and At data with parent attribute only and with without strand info.

@taimontgomery taimontgomery merged commit d9be169 into master Oct 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tiny-count: GFF validation and reliability improvements

2 participants