tiny-count: GFF validation and reliability improvements#236
tiny-count: GFF validation and reliability improvements#236taimontgomery merged 21 commits intomasterfrom
Conversation
…tion from tiny-count. If multiple ID values are listed, they are now concatenated rather than selecting the first. I think this will be much more intuitive and it also releases ReferenceTables.get_figure_id() so that it can be used without a constructed ReferenceTables object. I've also converted the argparse output in tiny-count to a read-only dictionary. This prefs object is being passed around to a LOT of classes in tiny-count, and in doing so we risk accidentally changing preferences. This "bug" was previously leveraged by the StepVector routine; it has been refactored to no longer rely on the mutibility of prefs.
…e, then the value of Parent is used as the ID. It is no longer treated as an error.
…ceTables to its own standalone function. This allows parsing machinery to be shared with the new GFFValidation class.
…gnment_chroms_mismatch_heuristic()
…tup (configuration.py) and tiny-count startup (counter.py) GFF validation is treated as an optional step that must be specifically requested in configuration.py. This is because we will assume that resume runs are using inputs that have already been validated. GFF validation is skipped in tiny-count during pipeline runs. This is because we will assume that both end-to-end runs and resume runs are using inputs that have already been validated.
…ing printed. Adding this exception so that we can call sys.exit() on validation failure and let the validation report speak for itself, rather than following the report with an unnecessary stacktrace
…cs will now read up to 50,000 lines of each SAM file (while checking every 10,000 lines for chromosome matches) because it is quite a bit faster than I assumed. For 9 library files this represents only ~0.4s of runtime
…xisting tests have been updated.
…all 3 keys were queried with every function call, in reverse order from lowest to highest priority, even if the preferred key was present. Now the chain will check the highest priority keys first, and continue as soon as a match is found
…s mapped from True/False to +/-. Now, if a feature's strand is anything but +/-, it is mapped to None. The GFFValidator produces a warning about this but no longer treats it as a hard error. Per Tai, a strand type of None matches strand selectors for "sense", "antisense", and "both." 5' and 3' anchored selectors can also evaluate these features, but evaluation does not distinguish between 5' and 3' ends.
|
Update 10/11: We are removing the requirement for stranded features. Currently strand is mapped from True/False to +/-. Now, if a feature's strand is anything but +/-, it is mapped to None. The GFFValidator produces a warning about this but no longer treats it as a hard error. Per Tai, a strand type of None matches strand selectors for "sense", "antisense", and "both," and 5'/3' anchored selectors can also evaluate these features, but evaluation does not distinguish between 5' and 3' ends. |
… for the Overlap column. I think this makes it easier to explain how the 5'/3' anchored selectors behave with unstranded features
…ures and adding that Parent is now used as a fallback ID attribute
…f unstranded features. Also refined/simplified the Overlap explanation in Stage 2
|
Update 10/12: introducing |
… feature has a Parent= but no ID/gene_id=. This was causing an infinite loop when ReferenceTables later tried to find the root ancestor of these features.
…and values after they have been parsed. Note that this happens after comma separated values have been split. This means that value list items can contain URL encoded commas which are then preserved as part of the value (rather than being split on the encoded comma)
…multiple parents aren't supported.
…efer to tiny-collapse as Collapser
…r C. elegans and Arabidopsis, then runs all four files through ReferenceTables.get(). Genomes need only be downloaded once. Nevertheless, this is a long-running test so I've set it for manual activation only
|
Tested on ram and At data with parent attribute only and with without strand info. |
GFF validation now takes place at the start of end-to-end pipeline runs, and at the start of tiny-count when it is called as a standalone step. This PR also introduces support for unstranded features which are represented internally with the value None, as well as a new overlap selector:
anchoredStrandedness and an appropriate ID attribute are checked for on a per-feature basis. After parsing all GFF files, the total set of chromosome identifiers is checked against the user's sequence files. In order of priority, sequence files include bowtie indexes, reference genomes, and alignment SAM files. The first two options can state with certainty that there isn't chromosome overlap between GFF and sequence files, and if that is determined to be the case, an error is issued and the script quits. The third option uses a 50,000 line sample from each SAM file as a heuristic, and if it fails to find chromosome overlap, a warning is issued and the script continues with normal execution. A warning is issued for unstranded features before proceeding with counting.
Parentis now accepted as an ID attribute ifIDandgene_idare missing. Features describing entire chromosomes are also skipped; Ensembl supplies these in their gff3 files and, in addition to not being useful as a selection target in tiny-count, they also lack strand information so they were throwing errors.Closes #235