tiny-count: expanded support for non-collapsed and 3rd-party-collapsed SAM files#217
Merged
taimontgomery merged 17 commits intomasterfrom Aug 15, 2022
Merged
tiny-count: expanded support for non-collapsed and 3rd-party-collapsed SAM files#217taimontgomery merged 17 commits intomasterfrom
taimontgomery merged 17 commits intomasterfrom
Conversation
…t produced by the pipeline. This commit simply changes how sequence counts are determined from each alignment's QNAME field. If this field does not have Collapser's signature "#_count=#" format, it next attempts to parse it as a fastx_collapse string. If that fails the alignment defaults to a count of 1. More broadly speaking, if Counter fails to parse the QNAME field, a default count of 1 is assumed.
… in addition to those made by tiny-collapse. If a SAM file contains fastx style QNAMEs then it is also required that the SAM file's header reports SO:queryname so that multiple-alignments can be properly bundled
…been addressed. SAM file requirements have been relaxed to once again allow non-collapsed SAM files. When these files are encountered, SAM_reader cannot produce decollapsed outputs without parsing and counting the entire file, so a warning is produced and the decollapse routine is skipped
…cess and run_cwltool_native. This was prompted by the addition of another configurable cwltool argument: --tmpdir-prefix. This will allow users to specify the location for temporary/intermediate files if the default location is on a space-limited volume
…ing the bundled read count determination out of statistics.py and into the sam reader. I'm still a little on the fence about it. While this makes it much cleaner to support a variety of collapser utilities (used pre-alignment), it also doesn't make nearly as much sense in the new location. The change slightly increases the runtime, and while it isn't by a significant amount, the %time reduced in statistics.py is far less than the %time increased by having it in SAM_reader. May continue testing. For now, I'm going to reflect on whether this is worth fretting over any further.
…for non-collapsed SAM records
Member
Author
|
This PR looks much messier than it actually is because it is derived from issue-187. Please merge PR #214 first. |
# Conflicts: # tiny/entry.py
Collaborator
|
I tested tiny-count on a non-collapsed sam (bowtie) for the sample data and the results matched with the pipeline output. Rather than use -c CONFIGFILE, maybe -f FEATURES to make it clear what this refers to. We also need to update the documentation to indicate that it is compatible with non-collapsed read sam files as well as collapsed read sam files from some external sources but that runtime and resource usage (memory and storage) for non-collapsed read data is greatly increased and that we strongly recommend first collapsing reads before alignment. |
…AM inputs are now accepted by tiny-count. Requested changes have been made to tiny-count's command line arguments. The helpstring has also been updated to indicate that the --decollapse option will be ignored for non-collapsed inputs.
…e description now addresses the effect these files will have on sequence-related stats
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Counter is now able to handle non-collapsed SAM files, and files produced from fastx_collapse outputs. SAM files with QNAME fields which do not conform to the tiny-collapse or fastx_collapse format are assumed to represent an original read count of 1.
Preliminary support for BioSeqZip collapsed inputs has also been added, but this functionality currently cannot be automatically or manually activated. The delimiter they use when appending counts to FASTQ/FASTA headers is not unique enough for us to use as a signature, so it will need to be manually set by the user rather than automatically detected.
I've also included a new parameter for tmp_directory in the Paths Sheet. During testing I ran into issues with non-collapsed bowtie outputs consuming all available space on the primary volume, so it made sense to me (and likely others) to be able to specify a temp/intermediary directory other than that specified via $TMP.
Closes #186