Skip to content

tiny-count: expanded support for non-collapsed and 3rd-party-collapsed SAM files#217

Merged
taimontgomery merged 17 commits intomasterfrom
issue-186
Aug 15, 2022
Merged

tiny-count: expanded support for non-collapsed and 3rd-party-collapsed SAM files#217
taimontgomery merged 17 commits intomasterfrom
issue-186

Conversation

@AlexTate
Copy link
Member

@AlexTate AlexTate commented Jul 31, 2022

Counter is now able to handle non-collapsed SAM files, and files produced from fastx_collapse outputs. SAM files with QNAME fields which do not conform to the tiny-collapse or fastx_collapse format are assumed to represent an original read count of 1.

Preliminary support for BioSeqZip collapsed inputs has also been added, but this functionality currently cannot be automatically or manually activated. The delimiter they use when appending counts to FASTQ/FASTA headers is not unique enough for us to use as a signature, so it will need to be manually set by the user rather than automatically detected.

I've also included a new parameter for tmp_directory in the Paths Sheet. During testing I ran into issues with non-collapsed bowtie outputs consuming all available space on the primary volume, so it made sense to me (and likely others) to be able to specify a temp/intermediary directory other than that specified via $TMP.

Closes #186

AlexTate added 10 commits April 7, 2022 13:35
…t produced by the pipeline. This commit simply changes how sequence counts are determined from each alignment's QNAME field. If this field does not have Collapser's signature "#_count=#" format, it next attempts to parse it as a fastx_collapse string. If that fails the alignment defaults to a count of 1.

More broadly speaking, if Counter fails to parse the QNAME field, a default count of 1 is assumed.
… in addition to those made by tiny-collapse. If a SAM file contains fastx style QNAMEs then it is also required that the SAM file's header reports SO:queryname so that multiple-alignments can be properly bundled
…been addressed. SAM file requirements have been relaxed to once again allow non-collapsed SAM files. When these files are encountered, SAM_reader cannot produce decollapsed outputs without parsing and counting the entire file, so a warning is produced and the decollapse routine is skipped
…cess and run_cwltool_native. This was prompted by the addition of another configurable cwltool argument: --tmpdir-prefix. This will allow users to specify the location for temporary/intermediate files if the default location is on a space-limited volume
…ing the bundled read count determination out of statistics.py and into the sam reader. I'm still a little on the fence about it. While this makes it much cleaner to support a variety of collapser utilities (used pre-alignment), it also doesn't make nearly as much sense in the new location. The change slightly increases the runtime, and while it isn't by a significant amount, the %time reduced in statistics.py is far less than the %time increased by having it in SAM_reader. May continue testing. For now, I'm going to reflect on whether this is worth fretting over any further.
@AlexTate AlexTate requested a review from taimontgomery July 31, 2022 02:32
@AlexTate AlexTate marked this pull request as ready for review July 31, 2022 19:38
@AlexTate
Copy link
Member Author

This PR looks much messier than it actually is because it is derived from issue-187. Please merge PR #214 first.

@AlexTate AlexTate marked this pull request as draft August 1, 2022 19:15
@AlexTate AlexTate changed the title Counter: expanded support for non-collapsed and 3rd-party-collapsed SAM files tiny-count: expanded support for non-collapsed and 3rd-party-collapsed SAM files Aug 4, 2022
@AlexTate AlexTate changed the base branch from master to classify-counter August 4, 2022 17:01
@AlexTate AlexTate changed the base branch from classify-counter to master August 4, 2022 17:01
@AlexTate AlexTate marked this pull request as ready for review August 4, 2022 18:16
@taimontgomery
Copy link
Collaborator

I tested tiny-count on a non-collapsed sam (bowtie) for the sample data and the results matched with the pipeline output. Rather than use -c CONFIGFILE, maybe -f FEATURES to make it clear what this refers to. We also need to update the documentation to indicate that it is compatible with non-collapsed read sam files as well as collapsed read sam files from some external sources but that runtime and resource usage (memory and storage) for non-collapsed read data is greatly increased and that we strongly recommend first collapsing reads before alignment.

…AM inputs are now accepted by tiny-count.

Requested changes have been made to tiny-count's command line arguments. The helpstring has also been updated to indicate that the --decollapse option will be ignored for non-collapsed inputs.
…e description now addresses the effect these files will have on sequence-related stats
@taimontgomery taimontgomery merged commit 384a30a into master Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Counter: add basic support for SAM files containing non-collapsed or fastx collapsed reads

2 participants