Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
113ae1c
Updating the SamplesSheet class so that it can be used in tiny-count …
AlexTate Apr 23, 2023
3c3a458
Updating load_samples() to use the SamplesSheet class. Cleaner, more …
AlexTate Apr 23, 2023
6662696
Changing usages of is_pipeline to in_pipeline to be consistent with t…
AlexTate Apr 23, 2023
fdf2c21
Renaming SAM_reader to AlignmentReader and adding better checks for p…
AlexTate Apr 24, 2023
7f805be
Renaming SamSqValidator to AlignmentSqValidator, and uses of "sam" to…
AlexTate Apr 24, 2023
7a7a374
Updating infer_strandedness() (even though it isn't currently in use)
AlexTate Apr 24, 2023
59f8bd0
Restructuring testdata folder for tiny-count
AlexTate Apr 29, 2023
78a7f87
Correcting load_config to use the CSVReader class' row_num attribute …
AlexTate Apr 29, 2023
51d2657
Corrections in SamplesSheet class. Validation methods already report …
AlexTate Apr 29, 2023
9350f13
Bringing some outdated and inconsistent section comments up to date i…
AlexTate Apr 29, 2023
cb9aeb6
Renaming the FASTQ/SAM Files column in Samples Sheet to Input Files
AlexTate May 1, 2023
4764dd3
More mass renaming of SAM -> alignment
AlexTate May 1, 2023
1c112f8
Updating unit tests in accordance with changes to both the SamplesShe…
AlexTate May 1, 2023
deb2425
Minor docstring update and lifting state for compatible alignment too…
AlexTate May 1, 2023
c309ffa
Adding relevant updates to tiny-count documentation and adding test f…
AlexTate May 1, 2023
1c1d062
Bugfix for decollapsed outputs: proper linebreaks between buffered wr…
AlexTate May 1, 2023
7f22621
Upgrading GFFValidator.alignment_chroms_mismatch_heuristic() to use P…
AlexTate May 1, 2023
eb0287e
Minor streamlining of info about BAM file requirements
AlexTate May 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions START_HERE/paths.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
############################## MAIN INPUT FILES FOR ANALYSIS ##############################
#
# Edit this section to provide the path to your Samples and Features sheets. Relative and
# absolute paths are both allowed. All relative paths are relative to THIS config file.
# Relative and absolute paths are both allowed.
# All relative paths are evaluated relative to THIS config file.
#
# Directions:
# 1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv]
# 2. Fill out the Features Sheet with selection rules [features.csv]
# 3. Set samples_csv and features_csv (below) to point to these files
# 3. Set samples_csv and features_csv to point to these files
# 4. Add annotation files and per-file alias preferences to gff_files (optional)
#
# If using the tinyRNA workflow, additionally set ebwt and/or reference_genome_files
# in the BOWTIE-BUILD section.
#
######-------------------------------------------------------------------------------######

##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --##
Expand Down
5 changes: 2 additions & 3 deletions START_HERE/run_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,8 @@ run_native: false

######------------------------- BOWTIE INDEX BUILD OPTIONS --------------------------######
#
# If you do not already have bowtie indexes, they can be built for you by setting
# run_bowtie_build (above) to true and adding your reference genome file(s) to your
# paths_config file.
# If you do not already have bowtie indexes, they can be built for you
# (see the BOWTIE-BUILD section in the Paths File)
#
# We have specified default parameters for small RNA data based on our own "best practices".
# You can change the parameters here.
Expand Down
2 changes: 1 addition & 1 deletion START_HERE/samples.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FASTQ/SAM Files,Sample/Group Name,Replicate number,Control,Normalization
Input Files,Sample/Group Name,Replicate number,Control,Normalization
./fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE,
./fastq_files/cond1_rep2.fastq.gz,condition1,2,,
./fastq_files/cond1_rep3.fastq.gz,condition1,3,,
Expand Down
4 changes: 2 additions & 2 deletions START_HERE/tiny-count_TUTORIAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Alternatively, if you have already installed tinyRNA, you can use the `tiny-coun

## Your Data Files
Gather the following files for the analysis:
1. **SAM files** containing small RNA reads aligned to a reference genome, one file per sample
1. **SAM or BAM files** containing small RNA reads aligned to a reference genome, one file per sample
2. **GFF3 or GFF2/GTF file(s)** containing annotations for features that you want to assign reads to

## Configuration Files
Expand All @@ -24,7 +24,7 @@ tiny-count --get-templates
Next, fill out the configuration files that were copied:

### 1. The Samples Sheet (samples.csv)
Edit this file to add the paths to your SAM files, and to define the group name, replicate number, etc. for each sample.
Edit this file to add the paths to your SAM or BAM files, and to define the group name, replicate number, etc. for each sample.

### 2. The Paths File (paths.yml)
Edit this file to add the paths to your GFF annotation(s) under the `gff_files` key. You can leave the `alias` key as-is for now. All other keys in this file are used in the tinyRNA workflow.
Expand Down
2 changes: 1 addition & 1 deletion doc/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ The final output directory name has three components:
The `run_directory` suffix in the Paths File supports subdirectories; if provided, the final output directory will be named as indicated above, but the subdirectory structure specified in `run_directory` will be retained.

## Samples Sheet Details
| _Column:_ | FASTQ/SAM Files | Sample/Group Name | Replicate Number | Control | Normalization |
| _Column:_ | Input Files | Sample/Group Name | Replicate Number | Control | Normalization |
|-----------:|---------------------|-------------------|------------------|---------|---------------|
| _Example:_ | cond1_rep1.fastq.gz | condition1 | 1 | True | RPM |

Expand Down
4 changes: 2 additions & 2 deletions doc/Parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ A custom Cython implementation of HTSeq's StepVector is used for finding feature
### Is Pipeline
| Run Config Key | Commandline Argument |
|----------------|----------------------|
| | `--is-pipeline` |
| | `--in-pipeline` |

This commandline argument tells tiny-count that it is running as a workflow step rather than a standalone/manual run. Under these conditions tiny-count will look for all input files in the current working directory regardless of the paths defined in the Samples Sheet and Features Sheet.

Expand Down Expand Up @@ -152,7 +152,7 @@ Optional arguments:
-sv {Cython,HTSeq}, --stepvector {Cython,HTSeq}
Select which StepVector implementation is used to find
features overlapping an interval. (default: Cython)
-p, --is-pipeline Indicates that tiny-count was invoked as part of a
-p, --in-pipeline Indicates that tiny-count was invoked as part of a
pipeline run and that input files should be sourced as
such. (default: False)
-d, --report-diags Produce diagnostic information about
Expand Down
17 changes: 14 additions & 3 deletions doc/tiny-count.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,16 @@ For an explanation of tiny-count's parameters in the Run Config and by commandli
tiny-count offers a variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. Using the command `tiny recount`, tinyRNA will run the workflow starting at the tiny-count step using inputs from a prior end-to-end run to save time. See the [pipeline resume documentation](Pipeline.md#resuming-a-prior-analysis) for details and prerequisites.

## Running as a Standalone Tool
If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command has [one required argument](Parameters.md#full-tiny-count-help-string): your Paths File. Your Samples Sheet will need to list SAM files rather than FASTQ files in the `FASTQ/SAM Files` column. SAM files from third party sources are also supported, and if they have been produced from reads collapsed by tiny-collapse or fastx, tiny-count will honor the reported read counts.
Skip to [Feature Selection](#feature-selection) if you are using the tinyRNA workflow.

If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command has [one required argument](Parameters.md#full-tiny-count-help-string): your Paths File. Your Samples Sheet will need to list SAM or BAM alignment files rather than FASTQ files in the `Input Files` column. Alignment files from third party sources are also supported, and if they have been produced from reads collapsed by tiny-collapse or fastx, tiny-count will honor the reported read counts.

#### Input File Requirements
The SAM/BAM files provided during standalone runs _must_ be ordered so that multi-mapping read alignments are listed adjacent to one another. This adjacency convention is required for proper normalization by genomic hits. For this reason, files with ambiguous order will be rejected unless they were produced by an alignment tool that we recognize for following the adjacency convention. At this time, this includes Bowtie, Bowtie2, and STAR (an admittedly incomplete list).

#### BAM File Tips
- Use the `--no-PG` option with `samtools view` when converting alignments
- Pysam will issue two warnings about missing index files; they can be ignored

#### Using Non-collapsed Sequence Alignments
While third-party SAM files from non-collapsed reads are supported, there are some caveats. These files will result in substantially higher resource usage and runtimes; we strongly recommend collapsing prior to alignment. Additionally, the sequence-related stats produced by tiny-count will no longer represent _unique_ sequences. These stats will instead refer to all sequences with unique QNAMEs (that is, multi-alignment bundles still cary a sequence count of 1).
Expand Down Expand Up @@ -139,8 +148,10 @@ Examples:

## Count Normalization
Small RNA reads passing selection will receive a normalized count increment. By default, read counts are normalized twice before being assigned to a feature. Both normalization steps can be disabled in `run_config.yml` if desired. Counts for each small RNA sequence are divided:
1. By the number of loci it aligns to in the genome.
2. By the number of _selected_ features for each of its alignments.
1. By the number of loci it aligns to in the genome (genomic hits).
2. By the number of _selected_ features for each of its alignments (feature hits).

>**Important**: For proper normalization by genomic hits, input files must be ordered such that multi-mapping read alignments are listed adjacent to one another.

## The Details
You may encounter the following cases when you have more than one unique GFF file listed in your Paths File:
Expand Down
9 changes: 6 additions & 3 deletions tests/testdata/config_files/paths.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
############################## MAIN INPUT FILES FOR ANALYSIS ##############################
#
# Edit this section to provide the path to your Samples and Features sheets. Relative and
# absolute paths are both allowed. All relative paths are relative to THIS config file.
# Relative and absolute paths are both allowed.
# All relative paths are evaluated relative to THIS config file.
#
# Directions:
# 1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv]
# 2. Fill out the Features Sheet with selection rules [features.csv]
# 3. Set samples_csv and features_csv (below) to point to these files
# 3. Set samples_csv and features_csv to point to these files
# 4. Add annotation files and per-file alias preferences to gff_files (optional)
#
# If using the tinyRNA workflow, additionally set ebwt and/or reference_genome_files
# in the BOWTIE-BUILD section.
#
######-------------------------------------------------------------------------------######

##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --##
Expand Down
5 changes: 2 additions & 3 deletions tests/testdata/config_files/run_config_template.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,8 @@ run_native: false

######------------------------- BOWTIE INDEX BUILD OPTIONS --------------------------######
#
# If you do not already have bowtie indexes, they can be built for you by setting
# run_bowtie_build (above) to true and adding your reference genome file(s) to your
# paths_config file.
# If you do not already have bowtie indexes, they can be built for you
# (see the BOWTIE-BUILD section in the Paths File)
#
# We have specified default parameters for small RNA data based on our own "best practices".
# You can change the parameters here.
Expand Down
2 changes: 1 addition & 1 deletion tests/testdata/config_files/samples.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FASTQ/SAM Files,Sample/Group Name,Replicate Number,Control,Normalization
Input Files,Sample/Group Name,Replicate Number,Control,Normalization
../../../START_HERE/fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE,
../../../START_HERE/fastq_files/cond1_rep2.fastq.gz,condition1,2,,
../../../START_HERE/fastq_files/cond1_rep3.fastq.gz,condition1,3,,
Expand Down
Binary file added tests/testdata/counter/bam/Lib304_test.bam
Binary file not shown.
Binary file added tests/testdata/counter/bam/single.bam
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
@HD SO:unsorted
@SQ SN:I LN:21
@PG ID:bowtie
NON_COLLAPSED_QNAME 16 I 15064570 255 21M * 0 0 CAAGACAGAGCTTCACCGTTC IIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:21 NM:i:0 XM:i:2
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
@HD SO:unsorted
@SQ SN:I LN:21
@PG ID:bowtie
0_count=5 16 I 15064570 255 21M * 0 0 CAAGACAGAGCTTCACCGTTC IIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:21 NM:i:0 XM:i:2
2 changes: 1 addition & 1 deletion tests/unit_test_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ def get_dir_checksum_tree(root_path: str) -> dict:
return dir_tree


def make_parsed_sam_record(Name="0_count=1", Seq="CAAGACAGAGCTTCACCGTTC", Chrom='I', Start=15064570, Strand=True, NM=0):
def make_parsed_alignment(Name="0_count=1", Seq="CAAGACAGAGCTTCACCGTTC", Chrom='I', Start=15064570, Strand=True, NM=0):
return {
"Name": Name,
"Length": len(Seq),
Expand Down
Loading