diff --git a/START_HERE/paths.yml b/START_HERE/paths.yml index 0246ca0e..ca45c9f6 100644 --- a/START_HERE/paths.yml +++ b/START_HERE/paths.yml @@ -1,14 +1,17 @@ ############################## MAIN INPUT FILES FOR ANALYSIS ############################## # -# Edit this section to provide the path to your Samples and Features sheets. Relative and -# absolute paths are both allowed. All relative paths are relative to THIS config file. +# Relative and absolute paths are both allowed. +# All relative paths are evaluated relative to THIS config file. # # Directions: # 1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv] # 2. Fill out the Features Sheet with selection rules [features.csv] -# 3. Set samples_csv and features_csv (below) to point to these files +# 3. Set samples_csv and features_csv to point to these files # 4. Add annotation files and per-file alias preferences to gff_files (optional) # +# If using the tinyRNA workflow, additionally set ebwt and/or reference_genome_files +# in the BOWTIE-BUILD section. +# ######-------------------------------------------------------------------------------###### ##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --## diff --git a/START_HERE/run_config.yml b/START_HERE/run_config.yml index 8d7d5bc5..260b146a 100644 --- a/START_HERE/run_config.yml +++ b/START_HERE/run_config.yml @@ -41,9 +41,8 @@ run_native: false ######------------------------- BOWTIE INDEX BUILD OPTIONS --------------------------###### # -# If you do not already have bowtie indexes, they can be built for you by setting -# run_bowtie_build (above) to true and adding your reference genome file(s) to your -# paths_config file. +# If you do not already have bowtie indexes, they can be built for you +# (see the BOWTIE-BUILD section in the Paths File) # # We have specified default parameters for small RNA data based on our own "best practices". # You can change the parameters here. diff --git a/START_HERE/samples.csv b/START_HERE/samples.csv index 6455c9ae..47caddd8 100755 --- a/START_HERE/samples.csv +++ b/START_HERE/samples.csv @@ -1,4 +1,4 @@ -FASTQ/SAM Files,Sample/Group Name,Replicate number,Control,Normalization +Input Files,Sample/Group Name,Replicate number,Control,Normalization ./fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE, ./fastq_files/cond1_rep2.fastq.gz,condition1,2,, ./fastq_files/cond1_rep3.fastq.gz,condition1,3,, diff --git a/START_HERE/tiny-count_TUTORIAL.md b/START_HERE/tiny-count_TUTORIAL.md index 1a8b6072..1fae5f30 100644 --- a/START_HERE/tiny-count_TUTORIAL.md +++ b/START_HERE/tiny-count_TUTORIAL.md @@ -11,7 +11,7 @@ Alternatively, if you have already installed tinyRNA, you can use the `tiny-coun ## Your Data Files Gather the following files for the analysis: -1. **SAM files** containing small RNA reads aligned to a reference genome, one file per sample +1. **SAM or BAM files** containing small RNA reads aligned to a reference genome, one file per sample 2. **GFF3 or GFF2/GTF file(s)** containing annotations for features that you want to assign reads to ## Configuration Files @@ -24,7 +24,7 @@ tiny-count --get-templates Next, fill out the configuration files that were copied: ### 1. The Samples Sheet (samples.csv) -Edit this file to add the paths to your SAM files, and to define the group name, replicate number, etc. for each sample. +Edit this file to add the paths to your SAM or BAM files, and to define the group name, replicate number, etc. for each sample. ### 2. The Paths File (paths.yml) Edit this file to add the paths to your GFF annotation(s) under the `gff_files` key. You can leave the `alias` key as-is for now. All other keys in this file are used in the tinyRNA workflow. diff --git a/doc/Configuration.md b/doc/Configuration.md index 26faf28a..7d9c663e 100644 --- a/doc/Configuration.md +++ b/doc/Configuration.md @@ -119,7 +119,7 @@ The final output directory name has three components: The `run_directory` suffix in the Paths File supports subdirectories; if provided, the final output directory will be named as indicated above, but the subdirectory structure specified in `run_directory` will be retained. ## Samples Sheet Details -| _Column:_ | FASTQ/SAM Files | Sample/Group Name | Replicate Number | Control | Normalization | +| _Column:_ | Input Files | Sample/Group Name | Replicate Number | Control | Normalization | |-----------:|---------------------|-------------------|------------------|---------|---------------| | _Example:_ | cond1_rep1.fastq.gz | condition1 | 1 | True | RPM | diff --git a/doc/Parameters.md b/doc/Parameters.md index d085c17a..94ea2bef 100644 --- a/doc/Parameters.md +++ b/doc/Parameters.md @@ -101,7 +101,7 @@ A custom Cython implementation of HTSeq's StepVector is used for finding feature ### Is Pipeline | Run Config Key | Commandline Argument | |----------------|----------------------| -| | `--is-pipeline` | +| | `--in-pipeline` | This commandline argument tells tiny-count that it is running as a workflow step rather than a standalone/manual run. Under these conditions tiny-count will look for all input files in the current working directory regardless of the paths defined in the Samples Sheet and Features Sheet. @@ -152,7 +152,7 @@ Optional arguments: -sv {Cython,HTSeq}, --stepvector {Cython,HTSeq} Select which StepVector implementation is used to find features overlapping an interval. (default: Cython) - -p, --is-pipeline Indicates that tiny-count was invoked as part of a + -p, --in-pipeline Indicates that tiny-count was invoked as part of a pipeline run and that input files should be sourced as such. (default: False) -d, --report-diags Produce diagnostic information about diff --git a/doc/tiny-count.md b/doc/tiny-count.md index 79f80f0e..8f367aa9 100644 --- a/doc/tiny-count.md +++ b/doc/tiny-count.md @@ -7,7 +7,16 @@ For an explanation of tiny-count's parameters in the Run Config and by commandli tiny-count offers a variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. Using the command `tiny recount`, tinyRNA will run the workflow starting at the tiny-count step using inputs from a prior end-to-end run to save time. See the [pipeline resume documentation](Pipeline.md#resuming-a-prior-analysis) for details and prerequisites. ## Running as a Standalone Tool -If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command has [one required argument](Parameters.md#full-tiny-count-help-string): your Paths File. Your Samples Sheet will need to list SAM files rather than FASTQ files in the `FASTQ/SAM Files` column. SAM files from third party sources are also supported, and if they have been produced from reads collapsed by tiny-collapse or fastx, tiny-count will honor the reported read counts. +Skip to [Feature Selection](#feature-selection) if you are using the tinyRNA workflow. + +If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command has [one required argument](Parameters.md#full-tiny-count-help-string): your Paths File. Your Samples Sheet will need to list SAM or BAM alignment files rather than FASTQ files in the `Input Files` column. Alignment files from third party sources are also supported, and if they have been produced from reads collapsed by tiny-collapse or fastx, tiny-count will honor the reported read counts. + +#### Input File Requirements +The SAM/BAM files provided during standalone runs _must_ be ordered so that multi-mapping read alignments are listed adjacent to one another. This adjacency convention is required for proper normalization by genomic hits. For this reason, files with ambiguous order will be rejected unless they were produced by an alignment tool that we recognize for following the adjacency convention. At this time, this includes Bowtie, Bowtie2, and STAR (an admittedly incomplete list). + +#### BAM File Tips +- Use the `--no-PG` option with `samtools view` when converting alignments +- Pysam will issue two warnings about missing index files; they can be ignored #### Using Non-collapsed Sequence Alignments While third-party SAM files from non-collapsed reads are supported, there are some caveats. These files will result in substantially higher resource usage and runtimes; we strongly recommend collapsing prior to alignment. Additionally, the sequence-related stats produced by tiny-count will no longer represent _unique_ sequences. These stats will instead refer to all sequences with unique QNAMEs (that is, multi-alignment bundles still cary a sequence count of 1). @@ -139,8 +148,10 @@ Examples: ## Count Normalization Small RNA reads passing selection will receive a normalized count increment. By default, read counts are normalized twice before being assigned to a feature. Both normalization steps can be disabled in `run_config.yml` if desired. Counts for each small RNA sequence are divided: -1. By the number of loci it aligns to in the genome. -2. By the number of _selected_ features for each of its alignments. +1. By the number of loci it aligns to in the genome (genomic hits). +2. By the number of _selected_ features for each of its alignments (feature hits). + +>**Important**: For proper normalization by genomic hits, input files must be ordered such that multi-mapping read alignments are listed adjacent to one another. ## The Details You may encounter the following cases when you have more than one unique GFF file listed in your Paths File: diff --git a/tests/testdata/config_files/paths.yml b/tests/testdata/config_files/paths.yml index 46c14bf2..2dc39b9a 100644 --- a/tests/testdata/config_files/paths.yml +++ b/tests/testdata/config_files/paths.yml @@ -1,14 +1,17 @@ ############################## MAIN INPUT FILES FOR ANALYSIS ############################## # -# Edit this section to provide the path to your Samples and Features sheets. Relative and -# absolute paths are both allowed. All relative paths are relative to THIS config file. +# Relative and absolute paths are both allowed. +# All relative paths are evaluated relative to THIS config file. # # Directions: # 1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv] # 2. Fill out the Features Sheet with selection rules [features.csv] -# 3. Set samples_csv and features_csv (below) to point to these files +# 3. Set samples_csv and features_csv to point to these files # 4. Add annotation files and per-file alias preferences to gff_files (optional) # +# If using the tinyRNA workflow, additionally set ebwt and/or reference_genome_files +# in the BOWTIE-BUILD section. +# ######-------------------------------------------------------------------------------###### ##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --## diff --git a/tests/testdata/config_files/run_config_template.yml b/tests/testdata/config_files/run_config_template.yml index e8195a36..04fda2f8 100644 --- a/tests/testdata/config_files/run_config_template.yml +++ b/tests/testdata/config_files/run_config_template.yml @@ -41,9 +41,8 @@ run_native: false ######------------------------- BOWTIE INDEX BUILD OPTIONS --------------------------###### # -# If you do not already have bowtie indexes, they can be built for you by setting -# run_bowtie_build (above) to true and adding your reference genome file(s) to your -# paths_config file. +# If you do not already have bowtie indexes, they can be built for you +# (see the BOWTIE-BUILD section in the Paths File) # # We have specified default parameters for small RNA data based on our own "best practices". # You can change the parameters here. diff --git a/tests/testdata/config_files/samples.csv b/tests/testdata/config_files/samples.csv index d96b524e..f4ed8522 100755 --- a/tests/testdata/config_files/samples.csv +++ b/tests/testdata/config_files/samples.csv @@ -1,4 +1,4 @@ -FASTQ/SAM Files,Sample/Group Name,Replicate Number,Control,Normalization +Input Files,Sample/Group Name,Replicate Number,Control,Normalization ../../../START_HERE/fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE, ../../../START_HERE/fastq_files/cond1_rep2.fastq.gz,condition1,2,, ../../../START_HERE/fastq_files/cond1_rep3.fastq.gz,condition1,3,, diff --git a/tests/testdata/counter/bam/Lib304_test.bam b/tests/testdata/counter/bam/Lib304_test.bam new file mode 100644 index 00000000..48e2b97e Binary files /dev/null and b/tests/testdata/counter/bam/Lib304_test.bam differ diff --git a/tests/testdata/counter/bam/single.bam b/tests/testdata/counter/bam/single.bam new file mode 100644 index 00000000..6170ec70 Binary files /dev/null and b/tests/testdata/counter/bam/single.bam differ diff --git a/tests/testdata/counter/discontinuous.gff3 b/tests/testdata/counter/gff/discontinuous.gff3 similarity index 100% rename from tests/testdata/counter/discontinuous.gff3 rename to tests/testdata/counter/gff/discontinuous.gff3 diff --git a/tests/testdata/counter/identity_choice_test.gff3 b/tests/testdata/counter/gff/identity_choice_test.gff3 similarity index 100% rename from tests/testdata/counter/identity_choice_test.gff3 rename to tests/testdata/counter/gff/identity_choice_test.gff3 diff --git a/tests/testdata/counter/single.gff3 b/tests/testdata/counter/gff/single.gff3 similarity index 100% rename from tests/testdata/counter/single.gff3 rename to tests/testdata/counter/gff/single.gff3 diff --git a/tests/testdata/counter/single2.gff3 b/tests/testdata/counter/gff/single2.gff3 similarity index 100% rename from tests/testdata/counter/single2.gff3 rename to tests/testdata/counter/gff/single2.gff3 diff --git a/tests/testdata/counter/Lib304_test.sam b/tests/testdata/counter/sam/Lib304_test.sam similarity index 100% rename from tests/testdata/counter/Lib304_test.sam rename to tests/testdata/counter/sam/Lib304_test.sam diff --git a/tests/testdata/counter/identity_choice_test.sam b/tests/testdata/counter/sam/identity_choice_test.sam similarity index 100% rename from tests/testdata/counter/identity_choice_test.sam rename to tests/testdata/counter/sam/identity_choice_test.sam diff --git a/tests/testdata/counter/non-collapsed.sam b/tests/testdata/counter/sam/non-collapsed.sam similarity index 81% rename from tests/testdata/counter/non-collapsed.sam rename to tests/testdata/counter/sam/non-collapsed.sam index 94e2d107..48368cde 100644 --- a/tests/testdata/counter/non-collapsed.sam +++ b/tests/testdata/counter/sam/non-collapsed.sam @@ -1,2 +1,4 @@ +@HD SO:unsorted @SQ SN:I LN:21 +@PG ID:bowtie NON_COLLAPSED_QNAME 16 I 15064570 255 21M * 0 0 CAAGACAGAGCTTCACCGTTC IIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:21 NM:i:0 XM:i:2 diff --git a/tests/testdata/counter/single.sam b/tests/testdata/counter/sam/single.sam similarity index 80% rename from tests/testdata/counter/single.sam rename to tests/testdata/counter/sam/single.sam index 6d8dc69b..ccf1b730 100644 --- a/tests/testdata/counter/single.sam +++ b/tests/testdata/counter/sam/single.sam @@ -1,2 +1,4 @@ +@HD SO:unsorted @SQ SN:I LN:21 +@PG ID:bowtie 0_count=5 16 I 15064570 255 21M * 0 0 CAAGACAGAGCTTCACCGTTC IIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:21 NM:i:0 XM:i:2 \ No newline at end of file diff --git a/tests/unit_test_helpers.py b/tests/unit_test_helpers.py index 0b6aaf45..b750d10e 100644 --- a/tests/unit_test_helpers.py +++ b/tests/unit_test_helpers.py @@ -131,7 +131,7 @@ def get_dir_checksum_tree(root_path: str) -> dict: return dir_tree -def make_parsed_sam_record(Name="0_count=1", Seq="CAAGACAGAGCTTCACCGTTC", Chrom='I', Start=15064570, Strand=True, NM=0): +def make_parsed_alignment(Name="0_count=1", Seq="CAAGACAGAGCTTCACCGTTC", Chrom='I', Start=15064570, Strand=True, NM=0): return { "Name": Name, "Length": len(Seq), diff --git a/tests/unit_tests_configuration.py b/tests/unit_tests_configuration.py index b9da1a9a..a10c6f38 100644 --- a/tests/unit_tests_configuration.py +++ b/tests/unit_tests_configuration.py @@ -10,6 +10,14 @@ from tiny.rna.util import r_reserved_keywords +def patch_isfile(): + return patch('tiny.rna.configuration.os.path.isfile', return_value=True) + + +def patch_open(read_data): + return patch('tiny.rna.configuration.open', mock_open(read_data=read_data)) + + class BowtieIndexesTest(unittest.TestCase): @classmethod def setUpClass(self): @@ -150,37 +158,60 @@ def test_verify_bowtie_build_outputs(self): class SamplesSheetTest(unittest.TestCase): + @classmethod + def setUpClass(cls): + cls.contexts = ["Pipeline Start", "Pipeline Step", "Standalone Run"] + """Does SamplesSheet catch multi-assignment of control condition?""" def test_validate_control_group(self): - sheet = csv_factory("samples.csv", [ - {'File': '1.fastq', 'Group': 'G1', 'Replicate': '1', 'Control': True, 'Normalization': ''}, # Good - {'File': '2.fastq', 'Group': 'G1', 'Replicate': '2', 'Control': True, 'Normalization': ''}, # Good - {'File': '3.fastq', 'Group': 'G2', 'Replicate': '1', 'Control': True, 'Normalization': ''} # Bad - ]) # ^^^ - - exp_contains = r".*(multiple control conditions).*" - with self.assertRaisesRegex(AssertionError, exp_contains), \ - patch('tiny.rna.configuration.open', mock_open(read_data=sheet)), \ - patch('tiny.rna.configuration.os.path.isfile', return_value=True): - SamplesSheet('mock_filename') + rows = [ + {'File': '1', 'Group': 'G1', 'Replicate': '1', 'Control': True, 'Normalization': ''}, # Good + {'File': '2', 'Group': 'G1', 'Replicate': '2', 'Control': True, 'Normalization': ''}, # Good + {'File': '3', 'Group': 'G2', 'Replicate': '1', 'Control': True, 'Normalization': ''} # Bad + ] # ^^^^ ^^^^ + + start_sheet = csv_factory("samples.csv", [dict(r, File=f"{i}.fastq") for i, r in enumerate(rows)]) + step_sheet = csv_factory("samples.csv", [dict(r, File=f"{i}.sam") for i, r in enumerate(rows)]) + run_sheet = step_sheet + + with patch_isfile(): + # Control condition should not be evaluated in Pipeline Step or Standalone Run context + with patch_open(step_sheet): + SamplesSheet('mock_filename', context="Pipeline Step") + with patch_open(run_sheet): + SamplesSheet('mock_filename', context="Standalone Run") + + # Control condition should be evaluated in Pipeline Start context + exp_contains = r".*(multiple control conditions).*" + with self.assertRaisesRegex(AssertionError, exp_contains), patch_open(start_sheet): + SamplesSheet('mock_filename', context="Pipeline Start") """Does SamplesSheet catch duplicate entries for the same group and rep?""" + def test_validate_group_rep(self): - sheet = csv_factory("samples.csv", [ - {'File': '1.fastq', 'Group': 'G1', 'Replicate': '1', 'Control': True, 'Normalization': ''}, # Good - {'File': '2.fastq', 'Group': 'G1', 'Replicate': '2', 'Control': True, 'Normalization': ''}, # Good - {'File': '3.fastq', 'Group': 'G1', 'Replicate': '2', 'Control': True, 'Normalization': ''} # Bad - ]) # ^^^ ^^^ + rows = [ + {'File': '1', 'Group': 'G1', 'Replicate': '1', 'Control': True, 'Normalization': ''}, # Good + {'File': '2', 'Group': 'G1', 'Replicate': '2', 'Control': True, 'Normalization': ''}, # Good + {'File': '3', 'Group': 'G1', 'Replicate': '2', 'Control': True, 'Normalization': ''} # Bad + ] # ^^^^ ^^^ + + start_sheet = csv_factory("samples.csv", [dict(r, File=f"{i}.fastq") for i, r in enumerate(rows)]) + step_sheet = csv_factory("samples.csv", [dict(r, File=f"{i}.sam") for i, r in enumerate(rows)]) + run_sheet = step_sheet + contexts = list(zip(self.contexts, [start_sheet, step_sheet, run_sheet])) exp_contains = r".*(same group and replicate).*" - with self.assertRaisesRegex(AssertionError, exp_contains), \ - patch('tiny.rna.configuration.open', mock_open(read_data=sheet)), \ - patch('tiny.rna.configuration.os.path.isfile', return_value=True): - SamplesSheet('mock_filename') - """Does SamplesSheet catch fastq files that don't exist, have a bad file extension, or are listed more than once?""" + # Validation should take place in all contexts + for context, sheet in contexts: + with self.assertRaisesRegex(AssertionError, exp_contains), patch_open(sheet), patch_isfile(): + SamplesSheet('mock_filename', context=context) + + """Does SamplesSheet catch bad fastq entries in Pipeline Start context?""" + def test_validate_fastq_filepath(self): + context = "Pipeline Start" csv_rows = [ {'File': '1.fastq', 'Group': 'G1', 'Replicate': '1', 'Control': True, 'Normalization': ''}, # Good {'File': '1.fastq', 'Group': 'G1', 'Replicate': '2', 'Control': True, 'Normalization': ''}, # Bad @@ -189,28 +220,50 @@ def test_validate_fastq_filepath(self): sheet = csv_factory("samples.csv", csv_rows) # File doesn't exist - exp_contains = r".*(was not found).*" - with self.assertRaisesRegex(AssertionError, exp_contains), \ - patch('tiny.rna.configuration.open', mock_open(read_data=sheet)): - SamplesSheet('mock_filename') + exp_contains = r".*(fastq file on row 1 of mock_filename was not found).*" + with self.assertRaisesRegex(AssertionError, exp_contains), patch_open(sheet): + SamplesSheet('mock_filename', context=context) # Duplicate filename - exp_contains = r".*(listed more than once).*" - with self.assertRaisesRegex(AssertionError, exp_contains), \ - patch('tiny.rna.configuration.open', mock_open(read_data=sheet)), \ - patch('tiny.rna.configuration.os.path.isfile', return_value=True): - SamplesSheet('mock_filename') + exp_contains = r".*(listed more than once).*\(row 2\)" + with self.assertRaisesRegex(AssertionError, exp_contains), patch_open(sheet), patch_isfile(): + SamplesSheet('mock_filename', context=context) # Bad file extension - exp_contains = r".*(\.fastq\(\.gz\) extension).*" - csv_rows.pop(0) + sheet = csv_factory("samples.csv", csv_rows[1:]) # remove duplicate entry + exp_contains = r".*(\.fastq\(\.gz\) extension).*\(row 2\)" + with self.assertRaisesRegex(AssertionError, exp_contains), patch_open(sheet), patch_isfile(): + SamplesSheet('mock_filename', context=context) + + """Does SamplesSheet catch bad alignment file entries in Standalone Run context?""" + + def test_validate_alignments_filepath(self): + context = "Standalone Run" + csv_rows = [ + {'File': '1.sam', 'Group': 'G1', 'Replicate': '1', 'Control': True, 'Normalization': ''}, # Good + {'File': '1.sam', 'Group': 'G1', 'Replicate': '2', 'Control': True, 'Normalization': ''}, # Good + {'File': '1.bam', 'Group': 'G1', 'Replicate': '3', 'Control': True, 'Normalization': ''}, # Bad + {'File': '2.xyz', 'Group': 'G2', 'Replicate': '1', 'Control': True, 'Normalization': ''} # Bad + ] # ^^^^^^^ sheet = csv_factory("samples.csv", csv_rows) - with self.assertRaisesRegex(AssertionError, exp_contains), \ - patch('tiny.rna.configuration.open', mock_open(read_data=sheet)), \ - patch('tiny.rna.configuration.os.path.isfile', return_value=True): - SamplesSheet('mock_filename') - """Does validate_r_safe_sample_groups detect group names that will cause namespace collisions in R?""" + # File doesn't exist + exp_contains = r".*(file on row 1 of mock_filename was not found).*" + with self.assertRaisesRegex(AssertionError, exp_contains), patch_open(sheet): + SamplesSheet('mock_filename', context=context) + + # Duplicate filename + exp_contains = r".*(listed more than once).*\(row 2\)" + with self.assertRaisesRegex(AssertionError, exp_contains), patch_open(sheet), patch_isfile(): + SamplesSheet('mock_filename', context=context) + + # Bad file extension + sheet = csv_factory("samples.csv", csv_rows[1:]) # remove duplicate entry + exp_contains = r".*(\.sam or \.bam extension).*\(row 3\)" + with self.assertRaisesRegex(AssertionError, exp_contains), patch_open(sheet), patch_isfile(): + SamplesSheet('mock_filename', context=context) + + """Does validate_r_safe_sample_groups() detect group names that will cause namespace collisions in R?""" def test_validate_r_safe_sample_groups(self): non_alphanum_chars = [bad.join(('a', 'b')) for bad in "~!@#$%^&*()+-=`<>?/,:;\"'[]{}\| \t\n\r\f\v"] @@ -227,6 +280,45 @@ def test_validate_r_safe_sample_groups(self): with self.assertRaisesRegex(AssertionError, msg): SamplesSheet.validate_r_safe_sample_groups(dict.fromkeys(bad)) + """Does validate_normalization() do what it should?""" + + def test_validate_normalization(self): + good = [ + {'File': '1.fastq', 'Group': 'G1', 'Replicate': '1', 'Control': False, 'Normalization': 'rpm'}, # Good + {'File': '2.fastq', 'Group': 'G1', 'Replicate': '2', 'Control': False, 'Normalization': 'RPM'}, # Good + {'File': '3.fastq', 'Group': 'G1', 'Replicate': '3', 'Control': False, 'Normalization': ' RPM '}, # Good + {'File': '4.fasta', 'Group': 'G2', 'Replicate': '1', 'Control': False, 'Normalization': ' 1'}, # Good + {'File': '5.fasta', 'Group': 'G2', 'Replicate': '2', 'Control': False, 'Normalization': '1.1'}, # Good + {'File': '6.fasta', 'Group': 'G2', 'Replicate': '3', 'Control': False, 'Normalization': ''}, # Good + ] # ^^^^ + + start_sheet = csv_factory("samples.csv", [dict(r, File=f"{i}.fastq") for i, r in enumerate(good)]) + step_sheet = csv_factory("samples.csv", [dict(r, File=f"{i}.sam") for i, r in enumerate(good)]) + run_sheet = step_sheet + + # These SHOULD NOT throw an error + for context, sheet in zip(self.contexts, [start_sheet, step_sheet, run_sheet]): + with patch_open(sheet), patch_isfile(): + SamplesSheet("mock_filename", context=context) + + bad = [ + {'File': '1.fastq', 'Group': 'G1', 'Replicate': '1', 'Control': False, 'Normalization': 'abc'}, # Bad + {'File': '2.fastq', 'Group': 'G1', 'Replicate': '2', 'Control': False, 'Normalization': '123.rpm'}, # Bad + {'File': '3.fastq', 'Group': 'G1', 'Replicate': '3', 'Control': False, 'Normalization': '.'}, # Bad + {'File': '1.fastq', 'Group': 'G2', 'Replicate': '1', 'Control': False, 'Normalization': '_'}, # Bad + ] + + start_sheet = csv_factory("samples.csv", [dict(r, File=f"{i}.fastq") for i, r in enumerate(bad)]) + step_sheet = csv_factory("samples.csv", [dict(r, File=f"{i}.sam") for i, r in enumerate(bad)]) + run_sheet = step_sheet + + exp_contains = r".*(Invalid normalization value).*" + + # These SHOULD throw an error + for context, sheet in zip(self.contexts, [start_sheet, step_sheet, run_sheet]): + with self.assertRaisesRegex(AssertionError, exp_contains), patch_open(sheet), patch_isfile(): + SamplesSheet("mock_filename", context=context) + class PathsFileTest(unittest.TestCase): @@ -344,11 +436,11 @@ def test_backward_incompatibility(self): with self.assertRaisesRegex(AssertionError, r".*(check the release notes).*"): config.check_backward_compatibility() - """Does PathsFile map all paths to the working directory when in_pipeline=True?""" + """Does PathsFile map all paths to the working directory when context=True?""" def test_pipeline_mapping(self): # There are entries in the template Paths File that will cause FileNotFound - # when in_pipeline=True (they're not in the template directory). Since + # when context=True (they're not in the template directory). Since # file existence isn't relevant for this test, we patch that out. with patch('tiny.rna.configuration.os.path.isfile', return_value=True): config = make_paths_file(in_pipeline=True) diff --git a/tests/unit_tests_counter.py b/tests/unit_tests_counter.py index e32a0fa6..9702e2b0 100644 --- a/tests/unit_tests_counter.py +++ b/tests/unit_tests_counter.py @@ -7,7 +7,7 @@ import unit_test_helpers as helpers import tiny.rna.counter.counter as counter -from tiny.rna.util import from_here +from tiny.rna.configuration import ConfigBase resources = "./testdata/counter" @@ -16,12 +16,12 @@ class CounterTests(unittest.TestCase): @classmethod def setUpClass(self): - self.gff_file = f"{resources}/identity_choice_test.gff3" - self.short_gff_file = f"{resources}/single.gff3" + self.gff_file = f"{resources}/gff/identity_choice_test.gff3" + self.short_gff_file = f"{resources}/gff/single.gff3" self.short_gff = helpers.read(self.short_gff_file) - self.sam_file = f"{resources}/identity_choice_test.sam" - self.short_sam_file = f"{resources}/single.sam" + self.sam_file = f"{resources}/sam/identity_choice_test.sam" + self.short_sam_file = f"{resources}/sam/single.sam" self.short_sam = helpers.read(self.short_sam_file) self.strand = {'sense': tuple('+'), 'antisense': tuple('-'), 'both': ('+', '-')} @@ -96,13 +96,13 @@ def get_loaded_samples_row(self, row, exp_file): def test_load_samples_single_cmd(self): mock_samp_sheet_path = '/dev/null' inp_file = "test.fastq" - exp_file = from_here(mock_samp_sheet_path, "test_aligned_seqs.sam") + exp_file = ConfigBase.joinpath(mock_samp_sheet_path, "test_aligned_seqs.sam") row = dict(self.csv_samp_row_dict, **{'File': inp_file}) csv = self.csv("samples.csv", [row]) with patch('tiny.rna.configuration.open', mock_open(read_data=csv)): - inputs_step = counter.load_samples(mock_samp_sheet_path, is_pipeline=False) + inputs_step = counter.load_samples(mock_samp_sheet_path, in_pipeline=False) expected_result = self.get_loaded_samples_row(row, exp_file) self.assertEqual(inputs_step, expected_result) @@ -118,7 +118,7 @@ def test_load_samples_single_pipeline(self): csv = self.csv("samples.csv", [row]) with patch('tiny.rna.configuration.open', mock_open(read_data=csv)): - inputs_pipeline = counter.load_samples(mock_samp_sheet_path, is_pipeline=True) + inputs_pipeline = counter.load_samples(mock_samp_sheet_path, in_pipeline=True) expected_result = self.get_loaded_samples_row(row, exp_file) self.assertEqual(inputs_pipeline, expected_result) @@ -144,25 +144,11 @@ def test_load_samples_sam(self): with patch('tiny.rna.configuration.open', mock_open(read_data=csv)): dummy_file = '/dev/null' - inputs = counter.load_samples(dummy_file, is_pipeline=False) + inputs = counter.load_samples(dummy_file, in_pipeline=False) expected_result = self.get_loaded_samples_row(row, sam_filename) self.assertEqual(inputs, expected_result) - """Does load_samples throw ValueError if a non-absolute path to a SAM file is provided?""" - - def test_load_samples_nonabs_path(self): - bad = "./dne.sam" - row = dict(self.csv_samp_row_dict, **{'File': bad}) - csv = self.csv("samples.csv", [row]) - - expected_error = "The following file must be expressed as an absolute path:\n" + bad - - with patch('tiny.rna.configuration.open', mock_open(read_data=csv)): - with self.assertRaisesRegex(ValueError, expected_error): - dummy_file = '/dev/null' - counter.load_samples(dummy_file, False) - """Does load_samples throw ValueError if sample filename does not have a .fastq or .sam extension?""" def test_load_samples_bad_extension(self): @@ -187,7 +173,7 @@ def test_load_config_single_cmd(self): with patch('tiny.rna.configuration.open', mock_open(read_data=csv)): dummy_file = '/dev/null' - ruleset = counter.load_config(dummy_file, is_pipeline=False) + ruleset = counter.load_config(dummy_file, in_pipeline=False) expected_ruleset = self.parsed_feat_rule self.assertEqual(ruleset, expected_ruleset) @@ -201,7 +187,7 @@ def test_load_config_single_pipeline(self): with patch('tiny.rna.configuration.open', mock_open(read_data=csv)): dummy_file = '/dev/null' - ruleset = counter.load_config(dummy_file, is_pipeline=True) + ruleset = counter.load_config(dummy_file, in_pipeline=True) expected_ruleset = self.parsed_feat_rule self.assertEqual(ruleset, expected_ruleset) diff --git a/tests/unit_tests_features.py b/tests/unit_tests_features.py index 236be93e..9f2cd3e0 100644 --- a/tests/unit_tests_features.py +++ b/tests/unit_tests_features.py @@ -3,20 +3,21 @@ from unittest.mock import patch, call from tiny.rna.counter.features import * -from unit_test_helpers import read, make_parsed_sam_record, rules_template, strand_to_bool +from unit_test_helpers import read, make_parsed_alignment, rules_template, strand_to_bool resources = "./testdata/counter" + class FeaturesTests(unittest.TestCase): @classmethod def setUpClass(self): - self.gff_file = f"{resources}/identity_choice_test.gff3" - self.short_gff_file = f"{resources}/single.gff3" + self.gff_file = f"{resources}/gff/identity_choice_test.gff3" + self.short_gff_file = f"{resources}/gff/single.gff3" self.short_gff = read(self.short_gff_file) - self.sam_file = f"{resources}/identity_choice_test.sam" - self.short_sam_file = f"{resources}/single.sam" + self.sam_file = f"{resources}/sam/identity_choice_test.sam" + self.short_sam_file = f"{resources}/sam/single.sam" self.short_sam = read(self.short_sam_file) """Helper functions""" @@ -115,7 +116,7 @@ def test_assign_features_no_match(self): htsgas[iv_none] += "Should not match" # Create mock SAM alignment with non-overlapping interval - sam_aln = make_parsed_sam_record(**{'Start': 2, 'Chrom': chrom, 'Strand': strand_to_bool(strand)}) + sam_aln = make_parsed_alignment(**{'Start': 2, 'Chrom': chrom, 'Strand': strand_to_bool(strand)}) """ iv_none: 0 |--| 2 @@ -143,7 +144,7 @@ def test_assign_features_single_base_overlap(self): htsgas[iv_none] += "Non-overlapping feature" # Create mock SAM alignment which overlaps iv_olap by one base - sam_aln = make_parsed_sam_record(**{'Chrom': chrom, 'Strand': strand_to_bool(strand), 'Start': 0, 'Seq': 'AT'}) + sam_aln = make_parsed_alignment(**{'Chrom': chrom, 'Strand': strand_to_bool(strand), 'Start': 0, 'Seq': 'AT'}) """ iv_none: 2 |-| 3 @@ -170,7 +171,7 @@ def test_count_reads_generic(self, mock): mock_bundle = ["mock_alignment"] mock_read_count = 1 - instance.sam_reader.bundle_multi_alignments.return_value = iter([(mock_bundle, mock_read_count)]) + instance.alignment_reader.bundle_multi_alignments.return_value = iter([(mock_bundle, mock_read_count)]) instance.assign_features.return_value = ({"mock_feat"}, 1) expected_calls_to_stats = [ @@ -195,11 +196,11 @@ def test_feature_selector_nested_interval(self): feat, fs = self.make_feature_for_interval_test(iv_rule, "Nested Overlap", chrom, strand, start, stop) aln_base = {'Seq': 'ATGC', 'Chrom': chrom, 'Strand': strand_to_bool(strand)} - aln_spill_lo = make_parsed_sam_record(**dict(aln_base, Start=start - 1, Name="spill")) - aln_spill_hi = make_parsed_sam_record(**dict(aln_base, Start=start + 2, Name="spill")) - aln_contained = make_parsed_sam_record(**dict(aln_base, Seq="N", Start=7, Name="contained")) - aln_contained_lo = make_parsed_sam_record(**dict(aln_base, Start=start, Name="contained")) - aln_contained_hi = make_parsed_sam_record(**dict(aln_base, Start=start + 1, Name="contained")) + aln_spill_lo = make_parsed_alignment(**dict(aln_base, Start=start - 1, Name="spill")) + aln_spill_hi = make_parsed_alignment(**dict(aln_base, Start=start + 2, Name="spill")) + aln_contained = make_parsed_alignment(**dict(aln_base, Seq="N", Start=7, Name="contained")) + aln_contained_lo = make_parsed_alignment(**dict(aln_base, Start=start, Name="contained")) + aln_contained_hi = make_parsed_alignment(**dict(aln_base, Start=start + 1, Name="contained")) """ aln: |ATGC| @@ -225,10 +226,10 @@ def test_feature_selector_partial_interval(self): feat, fs = self.make_feature_for_interval_test(iv_rule, "Partial Overlap", chrom, strand, start, stop) aln_base = {'Seq': 'ATGC', 'Chrom': chrom, 'Strand': strand_to_bool(strand)} - aln_spill_lo = make_parsed_sam_record(**dict(aln_base, Start=start - 1, Name="spill")) - aln_spill_hi = make_parsed_sam_record(**dict(aln_base, Start=start + 2, Name="spill")) - aln_contained_lo = make_parsed_sam_record(**dict(aln_base, Start=start, Name="contained")) - aln_contained_hi = make_parsed_sam_record(**dict(aln_base, Start=start + 1, Name="contained")) + aln_spill_lo = make_parsed_alignment(**dict(aln_base, Start=start - 1, Name="spill")) + aln_spill_hi = make_parsed_alignment(**dict(aln_base, Start=start + 2, Name="spill")) + aln_contained_lo = make_parsed_alignment(**dict(aln_base, Start=start, Name="contained")) + aln_contained_hi = make_parsed_alignment(**dict(aln_base, Start=start + 1, Name="contained")) """ aln: |ATGC| @@ -253,11 +254,11 @@ def test_feature_selector_exact_interval(self): aln_match = {'Seq': 'ATGC', 'Chrom': chrom, 'Strand': strand_to_bool(strand)} aln_short = {'Seq': 'NNN', 'Chrom': chrom, 'Strand': strand_to_bool(strand)} - aln_exact = make_parsed_sam_record(**dict(aln_match, Start=start, Name="exact")) - aln_short_lo = make_parsed_sam_record(**dict(aln_short, Start=start, Name="match lo")) - aln_short_hi = make_parsed_sam_record(**dict(aln_short, Start=start + 1, Name="match hi")) - aln_spill_lo = make_parsed_sam_record(**dict(aln_match, Start=start - 1, Name="spill lo")) - aln_spill_hi = make_parsed_sam_record(**dict(aln_match, Start=start + 1, Name="spill hi")) + aln_exact = make_parsed_alignment(**dict(aln_match, Start=start, Name="exact")) + aln_short_lo = make_parsed_alignment(**dict(aln_short, Start=start, Name="match lo")) + aln_short_hi = make_parsed_alignment(**dict(aln_short, Start=start + 1, Name="match hi")) + aln_spill_lo = make_parsed_alignment(**dict(aln_match, Start=start - 1, Name="spill lo")) + aln_spill_hi = make_parsed_alignment(**dict(aln_match, Start=start + 1, Name="spill hi")) """ aln_match: |ATGC| @@ -300,10 +301,10 @@ def test_feature_selector_5p_interval(self): feat_A, fs = self.make_feature_for_interval_test(iv_rule, "5' Anchored Overlap (+)", chrom, '+', start, end) aln_base = {'Start': start, 'Chrom': chrom, 'Strand': True} aln = { - 'aln_none': make_parsed_sam_record(**dict(aln_base, Start=start + 1, Seq="ATGC")), - 'aln_long': make_parsed_sam_record(**dict(aln_base, Seq="ATGCNN")), - 'aln_exact': make_parsed_sam_record(**dict(aln_base, Seq="ATGCN")), - 'aln_short': make_parsed_sam_record(**dict(aln_base, Seq="ATGC")), + 'aln_none': make_parsed_alignment(**dict(aln_base, Start=start + 1, Seq="ATGC")), + 'aln_long': make_parsed_alignment(**dict(aln_base, Seq="ATGCNN")), + 'aln_exact': make_parsed_alignment(**dict(aln_base, Seq="ATGCN")), + 'aln_short': make_parsed_alignment(**dict(aln_base, Seq="ATGC")), } self.assertEqual(fs.choose(feat_A, aln['aln_none']), {}) @@ -347,10 +348,10 @@ def test_feature_selector_3p_interval(self): feat_A, fs = self.make_feature_for_interval_test(iv_rule, "3' Anchored Overlap (+)", chrom, '+', start, end) aln_base = {'Start': start, 'Chrom': chrom, 'Strand': True} aln = { - 'aln_none': make_parsed_sam_record(**dict(aln_base, Seq="ATGC")), - 'aln_long': make_parsed_sam_record(**dict(aln_base, Start=start - 1, Seq="ATGCNN")), - 'aln_exact': make_parsed_sam_record(**dict(aln_base, Seq="ATGCN")), - 'aln_short': make_parsed_sam_record(**dict(aln_base, Start=start + 1, Seq="ATGC")), + 'aln_none': make_parsed_alignment(**dict(aln_base, Seq="ATGC")), + 'aln_long': make_parsed_alignment(**dict(aln_base, Start=start - 1, Seq="ATGCNN")), + 'aln_exact': make_parsed_alignment(**dict(aln_base, Seq="ATGCN")), + 'aln_short': make_parsed_alignment(**dict(aln_base, Start=start + 1, Seq="ATGC")), } self.assertEqual(fs.choose(feat_A, aln['aln_none']), {}) diff --git a/tests/unit_tests_hts_parsing.py b/tests/unit_tests_hts_parsing.py index 4af12415..6f06c066 100644 --- a/tests/unit_tests_hts_parsing.py +++ b/tests/unit_tests_hts_parsing.py @@ -22,12 +22,13 @@ unittest.main() -class SamReaderTests(unittest.TestCase): +class AlignmentReaderTests(unittest.TestCase): @classmethod def setUpClass(self): - self.sam_file = f"{resources}/identity_choice_test.sam" - self.short_sam_file = f"{resources}/single.sam" + self.sam_file = f"{resources}/sam/identity_choice_test.sam" + self.short_sam_file = f"{resources}/sam/single.sam" self.short_sam = helpers.read(self.short_sam_file) + self.short_bam_file = f"{resources}/bam/single.bam" @staticmethod def exhaust_iterator(it): @@ -35,10 +36,10 @@ def exhaust_iterator(it): # === TESTS === - """Did SAM_reader correctly skip header values and parse all pertinent info from a single record SAM file?""" + """Did AlignmentReader correctly skip header values and parse all pertinent info from a single record SAM file?""" - def test_sam_reader(self): - sam_bundle, read_count = next(SAM_reader().bundle_multi_alignments(self.short_sam_file)) + def test_AlignmentReader_single_sam(self): + sam_bundle, read_count = next(AlignmentReader().bundle_multi_alignments(self.short_sam_file)) sam_record = sam_bundle[0] self.assertEqual(sam_record['Chrom'], "I") @@ -50,7 +51,22 @@ def test_sam_reader(self): self.assertEqual(sam_record['Length'], 21) self.assertEqual(sam_record['nt5end'], 'G') - """Does our custom SAM parser produce the same pertinent info as HTSeq's BAM_reader? + """Did AlignmentReader correctly skip header values and parse all pertinent info from a single record BAM file?""" + + def test_AlignmentReader_single_bam(self): + bam_bundle, read_count = next(AlignmentReader().bundle_multi_alignments(self.short_bam_file)) + bam_record = bam_bundle[0] + + self.assertEqual(bam_record['Chrom'], "I") + self.assertEqual(bam_record['Start'], 15064569) + self.assertEqual(bam_record['End'], 15064590) + self.assertEqual(bam_record['Strand'], False) + self.assertEqual(bam_record['Name'], "0_count=5") + self.assertEqual(bam_record['Seq'], "CAAGACAGAGCTTCACCGTTC") + self.assertEqual(bam_record['Length'], 21) + self.assertEqual(bam_record['nt5end'], 'G') + + """Does our AlignmentReader produce the same pertinent info from a SAM file as HTSeq's BAM_reader? A note on SAM files: reads are always stored 5' to 3', so antisense reads are actually recorded in reverse complement. HTSeq automatically performs this conversion, but we @@ -59,8 +75,8 @@ def test_sam_reader(self): """ def test_sam_parser_comparison(self): - file = f"{resources}/Lib304_test.sam" - ours = SAM_reader().bundle_multi_alignments(file) + file = f"{resources}/sam/Lib304_test.sam" + ours = AlignmentReader().bundle_multi_alignments(file) theirs = HTSeq.bundle_multiple_alignments(HTSeq.BAM_Reader(file)) for (our_bundle, _), their_bundle in zip(ours, theirs): @@ -77,23 +93,44 @@ def test_sam_parser_comparison(self): else: self.assertEqual(our['Seq'], their.read.seq.decode()) - """Does SAM_reader._get_decollapsed_filename() create an appropriate filename?""" + """Does our AlignmentReader produce the same pertinent info from a BAM file as HTSeq's BAM_reader?""" + + def test_bam_parser_comparison(self): + file = f"{resources}/bam/Lib304_test.bam" + ours = AlignmentReader().bundle_multi_alignments(file) + theirs = HTSeq.bundle_multiple_alignments(HTSeq.BAM_Reader(file)) - def test_SAM_reader_get_decollapsed_filename(self): - reader = SAM_reader() + for (our_bundle, _), their_bundle in zip(ours, theirs): + self.assertEqual(len(our_bundle), len(their_bundle)) + for our, their in zip(our_bundle, their_bundle): + self.assertEqual(our['Chrom'], their.iv.chrom) + self.assertEqual(our['Start'], their.iv.start) + self.assertEqual(our['End'], their.iv.end) + self.assertEqual(our['Name'], their.read.name) + self.assertEqual(our['nt5end'], chr(their.read.seq[0])) # See note above + self.assertEqual(our['Strand'], helpers.strand_to_bool(their.iv.strand)) + if our['Strand'] is False: # See note above + self.assertEqual(our['Seq'][::-1].translate(helpers.complement), their.read.seq.decode()) + else: + self.assertEqual(our['Seq'], their.read.seq.decode()) + + """Does AlignmentReader._get_decollapsed_filename() create an appropriate filename?""" + + def test_AlignmentReader_get_decollapsed_filename(self): + reader = AlignmentReader() reader.file = "~/path/to/input/sam_file.sam" sam_out = reader._get_decollapsed_filename() self.assertEqual(sam_out, "sam_file_decollapsed.sam") - """Does SAM_reader._new_bundle report the correct read count for different collapser types?""" + """Does AlignmentReader._new_bundle report the correct read count for different collapser types?""" - def test_SAM_reader_new_bundle(self): + def test_AlignmentReader_new_bundle(self): qnames = ["0_count=3", "seq0_x5", "non-collapsed"] counts = [3, 5, 1] - reader = SAM_reader() + reader = AlignmentReader() reader._header_dict = {'HD': {'SO': 'queryname'}} for qname, expected in zip(qnames, counts): @@ -101,37 +138,44 @@ def test_SAM_reader_new_bundle(self): _, read_count = reader._new_bundle({'Name': qname}) self.assertEqual(read_count, expected) - """Does SAM_reader._gather_metadata() correctly identify metadata and write the decollapsed file header?""" + """Does AlignmentReader._gather_metadata() correctly identify metadata and write the decollapsed file header?""" - def test_SAM_reader_gather_metadata(self): - reader = SAM_reader(decollapse=True) + def test_AlignmentReader_gather_metadata(self): + reader = AlignmentReader(decollapse=True) reader._decollapsed_filename = "mock_outfile_name.sam" - + reader._assign_library(self.short_sam_file) sam_in = pysam.AlignmentFile(self.short_sam_file) + with patch('builtins.open', mock_open()) as sam_out: reader._gather_metadata(sam_in) expected_writelines = [ call('mock_outfile_name.sam', 'w'), call().__enter__(), - call().write("@SQ\tSN:I\tLN:21\n"), + call().write("@HD\tSO:unsorted\n@SQ\tSN:I\tLN:21\n@PG\tID:bowtie\n"), call().__exit__(None, None, None) ] + expected_header_dict = { + 'HD': {'SO': 'unsorted'}, + 'SQ': [{'LN': 21, 'SN': 'I'}], + 'PG': [{'ID': 'bowtie'}] + } + sam_out.assert_has_calls(expected_writelines) self.assertEqual(reader.collapser_type, 'tiny-collapse') - self.assertDictEqual(reader._header_dict, {'SQ': [{'SN': "I", 'LN': 21}]}) + self.assertDictEqual(reader._header_dict, expected_header_dict) self.assertEqual(reader.references, ('I',)) self.assertTrue(reader.has_nm) - """Does SAM_reader._write_decollapsed_sam() write the correct number of duplicates to the decollapsed file?""" + """Does AlignmentReader._write_decollapsed_sam() write the correct number of duplicates to the decollapsed file?""" - def test_SAM_reader_write_decollapsed_sam(self): + def test_AlignmentReader_write_decollapsed_sam(self): header = pysam.AlignmentHeader() alignment = pysam.AlignedSegment(header) alignment.query_name = "0_count=5" - reader = SAM_reader(decollapse=True) + reader = AlignmentReader(decollapse=True) reader.collapser_type = "tiny-collapse" reader._collapser_token = "=" reader._decollapsed_reads = [alignment] @@ -150,12 +194,12 @@ def test_SAM_reader_write_decollapsed_sam(self): outfile.assert_has_calls(expected_writelines) self.assertTrue(len(reader._decollapsed_reads) == 0) - """Does SAM_reader._parse_alignments() save lines and write them to the decollapsed file when appropriate?""" + """Does AlignmentReader._parse_alignments() save lines and write them to the decollapsed file when appropriate?""" - def test_SAM_reader_parse_alignments_decollapse(self): - with patch.object(SAM_reader, "_write_decollapsed_sam") as write_fn: - # Set up SAM_reader class - reader = SAM_reader(decollapse=True) + def test_AlignmentReader_parse_alignments_decollapse(self): + with patch.object(AlignmentReader, "_write_decollapsed_sam") as write_fn: + # Set up AlignmentReader class + reader = AlignmentReader(decollapse=True) reader.collapser_type = "tiny-collapse" reader._decollapsed_reads = buffer = [0] * 99999 # At 100,001, expect buffer to be written reader.file = self.short_sam_file # File with single alignment @@ -178,13 +222,13 @@ def test_SAM_reader_parse_alignments_decollapse(self): """Are decollapsed outputs skipped when non-collapsed SAM files are supplied?""" - def test_SAM_reader_no_decollapse_non_collapsed_SAM_files(self): + def test_AlignmentReader_no_decollapse_non_collapsed_SAM_files(self): stdout_capture = io.StringIO() - with patch.object(SAM_reader, "_write_decollapsed_sam") as write_sam, \ - patch.object(SAM_reader, "_write_header_for_decollapsed_sam") as write_header: + with patch.object(AlignmentReader, "_write_decollapsed_sam") as write_sam, \ + patch.object(AlignmentReader, "_write_header_for_decollapsed_sam") as write_header: with contextlib.redirect_stderr(stdout_capture): - reader = SAM_reader(decollapse=True) - records = reader.bundle_multi_alignments(f"{resources}/non-collapsed.sam") + reader = AlignmentReader(decollapse=True) + records = reader.bundle_multi_alignments(f"{resources}/sam/non-collapsed.sam") self.exhaust_iterator(records) write_sam.assert_not_called() @@ -194,13 +238,88 @@ def test_SAM_reader_no_decollapse_non_collapsed_SAM_files(self): "Alignments do not appear to be derived from a supported collapser input. " "Decollapsed SAM files will therefore not be produced.\n") + """Is incompatible alignment file ordering correctly identified from @HD header values?""" + + def test_AlignmentReader_incompatible_HD_header(self): + # Valid SO values: ["unknown", "unsorted", "queryname", "coordinate"] + # Valid GO values: ["none", "query", "reference"] + + strictly_compatible = [ + {'HD': {'SO': "queryname"}}, + {'HD': {'GO': "query"}}, + ] + + # Should not throw error + for header in strictly_compatible: + reader = AlignmentReader() + reader._header_dict = header + reader._assign_library("mock_infile.sam") + reader._check_for_incompatible_order() + + strictly_incompatible = [ + ({'HD': {'SO': "coordinate"}}, "by coordinate"), + ({'HD': {'GO': "reference"}}, "by reference"), + ({'HD': {}}, "sorting/grouping couldn't be determined"), + ({}, "sorting/grouping couldn't be determined"), + ] + + # Should throw error + for header, message in strictly_incompatible: + reader = AlignmentReader() + reader._header_dict = header + reader._assign_library("mock_infile.sam") + with self.assertRaisesRegex(ValueError, message): + reader._check_for_incompatible_order() + + """Is incompatible alignment file ordering correctly identified from @PG header values?""" + + def test_AlignmentReader_incompatible_PG_header(self): + # SO = ["unknown", "unsorted", "queryname", "coordinate"] + # GO = ["none", "query", "reference"] + + self.assertEqual(AlignmentReader.compatible_unordered, ("bowtie", "bowtie2", "star")) + SO_unordered = [{'SO': "unknown"}, {'SO': "unsorted"}] + GO_unordered = [{'GO': "none"}] + compatible = [ + {'PG': [{'ID': tool}], 'HD': un_so, 'GO': un_go} + for tool in AlignmentReader.compatible_unordered + for un_so in SO_unordered + for un_go in GO_unordered + ] + + # Only the last reported PG ID matters + multiple_tools = [{'ID': "INCOMPATIBLE"}, {'ID': "bowtie"}] + compatible += [{'PG': multiple_tools, 'HD': {'SO': "unsorted"}}] + + # Should not throw error + for header in compatible: + reader = AlignmentReader() + reader._header_dict = header + reader._assign_library("mock_infile.sam") + reader._check_for_incompatible_order() + + incompatible = [ + {'PG': [{'ID': "INCOMPATIBLE"}], 'HD': un_so, 'GO': un_go} + for un_so in SO_unordered + for un_go in GO_unordered + ] + + # Should throw error + expected_error = "adjacency couldn't be determined" + for header in incompatible: + reader = AlignmentReader() + reader._header_dict = header + reader._assign_library("mock_infile.sam") + with self.assertRaisesRegex(ValueError, expected_error): + reader._check_for_incompatible_order() + class ReferenceFeaturesTests(unittest.TestCase): @classmethod def setUpClass(self): - self.gff_file = f"{resources}/identity_choice_test.gff3" - self.short_gff_file = f"{resources}/single.gff3" + self.gff_file = f"{resources}/gff/identity_choice_test.gff3" + self.short_gff_file = f"{resources}/gff/single.gff3" self.short_gff = helpers.read(self.short_gff_file) self.maxDiff = None @@ -365,7 +484,7 @@ def test_ref_tables_alias_multisource_concat_all_features_false(self): def test_ref_tables_identity_matches_multisource_concat(self): feature_source = { self.short_gff_file: ["ID"], - f"{resources}/single2.gff3": ["ID"] + f"{resources}/gff/single2.gff3": ["ID"] } kwargs = {'all_features': True} @@ -392,7 +511,7 @@ def test_ref_tables_identity_matches_multisource_concat(self): def test_ref_tables_discontinuous_aliases(self): kwargs = {'all_features': True} - feature_source = {f"{resources}/discontinuous.gff3": ["Name"]} + feature_source = {f"{resources}/gff/discontinuous.gff3": ["Name"]} mock_selector = self.selector_with_template(helpers.rules_template) _, alias, _ = ReferenceFeatures(feature_source, **kwargs).get(mock_selector) @@ -408,7 +527,7 @@ def test_ref_tables_discontinuous_aliases(self): def test_ref_tables_discontinuous_no_match_all_features_false(self): kwargs = {'all_features': False} - feature_source = {f"{resources}/discontinuous.gff3": ["Name"]} + feature_source = {f"{resources}/gff/discontinuous.gff3": ["Name"]} mock_selector = self.selector_with_template([{'Identity': ('No', 'Match')}]) expected_err = "No features were retained while parsing your GFF file.\n" \ @@ -425,7 +544,7 @@ def test_ref_tables_discontinuous_no_match_all_features_false(self): def test_ref_tables_discontinuous_features(self): kwargs = {'all_features': True} - feature_source = {f"{resources}/discontinuous.gff3": ["Name"]} + feature_source = {f"{resources}/gff/discontinuous.gff3": ["Name"]} feature_selector = self.selector_with_template([{'Identity': ('No', 'Match')}]) # Features that fail to match on identity are not added to the StepVector, @@ -440,7 +559,7 @@ def test_ref_tables_discontinuous_features(self): def test_ref_tables_discontinuous_intervals(self): kwargs = {'all_features': True} - feature_source = {f"{resources}/discontinuous.gff3": ["Name"]} + feature_source = {f"{resources}/gff/discontinuous.gff3": ["Name"]} feature_selector = self.selector_with_template(helpers.rules_template) grandparent_iv = HTSeq.GenomicInterval('I', 0, 10, '-') @@ -467,7 +586,7 @@ def test_ref_tables_discontinuous_intervals(self): performed for intervals in this test.""" def test_ref_tables_discontinuous_identity_matches(self): - feature_source = {f"{resources}/discontinuous.gff3": ["Name"]} + feature_source = {f"{resources}/gff/discontinuous.gff3": ["Name"]} feature_selector = self.selector_with_template([ {'Identity': ('Class', 'NA'), 'Hierarchy': 2}, # Rule 1 {'Identity': ('Name', 'Sibling3'), 'Hierarchy': 3}, # Rule 2 @@ -511,7 +630,7 @@ def test_ref_tables_discontinuous_identity_matches(self): def test_ref_tables_source_filter(self): kwargs = {'all_features': False} - feature_source = {f"{resources}/discontinuous.gff3": ["Name"]} + feature_source = {f"{resources}/gff/discontinuous.gff3": ["Name"]} feature_selector = self.selector_with_template([{'Filter_s': "Source2Name"}]) child2_iv = HTSeq.GenomicInterval('I', 39, 50, '-') @@ -536,7 +655,7 @@ def test_ref_tables_source_filter(self): def test_ref_tables_type_filter(self): kwargs = {'all_features': False} - feature_source = {f"{resources}/discontinuous.gff3": ["Name"]} + feature_source = {f"{resources}/gff/discontinuous.gff3": ["Name"]} feature_selector = self.selector_with_template([{'Filter_t': "CDS"}]) child1_iv = HTSeq.GenomicInterval('I', 29, 40, '-') @@ -561,7 +680,7 @@ def test_ref_tables_type_filter(self): def test_ref_tables_both_filter(self): kwargs = {'all_features': False} - feature_source = {f"{resources}/discontinuous.gff3": ["Name"]} + feature_source = {f"{resources}/gff/discontinuous.gff3": ["Name"]} feature_selector = self.selector_with_template([{'Filter_s': "SourceName", 'Filter_t': "gene"}]) rt = ReferenceFeatures(feature_source, **kwargs) @@ -577,7 +696,7 @@ def test_ref_tables_both_filter(self): def test_ref_tables_tagged_match_single(self): kwargs = {'all_features': False} feat_id = "Gene:WBGene00023193" - feature_source = {f"{resources}/single.gff3": ["sequence_name"]} + feature_source = {f"{resources}/gff/single.gff3": ["sequence_name"]} feature_selector = self.selector_with_template([ {'Identity': ("ID", feat_id), 'Class': "tagged_match", 'Hierarchy': 1}, {'Identity': ("ID", feat_id), 'Class': "", 'Hierarchy': 2} @@ -604,7 +723,7 @@ def test_ref_tables_tagged_match_single(self): """Does ReferenceFeatures.get() correctly merge records for discontinuous features matching multiple tagged rules?""" def test_ref_tables_tagged_match_merging(self): - feature_source = {f"{resources}/discontinuous.gff3": ['Name']} + feature_source = {f"{resources}/gff/discontinuous.gff3": ['Name']} # All rules match the same root feature feature_selector = self.selector_with_template([ diff --git a/tests/unit_tests_validation.py b/tests/unit_tests_validation.py index 32a0845f..807f4c28 100644 --- a/tests/unit_tests_validation.py +++ b/tests/unit_tests_validation.py @@ -6,7 +6,13 @@ from glob import glob from unittest.mock import patch, mock_open -from tiny.rna.counter.validation import GFFValidator, ReportFormatter, SamSqValidator +from tiny.rna.counter.validation import GFFValidator, ReportFormatter, AlignmentSqValidator + +resources = "./testdata/counter" + +# To run all test suites +if __name__ == '__main__': + unittest.main() class GFFValidatorTest(unittest.TestCase): @@ -104,7 +110,7 @@ def test_gff_id_validation(self): def test_ebwt_chroms(self): validator = self.make_gff_validator() - ebwt_prefix = "./testdata/counter/validation/ebwt/ram1" + ebwt_prefix = f"{resources}/validation/ebwt/ram1" # Chroms are shared validator.chrom_set = {'ram1'} @@ -122,7 +128,7 @@ def test_ebwt_chroms(self): def test_genome_chroms(self): validator = self.make_gff_validator() - fasta_file = "./testdata/counter/validation/genome/genome.fasta" + fasta_file = f"{resources}/validation/genome/genome.fasta" # Chroms are shared validator.chrom_set = {'chr1', 'chr2', 'chr3'} @@ -140,8 +146,8 @@ def test_genome_chroms(self): def test_alignments_heuristic(self): validator = self.make_gff_validator() - sam_files = ['./testdata/counter/identity_choice_test.sam', - './testdata/counter/single.sam'] + sam_files = [f'{resources}/sam/identity_choice_test.sam', + f'{resources}/sam/single.sam'] sam_chroms = { sam_files[0]: {'I', 'V', 'MtDNA'}, @@ -231,7 +237,7 @@ def test_chrom_heuristics_report_output(self): exp_warnings = '\n'.join([ "GFF files and sequence files might not contain the same chromosome identifiers.", "This is determined from a subset of each sequence file, so false positives may be reported.", - "\t" + f"{validator.targets['sam files']}: ", + "\t" + f"{validator.targets['alignment files']}: ", "\t\t" + "sam1: ", "\t\t\t" + f"Chromosomes sampled: {', '.join(sorted(suspect_files['sam1']))}", "\t\t" + "sam2: ", @@ -298,7 +304,7 @@ def test_chrom_heuristics_runtime(self): class SamSqValidatorTest(unittest.TestCase): @classmethod def setUpClass(self): - self.syntax_header = "Every SAM file must have complete @SQ headers with SN and LN\n" \ + self.syntax_header = "Every alignment file must have complete @SQ headers with SN and LN\n" \ "fields when performing sequence-based counting.\n" self.identifier_header = "Sequence identifiers must be unique and have consistent length definitions.\n" @@ -327,12 +333,12 @@ def test_read_sq_headers(self): ], } - validator = SamSqValidator([sam_file]) + validator = AlignmentSqValidator([sam_file]) validator.read_sq_headers() self.assertDictEqual(validator.sq_headers, expected) - """Does SamSqValidator correctly identify missing @SQ headers?""" + """Does AlignmentSqValidator correctly identify missing @SQ headers?""" def test_missing_sq_headers(self): mock_files = ['sam1', 'sam2'] @@ -343,18 +349,18 @@ def test_missing_sq_headers(self): expected = '\n'.join([ self.syntax_header, - "\t" + f"{SamSqValidator.targets['missing sq']}: ", + "\t" + f"{AlignmentSqValidator.targets['missing sq']}: ", "\t\t" + mock_files[1] ]) - validator = SamSqValidator(mock_files) + validator = AlignmentSqValidator(mock_files) validator.sq_headers = mock_sam_headers validator.validate_sq_headers() self.assertListEqual(validator.report.errors, [expected]) self.assertListEqual(validator.report.warnings, []) - """Does SamSqValidator correctly identify incomplete @SQ headers?""" + """Does AlignmentSqValidator correctly identify incomplete @SQ headers?""" def test_incomplete_sq_headers(self): mock_files = ['sam1', 'sam2', 'sam3'] @@ -366,19 +372,19 @@ def test_incomplete_sq_headers(self): expected = '\n'.join([ self.syntax_header, - "\t" + f"{SamSqValidator.targets['incomplete sq']}: ", + "\t" + f"{AlignmentSqValidator.targets['incomplete sq']}: ", "\t\t" + mock_files[1], "\t\t" + mock_files[2], ]) - validator = SamSqValidator(mock_files) + validator = AlignmentSqValidator(mock_files) validator.sq_headers = mock_sam_headers validator.validate_sq_headers() self.assertListEqual(validator.report.errors, [expected]) self.assertListEqual(validator.report.warnings, []) - """Does SamSqValidator correctly identify duplicate identifiers?""" + """Does AlignmentSqValidator correctly identify duplicate identifiers?""" def test_duplicate_identifiers(self): mock_files = ['sam1', 'sam2', 'sam3'] @@ -390,18 +396,18 @@ def test_duplicate_identifiers(self): expected = '\n'.join([ self.identifier_header, - "\t" + f"{SamSqValidator.targets['intra sq']}: ", + "\t" + f"{AlignmentSqValidator.targets['intra sq']}: ", "\t\t" + mock_files[2], ]) - validator = SamSqValidator(mock_files) + validator = AlignmentSqValidator(mock_files) validator.sq_headers = mock_sam_headers validator.validate_sq_headers() self.assertListEqual(validator.report.errors, [expected]) self.assertListEqual(validator.report.warnings, []) - """Does SamSqValidator correctly identify identifiers with inconsistent length definitions?""" + """Does AlignmentSqValidator correctly identify identifiers with inconsistent length definitions?""" def test_inconsistent_identifier_length(self): mock_files = ['sam1', 'sam2', 'sam3'] @@ -413,12 +419,12 @@ def test_inconsistent_identifier_length(self): expected = '\n'.join([ self.identifier_header, - "\t" + f"{SamSqValidator.targets['inter sq']}: ", + "\t" + f"{AlignmentSqValidator.targets['inter sq']}: ", "\t\t" + 'chr1', "\t\t" + 'chr2', ]) - validator = SamSqValidator(mock_files) + validator = AlignmentSqValidator(mock_files) validator.sq_headers = mock_sam_headers validator.validate_sq_headers() diff --git a/tiny/cwl/tools/tiny-count.cwl b/tiny/cwl/tools/tiny-count.cwl index aeb2e694..f2a695a6 100644 --- a/tiny/cwl/tools/tiny-count.cwl +++ b/tiny/cwl/tools/tiny-count.cwl @@ -55,10 +55,10 @@ inputs: inputBinding: prefix: --all-features - is_pipeline: + in_pipeline: type: boolean? inputBinding: - prefix: --is-pipeline + prefix: --in-pipeline diagnostics: type: boolean? diff --git a/tiny/cwl/workflows/tinyrna_wf.cwl b/tiny/cwl/workflows/tinyrna_wf.cwl index 0a877b86..582ded97 100644 --- a/tiny/cwl/workflows/tinyrna_wf.cwl +++ b/tiny/cwl/workflows/tinyrna_wf.cwl @@ -82,7 +82,7 @@ inputs: features_csv: File gff_files: File[]? aligned_seqs: File[]? - is_pipeline: boolean? + in_pipeline: boolean? counter_diags: boolean? counter_decollapse: boolean? counter_stepvector: string? @@ -223,7 +223,7 @@ steps: valueFrom: $(String(self)) # convert boolean -> string decollapse: counter_decollapse stepvector: counter_stepvector - is_pipeline: {default: true} + in_pipeline: {default: true} diagnostics: counter_diags fastp_logs: preprocessing/json_report_file collapsed_fa: preprocessing/uniq_seqs diff --git a/tiny/rna/configuration.py b/tiny/rna/configuration.py index 53b7e0fc..b317fcb4 100644 --- a/tiny/rna/configuration.py +++ b/tiny/rna/configuration.py @@ -10,12 +10,12 @@ from pkg_resources import resource_filename from collections import Counter, OrderedDict, defaultdict -from typing import Union, Any, List, Dict +from typing import Union, Any, List, Dict, Optional from glob import glob from tiny.rna.compatibility import RunConfigCompatibility from tiny.rna.counter.validation import GFFValidator -from tiny.rna.util import get_timestamp, get_r_safename +from tiny.rna.util import get_timestamp, get_r_safename, append_to_exception class ConfigBase: @@ -265,13 +265,13 @@ def absorb_paths_file(self): def process_samples_sheet(self): samples_sheet_path = self.paths['samples_csv'] - samples_sheet = SamplesSheet(samples_sheet_path) + samples_sheet = SamplesSheet(samples_sheet_path, context="Pipeline Start") self['sample_basenames'] = samples_sheet.sample_basenames self['control_condition'] = samples_sheet.control_condition self['run_deseq'] = samples_sheet.is_compatible_df - self['in_fq'] = [self.cwl_file(fq, verify=False) for fq in samples_sheet.fastq_files] + self['in_fq'] = [self.cwl_file(fq, verify=False) for fq in samples_sheet.hts_samples] self['fastp_report_titles'] = [f"{g}_rep_{r}" for g, r in samples_sheet.groups_reps] def process_features_sheet(self) -> List[dict]: @@ -641,47 +641,95 @@ def append_to(self, key: str, val: Any): class SamplesSheet: - def __init__(self, file): + def __init__(self, file, context): self.csv = CSVReader(file, "Samples Sheet") self.basename = os.path.basename(file) self.dir = os.path.dirname(file) self.file = file - self.fastq_files = [] + self.hts_samples = [] self.groups_reps = [] + self.normalizations = [] self.sample_basenames = [] self.control_condition = None self.is_compatible_df = False + map_path, validate_path = self.get_context_methods(context).values() + self.validate_sample_path = validate_path + self.map_path = map_path + self.context = context + self.read_csv() + def get_context_methods(self, context): + return { + "Standalone Run": {'map_path': self.from_here, 'validate_path': self.validate_alignments_filepath}, + "Pipeline Start": {'map_path': self.from_here, 'validate_path': self.validate_fastq_filepath}, + "Pipeline Step": {'map_path': self.from_pipeline, 'validate_path': lambda _: True}, # skip validation + }[context] + def read_csv(self): reps_per_group = Counter() - for row in self.csv.rows(): - fastq_file = Configuration.joinpath(self.dir, row['File']) - group_name = row['Group'] - rep_number = row['Replicate'] - is_control = row['Control'].lower() == 'true' - basename = self.get_sample_basename(fastq_file) + try: + for row in self.csv.rows(): + hts_sample = self.map_path(row['File']) + group_name = row['Group'] + rep_number = row['Replicate'] + is_control = row['Control'].lower() == 'true' + norm_prefs = row['Normalization'] + basename = self.get_sample_basename(hts_sample) + + self.validate_sample_path(hts_sample) + self.validate_group_rep(group_name, rep_number) + self.validate_control_group(is_control, group_name) + self.validate_normalization(norm_prefs) + + self.hts_samples.append(hts_sample) + self.sample_basenames.append(basename) + self.groups_reps.append((group_name, rep_number)) + self.normalizations.append(norm_prefs) + reps_per_group[group_name] += 1 + + if is_control: self.control_condition = group_name + except Exception as e: + if not isinstance(e, AssertionError): + # Validation steps already indicate row, exception likely from something else + msg = f"Error occurred on line {self.csv.row_num} of {self.basename}" + append_to_exception(e, msg) + raise - self.validate_fastq_filepath(fastq_file) - self.validate_group_rep(group_name, rep_number) - self.validate_control_group(is_control, group_name) + self.is_compatible_df = self.validate_deseq_compatibility(reps_per_group) - self.fastq_files.append(fastq_file) - self.sample_basenames.append(basename) - self.groups_reps.append((group_name, rep_number)) - reps_per_group[group_name] += 1 + def from_here(self, input_file): + """Resolves .sam/.bam paths in pipeline startup and standalone context""" - if is_control: self.control_condition = group_name + return ConfigBase.joinpath(self.dir, input_file) - self.is_compatible_df = self.validate_deseq_compatibility(reps_per_group) + def from_pipeline(self, input_file): + """Resolves .fastq(.gz) file paths in pipeline step context""" + + basename_root, ext = os.path.splitext(os.path.basename(input_file)) + return f"{basename_root}_aligned_seqs.sam" + + def validate_alignments_filepath(self, file: str): + """Checks file existence, extension, and duplicate entries in standalone context""" + + root, ext = os.path.splitext(file) + + assert os.path.isfile(file), \ + "The file on row {row_num} of {selfname} was not found:\n\t{file}" \ + .format(row_num=self.csv.row_num, selfname=self.basename, file=file) + + assert ext in (".sam", ".bam"), \ + "Files in {selfname} must have a .sam or .bam extension (row {row_num})" \ + .format(selfname=self.basename, row_num=self.csv.row_num) + + assert file not in self.hts_samples, \ + "Alignment files cannot be listed more than once in {selfname} (row {row_num})" \ + .format(selfname=self.basename, row_num=self.csv.row_num) def validate_fastq_filepath(self, file: str): - """Checks file existence, extension, and duplicate entries. - Args: - file: fastq file path. For which has already been resolved relative to self.dir - """ + """Checks file existence, extension, and duplicate entries in pipeline startup context""" root, ext = os.path.splitext(file) @@ -690,30 +738,41 @@ def validate_fastq_filepath(self, file: str): .format(row_num=self.csv.row_num, selfname=self.basename, file=file) assert ext in (".fastq", ".gz"), \ - "Files in {selfname} must have a .fastq(.gz) extension (row {row_num})"\ + "Files in {selfname} must have a .fastq(.gz) extension (row {row_num})" \ .format(selfname=self.basename, row_num=self.csv.row_num) - assert file not in self.fastq_files, \ - "Fastq files cannot be listed more than once in {selfname} (row {row_num})"\ + assert file not in self.hts_samples, \ + "Fastq files cannot be listed more than once in {selfname} (row {row_num})" \ .format(selfname=self.basename, row_num=self.csv.row_num) def validate_group_rep(self, group:str, rep:str): assert (group, rep) not in self.groups_reps, \ "The same group and replicate number cannot appear on " \ - "more than one row in {selfname} (row {row_num})"\ + "more than one row in {selfname} (row {row_num})" \ .format(selfname=self.basename, row_num=self.csv.row_num) def validate_control_group(self, is_control: bool, group: str): - if not is_control: return + + if not is_control or self.context != "Pipeline Start": + return + assert self.control_condition in (group, None), \ - "tinyRNA does not support multiple control conditions " \ + "Experiments with multiple control conditions aren't supported " \ "(row {row_num} in {selfname}).\nHowever, if the control condition " \ "is unspecified, all possible comparisons will be made and this " \ - "should accomplish your goal."\ + "should accomplish your goal." \ .format(row_num=self.csv.row_num, selfname=self.basename) - @staticmethod - def validate_deseq_compatibility(sample_groups: Counter) -> bool: + def validate_normalization(self, norm): + assert re.fullmatch(r"\s*((?:\d+(?:\.\d*)?|\.\d+)|(rpm))\s*", norm, re.IGNORECASE) or not norm, \ + "Invalid normalization value in {selfname} (row {row_num})" \ + .format(selfname=self.basename, row_num=self.csv.row_num) + + def validate_deseq_compatibility(self, sample_groups: Counter) -> Optional[bool]: + + if self.context != "Pipeline Start": + return None + SamplesSheet.validate_r_safe_sample_groups(sample_groups) total_samples = sum(sample_groups.values()) @@ -774,7 +833,7 @@ class CSVReader(csv.DictReader): "Mismatches": "Mismatch" }), "Samples Sheet": OrderedDict({ - "FASTQ/SAM Files": "File", + "Input Files": "File", "Sample/Group Name": "Group", "Replicate Number": "Replicate", "Control": "Control", @@ -901,6 +960,15 @@ def check_backward_compatibility(self, header_vals): "the new column to your Features Sheet to avoid this error." ])) + if self.doctype == "Samples Sheet": + if 'fastq/sam files' in header_vals_lc: + compat_errors.append('\n'.join([ + "It looks like you're using a Samples Sheet from an earlier version of", + 'tinyRNA. The "FASTQ/SAM files" column has been renamed to "Input Files"', + 'due to the addition of BAM file support. Please rename the column in', + "your Samples Sheet to avoid this error." + ])) + if compat_errors: raise ValueError('\n\n'.join(compat_errors)) diff --git a/tiny/rna/counter/counter.py b/tiny/rna/counter/counter.py index 4bb95f7d..5e577cdc 100644 --- a/tiny/rna/counter/counter.py +++ b/tiny/rna/counter/counter.py @@ -8,18 +8,17 @@ from typing import List, Dict -from tiny.rna.counter.validation import GFFValidator, SamSqValidator +from tiny.rna.counter.validation import GFFValidator, AlignmentSqValidator from tiny.rna.counter.features import Features, FeatureCounter from tiny.rna.counter.statistics import MergedStatsManager from tiny.rna.counter.hts_parsing import ReferenceFeatures, ReferenceSeqs, ReferenceBase -from tiny.rna.configuration import CSVReader, PathsFile, get_templates +from tiny.rna.configuration import PathsFile, SamplesSheet, CSVReader, get_templates from tiny.rna.util import ( report_execution_time, add_transparent_help, append_to_exception, get_timestamp, - ReadOnlyDict, - from_here + ReadOnlyDict ) # Global variables for multiprocessing @@ -60,7 +59,7 @@ def get_args(): optional_args.add_argument('-vs', '--verify-stats', metavar='T/F', default='T', help='Verify that all reported stats are internally consistent.') optional_args.add_argument('-dc', '--decollapse', action='store_true', - help='Create a decollapsed copy of all SAM files listed in your Samples Sheet. ' + help='Create a decollapsed SAM copy of all files listed in your Samples Sheet. ' 'This option is ignored for non-collapsed inputs.') optional_args.add_argument('-sv', '--stepvector', choices=['Cython', 'HTSeq'], default='Cython', help='Select which StepVector implementation is used to find ' @@ -68,7 +67,7 @@ def get_args(): optional_args.add_argument('-a', '--all-features', action='store_true', help=argparse.SUPPRESS) #help='Represent all features in output counts table, ' # 'even if they did not match in Stage 1 selection.') - optional_args.add_argument('-p', '--is-pipeline', action='store_true', + optional_args.add_argument('-p', '--in-pipeline', action='store_true', help='Indicates that tiny-count was invoked as part of a pipeline run ' 'and that input files should be sourced as such.') optional_args.add_argument('-d', '--report-diags', action='store_true', @@ -89,72 +88,44 @@ def get_args(): return ReadOnlyDict(args_dict) -def load_samples(samples_csv: str, is_pipeline: bool) -> List[Dict[str, str]]: +def load_samples(samples_csv: str, in_pipeline: bool) -> List[Dict[str, str]]: """Parses the Samples Sheet to determine library names and alignment files for counting - Sample files may have a .fastq(.gz) extension (i.e. when tiny-count is called as part of a - pipeline run) or a .sam extension (i.e. when tiny-count is called as a standalone tool). + Sample files can have a .fastq(.gz) extension (i.e. when tiny-count is called as part of a + pipeline run) or a .sam/.bam extension (i.e. when tiny-count is called as a standalone tool). Args: samples_csv: a csv file which defines sample group, replicate, and file location - is_pipeline: helps locate sample SAM files. If true, files are assumed to reside - in the working directory. If false, files are assumed to reside in the same - directory as their source FASTQ files with '_aligned_seqs.sam' appended - (i.e. /dir/sample1.fastq -> /dir/sample1_aligned_seqs.sam). + in_pipeline: helps locate sample alignment files. If true, files are assumed to + reside in the working directory. Returns: - inputs: a list of dictionaries for each library, where each dictionary defines the - library name and the location of its SAM file for counting. + inputs: a list of dictionaries for each library. Each dictionary holds + the library's name, file location, and normalization preferences. """ - def get_library_filename(csv_row_file: str, samples_csv: str) -> str: - """The input samples.csv may contain either fastq or sam files""" - - file_ext = os.path.splitext(csv_row_file)[1].lower() - - # If the sample file has a fastq(.gz) extension, infer the name of its pipeline-produced .sam file - if file_ext in [".fastq", ".gz"]: - # Fix relative paths to be relative to sample_csv's path, rather than relative to cwd - csv_row_file = os.path.basename(csv_row_file) if is_pipeline else from_here(samples_csv, csv_row_file) - csv_row_file = os.path.splitext(csv_row_file)[0] + "_aligned_seqs.sam" - elif file_ext == ".sam": - if not os.path.isabs(csv_row_file): - raise ValueError("The following file must be expressed as an absolute path:\n%s" % (csv_row_file,)) - else: - raise ValueError("The filenames defined in your Samples Sheet must have a .fastq(.gz) or .sam extension.\n" - "The following filename contained neither:\n%s" % (csv_row_file,)) - return csv_row_file - + context = "Pipeline Step" if in_pipeline else "Standalone Run" + samples = SamplesSheet(samples_csv, context=context) inputs = list() - sheet = CSVReader(samples_csv, "Samples Sheet") - try: - for row in sheet.rows(): - library_name = f"{row['Group']}_rep_{row['Replicate']}" - library_file_name = get_library_filename(row['File'], samples_csv) - library_normalization = row['Normalization'] - - record = { - "Name": library_name, - "File": library_file_name, - "Norm": library_normalization - } - - if record not in inputs: inputs.append(record) - except Exception as e: - msg = f"Error occurred on line {sheet.line_num} of {os.path.basename(samples_csv)}" - append_to_exception(e, msg) - raise + for file, group_rep, norm in zip(samples.hts_samples, samples.groups_reps, samples.normalizations): + record = { + "Name": "_rep_".join(group_rep), + "File": file, + "Norm": norm + } + + if record not in inputs: inputs.append(record) return inputs -def load_config(features_csv: str, is_pipeline: bool) -> List[dict]: +def load_config(features_csv: str, in_pipeline: bool) -> List[dict]: """Parses the Features Sheet to provide inputs to FeatureSelector and reference parsers Args: features_csv: a csv file which defines feature sources and selection rules - is_pipeline: not currently used; helps properly locate input files + in_pipeline: not currently used; helps properly locate input files Returns: rules: a list of dictionaries, each representing a parsed row from input. @@ -175,7 +146,7 @@ def load_config(features_csv: str, is_pipeline: bool) -> List[dict]: # Duplicate rule entries are not allowed if rule not in rules: rules.append(rule) except Exception as e: - msg = f"Error occurred on line {sheet.line_num} of {os.path.basename(features_csv)}" + msg = f"Error occurred on line {sheet.row_num} of {os.path.basename(features_csv)}" append_to_exception(e, msg) raise @@ -183,7 +154,7 @@ def load_config(features_csv: str, is_pipeline: bool) -> List[dict]: def load_references(paths: PathsFile, libraries: List[dict], rules: List[dict], prefs) -> ReferenceBase: - """Determines the reference source (GFF or SAM @SQ headers) and constructs the appropriate object + """Determines the reference source (GFF or alignment @SQ headers) and constructs the appropriate object Args: paths: a PathsFile object that optionally contains a `gff_files` configuration @@ -197,13 +168,13 @@ def load_references(paths: PathsFile, libraries: List[dict], rules: List[dict], """ gff_files = paths.get_gff_config() - sam_files = [lib['File'] for lib in libraries] + aln_files = [lib['File'] for lib in libraries] if gff_files: - GFFValidator(gff_files, rules, alignments=sam_files).validate() + GFFValidator(gff_files, rules, alignments=aln_files).validate() references = ReferenceFeatures(gff_files, **prefs) else: - sq_validator = SamSqValidator(sam_files) + sq_validator = AlignmentSqValidator(aln_files) # Reuse sq_validator's parsing results to save time sq_validator.validate() @@ -220,7 +191,7 @@ def map_and_reduce(libraries, paths, prefs): # MergedStatsManager handles final output files, regardless of multiprocessing status summary = MergedStatsManager(Features, paths['features_csv'], prefs) - # Use a multiprocessing pool if multiple sam files were provided + # Use a multiprocessing pool if multiple alignment files were provided if len(libraries) > 1: mp.set_start_method("fork") with mp.Pool(len(libraries)) as pool: @@ -239,13 +210,13 @@ def map_and_reduce(libraries, paths, prefs): def main(): # Get command line arguments. args = get_args() - is_pipeline = args['is_pipeline'] + in_pipeline = args['in_pipeline'] try: # Load and validate config files and input files - paths = PathsFile(args['paths_file'], is_pipeline) - libraries = load_samples(paths['samples_csv'], is_pipeline) - selection = load_config(paths['features_csv'], is_pipeline) + paths = PathsFile(args['paths_file'], in_pipeline) + libraries = load_samples(paths['samples_csv'], in_pipeline) + selection = load_config(paths['features_csv'], in_pipeline) reference = load_references(paths, libraries, selection, args) # global for multiprocessing @@ -260,7 +231,7 @@ def main(): except Exception as e: if type(e) is SystemExit: return traceback.print_exception(*sys.exc_info()) - if args['is_pipeline']: + if in_pipeline: print("\n\ntiny-count encountered an error. Don't worry! You don't have to start over.\n" "You can resume the pipeline at tiny-count. To do so:\n\t" "1. cd into your Run Directory\n\t" diff --git a/tiny/rna/counter/features.py b/tiny/rna/counter/features.py index 512292bd..8eca57dc 100644 --- a/tiny/rna/counter/features.py +++ b/tiny/rna/counter/features.py @@ -4,7 +4,7 @@ from collections import defaultdict from typing import List, Tuple, Set, Dict, Mapping, Union -from tiny.rna.counter.hts_parsing import ReferenceFeatures, ReferenceSeqs, SAM_reader +from tiny.rna.counter.hts_parsing import ReferenceFeatures, ReferenceSeqs, AlignmentReader from .statistics import LibraryStats from .matching import * from ..util import append_to_exception @@ -32,7 +32,7 @@ class FeatureCounter: def __init__(self, references, selection_rules, **prefs): self.stats = LibraryStats(**prefs) - self.sam_reader = SAM_reader(**prefs) + self.alignment_reader = AlignmentReader(**prefs) self.selector = FeatureSelector(selection_rules, **prefs) if isinstance(references, ReferenceFeatures): @@ -48,10 +48,10 @@ def __init__(self, references, selection_rules, **prefs): def count_reads(self, library: dict): """Collects statistics on features assigned to each alignment associated with each read""" - read_seq = self.sam_reader.bundle_multi_alignments(library["File"]) + read_seq = self.alignment_reader.bundle_multi_alignments(library["File"]) self.stats.assign_library(library) - # For each sequence in the sam file... + # For each sequence in the alignment file... for bundle, read_count in read_seq: bstat = self.stats.count_bundle(bundle, read_count) @@ -85,7 +85,7 @@ class FeatureSelector: """Performs hierarchical selection given a set of candidate features for a locus Two sources of data serve as targets for selection: feature attributes (sourced from - input GFF files), and sequence attributes (sourced from input SAM files). + input GFF files), and sequence attributes (sourced from input alignment files). All candidate features are assumed to have matched at least one Identity selector, as determined by hts_parsing.ReferenceFeatures.get_matches_and_classes() diff --git a/tiny/rna/counter/hts_parsing.py b/tiny/rna/counter/hts_parsing.py index 469e184f..ec27df54 100644 --- a/tiny/rna/counter/hts_parsing.py +++ b/tiny/rna/counter/hts_parsing.py @@ -1,4 +1,5 @@ import functools +import itertools import os.path import HTSeq import pysam @@ -19,21 +20,25 @@ _re_attr_main = re.compile(r"\s*([^\s=]+)[\s=]+(.*)") _re_attr_empty = re.compile(r"^\s*$") -# For SAM_reader +# For AlignmentReader AlignmentDict = Dict[str, Union[str, int]] Bundle = Tuple[List[AlignmentDict], int] _re_tiny = r"\d+_count=\d+" _re_fastx = r"seq\d+_x(\d+)" -class SAM_reader: - """A minimal SAM reader that bundles multiple-alignments and only parses data relevant to the workflow.""" +class AlignmentReader: + """A SAM/BAM wrapper for Pysam that bundles multiple-alignments and only retrieves data relevant to the workflow.""" + + compatible_unordered = ("bowtie", "bowtie2", "star") # unordered files from these tools don't require sorting def __init__(self, **prefs): self.decollapse = prefs.get("decollapse", False) self.out_prefix = prefs.get("out_prefix", None) self.collapser_type = None self.references = [] + self.file_type = None + self.basename = None self.has_nm = None self.file = None @@ -50,8 +55,8 @@ def bundle_multi_alignments(self, file: str) -> Iterator[Bundle]: """Bundles multi-alignments (determined by a shared QNAME) and reports the associated read's count""" pysam_reader = pysam.AlignmentFile(file) - self.file = file + self._assign_library(file) self._gather_metadata(pysam_reader) self._iter = AlignmentIter(pysam_reader, self.has_nm, self._decollapsed_callback, self._decollapsed_reads) bundle, read_count = self._new_bundle(next(self._iter)) @@ -78,18 +83,25 @@ def _new_bundle(self, aln: AlignmentDict) -> Bundle: return [aln], count + def _assign_library(self, file): + self.file_type = file.rsplit('.')[-1].lower() + self.basename = os.path.basename(file) + self.file = file + def _gather_metadata(self, reader: pysam.AlignmentFile) -> None: - """Saves header information, examines the first alignment to determine which collapser utility was used (if any), and whether the alignment has a query sequence and NM tag. + """Saves header information, examines the first alignment to determine which collapser utility was used (if any) + and whether the alignment has a query sequence and NM tag. The input file's header is written to the decollapsed output file if necessary.""" header = reader.header first_aln = next(reader.head(1)) if first_aln.query_sequence is None: - raise ValueError("SAM alignments must include the read sequence.") + raise ValueError(f"Alignments must include the read sequence ({self.basename}).") self._header = header - self._header_dict = header.to_dict() # performs validation + self._header_dict = header.to_dict() # performs header validation + self._check_for_incompatible_order() self._determine_collapser_type(first_aln.query_name) self.has_nm = first_aln.has_tag("NM") self.references = header.references @@ -97,6 +109,44 @@ def _gather_metadata(self, reader: pysam.AlignmentFile) -> None: if self.decollapse: self._write_header_for_decollapsed_sam(str(header)) + def _check_for_incompatible_order(self): + """Inspects headers to determine if alignments of multi-mapping reads are in adjacent order + + Headers that report unsorted/unknown sorting order are likely still compatible, but + since the SAM/BAM standard doesn't address alignment adjacency, we have to assume + that it is implementation dependent. Therefore, if all other checks for strict + compatibility/incompatibility pass, we check the last reported @PG + header against a short (incomplete) list of alignment tools that are known to + follow the adjacency convention.""" + + hd_header = self._header_dict.get('HD', {}) + so_header = hd_header.get('SO', "") + go_header = hd_header.get('GO', "") + + # Check for strictly compatible order + if "queryname" in so_header or "query" in go_header: + return + + error = self.basename + " is incompatible because {reason}.\n" \ + "Please ensure that alignments for multi-mapping reads are ordered so " \ + "that they are adjacent to each other. One way to achieve this is to " \ + 'sort your files by QNAME with the command "samtools sort -n"' + + # Check for strictly incompatible order + if not ({'SO', 'GO'} & hd_header.keys()): + raise ValueError(error.format(reason="its sorting/grouping couldn't be determined from headers")) + if "coordinate" in so_header: + raise ValueError(error.format(reason="it is sorted by coordinate")) + if "reference" in go_header: + raise ValueError(error.format(reason="it is grouped by reference")) + + # If SO is unknown/unsorted or GO is none, check to see if the last program + # to handle the alignments is one that produces adjacent alignments by default + pg_header = self._header_dict.get('PG', [{}]) + last_prog = pg_header[-1].get('ID', "") + if last_prog.lower() not in self.compatible_unordered: + raise ValueError(error.format(reason="multi-mapping adjacency couldn't be determined")) + def _determine_collapser_type(self, qname: str) -> None: """Attempts to determine the collapsing utility that was used before producing the input alignment file, then checks basic requirements for the utility's outputs.""" @@ -108,11 +158,6 @@ def _determine_collapser_type(self, qname: str) -> None: elif re.match(_re_fastx, qname) is not None: self.collapser_type = "fastx" self._collapser_token = "_x" - - sort_order = self._header_dict.get('HD', {}).get('SO', None) - if sort_order is None or sort_order != "queryname": - raise ValueError("SAM files from fastx collapsed outputs must be sorted by queryname\n" - "(and the @HD [...] SO header must be set accordingly).") else: self.collapser_type = None @@ -148,23 +193,23 @@ def _write_decollapsed_sam(self) -> None: if name != prev_name: seq_count = int(name.rsplit(self._collapser_token, 1)[1]) - aln_out.extend([aln.to_string()] * seq_count) + aln_out.extend([aln.to_string() + '\n'] * seq_count) prev_name = name with open(self._get_decollapsed_filename(), 'a') as sam_o: - sam_o.write('\n'.join(aln_out)) + sam_o.writelines(aln_out) self._decollapsed_reads.clear() -def infer_strandedness(sam_file: str, intervals: dict) -> str: - """Infers strandedness from a sample SAM file and intervals from a parsed GFF file +def infer_strandedness(alignment_file: str, intervals: dict) -> str: + """Infers strandedness from a sample alignment file and intervals from a parsed GFF file Credit: this technique is an adaptation of those in RSeQC's infer_experiment.py. It has been modified to accept a GFF reference file rather than a BED file, - and to use HTSeq rather than bx-python. + and to use our AlignmentReader rather than bx-python. Args: - sam_file: the path of the SAM file to evaluate + alignment_file: the path of the alignment file to evaluate intervals: the intervals instance attribute of ReferenceFeatures, populated by .get() """ @@ -175,18 +220,14 @@ def infer_strandedness(sam_file: str, intervals: dict) -> str: iv_convert.strand = '.' unstranded[iv_convert] = orig_iv.strand - # Assumes read_SAM() defaults to non-reverse strandedness - sample_read = read_SAM(sam_file) + # Assumes AlignmentReader defaults to non-reverse strandedness + sample_read = AlignmentReader().bundle_multi_alignments(alignment_file) gff_sam_map = Counter() - for count in range(1, 20000): - try: - rec = next(sample_read) - strandless = HTSeq.GenomicInterval(rec['Chrom'], rec['Start'], rec['End']) - sam_strand = rec['strand'] - gff_strand = ':'.join(unstranded[strandless].get_steps()) - gff_sam_map[sam_strand + gff_strand] += 1 - except StopIteration: - break + for rec, _ in zip(itertools.chain(*sample_read), range(20000)): + strandless = HTSeq.GenomicInterval(rec['Chrom'], rec['Start'], rec['End']) + sam_strand = rec['strand'] + gff_strand = ':'.join(unstranded[strandless].get_steps()) + gff_sam_map[sam_strand + gff_strand] += 1 non_rev = (gff_sam_map['++'] + gff_sam_map['--']) / sum(gff_sam_map.values()) reverse = (gff_sam_map['+-'] + gff_sam_map['-+']) / sum(gff_sam_map.values()) diff --git a/tiny/rna/counter/statistics.py b/tiny/rna/counter/statistics.py index 41fbcfa7..a3afd0fd 100644 --- a/tiny/rna/counter/statistics.py +++ b/tiny/rna/counter/statistics.py @@ -634,7 +634,7 @@ def record_alignment_details(self, assignments, aln, bundle, n_candidates): This is called once per locus per read (every alignment) when the user elects to save diagnostics. The recorded information is later written to {library['Name']}_aln_table.txt - after the entire SAM file has been processed.""" + after the entire alignment file has been processed.""" # Map internal strand representation to +/-/. strand = self.map_strand[aln['Strand']] diff --git a/tiny/rna/counter/validation.py b/tiny/rna/counter/validation.py index a55b488e..a068be6e 100644 --- a/tiny/rna/counter/validation.py +++ b/tiny/rna/counter/validation.py @@ -7,7 +7,7 @@ from collections import Counter, defaultdict from typing import List, Dict -from tiny.rna.counter.hts_parsing import parse_gff, ReferenceFeatures, SAM_reader +from tiny.rna.counter.hts_parsing import parse_gff, ReferenceFeatures from tiny.rna.counter.features import FeatureSelector from tiny.rna.util import ReportFormatter, sorted_natural, gzip_open @@ -18,7 +18,7 @@ class GFFValidator: targets = { "ID attribute": "Features missing a valid identifier attribute", - "sam files": "Potentially incompatible SAM alignment files", + "alignment files": "Potentially incompatible alignment files", "seq chromosomes": "Chromosomes present in sequence files", "gff chromosomes": "Chromosomes present in GFF files", "strand": "Features missing strand information", @@ -171,20 +171,19 @@ def chroms_shared_with_genomes(self, genome_fastas): def alignment_chroms_mismatch_heuristic(self, sam_files: List[str], subset_size=50000) -> Dict[str, set]: """Since alignment files can be very large, we only check that there's at least one shared chromosome identifier and only the first subset_size lines are read from each file. The - returned dictionary contains only the SAM files whose sampled chromosome set failed to + returned dictionary contains only the alignment files whose sampled chromosome set failed to intersect with the chromosomes parsed from GFF files. Returns: - a dictionary of {SAM filename: SAM chromosomes sampled}""" + a dictionary of {alignment filename: alignment chromosomes sampled}""" files_wo_overlap = {} for file in sam_files: file_chroms = set() - with open(file, 'rb') as f: - for i, line in zip(range(subset_size), f): - if line[0] == ord('@'): continue - file_chroms.add(line.split(b'\t')[2].strip().decode()) - if i % 10000 == 0 and len(file_chroms & self.chrom_set): break + reader = pysam.AlignmentFile(file) + for i, aln in enumerate(reader.head(subset_size)): + file_chroms.add(aln.reference_name) + if i % 10000 == 0 and len(file_chroms & self.chrom_set): break if not len(file_chroms & self.chrom_set): files_wo_overlap[file] = file_chroms @@ -210,7 +209,7 @@ def generate_chrom_heuristics_report(self, suspect_files): for file, chroms in suspect_files.items()} summary = { - "sam files": chroms, + "alignment files": chroms, "gff chromosomes": sorted(self.chrom_set) } @@ -229,24 +228,24 @@ def build_column_filters(rules): return FeatureSelector.build_selectors([selector_defs])[0] -class SamSqValidator: +class AlignmentSqValidator: """Validates @SQ headers for tiny-count's sequence-based counting mode""" targets = { "inter sq": "Sequence identifiers with inconsistent lengths", - "intra sq": "SAM files with repeated sequence identifiers", - "incomplete sq": "SAM files with incomplete @SQ headers", - "missing sq": "SAM files that lack @SQ headers" + "intra sq": "alignment files with repeated sequence identifiers", + "incomplete sq": "alignment files with incomplete @SQ headers", + "missing sq": "alignment files that lack @SQ headers" } - def __init__(self, sam_files): + def __init__(self, alignment_files): self.report = ReportFormatter(self.targets) - self.sam_files = sam_files + self.alignment_files = alignment_files self.reference_seqs = {} self.sq_headers = {} def validate(self): - print("Validating sequence identifiers in SAM files... ", end='') + print("Validating sequence identifiers in alignment files... ", end='') self.read_sq_headers() self.validate_sq_headers() print("done.") @@ -279,7 +278,7 @@ def get_ambiguous_lengths(self) -> List[str]: Sequences with consistent length definitions are added to self.reference_seqs""" seq_lengths = defaultdict(set) - for sam in self.sam_files: + for sam in self.alignment_files: for sq in self.sq_headers[sam]: seq_id = sq['SN'] seq_len = int(sq['LN']) @@ -296,10 +295,10 @@ def get_ambiguous_lengths(self) -> List[str]: return bad_seqs def get_duplicate_identifiers(self) -> Dict[str, List[str]]: - """Returns a dictionary of SAM files that contain duplicate sequence identifiers""" + """Returns a dictionary of alignment files that contain duplicate sequence identifiers""" bad_files = {} - for file in self.sam_files: + for file in self.alignment_files: sq = self.sq_headers[file] id_count = Counter(seq['SN'] for seq in sq) duplicates = [seq_id for seq_id, count in id_count.items() if count > 1] @@ -308,13 +307,13 @@ def get_duplicate_identifiers(self) -> Dict[str, List[str]]: return bad_files def get_incomplete_sq_headers(self) -> List[str]: - """Returns a list of SAM files that have incomplete @SQ headers""" + """Returns a list of alignment files that have incomplete @SQ headers""" return [file for file, sqs in self.sq_headers.items() if not all("SN" in sq and "LN" in sq for sq in sqs)] def get_missing_headers(self) -> List[str]: - """Returns a list of SAM files that lack @SQ headers""" + """Returns a list of alignment files that lack @SQ headers""" return [file for file, sqs in self.sq_headers.items() if len(sqs) == 0] @@ -326,7 +325,7 @@ def generate_header_syntax_report(self, missing, incomplete): if incomplete: report['incomplete sq'] = sorted_natural(incomplete) - header = "Every SAM file must have complete @SQ headers with SN and LN\n" \ + header = "Every alignment file must have complete @SQ headers with SN and LN\n" \ "fields when performing sequence-based counting.\n" self.report.add_error_section(header, report) @@ -341,6 +340,6 @@ def generate_identifier_report(self, duplicate, ambiguous): self.report.add_error_section(header, report) def read_sq_headers(self): - for file in self.sam_files: + for file in self.alignment_files: pr = pysam.AlignmentFile(file, check_sq=False) self.sq_headers[file] = pr.header.to_dict().get('SQ', []) diff --git a/tiny/rna/util.py b/tiny/rna/util.py index 1d8700df..1670ef3b 100644 --- a/tiny/rna/util.py +++ b/tiny/rna/util.py @@ -148,17 +148,6 @@ def _fill_text(self, text, width, indent): return '\n\n'.join(output_lines) -def from_here(config_file, input_file): - """Calculates paths relative to the config file which contains them""" - config_file, input_file = (os.path.expanduser(p) for p in [config_file, input_file]) - - if not os.path.isabs(input_file): - from_here = os.path.dirname(config_file) - input_file = os.path.normpath(os.path.join(from_here, input_file)) - - return input_file - - def make_filename(args, ext='.csv'): return '_'.join([str(chnk) for chnk in args if chnk is not None]) + ext diff --git a/tiny/templates/paths.yml b/tiny/templates/paths.yml index a05f83fc..c72eebc0 100644 --- a/tiny/templates/paths.yml +++ b/tiny/templates/paths.yml @@ -1,14 +1,17 @@ ############################## MAIN INPUT FILES FOR ANALYSIS ############################## # -# Edit this section to provide the path to your Samples and Features sheets. Relative and -# absolute paths are both allowed. All relative paths are relative to THIS config file. +# Relative and absolute paths are both allowed. +# All relative paths are evaluated relative to THIS config file. # # Directions: # 1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv] # 2. Fill out the Features Sheet with selection rules [features.csv] -# 3. Set samples_csv and features_csv (below) to point to these files +# 3. Set samples_csv and features_csv to point to these files # 4. Add annotation files and per-file alias preferences to gff_files (optional) # +# If using the tinyRNA workflow, additionally set ebwt and/or reference_genome_files +# in the BOWTIE-BUILD section. +# ######-------------------------------------------------------------------------------###### ##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --## diff --git a/tiny/templates/run_config_template.yml b/tiny/templates/run_config_template.yml index 5eb18add..b8eb7c22 100644 --- a/tiny/templates/run_config_template.yml +++ b/tiny/templates/run_config_template.yml @@ -41,9 +41,8 @@ run_native: false ######------------------------- BOWTIE INDEX BUILD OPTIONS --------------------------###### # -# If you do not already have bowtie indexes, they can be built for you by setting -# run_bowtie_build (above) to true and adding your reference genome file(s) to your -# paths_config file. +# If you do not already have bowtie indexes, they can be built for you +# (see the BOWTIE-BUILD section in the Paths File) # # We have specified default parameters for small RNA data based on our own "best practices". # You can change the parameters here. diff --git a/tiny/templates/samples.csv b/tiny/templates/samples.csv index 32305007..9d6a8c00 100755 --- a/tiny/templates/samples.csv +++ b/tiny/templates/samples.csv @@ -1,2 +1,2 @@ -FASTQ/SAM Files,Sample/Group Name,Replicate Number,Control,Normalization +Input Files,Sample/Group Name,Replicate Number,Control,Normalization ,,,, \ No newline at end of file