diff --git a/README.md b/README.md
index 91b7d4df..f3903cc6 100644
--- a/README.md
+++ b/README.md
@@ -90,12 +90,12 @@ tiny get-template
### Requirements for User-Provided Input Files
-| Input Type | File Extension | Requirements |
-|----------------------------------------------------------------------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Reference annotations
[(example)](START_HERE/reference_data/ram1.gff3) | GFF3 / GFF2 / GTF | Column 9 attributes (defined as "tag=value" or "tag "):
- Each feature must have an `ID` or `gene_id` tag.
- Feature classes can be defined with the `Class` tag. If undefined, the default value \__UNKNOWN_\_ will be used.
- Discontinuous features must be defined with the `Parent` tag whose value is the logical parent's `ID`, or by sharing the same `ID`.
- Attribute values containing commas must represent lists.
- All features must be stranded.
- See the example link (left) for col. 9 formatting.
|
-| Sequencing data
[(example)](START_HERE/fastq_files) | FASTQ(.gz) | Files must be demultiplexed. |
-| Reference genome
[(example)](START_HERE/reference_data/ram1.fa) | FASTA | Chromosome identifiers (e.g. Chr1): - Must match your reference annotation file chromosome identifiers
- Are case sensitive
|
-| Bowtie indexes (optional) 1 | ebwt | Must be small indexes (.ebwtl indexes are not supported) |
+| Input Type | File Extension | Requirements |
+|----------------------------------------------------------------------------|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Reference annotations
[(example)](START_HERE/reference_data/ram1.gff3) | GFF3 / GFF2 / GTF | Column 9 attributes (defined as "tag=value" or "tag "):- Each feature must have an `ID` or `gene_id` or `Parent` tag (referred to as `ID` henceforth).
- Feature classes can be defined with the `Class` tag. If undefined, the default value \__UNKNOWN_\_ will be used.
- Discontinuous features must be defined with the `Parent` tag whose value is the logical parent's `ID`, or by sharing the same `ID`.
- Attribute values containing commas must represent lists.
- `Parent` tags with multiple values are not yet supported.
- See the example link (left) for col. 9 formatting.
|
+| Sequencing data
[(example)](START_HERE/fastq_files) | FASTQ(.gz) | Files must be demultiplexed. |
+| Reference genome
[(example)](START_HERE/reference_data/ram1.fa) | FASTA | Chromosome identifiers (e.g. Chr1): - Must match your reference annotation file chromosome identifiers
- Are case sensitive
|
+| Bowtie indexes (optional) 1 | ebwt | Must be small indexes (.ebwtl indexes are not supported) |
1 Bowtie indexes can be created for you. See the [configuration file documentation](doc/Configuration.md#building-bowtie-indexes).
diff --git a/START_HERE/run_config.yml b/START_HERE/run_config.yml
index 9de0e9cc..fc48ef7c 100644
--- a/START_HERE/run_config.yml
+++ b/START_HERE/run_config.yml
@@ -247,10 +247,6 @@ counter_type_filter: []
##-- Select the StepVector implementation that is used. Options: HTSeq or Cython --##
counter_stepvector: 'Cython'
-##-- If False: a feature with multiple values in its ID attribute is treated as an error --##
-##-- If True: multiple ID values are allowed, but only the first value is used --##
-counter_allow_multi_id: True
-
##-- If True: produce diagnostic logs to indicate what was eliminated and why --##
counter_diags: False
diff --git a/doc/Configuration.md b/doc/Configuration.md
index 464fadc5..4dbf5869 100644
--- a/doc/Configuration.md
+++ b/doc/Configuration.md
@@ -132,7 +132,7 @@ The Features Sheet allows you to define selection rules that determine how featu
Rules apply to features parsed from **all** Feature Sources, with the exception of "Alias by..." which only applies to the Feature Source on the same row. Selection first takes place against feature attributes (GFF column 9), and is directed by defining the attribute you want to be considered (Select for...) and the acceptable values for that attribute (with value...).
-Rules that match features in the first stage of selection will be used in a second stage which performs elimination by hierarchy and interval overlap characteristics. Remaining candidates pass to the third and final stage of selection which examines characteristics of the alignment itself: strand relative to the feature of interest, 5' end nucleotide, and length.
+Rules that match features in the first stage of selection will be used in a second stage which evaluates alignment vs. feature interval overlap. These matches are sorted by hierarchy value and passed to the third and final stage of selection which examines characteristics of the alignment itself: strand relative to the feature of interest, 5' end nucleotide, and length.
See [tiny-count's documentation](tiny-count.md#feature-selection) for an explanation of each column.
diff --git a/doc/tiny-count.md b/doc/tiny-count.md
index 1439c635..ad5e1314 100644
--- a/doc/tiny-count.md
+++ b/doc/tiny-count.md
@@ -46,16 +46,24 @@ You can optionally specify a tag for each rule. Feature assignments resulting fr
| _features.csv columns:_ | Hierarchy | Overlap |
|-------------------------|-----------|---------|
-This stage of selection is concerned with the interval overlap between alignments and features. **Overlap is determined in a strandless fashion.** See the [Strand](#strand) section in Stage 3 for refinement of selections by strand.
+Features overlapping a read alignment are selected based on their overlap characteristics. These matches are then sorted by hierarchy value before proceeding to Stage 3.
### Overlap
-This column allows you to specify which read alignments should be assigned based on how their start and end points overlap with candidate features. Candidates for each matched rule can be selected using the following options:
+This column allows you to specify the extent of overlap required for candidate feature selection:
- `partial`: alignment overlaps feature by at least one base
- `full`: alignment does not extend beyond either terminus of the feature
- `exact`: alignment termini are equal to the feature's
+- `anchored`: alignment's start and/or end is equal to the feature's
- `5' anchored`: alignment's 5' end is equal to the corresponding terminus of the feature
- `3' anchored`: alignment's 3' end is equal to the corresponding terminus of the feature
+In order to be a candidate, a feature must match a rule in Stage 1, reside on the same chromosome as the alignment, and must overlap the alignment by at least 1 nucleotide.
+
+#### Strandedness and the Overlap Selector
+A feature does not have to be on the same strand as the alignment in order to be a candidate. See the [Strand](#strand) section in Stage 3 for selection by strand. Unstranded features will have `5' anchored` and `3' anchored` overlap selectors downgraded to `anchored` selectors. Alignments overlapping these features are evaluated for shared start and/or end coordinates, but 5'/3' ends are not distinguished.
+
+#### Selector Demonstration
+
The following diagrams demonstrate the strand semantics of these interval selectors. The first two options show separate illustrations for features on each strand for emphasis. All matches shown in the remaining three options apply to features on either strand.


@@ -79,10 +87,14 @@ You can use larger hierarchy values to exclude features that are not of interest
The final stage of selection is concerned with the small RNA attributes of each alignment locus. Candidates are evaluated in order of hierarchy value where smaller values take precedence. Once a match has been found, reads are excluded from remaining candidates with larger hierarchy values.
### Strand
+This selector defines requirements for the alignment's strand relative to the feature's strand. Here, sense and antisense don't refer to the feature's or alignment's strand alone, but rather whether the alignment is sense/antisense to the feature.
- `sense`: the alignment strand must match the feature's strand for a match
- `antisense`: the alignment strand must not match the feature's strand for a match
- `both`: strand is not evaluated
+#### Unstranded Features
+These features will match all strand selectors regardless of the alignment's strand.
+
### 5' End Nucleotide and Length
| Parameter | Single | List | Range | Wildcard |
diff --git a/tests/testdata/counter/validation/ebwt/ram1.1.ebwt b/tests/testdata/counter/validation/ebwt/ram1.1.ebwt
new file mode 100644
index 00000000..67de2b90
Binary files /dev/null and b/tests/testdata/counter/validation/ebwt/ram1.1.ebwt differ
diff --git a/tests/testdata/counter/validation/ebwt/ram1.2.ebwt b/tests/testdata/counter/validation/ebwt/ram1.2.ebwt
new file mode 100644
index 00000000..a6a7d1e4
Binary files /dev/null and b/tests/testdata/counter/validation/ebwt/ram1.2.ebwt differ
diff --git a/tests/testdata/counter/validation/ebwt/ram1.3.ebwt b/tests/testdata/counter/validation/ebwt/ram1.3.ebwt
new file mode 100644
index 00000000..72d40b2d
Binary files /dev/null and b/tests/testdata/counter/validation/ebwt/ram1.3.ebwt differ
diff --git a/tests/testdata/counter/validation/ebwt/ram1.4.ebwt b/tests/testdata/counter/validation/ebwt/ram1.4.ebwt
new file mode 100644
index 00000000..ae543bb9
Binary files /dev/null and b/tests/testdata/counter/validation/ebwt/ram1.4.ebwt differ
diff --git a/tests/testdata/counter/validation/ebwt/ram1.rev.1.ebwt b/tests/testdata/counter/validation/ebwt/ram1.rev.1.ebwt
new file mode 100644
index 00000000..d1fbdec9
Binary files /dev/null and b/tests/testdata/counter/validation/ebwt/ram1.rev.1.ebwt differ
diff --git a/tests/testdata/counter/validation/ebwt/ram1.rev.2.ebwt b/tests/testdata/counter/validation/ebwt/ram1.rev.2.ebwt
new file mode 100644
index 00000000..f7851afe
Binary files /dev/null and b/tests/testdata/counter/validation/ebwt/ram1.rev.2.ebwt differ
diff --git a/tests/testdata/counter/validation/genome/genome.fasta b/tests/testdata/counter/validation/genome/genome.fasta
new file mode 100644
index 00000000..b6bb89fe
--- /dev/null
+++ b/tests/testdata/counter/validation/genome/genome.fasta
@@ -0,0 +1,6 @@
+>chr1
+ATGC
+>chr2
+CGTA
+>chr3
+ATATAT
\ No newline at end of file
diff --git a/tests/unit_test_helpers.py b/tests/unit_test_helpers.py
index 317dbbad..0084cfc1 100644
--- a/tests/unit_test_helpers.py
+++ b/tests/unit_test_helpers.py
@@ -135,7 +135,8 @@ def reassemble_gz_w(mock_calls):
# Converts strand character to a boolean value
def strand_to_bool(strand):
- assert strand in ['+', '-']
+ assert strand in ('+', '-', None)
+ if strand is None: return None
return strand == '+'
diff --git a/tests/unit_tests_collapser.py b/tests/unit_tests_collapser.py
index 3c8045ce..41d032e2 100644
--- a/tests/unit_tests_collapser.py
+++ b/tests/unit_tests_collapser.py
@@ -35,9 +35,9 @@ def setUpClass(self):
self.output = {"file": {
"out": {"exists": "mockPrefixExists_collapsed.fa", "dne": "mockPrefixDNE_collapsed.fa"},
"low": {"exists": "mockPrefixExists_collapsed_lowcounts.fa", "dne": "mockPrefixDNE_collapsed_lowcounts.fa"}}}
- self.output["msg"] = {k: "Collapser critical error: "+v['exists']+" already exists.\n"
+ self.output["msg"] = {k: "tiny-collapse critical error: "+v['exists']+" already exists.\n"
for k,v in self.output['file'].items()}
- self.prefix_required_msg = "Collapser critical error: an output file must be specified.\n"
+ self.prefix_required_msg = "tiny-collapse critical error: an output file must be specified.\n"
# Min-length fastq/fasta (single record)
self.min_seq = "GTTTTGTTGGGCTTTCGCGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCT"
@@ -351,7 +351,7 @@ def test_collapser_command(self):
with ShellCapture(f'tiny-collapse -i /dev/null -o {prefix}') as test:
test()
self.assertEqual(test.get_stdout(), '')
- self.assertIn(f"Collapser critical error: {expected_out_file} already exists.\n", test.get_stderr())
+ self.assertIn(f"tiny-collapse critical error: {expected_out_file} already exists.\n", test.get_stderr())
# (Very) roughly tests that the output file of the last test (same prefix) was not modified by this call
self.assertEqual(test_collapsed_fa_size, os.path.getsize(expected_out_file))
finally:
diff --git a/tests/unit_tests_hts_parsing.py b/tests/unit_tests_hts_parsing.py
index 1360631d..fe741193 100644
--- a/tests/unit_tests_hts_parsing.py
+++ b/tests/unit_tests_hts_parsing.py
@@ -204,7 +204,7 @@ def test_ref_tables_missing_id_attribute(self):
gff_row_without_id = helpers.read(self.short_gff_file).replace('ID=Gene:WBGene00023193;', '')
mock_reader = mock_open(read_data=gff_row_without_id)
- expected_err = f"Feature WBGene00023193 does not contain an ID attribute.\n"
+ expected_err = f"Feature WBGene00023193 does not have an ID attribute.\n"
expected_err += f"Error occurred on line 1 of {self.short_gff_file}"
with patch('tiny.rna.counter.hts_parsing.HTSeq.utils.open', new=mock_reader):
@@ -732,5 +732,59 @@ def test_CaseInsensitiveAttrs_contains_ident_wildcard(self):
self.assertFalse(cia.contains_ident(("attrkey4", Wildcard())))
self.assertFalse(cia.contains_ident((Wildcard(), "attrval7")))
+
+@unittest.skip("Long-running test, execute manually")
+class GenomeParsingTests(unittest.TestCase):
+ """Runs full-scale, unmodified GFF3/GTF genomes for select species through the ReferenceTables class"""
+
+ @classmethod
+ def setUpClass(self):
+ import requests
+ release = "54"
+ baseurl = "http://ftp.ensemblgenomes.org/pub/release-"
+ self.data_dir = "./testdata/local_only/gff/"
+ if not os.path.exists(self.data_dir): os.makedirs(self.data_dir)
+
+ self.urls = {
+ 'Arabidopsis':
+ baseurl + '%s/plants/{}/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.%s.{}.gz' % (release, release),
+ 'C. elegans':
+ baseurl + '%s/metazoa/{}/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.%s.{}.gz' % (release, release)
+ }
+
+ self.genomes = {
+ species: os.path.join(self.data_dir, os.path.basename(url))
+ for species, url in self.urls.items()
+ }
+
+ # Download genome files if they aren't already in the data dir
+ for ftype in ('gff3', 'gtf'):
+ for species, file_template in self.genomes.items():
+ file = file_template.format(ftype)
+ if os.path.isfile(file): continue
+
+ print(f"Downloading {ftype} genome for {species} from Ensembl...")
+ url = self.urls[species].format(ftype, ftype)
+ with requests.get(url, stream=True) as r:
+ r.raise_for_status()
+ with open(file, 'wb') as f:
+ for chunk in r.iter_content(chunk_size=16384):
+ f.write(chunk)
+
+ """Does ReferenceTables.get() process all genomes without throwing any errors?"""
+ def test_gff_megazord(self):
+ print("Running GFF Megazord test. This will take a long time...")
+
+ # Single rule with all wildcard selectors, but only Identity is actually relevant within ReferenceTables
+ rules = [{'Identity': ('', ''), 'Tag': '', 'Hierarchy': 0, 'Overlap': 'partial', 'Strand': '', 'nt5end': '', 'Length': ''}]
+ files = {gff.format(ftype): [] for gff in self.genomes.values() for ftype in ('gff3', 'gtf')}
+
+ fs = FeatureSelector(rules, LibraryStats())
+ rt = ReferenceTables(files, fs)
+
+ # The test is passed if this command
+ # completes without throwing errors.
+ rt.get()
+
if __name__ == '__main__':
unittest.main()
diff --git a/tests/unit_tests_validation.py b/tests/unit_tests_validation.py
new file mode 100644
index 00000000..a692c216
--- /dev/null
+++ b/tests/unit_tests_validation.py
@@ -0,0 +1,300 @@
+import contextlib
+import unittest
+import time
+import io
+
+from glob import glob
+from unittest.mock import patch, mock_open
+
+from rna.counter.validation import GFFValidator, ReportFormatter
+
+
+class ValidationTests(unittest.TestCase):
+
+ """======== Helper functions =========================== """
+
+ @classmethod
+ def setUpClass(self):
+ self.strand_ext_header = \
+ 'Unstranded features are allowed, but they can lead to potentially unexpected results.\n' \
+ 'These features will match "sense", "antisense", and "both" strand selectors. 5\'/3\' anchored\n' \
+ "overlap selectors for these features will evaluate for termini shared with the alignment,\n" \
+ "but will not distinguish between the alignment's 5' and 3' ends."
+
+ def make_gff_row(self, **cols):
+ template = {
+ 'seqid': "I",
+ 'source': "Wormbase",
+ 'type': "gene",
+ 'start': 3747,
+ 'end': 3909,
+ 'score': ".",
+ 'strand': "+",
+ 'phase': ".",
+ 'attrs': {
+ 'ID': 'featid',
+ 'gene_id': 'featid',
+ 'Parent': 'parentid'
+ }
+ }
+
+ new_gff = dict(template, **cols)
+ new_gff['attrs'] = ';'.join([f"{k}={v}" for k, v in new_gff['attrs'].items()])
+ return '\t'.join(map(str, new_gff.values()))
+
+ def make_gff_validator(self) -> GFFValidator:
+ return GFFValidator({}, {})
+
+ def mock_gff_open(self, contents):
+ mo = mock_open(read_data=contents)
+ return patch('tiny.rna.counter.hts_parsing.HTSeq.utils.open', mo)
+
+ """Does GFFValidator correctly validate strand information?"""
+
+ def test_gff_strand_validation(self):
+ validator = self.make_gff_validator()
+ mock_filename = '/dev/null'
+ mock_gff = '\n'.join([
+ self.make_gff_row(strand="+"), # Valid
+ self.make_gff_row(strand="-"), # Valid
+ self.make_gff_row(strand=".", type="chromosome"), # Valid
+ self.make_gff_row(strand=".") # Invalid
+ ])
+
+ expected = '\n'.join([
+ "The following issues were found in the GFF files provided. ",
+ self.strand_ext_header,
+ "\t" + f"{validator.targets['strand']}: ",
+ "\t\t" + f"1 missing in {mock_filename}"
+ ])
+
+ with self.mock_gff_open(mock_gff) as p:
+ validator.parse_and_validate_gffs({mock_filename: []})
+
+ self.assertListEqual(validator.report.warnings, [expected])
+ self.assertListEqual(validator.report.errors, [])
+
+ """Does GFFValidator correctly validate ID attributes?"""
+
+ def test_gff_id_validation(self):
+ validator = self.make_gff_validator()
+ mock_filename = '/dev/null'
+ mock_gff = '\n'.join([
+ self.make_gff_row(attrs={'ID': 'feat1'}), # Valid
+ self.make_gff_row(attrs={'gene_id': 'feat3'}), # Valid
+ self.make_gff_row(attrs={'Parent': 'feat5'}), # Valid
+ self.make_gff_row(attrs={'Gene id': 'feat6'}), # Invalid
+ self.make_gff_row(attrs={'other': 'feat7'}), # Invalid
+ self.make_gff_row(attrs={}), # Invalid
+ ])
+
+ expected = '\n'.join([
+ "The following issues were found in the GFF files provided. ",
+ "\t" + f"{validator.targets['ID attribute']}: ",
+ "\t\t" + f"3 missing in {mock_filename}"
+ ])
+
+ with patch('tiny.rna.counter.hts_parsing.HTSeq.utils.open', mock_open(read_data=mock_gff)) as p:
+ validator.parse_and_validate_gffs({mock_filename: []})
+
+ self.assertListEqual(validator.report.errors, [expected])
+ self.assertListEqual(validator.report.warnings, [])
+
+ """Does GFFValidator.chroms_shared_with_ebwt() correctly identify ebwt chromosomes?"""
+
+ def test_ebwt_chroms(self):
+ validator = self.make_gff_validator()
+ ebwt_prefix = "./testdata/counter/validation/ebwt/ram1"
+
+ # Chroms are shared
+ validator.chrom_set = {'ram1'}
+ shared, ebwt_chroms = validator.chroms_shared_with_ebwt(ebwt_prefix)
+ self.assertSetEqual(shared, validator.chrom_set)
+ self.assertSetEqual(shared, ebwt_chroms)
+
+ # Chroms aren't shared
+ validator.chrom_set = {'chr1', 'chr2', 'chr3'}
+ shared, ebwt_chroms = validator.chroms_shared_with_ebwt(ebwt_prefix)
+ self.assertSetEqual(shared, set())
+ self.assertSetEqual(ebwt_chroms, {'ram1'})
+
+ """Does GFFValidator.chroms_shared_with_genomes() correctly identify genome chromosomes?"""
+
+ def test_genome_chroms(self):
+ validator = self.make_gff_validator()
+ fasta_file = "./testdata/counter/validation/genome/genome.fasta"
+
+ # Chroms are shared
+ validator.chrom_set = {'chr1', 'chr2', 'chr3'}
+ shared, genome_chroms = validator.chroms_shared_with_genomes([fasta_file])
+ self.assertSetEqual(shared, validator.chrom_set)
+ self.assertSetEqual(shared, genome_chroms)
+
+ # Chroms aren't shared
+ validator.chrom_set = {'ram1'}
+ shared, genome_chroms = validator.chroms_shared_with_genomes([fasta_file])
+ self.assertSetEqual(shared, set())
+ self.assertSetEqual(genome_chroms, {'chr1', 'chr2', 'chr3'})
+
+ """Does GFFValidator's alignments heuristic identify potentially incompatible SAM files?"""
+
+ def test_alignments_heuristic(self):
+ validator = self.make_gff_validator()
+ sam_files = ['./testdata/counter/identity_choice_test.sam',
+ './testdata/counter/single.sam']
+
+ sam_chroms = {
+ sam_files[0]: {'I', 'V', 'MtDNA'},
+ sam_files[1]: {'I'}
+ }
+
+ # Some chroms are shared
+ validator.chrom_set = {'I', 'not_shared', 'also_not_shared'}
+ bad_sams = validator.alignment_chroms_mismatch_heuristic(sam_files)
+ self.assertDictEqual(bad_sams, {})
+
+ # Some chroms aren't shared
+ validator.chrom_set = {'V', 'not_shared', 'also_not_shared'}
+ bad_sams = validator.alignment_chroms_mismatch_heuristic(sam_files)
+ self.assertDictEqual(bad_sams, {sam_files[1]: sam_chroms[sam_files[1]]})
+
+ # Chroms aren't shared
+ validator.chrom_set = {'not_shared', 'also_not_shared'}
+ bad_sams = validator.alignment_chroms_mismatch_heuristic(sam_files)
+ self.assertDictEqual(bad_sams, sam_chroms)
+
+ # Chroms aren't shared, single SAM file input
+ validator.chrom_set = {'V'}
+ sam_file = sam_files[1]
+ bad_sams = validator.alignment_chroms_mismatch_heuristic([sam_file])
+ self.assertDictEqual(bad_sams, {sam_file: sam_chroms[sam_file]})
+
+ """Does GFFValidator.generate_gff_report() correctly process an infractions report?"""
+
+ def test_gff_report_output(self):
+ validator = self.make_gff_validator()
+ infractions = {
+ 'ID attribute': {'gff1': 10, 'gff2': 1},
+ 'strand': {'gff1': 5, 'gff3': 1}
+ }
+
+ gff_sections_header = "The following issues were found in the GFF files provided. "
+ expected_errors = '\n'.join([
+ gff_sections_header,
+ "\t" + f"{validator.targets['ID attribute']}: ",
+ "\t\t" + "10 missing in gff1",
+ "\t\t" + "1 missing in gff2"
+ ])
+
+ expected_warnings = '\n'.join([
+ '\n'.join([gff_sections_header, self.strand_ext_header]),
+ "\t" + f"{validator.targets['strand']}: ",
+ "\t\t" + "5 missing in gff1",
+ "\t\t" + "1 missing in gff3"
+ ])
+
+ validator.generate_gff_report(infractions)
+ self.assertEqual(validator.report.errors[0], expected_errors)
+ self.assertEqual(validator.report.warnings[0], expected_warnings)
+
+ """Does GFFValidator.generate_chrom_report() correctly process an infractions report?"""
+
+ def test_chrom_report_output(self):
+ validator = self.make_gff_validator()
+ validator.chrom_set = {'chr1', 'chr2'}
+ seq_chroms = {'chr3', 'chr4'}
+ shared_chroms = validator.chrom_set & seq_chroms
+
+ exp_errors = '\n'.join([
+ "GFF files and sequence files don't share any chromosome identifiers.",
+ "\t" + f"{validator.targets['gff chromosomes']}: ",
+ "\t\t" + "chr1",
+ "\t\t" + "chr2",
+ "\t" + f"{validator.targets['seq chromosomes']}: ",
+ "\t\t" + "chr3",
+ "\t\t" + "chr4"
+ ])
+
+ validator.generate_chrom_report(shared_chroms, seq_chroms)
+ self.assertEqual(validator.report.errors[0], exp_errors)
+
+ """Does GFFValidator.generate_chrom_heuristics_report() correctly process an infractions report?"""
+
+ def test_chrom_heuristics_report_output(self):
+ validator = self.make_gff_validator()
+ validator.chrom_set = {'chr1', 'chr2'}
+ suspect_files = {
+ 'sam1': {'chr3', 'chr4'},
+ 'sam2': {'chr5', 'chr6'}
+ }
+
+ exp_warnings = '\n'.join([
+ "GFF files and sequence files might not contain the same chromosome identifiers.",
+ "This is determined from a subset of each sequence file, so false positives may be reported.",
+ "\t" + f"{validator.targets['sam files']}: ",
+ "\t\t" + "sam1: ",
+ "\t\t\t" + f"Chromosomes sampled: {', '.join(sorted(suspect_files['sam1']))}",
+ "\t\t" + "sam2: ",
+ "\t\t\t" + f"Chromosomes sampled: {', '.join(sorted(suspect_files['sam2']))}",
+ "\t" + f"{validator.targets['gff chromosomes']}: ",
+ "\t\t" + "chr1",
+ "\t\t" + "chr2"
+ ])
+
+ validator.generate_chrom_heuristics_report(suspect_files)
+ self.assertEqual(validator.report.warnings[0], exp_warnings)
+
+ """Does ReportFormatter add and print all sections with correct formatting?"""
+
+ def test_report_multi_section(self):
+ key_mapper = {'short': "long description"}
+ formatter = ReportFormatter(key_mapper)
+
+ formatter.add_warning_section("Header only")
+ formatter.add_warning_section("Header 2", {'short': 5, 'short2': [1,2,3]})
+ formatter.add_error_section("Header 3", {'short': {'level2a': {4,5,6}, 'level2b': 'msg'}})
+
+ expected_report = '\n'.join([
+ ReportFormatter.error_header,
+ "Header 3",
+ "\t" + "long description: ",
+ "\t\t" + "level2a: ",
+ *["\t\t\t" + str(i) for i in [4,5,6]],
+ "\t\t" + "level2b: msg",
+ "",
+ ReportFormatter.warning_header,
+ "Header only",
+ "",
+ ReportFormatter.warning_header,
+ "Header 2",
+ "\t" + "long description: 5",
+ "\t" + "short2: ",
+ *["\t\t" + str(i) for i in [1,2,3]],
+ "",
+ ""
+ ])
+
+ stdout = io.StringIO()
+ with contextlib.redirect_stdout(stdout):
+ formatter.print_report()
+
+ self.assertEqual(stdout.getvalue(), expected_report)
+
+ """Do chromosome heuristics run in 2 seconds or less for a full-size test dataset?"""
+
+ def test_chrom_heuristics_runtime(self):
+ validator = self.make_gff_validator()
+ validator.chrom_set = {'none match'}
+ files = glob("./testdata/local_only/sam/full/*.sam")
+
+ start = time.time()
+ _ = validator.alignment_chroms_mismatch_heuristic(files)
+ end = time.time()
+
+ print(f"Chromosome heuristics runtime: {end-start:.2f}s")
+ self.assertLessEqual(end-start, 2)
+
+
+
+if __name__ == '__main__':
+ unittest.main()
diff --git a/tiny/cwl/tools/tiny-count.cwl b/tiny/cwl/tools/tiny-count.cwl
index cfaf77b4..419aa730 100644
--- a/tiny/cwl/tools/tiny-count.cwl
+++ b/tiny/cwl/tools/tiny-count.cwl
@@ -58,11 +58,6 @@ inputs:
inputBinding:
prefix: -sv
- multi_id:
- type: boolean?
- inputBinding:
- prefix: -md
-
all_features:
type: boolean?
inputBinding:
diff --git a/tiny/cwl/tools/tiny-plot.cwl b/tiny/cwl/tools/tiny-plot.cwl
index b4490b77..732f7579 100644
--- a/tiny/cwl/tools/tiny-plot.cwl
+++ b/tiny/cwl/tools/tiny-plot.cwl
@@ -87,10 +87,36 @@ inputs:
doc: "A list of desired plot types to produce"
outputs:
- plots:
- type: File[]?
+
+ len_dist:
+ type: Directory?
+ outputBinding:
+ glob: len_dist
+
+ rule_chart:
+ type: Directory?
+ outputBinding:
+ glob: rule_chart
+
+ class_chart:
+ type: Directory?
+ outputBinding:
+ glob: class_chart
+
+ replicate_scatter:
+ type: Directory?
+ outputBinding:
+ glob: replicate_scatter
+
+ sample_avg_scatter_by_dge:
+ type: Directory?
+ outputBinding:
+ glob: scatter_by_dge
+
+ sample_avg_scatter_by_dge_class:
+ type: Directory?
outputBinding:
- glob: "*.pdf"
+ glob: scatter_by_dge_class
console_output:
type: stdout
\ No newline at end of file
diff --git a/tiny/cwl/workflows/tinyrna_wf.cwl b/tiny/cwl/workflows/tinyrna_wf.cwl
index ccd59926..06fb79b1 100644
--- a/tiny/cwl/workflows/tinyrna_wf.cwl
+++ b/tiny/cwl/workflows/tinyrna_wf.cwl
@@ -87,7 +87,6 @@ inputs:
counter_all_features: boolean?
counter_type_filter: string[]?
counter_source_filter: string[]?
- counter_allow_multi_id: boolean?
counter_normalize_by_hits: boolean?
# deseq inputs
@@ -215,7 +214,6 @@ steps:
source: counter_normalize_by_hits
valueFrom: $(String(self)) # convert boolean -> string
decollapse: counter_decollapse
- multi_id: counter_allow_multi_id
stepvector: counter_stepvector
is_pipeline: {default: true}
diagnostics: counter_diags
@@ -263,7 +261,14 @@ steps:
out_prefix: run_name
plot_requests: plot_requests
vector_scatter: plot_vector_points
- out: [plots, console_output]
+ out:
+ - console_output
+ - len_dist
+ - rule_chart
+ - class_chart
+ - replicate_scatter
+ - sample_avg_scatter_by_dge
+ - sample_avg_scatter_by_dge_class
organize_bt_indexes:
run: ../tools/make-subdir.cwl
@@ -327,7 +332,9 @@ steps:
run: ../tools/make-subdir.cwl
in:
dir_files:
- source: [ plotter/plots, plotter/console_output, dge/pca_plot ]
+ source: [plotter/len_dist, plotter/rule_chart, plotter/class_chart, plotter/replicate_scatter,
+ plotter/sample_avg_scatter_by_dge, plotter/sample_avg_scatter_by_dge_class,
+ dge/pca_plot, plotter/console_output]
pickValue: all_non_null
dir_name: dir_name_plotter
out: [ subdir ]
diff --git a/tiny/entry.py b/tiny/entry.py
index 9db63c74..feccf61d 100644
--- a/tiny/entry.py
+++ b/tiny/entry.py
@@ -87,7 +87,7 @@ def run(tinyrna_cwl_path: str, config_file: str) -> None:
print("Running the end-to-end analysis...")
# First get the configuration file set up for this run
- config_object = Configuration(config_file)
+ config_object = Configuration(config_file, validate_inputs=True)
run_directory = config_object.create_run_directory()
config_object.save_run_profile()
diff --git a/tiny/rna/collapser.py b/tiny/rna/collapser.py
index aff1eb32..4e3504ff 100644
--- a/tiny/rna/collapser.py
+++ b/tiny/rna/collapser.py
@@ -159,7 +159,7 @@ def seq2fasta(seqs: dict, out_prefix: str, thresh: int = 0, gz: bool = False) ->
Returns: None
"""
- assert out_prefix is not None, "Collapser critical error: an output file prefix must be specified."
+ assert out_prefix is not None, "tiny-collapse critical error: an output file prefix must be specified."
assert thresh >= 0, "An invalid threshold was specified."
writer, encoder, mode = fasta_interface(gz)
@@ -189,7 +189,7 @@ def look_before_you_leap(out_prefix: str, gz: bool) -> (str, str):
candidates = tuple(f"{out_prefix}{file}{ext}" for file in ["_collapsed", "_collapsed_lowcounts"])
for file in candidates:
if os.path.isfile(file):
- raise FileExistsError(f"Collapser critical error: {file} already exists.")
+ raise FileExistsError(f"tiny-collapse critical error: {file} already exists.")
return candidates
diff --git a/tiny/rna/configuration.py b/tiny/rna/configuration.py
index ebaf0c10..dde86823 100644
--- a/tiny/rna/configuration.py
+++ b/tiny/rna/configuration.py
@@ -12,6 +12,8 @@
from datetime import datetime
from typing import Union, Any
+from tiny.rna.counter.validation import GFFValidator
+
timestamp_format = re.compile(r"\d{4}-\d{2}-\d{2}_\d{2}-\d{2}-\d{2}")
@@ -160,7 +162,7 @@ class Configuration(ConfigBase):
appropriately if 'run_bowtie_index' is set to 'true'
"""
- def __init__(self, config_file: str):
+ def __init__(self, config_file: str, validate_inputs=False):
# Parse YAML configuration file
super().__init__(config_file)
@@ -172,6 +174,7 @@ def __init__(self, config_file: str):
self.setup_ebwt_idx()
self.process_sample_sheet()
self.process_feature_sheet()
+ if validate_inputs: self.validate_inputs()
def load_paths_config(self):
"""Constructs a sub-configuration object containing workflow file preferences"""
@@ -309,6 +312,21 @@ def setup_ebwt_idx(self):
# preserve the prefix path. What the workflow "sees" is the ebwt files at working dir root
self["ebwt"] = os.path.basename(self["ebwt"])
+ def validate_inputs(self):
+ """For now, only GFF files are validated here"""
+
+ gff_files = {gff['path']: [] for gff in self['gff_files']}
+ prefs = {x: self[f'{x}_filter'] for x in ['source', 'type']}
+ ebwt = self.paths['ebwt'] if not self['run_bowtie_build'] else None
+
+ GFFValidator(
+ gff_files,
+ prefs=prefs,
+ ebwt=ebwt,
+ genomes=self.paths['reference_genome_files'],
+ alignments=None # Used in tiny-count standalone runs
+ ).validate()
+
def save_run_profile(self, config_file_name=None) -> str:
"""Saves Samples Sheet and processed run config to the Run Directory for record keeping"""
diff --git a/tiny/rna/counter/counter.py b/tiny/rna/counter/counter.py
index c81336d9..c7f48f0b 100644
--- a/tiny/rna/counter/counter.py
+++ b/tiny/rna/counter/counter.py
@@ -13,9 +13,10 @@
from collections import defaultdict
from typing import Tuple, List, Dict
+from tiny.rna.counter.validation import GFFValidator
from tiny.rna.counter.features import Features, FeatureCounter
from tiny.rna.counter.statistics import MergedStatsManager
-from tiny.rna.util import report_execution_time, from_here
+from tiny.rna.util import report_execution_time, from_here, ReadOnlyDict
from tiny.rna.configuration import CSVReader
# Global variables for multiprocessing
@@ -54,9 +55,6 @@ def get_args():
optional_args.add_argument('-sv', '--stepvector', choices=['Cython', 'HTSeq'], default='Cython',
help='Select which StepVector implementation is used to find '
'features overlapping an interval.')
- optional_args.add_argument('-md', '--multi-id', action='store_true',
- help="Don't treat features with multiple ID values as an error. "
- "Only the first value will be used as the feature's ID.")
optional_args.add_argument('-a', '--all-features', action='store_true',
help='Represent all features in output counts table, '
'even if they did not match a Select for / with value.')
@@ -70,7 +68,7 @@ def get_args():
args = arg_parser.parse_args()
setattr(args, 'normalize_by_hits', args.normalize_by_hits.lower() in ['t', 'true'])
- return args
+ return ReadOnlyDict(vars(args))
def load_samples(samples_csv: str, is_pipeline: bool) -> List[Dict[str, str]]:
@@ -160,6 +158,12 @@ def load_config(features_csv: str, is_pipeline: bool) -> Tuple[List[dict], Dict[
return rules, gff_files
+def validate_inputs(gffs, libraries, prefs):
+ if prefs.get('is_pipeline'): return
+ libraries = [lib['File'] for lib in libraries]
+ GFFValidator(gffs, prefs, alignments=libraries).validate()
+
+
@report_execution_time("Counting and merging")
def map_and_reduce(libraries, prefs):
"""Assigns one worker process per library and merges the statistics they report"""
@@ -189,23 +193,25 @@ def main():
try:
# Determine SAM inputs and their associated library names
- libraries = load_samples(args.samples_csv, args.is_pipeline)
+ libraries = load_samples(args['samples_csv'], args['is_pipeline'])
# Load selection rules and feature sources from the Features Sheet
- selection_rules, gff_file_set = load_config(args.features_csv, args.is_pipeline)
+ selection_rules, gff_file_set = load_config(args['features_csv'], args['is_pipeline'])
+ validate_inputs(gff_file_set, libraries, args)
# global for multiprocessing
global counter
- counter = FeatureCounter(gff_file_set, selection_rules, **vars(args))
+ counter = FeatureCounter(gff_file_set, selection_rules, **args)
# Assign and count features using multiprocessing and merge results
- merged_counts = map_and_reduce(libraries, vars(args))
+ merged_counts = map_and_reduce(libraries, args)
# Write final outputs
merged_counts.write_report_files()
- except:
+ except Exception as e:
+ if type(e) is SystemExit: return
traceback.print_exception(*sys.exc_info())
- if args.is_pipeline:
+ if args['is_pipeline']:
print("\n\ntiny-count encountered an error. Don't worry! You don't have to start over.\n"
"You can resume the pipeline at tiny-count. To do so:\n\t"
"1. cd into your Run Directory\n\t"
diff --git a/tiny/rna/counter/features.py b/tiny/rna/counter/features.py
index 22bb7038..675d5c45 100644
--- a/tiny/rna/counter/features.py
+++ b/tiny/rna/counter/features.py
@@ -9,7 +9,8 @@
from .matching import *
# Type aliases for human readability
-match_tuple = Tuple[int, int, IntervalSelector] # (rank, rule, interval selector)
+match_tuple = Tuple[int, int, IntervalSelector] # (rank, rule, IntervalSelector)
+unbuilt_match_tuple = Tuple[int, int, str] # (rank, rule, interval selector keyword)
feature_record_tuple = Tuple[str, str, Tuple[match_tuple]] # (feature ID, strand, match tuple)
@@ -146,7 +147,7 @@ def choose(cls, candidates: Set[feature_record_tuple], alignment: dict) -> Mappi
rule_def = FeatureSelector.rules_table[rule]
if alignment['nt5end'] not in rule_def["nt5end"]: continue
if alignment['Length'] not in rule_def["Length"]: continue
- if alignment['Strand'] ^ strand not in rule_def["Strand"]: continue
+ if (alignment['Strand'], strand) not in rule_def["Strand"]: continue
selections[feat].add(rule)
min_rank = rank
@@ -188,7 +189,7 @@ def build_selectors(rules_table) -> List[dict]:
return rules_table
@staticmethod
- def build_interval_selectors(iv: 'HTSeq.GenomicInterval', match_tuples: List[Tuple]):
+ def build_interval_selectors(iv: 'HTSeq.GenomicInterval', match_tuples: List[unbuilt_match_tuple]):
"""Builds partial/full/exact/3' anchored/5' anchored interval selectors
Unlike build_selectors() and build_inverted_identities(), this function
@@ -208,8 +209,9 @@ def build_interval_selectors(iv: 'HTSeq.GenomicInterval', match_tuples: List[Tup
'full': lambda: IntervalFullMatch(iv),
'exact': lambda: IntervalExactMatch(iv),
'partial': lambda: IntervalPartialMatch(iv),
- "5' anchored": lambda: Interval5pMatch(iv),
- "3' anchored": lambda: Interval3pMatch(iv),
+ 'anchored': lambda: IntervalAnchorMatch(iv),
+ "5' anchored": lambda: Interval5pMatch(iv) if iv.strand in ('+', '-') else IntervalAnchorMatch(iv),
+ "3' anchored": lambda: Interval3pMatch(iv) if iv.strand in ('+', '-') else IntervalAnchorMatch(iv),
}
for i in range(len(match_tuples)):
diff --git a/tiny/rna/counter/hts_parsing.py b/tiny/rna/counter/hts_parsing.py
index cccbdce4..4e577285 100644
--- a/tiny/rna/counter/hts_parsing.py
+++ b/tiny/rna/counter/hts_parsing.py
@@ -1,10 +1,12 @@
+import functools
import os.path
import HTSeq
import sys
import re
from collections import Counter, defaultdict
-from typing import Tuple, List, Dict, Iterator, Optional, DefaultDict, Set, Union, IO
+from typing import Tuple, List, Dict, Iterator, Optional, DefaultDict, Set, Union, IO, Callable
+from urllib.parse import unquote
from inspect import stack
from tiny.rna.counter.matching import Wildcard
@@ -257,10 +259,10 @@ def infer_strandedness(sam_file: str, intervals: dict) -> str:
def parse_GFF_attribute_string(attrStr, extra_return_first_value=False, gff_version=2):
"""Parses a GFF attribute string and returns it as a dictionary.
- This is a slight modification of the same method found in HTSeq.features.
- It has been adapted to parse comma separated attribute values as separate values.
- Values are stored in a set for ease of handling in ReferenceTables and because
- duplicate values don't make sense in this context.
+ This slight modification of the same method found in HTSeq.features includes
+ the following for improved compliance with the GFF format:
+ - Attribute values containing commas are tokenized.
+ - Attribute keys are URL decoded. Values are URL decoded after tokenization.
Per the original HTSeq docstring:
"If 'extra_return_first_value' is set, a pair is returned: the dictionary
@@ -293,9 +295,9 @@ def parse_GFF_attribute_string(attrStr, extra_return_first_value=False, gff_vers
if (gff_version == 2) and val.startswith('"') and val.endswith('"'):
val = val[1:-1]
# Modification: allow for comma separated attribute values
- attribute_dict[key] = (val,) \
+ attribute_dict[unquote(key)] = (unquote(val),) \
if ',' not in val \
- else tuple(c.strip() for c in val.split(','))
+ else tuple(unquote(c.strip()) for c in val.split(','))
if extra_return_first_value and i == 0:
first_val = val
if extra_return_first_value:
@@ -406,6 +408,25 @@ def not_implemented(self):
raise NotImplementedError(f"CaseInsensitiveAttrs does not support {stack()[1].function}")
+def parse_gff(file, row_fn: Callable, alias_keys=None):
+ if alias_keys is not None:
+ row_fn = functools.partial(row_fn, alias_keys=alias_keys)
+
+ gff = HTSeq.GFF_Reader(file)
+ try:
+ for row in gff:
+ row_fn(row)
+ except Exception as e:
+ # Append to error message while preserving exception provenance and traceback
+ extended_msg = f"Error occurred on line {gff.line_no} of {file}"
+ if type(e) is KeyError:
+ e.args += (extended_msg,)
+ else:
+ primary_msg = "%s\n%s" % (str(e.args[0]), extended_msg)
+ e.args = (primary_msg,) + e.args[1:]
+ raise
+
+
# Type aliases for human readability
ClassTable = AliasTable = DefaultDict[str, Tuple[str]]
StepVector = HTSeq.GenomicArrayOfSets
@@ -441,25 +462,12 @@ class ReferenceTables:
def __init__(self, gff_files: Dict[str, list], feature_selector, **prefs):
self.all_features = prefs.get('all_features', False)
- self.allow_multi_id = prefs.get('multi_id', False)
+ self.stepvector = prefs.get('stepvector', 'HTSeq')
self.selector = feature_selector
self._set_filters(**prefs)
self.gff_files = gff_files
# ----------------------------------------------------------- Primary Key:
- if prefs['stepvector'] == 'Cython':
- try:
- from tiny.rna.counter.stepvector import StepVector
- setattr(HTSeq.StepVector, 'StepVector', StepVector)
- self.feats = HTSeq.GenomicArray("auto", stranded=False) # Root Match ID
- except ModuleNotFoundError:
- prefs['stepvector'] = 'HTSeq'
- print("Failed to import Cython StepVector\n"
- "Falling back to HTSeq's StepVector",
- file=sys.stderr)
-
- if prefs['stepvector'] == 'HTSeq':
- self.feats = HTSeq.GenomicArrayOfSets("auto", stranded=False) # Root Match ID
-
+ self.feats = self._init_genomic_array() # Root Match ID
self.parents, self.filtered = {}, set() # Original Feature ID
self.intervals = defaultdict(list) # Root Feature ID
self.matches = defaultdict(set) # Root Match ID
@@ -475,34 +483,28 @@ def get(self) -> Tuple[StepVector, AliasTable, ClassTable, dict]:
"""Initiates GFF parsing and returns the resulting reference tables"""
for file, alias_keys in self.gff_files.items():
- gff = HTSeq.GFF_Reader(file)
- try:
- for row in gff:
- if row.iv.strand == ".":
- raise ValueError(f"Feature {row.name} in {file} has no strand information.")
- if not self.filter_match(row):
- self.exclude_row(row)
- continue
-
- # Grab the primary key for this feature
- feature_id = self.get_feature_id(row)
- # Get feature's classes and identity match tuples
- matches, classes = self.get_matches_and_classes(row.attr)
- # Only add features with identity matches if all_features is False
- if not self.all_features and not len(matches):
- self.exclude_row(row)
- continue
- # Add feature data to root ancestor in the reference tables
- root_id = self.add_feature(feature_id, row, matches, classes)
- # Add alias to root ancestor if it is unique
- self.add_alias(root_id, alias_keys, row.attr)
- except Exception as e:
- # Append to error message while preserving exception provenance and traceback
- e.args = (str(e.args[0]) + "\nError occurred on line %d of %s" % (gff.line_no, file),)
- raise e.with_traceback(sys.exc_info()[2]) from e
+ parse_gff(file, self.parse_row, alias_keys=alias_keys)
return self.finalize_tables()
+ def parse_row(self, row, alias_keys=None):
+ if row.type.lower() == "chromosome" or not self.filter_match(row):
+ self.exclude_row(row)
+ return
+
+ # Grab the primary key for this feature
+ feature_id = self.get_feature_id(row)
+ # Get feature's classes and identity match tuples
+ matches, classes = self.get_matches_and_classes(row.attr)
+ # Only add features with identity matches if all_features is False
+ if not self.all_features and not len(matches):
+ self.exclude_row(row)
+ return
+ # Add feature data to root ancestor in the reference tables
+ root_id = self.add_feature(feature_id, row, matches, classes)
+ # Add alias to root ancestor if it is unique
+ self.add_alias(root_id, alias_keys, row.attr)
+
def get_root_feature(self, lineage: list) -> str:
"""Returns the highest feature ID in the ancestral tree which passed stage 1 selection.
The original feature ID is returned if there are no valid ancestors."""
@@ -588,7 +590,7 @@ def get_row_parent(self, feature_id: str, row_attrs: CaseInsensitiveAttrs) -> st
if len(parent_attr) > 1:
raise ValueError(f"{feature_id} defines multiple parents which is unsupported at this time.")
- if len(parent_attr) == 0 or parent is None:
+ if parent in (None, feature_id):
return feature_id
if (parent not in self.tags # If parent is not a root feature
and parent not in self.parents # If parent doesn't have a parent itself
@@ -648,7 +650,8 @@ def _finalize_features(self):
for sub_iv in merged_sub_ivs:
finalized_match_tuples = self.selector.build_interval_selectors(sub_iv, sorted_matches.copy())
- self.feats[sub_iv] += (tagged_id, sub_iv.strand == '+', tuple(finalized_match_tuples))
+ strand = self.map_strand(sub_iv.strand)
+ self.feats[sub_iv] += (tagged_id, strand, tuple(finalized_match_tuples))
@staticmethod
def _merge_adjacent_subintervals(unmerged_sub_ivs: List[HTSeq.GenomicInterval]) -> list:
@@ -687,6 +690,15 @@ def get_feats_table_size(self) -> int:
return total_feats
+ def map_strand(self, gff_value: str):
+ """Maps HTSeq's strand representation (+/-/.) to
+ tinyRNA's strand representation (True/False/None)"""
+
+ return {
+ '+': True,
+ '-': False,
+ }.get(gff_value, None)
+
def was_matched(self, untagged_id):
"""Checks if the feature ID previously matched on identity, regardless of whether
the matching rule was tagged or untagged."""
@@ -701,20 +713,36 @@ def chrom_vector_setdefault(self, chrom):
if chrom not in self.feats.chrom_vectors:
self.feats.add_chrom(chrom)
- def get_feature_id(self, row):
- id_collection = row.attr.get('ID', default=
- row.attr.get('gene_id', default=None))
+ @staticmethod
+ def get_feature_id(row):
+ id_collection = row.attr.get('ID') \
+ or row.attr.get('gene_id') \
+ or row.attr.get('Parent')
if id_collection is None:
- raise ValueError(f"Feature {row.name} does not contain an ID attribute.")
+ raise ValueError(f"Feature {row.name} does not have an ID attribute.")
if len(id_collection) == 0:
raise ValueError("A feature's ID attribute cannot be empty. This value is required.")
- if len(id_collection) > 1 and not self.allow_multi_id:
- err_msg = "A feature's ID attribute cannot contain multiple values. Only one ID per feature is allowed."
- raise ValueError(err_msg)
+ if len(id_collection) > 1:
+ return ','.join(id_collection)
return id_collection[0]
+ def _init_genomic_array(self):
+ if self.stepvector == 'Cython':
+ try:
+ from tiny.rna.counter.stepvector import StepVector
+ setattr(HTSeq.StepVector, 'StepVector', StepVector)
+ return HTSeq.GenomicArray("auto", stranded=False)
+ except ModuleNotFoundError:
+ self.stepvector = 'HTSeq'
+ print("Failed to import Cython StepVector\n"
+ "Falling back to HTSeq's StepVector",
+ file=sys.stderr)
+
+ if self.stepvector == 'HTSeq':
+ return HTSeq.GenomicArrayOfSets("auto", stranded=False)
+
@classmethod
def _set_filters(cls, **kwargs):
"""Assigns inclusive filter values"""
diff --git a/tiny/rna/counter/matching.py b/tiny/rna/counter/matching.py
index 5382b9a6..a7e7ecc9 100644
--- a/tiny/rna/counter/matching.py
+++ b/tiny/rna/counter/matching.py
@@ -1,7 +1,9 @@
"""Selector classes which determine a match via __contains__()"""
-import re
import HTSeq
+import re
+
+from typing import Optional, Tuple
from tiny.rna.util import Singleton
@@ -27,8 +29,11 @@ def __init__(self, strand):
self.strand = strand
self.select = (self.strand == 'sense')
- def __contains__(self, x: bool):
- return self.select ^ x
+ def __contains__(self, strand_relationship: Tuple[bool, Optional[bool]]):
+ """If feature strand is None, a match is not possible for 'sense' or 'antisense'"""
+
+ aln_strand, feat_strand = strand_relationship
+ return feat_strand is None or aln_strand ^ feat_strand ^ self.select
def __repr__(self): return str(self.strand)
@@ -156,6 +161,34 @@ def __repr__(self):
return f""
+class IntervalAnchorMatch(IntervalSelector):
+ """Evaluates whether an alignment's start matches the feature's start, and vice versa for end."""
+
+ def __init__(self, iv: HTSeq.GenomicInterval):
+ super().__init__(iv)
+ assert iv.strand not in ('+', '-')
+
+ def __contains__(self, alignment: dict):
+ """The following diagram demonstrates unstranded anchored matching semantics.
+
+ Match | <------->|
+ Match |<-----------|-->
+ Match |<---------->|
+ No match | <---------|->
+ <-----------|<==feat_A==>|----------->
+
+ Args:
+ alignment: An alignment dictionary containing strand, start, and end
+
+ Returns: True if the alignment's 5' end is anchored to the strand-appropriate
+ terminus of this feature's interval.
+ """
+
+ return alignment['Start'] == self.start or alignment['End'] == self.end
+
+ def __repr__(self):
+ return f""
+
class Interval5pMatch(IntervalSelector):
"""Evaluates whether an alignment's 5' end is anchored to the corresponding terminus of the feature"""
diff --git a/tiny/rna/counter/statistics.py b/tiny/rna/counter/statistics.py
index 70c9a7e8..8d625564 100644
--- a/tiny/rna/counter/statistics.py
+++ b/tiny/rna/counter/statistics.py
@@ -375,7 +375,7 @@ def write_output_logfile(self):
if len(self.missing_collapser_outputs):
missing = '\n\t'.join(self.missing_collapser_outputs)
self.add_warning("The Unique Sequences stat could not be determined for the following libraries because "
- "their Collapser outputs were not found in the working directory:\n\t" + missing)
+ "their tiny-collapse outputs were not found in the working directory:\n\t" + missing)
# Only display conditional categories if they were collected for at least one library
empty_rows = self.pipeline_stats_df.loc[self.conditional_categories].isna().all(axis='columns')
diff --git a/tiny/rna/counter/validation.py b/tiny/rna/counter/validation.py
new file mode 100644
index 00000000..40485e25
--- /dev/null
+++ b/tiny/rna/counter/validation.py
@@ -0,0 +1,251 @@
+import functools
+import subprocess
+import sys
+
+from collections import Counter, defaultdict
+
+from tiny.rna.util import sorted_natural
+from tiny.rna.counter.hts_parsing import parse_gff, ReferenceTables
+
+
+class ReportFormatter:
+ error_header = "========[ ERROR ]=============================================="
+ warning_header = "========[ WARNING ]============================================"
+
+ def __init__(self, key_mapper):
+ self.key_mapper = key_mapper
+ self.warnings = []
+ self.errors = []
+
+ def print_report(self):
+ for error_section in self.errors:
+ print(self.error_header)
+ print(error_section)
+ print()
+
+ for warning_section in self.warnings:
+ print(self.warning_header)
+ print(warning_section)
+ print()
+
+ def add_error_section(self, header, body=None):
+ self.add_section(header, body, self.errors)
+
+ def add_warning_section(self, header, body=None):
+ self.add_section(header, body, self.warnings)
+
+ def add_section(self, header, body, dest):
+ if isinstance(body, dict):
+ body = self.nested_dict(body)
+ elif isinstance(body, list):
+ body = '\n'.join(body)
+ if body:
+ dest.append('\n'.join([header, body]))
+ else:
+ dest.append(header)
+
+ def nested_dict(self, summary, indent='\t'):
+ report_lines = self.recursive_indent(summary, indent)
+ return '\n'.join(report_lines)
+
+ def recursive_indent(self, mapping: dict, indent: str):
+ """Converts a nested dictionary into a properly indented list of lines
+
+ Args:
+ mapping: the nested dictionary to be formatted
+ indent: the indent string to prepend to lines on the current level
+ """
+
+ lines = []
+ for key, val in mapping.items():
+ if not val: continue
+ key_header = f"{indent}{self.key_mapper.get(key, key)}: "
+ if isinstance(val, dict):
+ lines.append(key_header)
+ lines.extend(self.recursive_indent(val, indent + '\t'))
+ elif isinstance(val, (list, set)):
+ lines.append(key_header)
+ lines.extend([indent + '\t' + line for line in map(str, val)])
+ else:
+ lines.append(key_header + str(val))
+ return lines
+
+
+class GFFValidator:
+ """Validates GFF files based on their contents and the contents of sequencing files to which
+ the GFF files are expected to be applied."""
+
+ targets = {
+ "ID attribute": "Features missing a valid identifier attribute",
+ "sam files": "Potentially incompatible SAM alignment files",
+ "seq chromosomes": "Chromosomes present in sequence files",
+ "gff chromosomes": "Chromosomes present in GFF files",
+ "strand": "Features missing strand information",
+ }
+
+ def __init__(self, gff_files, prefs, ebwt=None, genomes=None, alignments=None):
+ self.ReferenceTables = ReferenceTables(gff_files, None, **prefs)
+ self.report = ReportFormatter(self.targets)
+ self.chrom_set = set()
+
+ self.seq_files = [ebwt, genomes, alignments]
+ self.gff_files = gff_files
+ self.prefs = prefs
+
+ def validate(self):
+ print("Validating annotation files...")
+ self.parse_and_validate_gffs(self.gff_files)
+ self.validate_chroms(*self.seq_files)
+ self.report.print_report()
+ if self.report.errors:
+ sys.exit(1)
+
+ def parse_and_validate_gffs(self, file_set):
+ gff_infractions = defaultdict(Counter)
+ for file, *_ in file_set.items():
+ row_fn = functools.partial(self.validate_gff_row, issues=gff_infractions, file=file)
+ parse_gff(file, row_fn=row_fn)
+
+ if gff_infractions:
+ self.generate_gff_report(gff_infractions)
+
+ def validate_gff_row(self, row, issues, file):
+ if row.type.lower() == "chromosome": return # Skip definitions of whole chromosomes regardless
+ if not self.ReferenceTables.filter_match(row): return # Obey source/type filters before validation
+
+ if row.iv.strand not in ('+', '-'):
+ issues['strand'][file] += 1
+
+ try:
+ self.ReferenceTables.get_feature_id(row)
+ except:
+ issues['ID attribute'][file] += 1
+
+ self.chrom_set.add(row.iv.chrom)
+
+ def generate_gff_report(self, infractions):
+ header = "The following issues were found in the GFF files provided. "
+
+ if "strand" in infractions:
+ strand_issues = {"strand":
+ sorted_natural([
+ f"{count} missing in {file}"
+ for file, count in infractions['strand'].items()
+ ], reverse=True)
+ }
+
+ ext_header = 'Unstranded features are allowed, but they can lead to potentially unexpected results.\n' \
+ 'These features will match "sense", "antisense", and "both" strand selectors. 5\'/3\' anchored\n' \
+ "overlap selectors for these features will evaluate for termini shared with the alignment,\n" \
+ "but will not distinguish between the alignment's 5' and 3' ends."
+
+ self.report.add_warning_section('\n'.join([header, ext_header]), strand_issues)
+
+ if "ID attribute" in infractions:
+ idattr_issues = {"ID attribute":
+ sorted_natural([
+ f"{count} missing in {file}"
+ for file, count in infractions['ID attribute'].items()
+ ], reverse=True)
+ }
+
+ self.report.add_error_section(header, idattr_issues)
+
+ def validate_chroms(self, ebwt=None, genomes=None, alignments=None):
+ # First search bowtie indexes if they are available
+ if ebwt is not None:
+ try:
+ chroms, shared = self.chroms_shared_with_ebwt(ebwt)
+ self.generate_chrom_report(chroms, shared)
+ return
+ except Exception:
+ pass # Fallback to other input options
+
+ # Next search the genome fasta(s) if available
+ if genomes is not None:
+ chroms, shared = self.chroms_shared_with_genomes(genomes)
+ self.generate_chrom_report(chroms, shared)
+ return
+
+ # Preferred inputs aren't available; continue testing with heuristic options
+ if alignments is not None:
+ self.validate_chroms_heuristic(alignments)
+ else:
+ self.report.add_warning_section("Shared chromosome identifiers could not be validated.")
+
+ def validate_chroms_heuristic(self, alignments):
+ suspect_files = self.alignment_chroms_mismatch_heuristic(alignments)
+ self.generate_chrom_heuristics_report(suspect_files)
+
+ def chroms_shared_with_ebwt(self, index_prefix):
+ """Returns the set intersection between parsed GFF chromosomes and those in the bowtie index"""
+
+ summary = subprocess.check_output(['bowtie-inspect', index_prefix, '-s']).decode('latin1').splitlines()
+ ebwt_chroms = set()
+
+ for line in summary:
+ if line.startswith("Sequence-"):
+ try:
+ ebwt_chroms.add(line.split('\t')[1])
+ except IndexError:
+ pass
+
+ shared = ebwt_chroms & self.chrom_set
+ return shared, ebwt_chroms
+
+ def chroms_shared_with_genomes(self, genome_fastas):
+ """Returns the set intersection between parsed GFF chromosomes and those in the reference genome"""
+
+ genome_chroms = set()
+ for fasta in genome_fastas:
+ with open(fasta, 'rb') as f:
+ for line in f:
+ if line[0] == ord('>'):
+ genome_chroms.add(line[1:].strip().decode())
+
+ shared = genome_chroms & self.chrom_set
+ return shared, genome_chroms
+
+ def alignment_chroms_mismatch_heuristic(self, sam_files, subset_size=50000):
+ """Since alignment files can be very large, we only check that there's at least one shared
+ chromosome identifier and only the first subset_size lines are read from each file."""
+
+ files_wo_overlap = {}
+
+ for file in sam_files:
+ file_chroms = set()
+ with open(file, 'rb') as f:
+ for line, i in zip(f, range(subset_size)):
+ if line[0] == ord('@'): continue
+ file_chroms.add(line.split(b'\t')[2].strip().decode())
+ if i % 10000 == 0 and len(file_chroms & self.chrom_set): break
+
+ if not len(file_chroms & self.chrom_set):
+ files_wo_overlap[file] = file_chroms
+
+ return files_wo_overlap
+
+ def generate_chrom_report(self, shared, chroms):
+ if shared: return
+ header = "GFF files and sequence files don't share any chromosome identifiers."
+ summary = {
+ "gff chromosomes": sorted(self.chrom_set),
+ "seq chromosomes": sorted(chroms)
+ }
+
+ self.report.add_error_section(header, summary)
+
+ def generate_chrom_heuristics_report(self, suspect_files):
+ if not suspect_files: return
+
+ header = "GFF files and sequence files might not contain the same chromosome identifiers.\n" \
+ "This is determined from a subset of each sequence file, so false positives may be reported."
+ chroms = {file: [f"Chromosomes sampled: {', '.join(sorted(chroms))}"]
+ for file, chroms in suspect_files.items()}
+
+ summary = {
+ "sam files": chroms,
+ "gff chromosomes": sorted(self.chrom_set)
+ }
+
+ self.report.add_warning_section(header, summary)
\ No newline at end of file
diff --git a/tiny/rna/plotter.py b/tiny/rna/plotter.py
index 7ab2ca14..e31e8310 100644
--- a/tiny/rna/plotter.py
+++ b/tiny/rna/plotter.py
@@ -120,7 +120,7 @@ def len_dist_plots(matrices: dict, out_prefix:str, vmin: int = None, vmax: int =
# Save plot
pdf_name = make_filename([out_prefix, condition_and_rep, "len_dist"], ext='.pdf')
- plot.figure.savefig(pdf_name)
+ save_plot(plot, "len_dist", pdf_name)
def get_len_dist_dict(files_list: list) -> DefaultDict[str, Dict[str, pd.DataFrame]]:
@@ -175,7 +175,7 @@ def class_charts(raw_class_counts: pd.DataFrame, mapped_reads: pd.Series, out_pr
# Save the plot
pdf_name = make_filename([out_prefix, library, 'class_chart'], ext='.pdf')
- chart.figure.savefig(pdf_name)
+ save_plot(chart, "class_chart", pdf_name)
def rule_charts(rule_counts: pd.DataFrame, out_prefix: str, scale=2, **kwargs):
@@ -203,7 +203,7 @@ def rule_charts(rule_counts: pd.DataFrame, out_prefix: str, scale=2, **kwargs):
# Save the plot
pdf_name = make_filename([out_prefix, library, 'rule_chart'], ext='.pdf')
- chart.figure.savefig(pdf_name)
+ save_plot(chart, "rule_chart", pdf_name)
def get_proportions_df(counts_df: pd.DataFrame, mapped_totals: pd.Series, un: str, scale=2):
@@ -299,7 +299,7 @@ def scatter_replicates(count_df: pd.DataFrame, samples: dict, output_prefix: str
rscat.set_xlabel("Log$_{2}$ normalized reads in replicate " + rep1)
rscat.set_ylabel("Log$_{2}$ normalized reads in replicate " + rep2)
pdf_name = make_filename([output_prefix, samp, 'replicates', rep1, rep2, 'scatter'], ext='.pdf')
- rscat.figure.savefig(pdf_name)
+ save_plot(rscat, "replicate_scatter", pdf_name)
def load_dge_tables(comparisons: list) -> pd.DataFrame:
@@ -376,7 +376,7 @@ def scatter_dges(count_df, dges, output_prefix, view_lims, classes=None, show_un
sscat.set_ylabel("Log$_{2}$ normalized reads in " + p2)
sscat.get_legend().set_bbox_to_anchor((1, 1))
pdf_name = make_filename([output_prefix, pair, 'scatter_by_dge_class'], ext='.pdf')
- sscat.figure.savefig(pdf_name)
+ save_plot(sscat, "scatter_by_dge_class", pdf_name)
else:
for pair in dges:
@@ -392,7 +392,7 @@ def scatter_dges(count_df, dges, output_prefix, view_lims, classes=None, show_un
sscat.set_xlabel("Log$_{2}$ normalized reads in " + p1)
sscat.set_ylabel("Log$_{2}$ normalized reads in " + p2)
pdf_name = make_filename([output_prefix, pair, 'scatter_by_dge'], ext='.pdf')
- sscat.figure.savefig(pdf_name)
+ save_plot(sscat, 'scatter_by_dge', pdf_name)
def load_raw_counts(raw_counts_file: str) -> pd.DataFrame:
@@ -527,6 +527,21 @@ def get_class_counts(raw_counts_df: pd.DataFrame) -> pd.DataFrame:
return class_counts
+def save_plot(plot, dir_name, filename):
+ """Complete plots are sent here for output patterns that apply to all plots
+ Args:
+ - plot: the finished axes object
+ - dir_name: all plots are placed in subdirectories by this name
+ - filename: the file basename for the plot, including extension
+ """
+
+ if not os.path.isdir(dir_name):
+ os.mkdir(dir_name)
+
+ final_save_path = os.path.join(dir_name, filename)
+ plot.figure.savefig(final_save_path)
+
+
def validate_inputs(args: argparse.Namespace) -> None:
"""Determines if the necessary input files have been provided for the requested plots
This is necessary because we allow users to run the tool with only the files necessary
diff --git a/tiny/rna/util.py b/tiny/rna/util.py
index ab5431b3..ac94fd0c 100644
--- a/tiny/rna/util.py
+++ b/tiny/rna/util.py
@@ -68,4 +68,29 @@ def get_r_safename(name: str) -> str:
leading_char = lambda x: re.sub(r"^(?=[^a-zA-Z.]+|\.\d)", "X", x)
special_char = lambda x: re.sub(r"[^a-zA-Z0-9_.]", ".", x)
- return special_char(leading_char(name))
\ No newline at end of file
+ return special_char(leading_char(name))
+
+
+class ReadOnlyDict(dict):
+ """A very simple "read-only" wrapper for dictionaries. This will primarily be used
+ for passing around argparse's command line args as a dictionary, but ensuring that
+ preferences cannot be accidentally modified as they are passed around to a menagerie
+ of class constructors. All methods except __setitem__() are deferred to the base class."""
+
+ def __init__(self, rw_dict):
+ super().__init__(rw_dict)
+
+ def __setitem__(self, *_):
+ raise RuntimeError("Attempted to modify read-only dictionary after construction.")
+
+def sorted_natural(lines, reverse=False):
+ """Sorts alphanumeric strings with entire numbers considered in the sorting order,
+ rather than the default behavior which is to sort by the individual ASCII values
+ of the given number. Returns a sorted copy of the list, just like sorted().
+
+ Not sure who to credit... it seems this snippet has been floating around for quite
+ some time. Strange that there isn't something in the standard library for this."""
+
+ convert = lambda text: int(text) if text.isdigit() else text.lower()
+ alphanum_key = lambda key: [convert(c) for c in re.split(r'(\d+)', key)]
+ return sorted(lines, key=alphanum_key, reverse=reverse)
\ No newline at end of file
diff --git a/tiny/templates/run_config_template.yml b/tiny/templates/run_config_template.yml
index fd6f1817..71d85e99 100644
--- a/tiny/templates/run_config_template.yml
+++ b/tiny/templates/run_config_template.yml
@@ -247,10 +247,6 @@ counter_type_filter: []
##-- Select the StepVector implementation that is used. Options: HTSeq or Cython --##
counter_stepvector: 'Cython'
-##-- If False: a feature with multiple values in its ID attribute is treated as an error --##
-##-- If True: multiple ID values are allowed, but only the first value is used --##
-counter_allow_multi_id: True
-
##-- If True: produce diagnostic logs to indicate what was eliminated and why --##
counter_diags: False