diff --git a/START_HERE/features.csv b/START_HERE/features.csv
index fba532fa..0e206cae 100644
--- a/START_HERE/features.csv
+++ b/START_HERE/features.csv
@@ -1,7 +1,7 @@
Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Strand,5' End Nucleotide,Length,Overlap
Class,mask,masked,,,1,both,all,all,Partial
Class,miRNA,miRNA,,,2,sense,all,16-24,5' anchored
-Class,piRNA,piRNA-5'A,,,2,both,A,24-32,Full
-Class,piRNA,piRNA-5'T,,,2,both,T,24-32,Full
-Class,siRNA,siRNA,,,2,both,all,15-22,Full
-Class,unk,unknown,,,3,both,all,all,Full
+Class,piRNA,piRNA-5'A,,,2,both,A,24-32,Nested
+Class,piRNA,piRNA-5'T,,,2,both,T,24-32,Nested
+Class,siRNA,siRNA,,,2,both,all,15-22,Nested
+Class,unk,unknown,,,3,both,all,all,Nested
diff --git a/START_HERE/tinyRNA_TUTORIAL.md b/START_HERE/tinyRNA_TUTORIAL.md
index 783ae5ff..3ca92a08 100644
--- a/START_HERE/tinyRNA_TUTORIAL.md
+++ b/START_HERE/tinyRNA_TUTORIAL.md
@@ -43,6 +43,7 @@ Expected runtime: ~10-60 minutes (expect longer runtimes if a bowtie index must
2. Move your GFF and genome sequence files into the reference_data directory.
3. Edit features.csv and samples.csv file for your datasets and selection parameters.
4. Edit paths.yml as follows:
- - line 46: `ebwt: ''` (no value)
- - line 50: `reference_genome_files: your-fasta-formatted-DNA-sequence-file`
+ - line 20: change the value after `path:` to point to your GFF or GTF file
+ - line 46: delete the value after `ebwt:`
+ - line 51: change the value after `- ` to point to your fasta formatted DNA sequence file
5. Run the pipeline with the command: `tiny run --config run_config.yml`
\ No newline at end of file
diff --git a/doc/tiny-count.md b/doc/tiny-count.md
index 184fab49..c23300f8 100644
--- a/doc/tiny-count.md
+++ b/doc/tiny-count.md
@@ -53,24 +53,25 @@ Wildcard values (`all`, `*`, or an empty cell) can be used in the `Select for...
Features overlapping a read alignment are selected based on their overlap characteristics. These matches are then sorted by hierarchy value before proceeding to Stage 3.
### Overlap
-This column allows you to specify the extent of overlap required for candidate feature selection:
-- `partial`: alignment overlaps feature by at least one base
-- `full`: alignment does not extend beyond either terminus of the feature
-- `exact`: alignment termini are equal to the feature's
-- `anchored`: alignment's start and/or end is equal to the feature's
-- `5' anchored`: alignment's 5' end is equal to the corresponding terminus of the feature
-- `3' anchored`: alignment's 3' end is equal to the corresponding terminus of the feature
+This column allows you to specify the extent of overlap required for candidate feature selection. In order to be a candidate, a feature must reside on the same chromosome as the alignment and overlap its interval by at least 1 nucleotide. A shared strand is not required. See the [Strand](#strand) section in Stage 3 for selection by strand.
-In order to be a candidate, a feature must match a rule in Stage 1, reside on the same chromosome as the alignment, and must overlap the alignment by at least 1 nucleotide.
-
-#### Strandedness and the Overlap Selector
-A feature does not have to be on the same strand as the alignment in order to be a candidate. See the [Strand](#strand) section in Stage 3 for selection by strand. Unstranded features will have `5' anchored` and `3' anchored` overlap selectors downgraded to `anchored` selectors. Alignments overlapping these features are evaluated for shared start and/or end coordinates, but 5'/3' ends are not distinguished.
+#### Unstranded Features
+ If these features match rules with `5' anchored` and `3' anchored` overlap selectors, they will be downgraded to `anchored` selectors. Alignments overlapping these features are evaluated for shared start and/or end coordinates, but 5' and 3' ends are not distinguished.
#### Selector Demonstration
-The following diagrams demonstrate the strand semantics of these interval selectors. The first two options show separate illustrations for features on each strand for emphasis. All matches shown in the remaining three options apply to features on either strand.
-
-
+The following table provides a description and illustration of the available overlap selectors. All matches apply to features on either strand, i.e. matches shown below the antisense strand also apply, as shown, to the feature on the sense strand, and vice versa.
+
+| Keyword and Description | Illustration |
+|:--------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------:|
+| `partial`: alignment overlaps feature by at least one base |
|
+| `nested`: alignment does not extend beyond either terminus of the feature |
|
+| `exact`: alignment termini are equal to the feature's |
|
+| `anchored`: alignment is nested with start and/or end equal to the feature's |
|
+| `5' anchored`: alignment is nested with 5' end equal to the corresponding terminus of the feature |
|
+| `3' anchored`: alignment is nested with 3' end equal to the corresponding terminus of the feature |
|
+
+:people_holding_hands: Illustration colors have been selected for colorblindness accessibility.
### Hierarchy
Each rule must be assigned a hierarchy value. This value is used to sort Stage 2 matches so that matches with smaller hierarchy values take precedence in Stage 3.
diff --git a/images/3'_anchored_5'_anchored_interval.png b/images/3'_anchored_5'_anchored_interval.png
deleted file mode 100644
index e0bffcbb..00000000
Binary files a/images/3'_anchored_5'_anchored_interval.png and /dev/null differ
diff --git a/images/full_exact_partial_interval.png b/images/full_exact_partial_interval.png
deleted file mode 100644
index e22265c1..00000000
Binary files a/images/full_exact_partial_interval.png and /dev/null differ
diff --git a/images/overlap_selectors/3'_anchored.png b/images/overlap_selectors/3'_anchored.png
new file mode 100644
index 00000000..b0b538d7
Binary files /dev/null and b/images/overlap_selectors/3'_anchored.png differ
diff --git a/images/overlap_selectors/5'_anchored.png b/images/overlap_selectors/5'_anchored.png
new file mode 100644
index 00000000..6c6c84df
Binary files /dev/null and b/images/overlap_selectors/5'_anchored.png differ
diff --git a/images/overlap_selectors/anchored.png b/images/overlap_selectors/anchored.png
new file mode 100644
index 00000000..29f38393
Binary files /dev/null and b/images/overlap_selectors/anchored.png differ
diff --git a/images/overlap_selectors/exact.png b/images/overlap_selectors/exact.png
new file mode 100644
index 00000000..8a30e917
Binary files /dev/null and b/images/overlap_selectors/exact.png differ
diff --git a/images/overlap_selectors/nested.png b/images/overlap_selectors/nested.png
new file mode 100644
index 00000000..a0282aed
Binary files /dev/null and b/images/overlap_selectors/nested.png differ
diff --git a/images/overlap_selectors/partial.png b/images/overlap_selectors/partial.png
new file mode 100644
index 00000000..b386faea
Binary files /dev/null and b/images/overlap_selectors/partial.png differ
diff --git a/images/tiny-count_selection.png b/images/tiny-count_selection.png
index 6f92d6f6..5cde511f 100644
Binary files a/images/tiny-count_selection.png and b/images/tiny-count_selection.png differ
diff --git a/tests/testdata/config_files/features.csv b/tests/testdata/config_files/features.csv
index 66977d07..8a6e12d1 100755
--- a/tests/testdata/config_files/features.csv
+++ b/tests/testdata/config_files/features.csv
@@ -1,7 +1,7 @@
Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Strand,5' End Nucleotide,Length,Overlap
Class,mask,,,,1,both,all,all,Partial
-Class,miRNA,,,,2,sense,all,16-22,Full
-Class,piRNA,5pA,,,2,both,A,24-32,Full
-Class,piRNA,5pT,,,2,both,T,24-32,Full
-Class,siRNA,,,,2,both,all,15-22,Full
-Class,unk,,,,3,both,all,all,Full
\ No newline at end of file
+Class,miRNA,,,,2,sense,all,16-22,Nested
+Class,piRNA,5pA,,,2,both,A,24-32,Nested
+Class,piRNA,5pT,,,2,both,T,24-32,Nested
+Class,siRNA,,,,2,both,all,15-22,Nested
+Class,unk,,,,3,both,all,all,Nested
\ No newline at end of file
diff --git a/tests/unit_tests_features.py b/tests/unit_tests_features.py
index 65e776f2..55c41410 100644
--- a/tests/unit_tests_features.py
+++ b/tests/unit_tests_features.py
@@ -24,7 +24,7 @@ def setUpClass(self):
def make_feature_for_interval_test(self, iv_rule, feat_id, chrom, strand: str, start, stop):
feat_iv = HTSeq.GenomicInterval(chrom, start, stop, strand)
rules = [dict(deepcopy(rules_template[0]), Overlap=iv_rule, Identity=('*', '*'), nt5end='*', Length='*')]
- fs = FeatureSelector(rules, LibraryStats(normalize_by_hits=True))
+ fs = FeatureSelector(rules)
# Feature with specified coordinates, matching rule 0 with hierarchy 0 and the appropriate selector for iv_rule
selectors = fs.build_interval_selectors(feat_iv, [(0, 0, iv_rule)])
@@ -187,12 +187,12 @@ def test_count_reads_generic(self, mock):
instance.assign_features.assert_called_once_with("mock_alignment")
instance.stats.assert_has_calls(expected_calls_to_stats)
- """Does FeatureSelector.choose() correctly select features defining `full` interval matching rules?"""
+ """Does FeatureSelector.choose() correctly select features defining `nested` interval matching rules?"""
- def test_feature_selector_full_interval(self):
- iv_rule = 'full'
+ def test_feature_selector_nested_interval(self):
+ iv_rule = 'nested'
chrom, strand, start, stop = "n/a", "+", 5, 10
- feat, fs = self.make_feature_for_interval_test(iv_rule, "Full Overlap", chrom, strand, start, stop)
+ feat, fs = self.make_feature_for_interval_test(iv_rule, "Nested Overlap", chrom, strand, start, stop)
aln_base = {'Seq': 'ATGC', 'Chrom': chrom, 'Strand': strand_to_bool(strand)}
aln_spill_lo = make_parsed_sam_record(**dict(aln_base, Start=start - 1, Name="spill"))
@@ -213,9 +213,9 @@ def test_feature_selector_full_interval(self):
self.assertEqual(fs.choose(feat, aln_spill_lo), {})
self.assertEqual(fs.choose(feat, aln_spill_hi), {})
- self.assertEqual(fs.choose(feat, aln_contained), {"Full Overlap": {0}})
- self.assertEqual(fs.choose(feat, aln_contained_lo), {"Full Overlap": {0}})
- self.assertEqual(fs.choose(feat, aln_contained_hi), {"Full Overlap": {0}})
+ self.assertEqual(fs.choose(feat, aln_contained), {"Nested Overlap": {0}})
+ self.assertEqual(fs.choose(feat, aln_contained_lo), {"Nested Overlap": {0}})
+ self.assertEqual(fs.choose(feat, aln_contained_hi), {"Nested Overlap": {0}})
"""Does FeatureSelector.choose() correctly select features with `partial` interval matching rules?"""
@@ -284,7 +284,7 @@ def test_feature_selector_5p_interval(self):
"""
No match | 6 -------->| 10 aln_none
- Match 5 |------------|--> 11 aln_long
+ No match 5 |------------|--> 11 aln_long
Match 5 |----------->| 10 aln_exact
Match 5 |--------> 9 | aln_short
(+) 5' -------------|==feat_A===>|-----------> 3'
@@ -292,7 +292,7 @@ def test_feature_selector_5p_interval(self):
(-) 3' <------------|<===feat_B==|------------ 5'
| 6 <--------| 10 Match
5 |<-----------| 10 Match
- 4 <--|------------| 10 Match
+ 4 <--|------------| 10 No match
5 |<-------- 9 | No match
"""
@@ -307,7 +307,7 @@ def test_feature_selector_5p_interval(self):
}
self.assertEqual(fs.choose(feat_A, aln['aln_none']), {})
- self.assertEqual(fs.choose(feat_A, aln['aln_long']), {"5' Anchored Overlap (+)": {0}})
+ self.assertEqual(fs.choose(feat_A, aln['aln_long']), {})
self.assertEqual(fs.choose(feat_A, aln['aln_exact']), {"5' Anchored Overlap (+)": {0}})
self.assertEqual(fs.choose(feat_A, aln['aln_short']), {"5' Anchored Overlap (+)": {0}})
@@ -319,7 +319,7 @@ def test_feature_selector_5p_interval(self):
aln['aln_none'].update({'Start': 5, 'End': 9, 'Strand': False})
self.assertEqual(fs.choose(feat_B, aln['aln_none']), {})
- self.assertEqual(fs.choose(feat_B, aln['aln_long']), {"5' Anchored Overlap (-)": {0}})
+ self.assertEqual(fs.choose(feat_B, aln['aln_long']), {})
self.assertEqual(fs.choose(feat_B, aln['aln_exact']), {"5' Anchored Overlap (-)": {0}})
self.assertEqual(fs.choose(feat_B, aln['aln_short']), {"5' Anchored Overlap (-)": {0}})
@@ -330,17 +330,17 @@ def test_feature_selector_3p_interval(self):
chrom, start, end = "n/a", 5, 10
"""
- No match 5 |--------> 9 | aln_none
- Match 4 --|----------->| 10 aln_long
- Match 5 |----------->| 10 aln_exact
- Match | 6 -------->| 10 aln_short
- (+) 5' -------------|==feat_A===>|-----------> 3'
- start = 5 end = 10
- (-) 3' <------------|<===feat_B==|------------ 5'
- 5 |<-------- 9 | Match
- 5 |<-----------| 10 Match
- 5 |<-----------|-- 11 Match
- | 6 <--------| 10 No match
+ No match 5 |--------> 9 | aln_none
+ No match 4 --|----------->| 10 aln_long
+ Match 5 |----------->| 10 aln_exact
+ Match | 6 -------->| 10 aln_short
+ (+) 5' --------------|==feat_A===>|-----------> 3'
+ start = 5 end = 10
+ (-) 3' <-------------|<===feat_B==|------------ 5'
+ 5 |<-------- 9 | Match
+ 5 |<-----------| 10 Match
+ 5 |<-----------|-- 11 No match
+ | 6 <--------| 10 No match
"""
# Test feat_A on (+) strand
@@ -354,7 +354,7 @@ def test_feature_selector_3p_interval(self):
}
self.assertEqual(fs.choose(feat_A, aln['aln_none']), {})
- self.assertEqual(fs.choose(feat_A, aln['aln_long']), {"3' Anchored Overlap (+)": {0}})
+ self.assertEqual(fs.choose(feat_A, aln['aln_long']), {})
self.assertEqual(fs.choose(feat_A, aln['aln_exact']), {"3' Anchored Overlap (+)": {0}})
self.assertEqual(fs.choose(feat_A, aln['aln_short']), {"3' Anchored Overlap (+)": {0}})
@@ -366,7 +366,7 @@ def test_feature_selector_3p_interval(self):
aln['aln_none'].update({'Start': 6, 'End': 10, 'Strand': False})
self.assertEqual(fs.choose(feat_B, aln['aln_none']), {})
- self.assertEqual(fs.choose(feat_B, aln['aln_long']), {"3' Anchored Overlap (-)": {0}})
+ self.assertEqual(fs.choose(feat_B, aln['aln_long']), {})
self.assertEqual(fs.choose(feat_B, aln['aln_exact']), {"3' Anchored Overlap (-)": {0}})
self.assertEqual(fs.choose(feat_B, aln['aln_short']), {"3' Anchored Overlap (-)": {0}})
@@ -382,7 +382,7 @@ def test_wildcard_identities(self):
rules = [*one, *two, *non, *dup]
- actual = FeatureSelector(rules, LibraryStats(normalize_by_hits=True)).inv_ident
+ actual = FeatureSelector(rules).inv_ident
expected = {
(Wildcard(), non_wild): [0],
(non_wild, Wildcard()): [1],
@@ -395,17 +395,17 @@ def test_wildcard_identities(self):
"""Does FeatureSelector build both shifted and unshifted selectors and group them by resulting interval?"""
def test_build_interval_selectors_grouping(self):
- fs = FeatureSelector(deepcopy(rules_template), LibraryStats())
+ fs = FeatureSelector(deepcopy(rules_template))
iv = HTSeq.GenomicInterval('I', 10, 20, '+')
match_tuples = [('n/a', 'n/a', 'partial'),
- ('n/a', 'n/a', 'full'),
+ ('n/a', 'n/a', 'nested'),
('n/a', 'n/a', 'exact'),
('n/a', 'n/a', "5' anchored"),
('n/a', 'n/a', "3' anchored"),
# iv_shifted_1 Shift values:
('n/a', 'n/a', 'partial, -5, 5'), # 5': -5 3': 5
- ('n/a', 'n/a', 'full, -5, 5'), # 5': -5 3': 5
+ ('n/a', 'n/a', 'nested, -5, 5'), # 5': -5 3': 5
# iv_shifted_2
('n/a', 'n/a', 'exact, -10, 10'), # 5': -10 3': 10
# iv_shifted_3
diff --git a/tests/unit_tests_hts_parsing.py b/tests/unit_tests_hts_parsing.py
index 74de20d7..3f8e9ea6 100644
--- a/tests/unit_tests_hts_parsing.py
+++ b/tests/unit_tests_hts_parsing.py
@@ -18,13 +18,6 @@
resources = "./testdata/counter"
-class MockFeatureSelector:
- def __init__(self, rules_table):
- self.rules_table = FeatureSelector.build_selectors(rules_table)
- self.inv_ident = FeatureSelector.build_inverted_identities(self.rules_table)
- self.build_interval_selectors = FeatureSelector.build_interval_selectors
-
-
class MyTestCase(unittest.TestCase):
@classmethod
@@ -54,7 +47,7 @@ def selector_with_template(self, updates_list):
rules = [deepcopy(helpers.rules_template[0]) for _ in range(len(updates_list))]
for changes, template in zip(updates_list, rules):
template.update(changes)
- return MockFeatureSelector(rules)
+ return FeatureSelector(rules)
def exhaust_iterator(self, it):
collections.deque(it, maxlen=0)
@@ -137,7 +130,7 @@ def test_ref_tables_single_feature(self):
feature_selector = self.selector_with_template([
# Fails to match due to Identity selector
{'Identity': ("Class", "CSR"), 'Strand': "sense", 'Hierarchy': 1, 'Class': 'none', 'nt5end': "all",
- 'Overlap': 'full', 'Length': "20"},
+ 'Overlap': 'nested', 'Length': "20"},
# Match
{'Identity': ("biotype", "snoRNA"), 'Strand': "antisense", 'Hierarchy': 2, 'Class': 'tag', 'nt5end': "all",
'Overlap': 'partial', 'Length': "30"}
@@ -161,7 +154,7 @@ def test_ref_tables_single_feature_all_features_false(self):
feature_selector = self.selector_with_template([
# Fails to match due to Identity selector
{'Identity': ("Class", "CSR"), 'Strand': "sense", 'Hierarchy': 1, 'Class': 'none', 'nt5end': "all",
- 'Overlap': 'full', 'Length': "20"},
+ 'Overlap': 'nested', 'Length': "20"},
# Match
{'Identity': ("biotype", "snoRNA"), 'Strand': "antisense", 'Hierarchy': 2, 'Class': 'tag', 'nt5end': "all",
'Overlap': 'partial', 'Length': "30"}
@@ -183,7 +176,7 @@ def test_ref_tables_missing_name_attribute_all_features_false(self):
kwargs = {'all_features': False}
bad = "bad_name_attribute"
feature_source = {self.short_gff_file: [bad]}
- feature_selector = MockFeatureSelector([])
+ feature_selector = FeatureSelector([])
expected_err = "No features were retained while parsing your GFF file.\n" \
"This may be due to a lack of features matching 'Select for...with value...'"
@@ -221,7 +214,7 @@ def test_ref_tables_alias_multisource_concat(self):
# Notice: screening for "ID" name attribute happens earlier in counter.load_config()
expected_alias = {"Gene:WBGene00023193": ("additional_class", "Gene:WBGene00023193", "unknown")}
- _, alias, _ = ReferenceFeatures(feature_source, **kwargs).get(MockFeatureSelector([]))
+ _, alias, _ = ReferenceFeatures(feature_source, **kwargs).get(FeatureSelector([]))
self.assertDictEqual(alias, expected_alias)
@@ -236,7 +229,7 @@ def test_ref_tables_alias_multisource_concat_all_features_false(self):
with self.assertRaisesRegex(ValueError, expected_err):
# No aliases saved due to all_features=False and the lack of identity matches
- _, alias, _ = ReferenceFeatures(feature_source, **kwargs).get(MockFeatureSelector([]))
+ _, alias, _ = ReferenceFeatures(feature_source, **kwargs).get(FeatureSelector([]))
"""Does ReferenceFeatures.get() properly concatenate identity match tuples when multiple GFF files define
matches for a feature?"""
@@ -762,7 +755,7 @@ def test_gff_megazord(self):
'Overlap': 'partial', 'Strand': '', 'nt5end': '', 'Length': ''}]
files = {gff.format(ftype): [] for gff in self.genomes.values() for ftype in ('gff3', 'gtf')}
- fs = FeatureSelector(rules, LibraryStats())
+ fs = FeatureSelector(rules)
rt = ReferenceFeatures(files)
# The test is passed if this command
diff --git a/tests/unit_tests_matching.py b/tests/unit_tests_matching.py
index d979e688..25f618dd 100644
--- a/tests/unit_tests_matching.py
+++ b/tests/unit_tests_matching.py
@@ -16,26 +16,30 @@ def setUpClass(self):
"""Does FeatureSelector build the proper interval selectors?"""
def test_feature_selector_interval_build(self):
- fs = FeatureSelector(deepcopy(rules_template), LibraryStats())
+ fs = FeatureSelector(deepcopy(rules_template))
iv = HTSeq.GenomicInterval('I', 0, 10, '+')
# Match tuples formed during GFF parsing
match_tuples = [('n/a', 'n/a', 'partial'),
- ('n/a', 'n/a', 'full'),
+ ('n/a', 'n/a', 'nested'),
('n/a', 'n/a', 'exact'),
+ ('n/a', 'n/a', 'anchored'),
('n/a', 'n/a', "5' anchored"),
('n/a', 'n/a', "3' anchored"),
('n/a', 'n/a', "5'anchored"), # spaces should be optional
('n/a', 'n/a', "3'anchored")]
+ match_tuples += [('n/a', 'n/a', kwd) for kwd in Wildcard.kwds]
result = fs.build_interval_selectors(iv, match_tuples)
self.assertEqual(len(result), 1)
self.assertIsInstance(result[iv][0][2], IntervalPartialMatch)
- self.assertIsInstance(result[iv][1][2], IntervalFullMatch)
+ self.assertIsInstance(result[iv][1][2], IntervalNestedMatch)
self.assertIsInstance(result[iv][2][2], IntervalExactMatch)
- self.assertIsInstance(result[iv][3][2], Interval5pMatch)
- self.assertIsInstance(result[iv][4][2], Interval3pMatch)
+ self.assertIsInstance(result[iv][3][2], IntervalAnchorMatch)
+ self.assertIsInstance(result[iv][4][2], Interval5pMatch)
+ self.assertIsInstance(result[iv][5][2], Interval3pMatch)
+ self.assertTrue(isinstance(o, Wildcard) for o in result[iv][6:])
"""Are interval selectors hashable and properly compare for equality?
This is important for storing feature records in GenomicArraysOfSets"""
@@ -46,7 +50,7 @@ def test_interval_selector_equality(self):
shared_iv = HTSeq.GenomicInterval('I', 0, 10)
class_name_test = {
IntervalPartialMatch(shared_iv),
- IntervalFullMatch(shared_iv),
+ IntervalNestedMatch(shared_iv),
IntervalExactMatch(shared_iv),
Interval5pMatch(shared_iv),
Interval3pMatch(shared_iv)
diff --git a/tiny/rna/counter/features.py b/tiny/rna/counter/features.py
index 045945b8..e5da0c00 100644
--- a/tiny/rna/counter/features.py
+++ b/tiny/rna/counter/features.py
@@ -30,7 +30,7 @@ class FeatureCounter:
def __init__(self, references, selection_rules, **prefs):
self.stats = LibraryStats(**prefs)
self.sam_reader = SAM_reader(**prefs)
- self.selector = FeatureSelector(selection_rules, self.stats, **prefs)
+ self.selector = FeatureSelector(selection_rules, **prefs)
if isinstance(references, ReferenceFeatures):
self.mode = "by feature"
@@ -101,12 +101,11 @@ class FeatureSelector:
rules_table: List[dict]
inv_ident: Dict[tuple, List[int]]
- def __init__(self, rules: List[dict], libstats: 'LibraryStats', report_diags=False, **kwargs):
+ def __init__(self, rules: List[dict], **kwargs):
FeatureSelector.rules_table = self.build_selectors(rules)
FeatureSelector.inv_ident = self.build_inverted_identities(FeatureSelector.rules_table)
-
- self.report_eliminations = report_diags
- if report_diags: self.elim_stats = libstats.diags.selection_diags
+ self.warnings = defaultdict(set)
+ self.overlap_cache = {}
@classmethod
def choose(cls, candidates: Set[feature_record_tuple], alignment: dict) -> Mapping[str, set]:
@@ -199,9 +198,8 @@ def build_selectors(rules_table) -> List[dict]:
return rules_table
- @staticmethod
- def build_interval_selectors(iv: 'HTSeq.GenomicInterval', match_tuples: List[unbuilt_match_tuple]):
- """Builds partial/full/exact/5'anchored/3'anchored interval selectors
+ def build_interval_selectors(self, iv: 'HTSeq.GenomicInterval', match_tuples: List[unbuilt_match_tuple], feat_id=None):
+ """Builds partial/wildcard/nested/exact/5'anchored/3'anchored interval selectors
Unlike build_selectors() and build_inverted_identities(), this function
is not called at construction time. Instead, it is called when finalizing
@@ -218,39 +216,51 @@ def build_interval_selectors(iv: 'HTSeq.GenomicInterval', match_tuples: List[unb
Args:
iv: The interval of the feature from which each selector is built
match_tuples: A list of tuples representing the feature's Stage 1 matches
+ feat_id: The tagged feature ID associated with these matches (for error logging)
+
+ Returns:
+ matches_by_interval: a dictionary of shifted intervals and their associated
+ match tuples, where each match tuple now contains a complete IntervalSelector
"""
- cache = {}
+ cache = self.overlap_cache
selector_factory = {
- 'full': lambda x: IntervalFullMatch(x),
'exact': lambda x: IntervalExactMatch(x),
+ 'full': lambda x: IntervalNestedMatch(x), # temporary backward compatibility
+ 'nested': lambda x: IntervalNestedMatch(x),
'partial': lambda x: IntervalPartialMatch(x),
'anchored': lambda x: IntervalAnchorMatch(x),
"5'anchored": lambda x: Interval5pMatch(x) if iv.strand in ('+', '-') else IntervalAnchorMatch(x),
- "3'anchored": lambda x: Interval3pMatch(x) if iv.strand in ('+', '-') else IntervalAnchorMatch(x),
- }
+ "3'anchored": lambda x: Interval3pMatch(x) if iv.strand in ('+', '-') else IntervalAnchorMatch(x)}
matches_by_interval = defaultdict(list)
- for i, match in enumerate(match_tuples):
- try:
- # Split optional shift parameters from definition
- defn = re.split(r'\s*,\s*', match[2], 1)
- selector, shift = defn[0], defn[1:]
- selector = selector.replace(' ', '')
-
- # Shift the interval before constructing if shift parameters were provided
- match_iv = IntervalSelector.get_shifted_interval(shift[0], iv) if shift else iv
- cache_key = (selector, match_iv)
-
- # Cache instances to prevent duplicates for the same match type on the same iv
- selector_obj = cache.setdefault(cache_key, selector_factory[selector](match_iv))
-
- built_match_tuple = (match[0], match[1], selector_obj)
- matches_by_interval[match_iv].append(built_match_tuple)
- except KeyError:
- raise ValueError(f'Invalid overlap selector: "{match_tuples[i][2]}"')
- except IllegalShiftError:
- pass # Drop offending match tuple
+ for match in match_tuples:
+ # Split optional shift parameters from definition
+ defn = re.split(r'\s*,\s*', match[2], 1)
+ selector, shift = defn[0], defn[1:]
+ selector = selector.replace(' ', '')
+
+ if selector in Wildcard.kwds:
+ selector_obj = Wildcard()
+ match_iv = iv
+ else:
+ try:
+ # Shift the interval before constructing if shift parameters were provided
+ match_iv = IntervalSelector.get_shifted_interval(shift[0], iv) if shift else iv
+
+ # Cache instances to prevent duplicates for the same match type on the same iv
+ cache_key = (selector, match_iv.chrom, match_iv.start, match_iv.end)
+ selector_obj = cache.setdefault(cache_key, selector_factory[selector](match_iv))
+ except KeyError:
+ raise ValueError(f'Invalid overlap selector: "{match[2]}"')
+ except IllegalShiftError as e:
+ bad_shift_msg = f'{feat_id} and rule {match[0]} ({match[2]})'
+ self.warnings[e.subtype].add(bad_shift_msg)
+ continue
+
+ # Replace the match tuple's definition with selector
+ built_match_tuple = (match[0], match[1], selector_obj)
+ matches_by_interval[match_iv].append(built_match_tuple)
return matches_by_interval
diff --git a/tiny/rna/counter/hts_parsing.py b/tiny/rna/counter/hts_parsing.py
index 71fa5f6a..c3adabe8 100644
--- a/tiny/rna/counter/hts_parsing.py
+++ b/tiny/rna/counter/hts_parsing.py
@@ -11,7 +11,7 @@
from inspect import stack
from tiny.rna.counter.matching import Wildcard
-from tiny.rna.util import report_execution_time, make_filename
+from tiny.rna.util import report_execution_time, make_filename, ReportFormatter
# For parse_GFF_attribute_string()
_re_attr_main = re.compile(r"\s*([^\s=]+)[\s=]+(.*)")
@@ -436,12 +436,15 @@ def __init__(self, **prefs):
def get(self, feature_selector): pass
def _init_genomic_array(self):
+ """The Cython StepVector is more efficient but requires extra setup steps.
+ If these fail, we want to fall back to HTSeq's StepVector and carry on."""
+
if self.stepvector == 'Cython':
try:
from tiny.rna.counter.stepvector import StepVector
setattr(HTSeq.StepVector, 'StepVector', StepVector)
return HTSeq.GenomicArray("auto", stranded=False)
- except ModuleNotFoundError:
+ except:
self.stepvector = 'HTSeq'
print("Failed to import Cython StepVector\n"
"Falling back to HTSeq's StepVector",
@@ -468,6 +471,23 @@ def get_feats_table_size(self) -> int:
return total_feats
+ def print_selector_warnings(self):
+ """Warnings accumulate in FeatureSelector and are printed here using ReportFormatter"""
+
+ if self.selector is None: return
+
+ header = "Incompatible feature intervals were produced as a result of overlap shift parameters."
+ bad_shift = "The following matches were omitted from selection because they result in"
+ descriptions = {
+ "null": f"{bad_shift} a zero-width interval",
+ "inverted": f"{bad_shift} an inverted interval (start > end)",
+ "negative start": f"{bad_shift} a negative start coordinate"
+ }
+
+ formatter = ReportFormatter(descriptions)
+ formatter.add_warning_section(header, self.selector.warnings)
+ formatter.print_report()
+
@staticmethod
def map_strand(htseq_value: str):
"""Maps HTSeq's strand representation (+/-/.) to
@@ -557,6 +577,7 @@ def get(self, feature_selector) -> Tuple[GenomicArray, AliasTable, TagTable]:
self._finalize_aliases()
self._finalize_features()
+ if self.selector.warnings: self.print_selector_warnings()
if self.get_feats_table_size() == 0 and self.all_features is False:
raise ValueError("No features were retained while parsing your GFF file.\n"
"This may be due to a lack of features matching 'Select for...with value...'")
@@ -741,6 +762,9 @@ def _finalize_features(self):
tagged_matches = self.matches[tagged_id]
self._add_subinterval_matches(tagged_id, merged_sub_ivs, tagged_matches)
+ # GenomicArray is built. Clear cache...
+ self.selector.overlap_cache.clear()
+
def _add_subinterval_matches(self, tagged_id: tuple, sub_ivs: list, matches: set):
"""Adds the classified group of matches to the GenomicArray under each of the
feature's sub-intervals
@@ -753,7 +777,7 @@ def _add_subinterval_matches(self, tagged_id: tuple, sub_ivs: list, matches: set
for sub_iv in sub_ivs:
# Build interval selectors for this
- matches_by_shifted_iv = self.selector.build_interval_selectors(sub_iv, matches)
+ matches_by_shifted_iv = self.selector.build_interval_selectors(sub_iv, matches, tagged_id)
strand = self.map_strand(sub_iv.strand)
for shifted_iv, built_matches in matches_by_shifted_iv.items():
@@ -761,7 +785,6 @@ def _add_subinterval_matches(self, tagged_id: tuple, sub_ivs: list, matches: set
sorted_matches = sorted(built_matches, key=lambda x: x[1])
self.feats[shifted_iv] += (tagged_id, strand, tuple(sorted_matches))
-
@staticmethod
def _merge_adjacent_subintervals(unmerged_sub_ivs: List[HTSeq.GenomicInterval]) -> list:
"""Combines overlapping and adjacent elements in the provided list of subintervals"""
@@ -805,6 +828,12 @@ def get(self, selector):
for seq_id, seq_len in self.seqs.items():
self.add_reference_seq(seq_id, seq_len, match_tuples)
+ # GenomicArray is built. Clear cache...
+ self.selector.overlap_cache.clear()
+
+ # If the user provided shift parameters, there will likely be warnings
+ if self.selector.warnings: self.print_selector_warnings()
+
# Aliases are irrelevant for non-GFF references
aliases = {seq_id: () for seq_id in self.tags}
return self.feats, aliases, self.tags
@@ -829,10 +858,10 @@ def add_reference_seq(self, seq_id, seq_len, matches):
for strand in ('+', '-'):
iv = HTSeq.GenomicInterval(seq_id, 0, seq_len, strand)
- matches_by_shifted_iv = self.selector.build_interval_selectors(iv, matches)
+ matches_by_shifted_iv = self.selector.build_interval_selectors(iv, matches, tagged_id)
strand = self.map_strand(strand)
for shifted_iv, built_matches in matches_by_shifted_iv.items():
# Sort match tuples by rank for more efficient feature selection
sorted_matches = sorted(built_matches, key=lambda x: x[1])
- self.feats[shifted_iv] += (tagged_id, strand, tuple(sorted_matches))
\ No newline at end of file
+ self.feats[shifted_iv] += (tagged_id, strand, tuple(sorted_matches))
diff --git a/tiny/rna/counter/matching.py b/tiny/rna/counter/matching.py
index fdafbca5..86915bea 100644
--- a/tiny/rna/counter/matching.py
+++ b/tiny/rna/counter/matching.py
@@ -202,6 +202,7 @@ class IntervalPartialMatch(IntervalSelector):
at least one base, no further evaluation is required when this filter
is used by FeatureSelector.choose()"""
+ __slots__ = () # Base class holds attributes
__contains__ = Wildcard.__contains__
def __init__(self, iv: HTSeq.GenomicInterval):
@@ -211,7 +212,8 @@ def __repr__(self):
return f""
-class IntervalFullMatch(IntervalSelector):
+class IntervalNestedMatch(IntervalSelector):
+ __slots__ = () # Base class holds attributes
def __init__(self, iv: HTSeq.GenomicInterval):
super().__init__(iv)
@@ -224,6 +226,7 @@ def __repr__(self):
class IntervalExactMatch(IntervalSelector):
+ __slots__ = () # Base class holds attributes
def __init__(self, iv: HTSeq.GenomicInterval):
super().__init__(iv)
@@ -236,35 +239,39 @@ def __repr__(self):
class IntervalAnchorMatch(IntervalSelector):
- """Evaluates whether an alignment's start matches the feature's start, and vice versa for end."""
+ """Evaluates whether either end of the alignment's interval is anchored to
+ the feature with the non-anchored end nested within the feature's interval"""
+
+ __slots__ = () # Base class holds attributes
def __init__(self, iv: HTSeq.GenomicInterval):
super().__init__(iv)
- assert iv.strand not in ('+', '-')
def __contains__(self, alignment: dict):
"""The following diagram demonstrates unstranded anchored matching semantics.
Match | <------->|
- Match |<-----------|-->
+ Match |<-------> |
Match |<---------->|
No match | <---------|->
<-----------|<==feat_A==>|----------->
Args:
- alignment: An alignment dictionary containing strand, start, and end
-
- Returns: True if the alignment's 5' end is anchored to the strand-appropriate
- terminus of this feature's interval.
+ alignment: An alignment dictionary containing `Start` and `End` keys
"""
- return alignment['Start'] == self.start or alignment['End'] == self.end
+ return (alignment['Start'] == self.start and alignment['End'] <= self.end) \
+ or (alignment['End'] == self.end and alignment['Start'] >= self.start)
def __repr__(self):
return f""
+
class Interval5pMatch(IntervalSelector):
- """Evaluates whether an alignment's 5' end is anchored to the corresponding terminus of the feature"""
+ """Evaluates whether an alignment's 5' end is anchored to the corresponding terminus
+ of the feature, and the alignment's 3' end is nested within the feature's interval."""
+
+ __slots__ = () # Base class holds attributes
def __init__(self, iv: HTSeq.GenomicInterval):
super().__init__(iv)
@@ -274,7 +281,7 @@ def __contains__(self, alignment: dict):
Each "(no) match" label applies to BOTH feat_A AND feat_B.
No match | -------->|
- Match |------------|-->
+ No Match |------------|-->
Match |----------->|
Match |--------> |
(+) 5' ------------|==feat_A===>|-----------> 3'
@@ -282,26 +289,27 @@ def __contains__(self, alignment: dict):
(-) 3' <-----------|<===feat_B==|------------ 5'
| <--------| Match
|<-----------| Match
- <--|------------| Match
+ <--|------------| No Match
|<-------- | No match
Args:
- alignment: An alignment dictionary containing strand, start, and end
-
- Returns: True if the alignment's 5' end is anchored to the strand-appropriate
- terminus of this feature's interval.
+ alignment: An alignment dictionary containing `Strand`, `Start` and `End` keys
"""
if alignment['Strand'] is True:
- return alignment['Start'] == self.start
+ return alignment['Start'] == self.start and alignment['End'] <= self.end
else:
- return alignment['End'] == self.end
+ return alignment['End'] == self.end and alignment['Start'] >= self.start
def __repr__(self):
- return f"<5' anchored to {self.start} on (+) / {self.end} on (-)>"
+ return f"<5' anchored to {self.start} nested in {self.end} on (+) / " \
+ f"anchored to {self.end} nested in {self.start} on (-)>"
class Interval3pMatch(IntervalSelector):
- """Evaluates whether an alignment's 5' end is anchored to the corresponding terminus of the feature"""
+ """Evaluates whether an alignment's 3' end is anchored to the corresponding terminus
+ of the feature, and the alignment's 5' end is nested within the feature's interval."""
+
+ __slots__ = () # Base class holds attributes
def __init__(self, iv: HTSeq.GenomicInterval):
super().__init__(iv)
@@ -311,7 +319,7 @@ def __contains__(self, alignment):
Each "(no) match" label applies to BOTH feat_A AND feat_B.
No match |--------> |
- Match --|----------->|
+ No Match --|----------->|
Match |----------->|
Match | -------->|
(+) 5' ------------|==feat_A===>|-----------> 3'
@@ -319,19 +327,17 @@ def __contains__(self, alignment):
(-) 3' <-----------|<===feat_B==|------------ 5'
|<-------- | Match
|<-----------| Match
- |<-----------|-- Match
+ |<-----------|-- No Match
| <--------| No match
Args:
- alignment: An alignment dictionary containing strand, start, and end
-
- Returns: True if the alignment's 3' end is anchored to the strand-appropriate
- terminus of this feature's interval.
+ alignment: An alignment dictionary containing `Strand`, `Start` and `End` keys
"""
if alignment['Strand'] is True:
- return alignment['End'] == self.end
+ return alignment['End'] == self.end and alignment['Start'] >= self.start
else:
- return alignment['Start'] == self.start
+ return alignment['Start'] == self.start and alignment['End'] <= self.end
def __repr__(self):
- return f"<3' anchored to {self.end} on (+) / {self.start} on (-)>"
+ return f"<3' anchored to {self.end} nested in {self.start} on (+) / " \
+ f"anchored to {self.start} nested in {self.end} on (-)>"
diff --git a/tiny/rna/counter/statistics.py b/tiny/rna/counter/statistics.py
index f0d7793a..a3ebe0f9 100644
--- a/tiny/rna/counter/statistics.py
+++ b/tiny/rna/counter/statistics.py
@@ -519,7 +519,7 @@ def add_library(self, other: LibraryStats):
def write_output_logfile(self):
self.write_alignment_diags()
- self.write_selection_diags()
+ # self.write_selection_diags() Not currently collected
self.write_alignment_tables()
def write_alignment_diags(self):
diff --git a/tiny/rna/counter/validation.py b/tiny/rna/counter/validation.py
index 1605b62b..45f331f8 100644
--- a/tiny/rna/counter/validation.py
+++ b/tiny/rna/counter/validation.py
@@ -8,75 +8,7 @@
from tiny.rna.counter.hts_parsing import parse_gff, ReferenceFeatures, SAM_reader
from tiny.rna.counter.features import FeatureSelector
-from tiny.rna.util import sorted_natural, gzip_open, report_execution_time
-
-
-class ReportFormatter:
- error_header = "========[ ERROR ]=============================================="
- warning_header = "========[ WARNING ]============================================"
-
- def __init__(self, key_mapper):
- self.key_mapper = key_mapper
- self.warnings = []
- self.errors = []
-
- def print_report(self):
- for error_section in self.errors:
- print(self.error_header)
- print(error_section)
- print()
-
- for warning_section in self.warnings:
- print(self.warning_header)
- print(warning_section)
- print()
-
- def add_error_section(self, header, body=None):
- self.add_section(header, body, self.errors)
-
- def add_warning_section(self, header, body=None):
- self.add_section(header, body, self.warnings)
-
- def add_section(self, header, body, dest):
- if isinstance(body, dict):
- body = self.nested_dict(body)
- elif isinstance(body, list):
- body = '\n'.join(body)
- if body:
- dest.append('\n'.join([header, body]))
- else:
- dest.append(header)
-
- def nested_dict(self, summary, indent='\t'):
- report_lines = self.recursive_indent(summary, indent)
- return '\n'.join(report_lines)
-
- def recursive_indent(self, mapping: dict, indent: str):
- """Converts a nested dictionary into a properly indented list of lines
-
- Args:
- mapping: the nested dictionary to be formatted
- indent: the indent string to prepend to lines on the current level
- """
-
- lines = []
- for key, val in mapping.items():
- key_header = f"{indent}{self.key_mapper.get(key, key)}: "
- if isinstance(val, dict):
- lines.append(key_header)
- if len(val):
- lines.extend(self.recursive_indent(val, indent + '\t'))
- else:
- lines.append(indent + "\t" + "{ NONE }")
- elif isinstance(val, (list, set)):
- lines.append(key_header)
- if len(val):
- lines.extend([indent + '\t' + line for line in map(str, val)])
- else:
- lines.append(indent + "\t" + "[ NONE ]")
- else:
- lines.append(key_header + str(val))
- return lines
+from tiny.rna.util import ReportFormatter, sorted_natural, gzip_open
class GFFValidator:
diff --git a/tiny/rna/util.py b/tiny/rna/util.py
index 954ea668..cd0e4b6b 100644
--- a/tiny/rna/util.py
+++ b/tiny/rna/util.py
@@ -34,6 +34,74 @@ def wrapper(*args, **kwargs):
return timer
+class ReportFormatter:
+ error_header = "========[ ERROR ]=============================================="
+ warning_header = "========[ WARNING ]============================================"
+
+ def __init__(self, key_mapper):
+ self.key_mapper = key_mapper
+ self.warnings = []
+ self.errors = []
+
+ def print_report(self):
+ for error_section in self.errors:
+ print(self.error_header)
+ print(error_section)
+ print()
+
+ for warning_section in self.warnings:
+ print(self.warning_header)
+ print(warning_section)
+ print()
+
+ def add_error_section(self, header, body=None):
+ self.add_section(header, body, self.errors)
+
+ def add_warning_section(self, header, body=None):
+ self.add_section(header, body, self.warnings)
+
+ def add_section(self, header, body, dest):
+ if isinstance(body, dict):
+ body = self.nested_dict(body)
+ elif isinstance(body, list):
+ body = '\n'.join(body)
+ if body:
+ dest.append('\n'.join([header, body]))
+ else:
+ dest.append(header)
+
+ def nested_dict(self, summary, indent='\t'):
+ report_lines = self.recursive_indent(summary, indent)
+ return '\n'.join(report_lines)
+
+ def recursive_indent(self, mapping: dict, indent: str):
+ """Converts a nested dictionary into a properly indented list of lines
+
+ Args:
+ mapping: the nested dictionary to be formatted
+ indent: the indent string to prepend to lines on the current level
+ """
+
+ lines = []
+ for key, val in mapping.items():
+ key_header = f"{indent}{self.key_mapper.get(key, key)}: "
+ if isinstance(val, dict):
+ lines.append(key_header)
+ if len(val):
+ lines.extend(self.recursive_indent(val, indent + '\t'))
+ else:
+ lines.append(indent + "\t" + "{ NONE }")
+ elif isinstance(val, (list, set)):
+ lines.append(key_header)
+ if len(val):
+ lines.extend([indent + '\t' + line for line in map(str, val)])
+ else:
+ lines.append(indent + "\t" + "[ NONE ]")
+ else:
+ lines.append(key_header + str(val))
+ return lines
+
+
class SmartFormatter(argparse.HelpFormatter):
"""Properly formats argparse help string fields that contain newlines.