Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
7985ee4
Changing semantics of *anchored overlap selectors to require the non-…
AlexTate Feb 17, 2023
5dc9b10
Updating the IntervalAnchorMatch class to no longer require unstrande…
AlexTate Feb 18, 2023
64569e5
Renaming the "full" selector to "nested" to be more descriptive
AlexTate Feb 18, 2023
fe5132b
Updating documentation for overlap selectors to include the new nesti…
AlexTate Feb 18, 2023
3d1814a
Changing red text and lines to magenta, and existing magenta text to …
AlexTate Feb 19, 2023
8bd4061
Updating the Features Sheet to use "nested" instead of "full" (missed…
AlexTate Feb 19, 2023
7180a94
Adding wildcard support to the overlap selector. Wildcard keywords ('…
AlexTate Feb 19, 2023
e1b53ea
Changing the constructor signature for FeatureSelector to no longer r…
AlexTate Feb 19, 2023
3554fdf
Moving the ReportFormatter class out of validation.py and into util.p…
AlexTate Feb 19, 2023
b5848d2
FeatureSelector now logs the tagged feature ID, rule, and overlap def…
AlexTate Feb 19, 2023
a221d29
The Reference* classes now report matches that were dropped due to sh…
AlexTate Feb 19, 2023
3750563
Tuning the color and transparency of the larger shapes in the selecti…
AlexTate Feb 20, 2023
41d1d83
Clarifying edits to the line numbers and descriptions of Paths File c…
AlexTate Feb 20, 2023
11dd2f5
Correcting overlap illustration width for GitHub compatibility and ce…
AlexTate Feb 20, 2023
d116376
Overlap selectors are now cached (and retrieved) from the FeatureSele…
AlexTate Feb 21, 2023
940d09b
Dynamic attribute creation has been removed from overlap selectors, a…
AlexTate Feb 21, 2023
85fa791
Correcting usage of "full" instead of "nested" overlap selector, and …
AlexTate Feb 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions START_HERE/features.csv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Strand,5' End Nucleotide,Length,Overlap
Class,mask,masked,,,1,both,all,all,Partial
Class,miRNA,miRNA,,,2,sense,all,16-24,5' anchored
Class,piRNA,piRNA-5'A,,,2,both,A,24-32,Full
Class,piRNA,piRNA-5'T,,,2,both,T,24-32,Full
Class,siRNA,siRNA,,,2,both,all,15-22,Full
Class,unk,unknown,,,3,both,all,all,Full
Class,piRNA,piRNA-5'A,,,2,both,A,24-32,Nested
Class,piRNA,piRNA-5'T,,,2,both,T,24-32,Nested
Class,siRNA,siRNA,,,2,both,all,15-22,Nested
Class,unk,unknown,,,3,both,all,all,Nested
5 changes: 3 additions & 2 deletions START_HERE/tinyRNA_TUTORIAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ Expected runtime: ~10-60 minutes (expect longer runtimes if a bowtie index must
2. Move your GFF and genome sequence files into the reference_data directory.
3. Edit features.csv and samples.csv file for your datasets and selection parameters.
4. Edit paths.yml as follows:
- line 46: `ebwt: ''` (no value)
- line 50: `reference_genome_files: your-fasta-formatted-DNA-sequence-file`
- line 20: change the value after `path:` to point to your GFF or GTF file
- line 46: delete the value after `ebwt:`
- line 51: change the value after `- ` to point to your fasta formatted DNA sequence file
5. Run the pipeline with the command: `tiny run --config run_config.yml`
29 changes: 15 additions & 14 deletions doc/tiny-count.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,24 +53,25 @@ Wildcard values (`all`, `*`, or an empty cell) can be used in the `Select for...
Features overlapping a read alignment are selected based on their overlap characteristics. These matches are then sorted by hierarchy value before proceeding to Stage 3.

### Overlap
This column allows you to specify the extent of overlap required for candidate feature selection:
- `partial`: alignment overlaps feature by at least one base
- `full`: alignment does not extend beyond either terminus of the feature
- `exact`: alignment termini are equal to the feature's
- `anchored`: alignment's start and/or end is equal to the feature's
- `5' anchored`: alignment's 5' end is equal to the corresponding terminus of the feature
- `3' anchored`: alignment's 3' end is equal to the corresponding terminus of the feature
This column allows you to specify the extent of overlap required for candidate feature selection. In order to be a candidate, a feature must reside on the same chromosome as the alignment and overlap its interval by at least 1 nucleotide. A shared strand is not required. See the [Strand](#strand) section in Stage 3 for selection by strand.

In order to be a candidate, a feature must match a rule in Stage 1, reside on the same chromosome as the alignment, and must overlap the alignment by at least 1 nucleotide.

#### Strandedness and the Overlap Selector
A feature does not have to be on the same strand as the alignment in order to be a candidate. See the [Strand](#strand) section in Stage 3 for selection by strand. Unstranded features will have `5' anchored` and `3' anchored` overlap selectors downgraded to `anchored` selectors. Alignments overlapping these features are evaluated for shared start and/or end coordinates, but 5'/3' ends are not distinguished.
#### Unstranded Features
If these features match rules with `5' anchored` and `3' anchored` overlap selectors, they will be downgraded to `anchored` selectors. Alignments overlapping these features are evaluated for shared start and/or end coordinates, but 5' and 3' ends are not distinguished.

#### Selector Demonstration

The following diagrams demonstrate the strand semantics of these interval selectors. The first two options show separate illustrations for features on each strand for emphasis. All matches shown in the remaining three options apply to features on either strand.
![3'_anchored_5'_anchored](../images/3'_anchored_5'_anchored_interval.png)
![Full_Exact_Partial](../images/full_exact_partial_interval.png)
The following table provides a description and illustration of the available overlap selectors. All matches apply to features on either strand, i.e. matches shown below the antisense strand also apply, as shown, to the feature on the sense strand, and vice versa.

| Keyword and Description | Illustration |
|:--------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------:|
| `partial`: alignment overlaps feature by at least one base | <img src="../images/overlap_selectors/partial.png" alt="Partial" width=400 /> |
| `nested`: alignment does not extend beyond either terminus of the feature | <img src="../images/overlap_selectors/nested.png" alt="Nested" width=400 /> |
| `exact`: alignment termini are equal to the feature's | <img src="../images/overlap_selectors/exact.png" alt="Exact" width=400 /> |
| `anchored`: alignment is nested with start and/or end equal to the feature's | <img src="../images/overlap_selectors/anchored.png" alt="Anchored" width=400 /> |
| `5' anchored`: alignment is nested with 5' end equal to the corresponding terminus of the feature | <img src="../images/overlap_selectors/5'_anchored.png" alt="5' anchored" width=400 /> |
| `3' anchored`: alignment is nested with 3' end equal to the corresponding terminus of the feature | <img src="../images/overlap_selectors/3'_anchored.png" alt="3' anchored" width=400 /> |

:people_holding_hands: Illustration colors have been selected for colorblindness accessibility.

### Hierarchy
Each rule must be assigned a hierarchy value. This value is used to sort Stage 2 matches so that matches with smaller hierarchy values take precedence in Stage 3.
Expand Down
Binary file removed images/3'_anchored_5'_anchored_interval.png
Binary file not shown.
Binary file removed images/full_exact_partial_interval.png
Binary file not shown.
Binary file added images/overlap_selectors/3'_anchored.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/overlap_selectors/5'_anchored.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/overlap_selectors/anchored.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/overlap_selectors/exact.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/overlap_selectors/nested.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/overlap_selectors/partial.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/tiny-count_selection.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 5 additions & 5 deletions tests/testdata/config_files/features.csv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Strand,5' End Nucleotide,Length,Overlap
Class,mask,,,,1,both,all,all,Partial
Class,miRNA,,,,2,sense,all,16-22,Full
Class,piRNA,5pA,,,2,both,A,24-32,Full
Class,piRNA,5pT,,,2,both,T,24-32,Full
Class,siRNA,,,,2,both,all,15-22,Full
Class,unk,,,,3,both,all,all,Full
Class,miRNA,,,,2,sense,all,16-22,Nested
Class,piRNA,5pA,,,2,both,A,24-32,Nested
Class,piRNA,5pT,,,2,both,T,24-32,Nested
Class,siRNA,,,,2,both,all,15-22,Nested
Class,unk,,,,3,both,all,all,Nested
58 changes: 29 additions & 29 deletions tests/unit_tests_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ def setUpClass(self):
def make_feature_for_interval_test(self, iv_rule, feat_id, chrom, strand: str, start, stop):
feat_iv = HTSeq.GenomicInterval(chrom, start, stop, strand)
rules = [dict(deepcopy(rules_template[0]), Overlap=iv_rule, Identity=('*', '*'), nt5end='*', Length='*')]
fs = FeatureSelector(rules, LibraryStats(normalize_by_hits=True))
fs = FeatureSelector(rules)

# Feature with specified coordinates, matching rule 0 with hierarchy 0 and the appropriate selector for iv_rule
selectors = fs.build_interval_selectors(feat_iv, [(0, 0, iv_rule)])
Expand Down Expand Up @@ -187,12 +187,12 @@ def test_count_reads_generic(self, mock):
instance.assign_features.assert_called_once_with("mock_alignment")
instance.stats.assert_has_calls(expected_calls_to_stats)

"""Does FeatureSelector.choose() correctly select features defining `full` interval matching rules?"""
"""Does FeatureSelector.choose() correctly select features defining `nested` interval matching rules?"""

def test_feature_selector_full_interval(self):
iv_rule = 'full'
def test_feature_selector_nested_interval(self):
iv_rule = 'nested'
chrom, strand, start, stop = "n/a", "+", 5, 10
feat, fs = self.make_feature_for_interval_test(iv_rule, "Full Overlap", chrom, strand, start, stop)
feat, fs = self.make_feature_for_interval_test(iv_rule, "Nested Overlap", chrom, strand, start, stop)

aln_base = {'Seq': 'ATGC', 'Chrom': chrom, 'Strand': strand_to_bool(strand)}
aln_spill_lo = make_parsed_sam_record(**dict(aln_base, Start=start - 1, Name="spill"))
Expand All @@ -213,9 +213,9 @@ def test_feature_selector_full_interval(self):

self.assertEqual(fs.choose(feat, aln_spill_lo), {})
self.assertEqual(fs.choose(feat, aln_spill_hi), {})
self.assertEqual(fs.choose(feat, aln_contained), {"Full Overlap": {0}})
self.assertEqual(fs.choose(feat, aln_contained_lo), {"Full Overlap": {0}})
self.assertEqual(fs.choose(feat, aln_contained_hi), {"Full Overlap": {0}})
self.assertEqual(fs.choose(feat, aln_contained), {"Nested Overlap": {0}})
self.assertEqual(fs.choose(feat, aln_contained_lo), {"Nested Overlap": {0}})
self.assertEqual(fs.choose(feat, aln_contained_hi), {"Nested Overlap": {0}})

"""Does FeatureSelector.choose() correctly select features with `partial` interval matching rules?"""

Expand Down Expand Up @@ -284,15 +284,15 @@ def test_feature_selector_5p_interval(self):

"""
No match | 6 -------->| 10 aln_none
Match 5 |------------|--> 11 aln_long
No match 5 |------------|--> 11 aln_long
Match 5 |----------->| 10 aln_exact
Match 5 |--------> 9 | aln_short
(+) 5' -------------|==feat_A===>|-----------> 3'
start = 5 end = 10
(-) 3' <------------|<===feat_B==|------------ 5'
| 6 <--------| 10 Match
5 |<-----------| 10 Match
4 <--|------------| 10 Match
4 <--|------------| 10 No match
5 |<-------- 9 | No match
"""

Expand All @@ -307,7 +307,7 @@ def test_feature_selector_5p_interval(self):
}

self.assertEqual(fs.choose(feat_A, aln['aln_none']), {})
self.assertEqual(fs.choose(feat_A, aln['aln_long']), {"5' Anchored Overlap (+)": {0}})
self.assertEqual(fs.choose(feat_A, aln['aln_long']), {})
self.assertEqual(fs.choose(feat_A, aln['aln_exact']), {"5' Anchored Overlap (+)": {0}})
self.assertEqual(fs.choose(feat_A, aln['aln_short']), {"5' Anchored Overlap (+)": {0}})

Expand All @@ -319,7 +319,7 @@ def test_feature_selector_5p_interval(self):
aln['aln_none'].update({'Start': 5, 'End': 9, 'Strand': False})

self.assertEqual(fs.choose(feat_B, aln['aln_none']), {})
self.assertEqual(fs.choose(feat_B, aln['aln_long']), {"5' Anchored Overlap (-)": {0}})
self.assertEqual(fs.choose(feat_B, aln['aln_long']), {})
self.assertEqual(fs.choose(feat_B, aln['aln_exact']), {"5' Anchored Overlap (-)": {0}})
self.assertEqual(fs.choose(feat_B, aln['aln_short']), {"5' Anchored Overlap (-)": {0}})

Expand All @@ -330,17 +330,17 @@ def test_feature_selector_3p_interval(self):
chrom, start, end = "n/a", 5, 10

"""
No match 5 |--------> 9 | aln_none
Match 4 --|----------->| 10 aln_long
Match 5 |----------->| 10 aln_exact
Match | 6 -------->| 10 aln_short
(+) 5' -------------|==feat_A===>|-----------> 3'
start = 5 end = 10
(-) 3' <------------|<===feat_B==|------------ 5'
5 |<-------- 9 | Match
5 |<-----------| 10 Match
5 |<-----------|-- 11 Match
| 6 <--------| 10 No match
No match 5 |--------> 9 | aln_none
No match 4 --|----------->| 10 aln_long
Match 5 |----------->| 10 aln_exact
Match | 6 -------->| 10 aln_short
(+) 5' --------------|==feat_A===>|-----------> 3'
start = 5 end = 10
(-) 3' <-------------|<===feat_B==|------------ 5'
5 |<-------- 9 | Match
5 |<-----------| 10 Match
5 |<-----------|-- 11 No match
| 6 <--------| 10 No match
"""

# Test feat_A on (+) strand
Expand All @@ -354,7 +354,7 @@ def test_feature_selector_3p_interval(self):
}

self.assertEqual(fs.choose(feat_A, aln['aln_none']), {})
self.assertEqual(fs.choose(feat_A, aln['aln_long']), {"3' Anchored Overlap (+)": {0}})
self.assertEqual(fs.choose(feat_A, aln['aln_long']), {})
self.assertEqual(fs.choose(feat_A, aln['aln_exact']), {"3' Anchored Overlap (+)": {0}})
self.assertEqual(fs.choose(feat_A, aln['aln_short']), {"3' Anchored Overlap (+)": {0}})

Expand All @@ -366,7 +366,7 @@ def test_feature_selector_3p_interval(self):
aln['aln_none'].update({'Start': 6, 'End': 10, 'Strand': False})

self.assertEqual(fs.choose(feat_B, aln['aln_none']), {})
self.assertEqual(fs.choose(feat_B, aln['aln_long']), {"3' Anchored Overlap (-)": {0}})
self.assertEqual(fs.choose(feat_B, aln['aln_long']), {})
self.assertEqual(fs.choose(feat_B, aln['aln_exact']), {"3' Anchored Overlap (-)": {0}})
self.assertEqual(fs.choose(feat_B, aln['aln_short']), {"3' Anchored Overlap (-)": {0}})

Expand All @@ -382,7 +382,7 @@ def test_wildcard_identities(self):

rules = [*one, *two, *non, *dup]

actual = FeatureSelector(rules, LibraryStats(normalize_by_hits=True)).inv_ident
actual = FeatureSelector(rules).inv_ident
expected = {
(Wildcard(), non_wild): [0],
(non_wild, Wildcard()): [1],
Expand All @@ -395,17 +395,17 @@ def test_wildcard_identities(self):
"""Does FeatureSelector build both shifted and unshifted selectors and group them by resulting interval?"""

def test_build_interval_selectors_grouping(self):
fs = FeatureSelector(deepcopy(rules_template), LibraryStats())
fs = FeatureSelector(deepcopy(rules_template))
iv = HTSeq.GenomicInterval('I', 10, 20, '+')

match_tuples = [('n/a', 'n/a', 'partial'),
('n/a', 'n/a', 'full'),
('n/a', 'n/a', 'nested'),
('n/a', 'n/a', 'exact'),
('n/a', 'n/a', "5' anchored"),
('n/a', 'n/a', "3' anchored"),
# iv_shifted_1 Shift values:
('n/a', 'n/a', 'partial, -5, 5'), # 5': -5 3': 5
('n/a', 'n/a', 'full, -5, 5'), # 5': -5 3': 5
('n/a', 'n/a', 'nested, -5, 5'), # 5': -5 3': 5
# iv_shifted_2
('n/a', 'n/a', 'exact, -10, 10'), # 5': -10 3': 10
# iv_shifted_3
Expand Down
21 changes: 7 additions & 14 deletions tests/unit_tests_hts_parsing.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,6 @@
resources = "./testdata/counter"


class MockFeatureSelector:
def __init__(self, rules_table):
self.rules_table = FeatureSelector.build_selectors(rules_table)
self.inv_ident = FeatureSelector.build_inverted_identities(self.rules_table)
self.build_interval_selectors = FeatureSelector.build_interval_selectors


class MyTestCase(unittest.TestCase):

@classmethod
Expand Down Expand Up @@ -54,7 +47,7 @@ def selector_with_template(self, updates_list):
rules = [deepcopy(helpers.rules_template[0]) for _ in range(len(updates_list))]
for changes, template in zip(updates_list, rules):
template.update(changes)
return MockFeatureSelector(rules)
return FeatureSelector(rules)

def exhaust_iterator(self, it):
collections.deque(it, maxlen=0)
Expand Down Expand Up @@ -137,7 +130,7 @@ def test_ref_tables_single_feature(self):
feature_selector = self.selector_with_template([
# Fails to match due to Identity selector
{'Identity': ("Class", "CSR"), 'Strand': "sense", 'Hierarchy': 1, 'Class': 'none', 'nt5end': "all",
'Overlap': 'full', 'Length': "20"},
'Overlap': 'nested', 'Length': "20"},
# Match
{'Identity': ("biotype", "snoRNA"), 'Strand': "antisense", 'Hierarchy': 2, 'Class': 'tag', 'nt5end': "all",
'Overlap': 'partial', 'Length': "30"}
Expand All @@ -161,7 +154,7 @@ def test_ref_tables_single_feature_all_features_false(self):
feature_selector = self.selector_with_template([
# Fails to match due to Identity selector
{'Identity': ("Class", "CSR"), 'Strand': "sense", 'Hierarchy': 1, 'Class': 'none', 'nt5end': "all",
'Overlap': 'full', 'Length': "20"},
'Overlap': 'nested', 'Length': "20"},
# Match
{'Identity': ("biotype", "snoRNA"), 'Strand': "antisense", 'Hierarchy': 2, 'Class': 'tag', 'nt5end': "all",
'Overlap': 'partial', 'Length': "30"}
Expand All @@ -183,7 +176,7 @@ def test_ref_tables_missing_name_attribute_all_features_false(self):
kwargs = {'all_features': False}
bad = "bad_name_attribute"
feature_source = {self.short_gff_file: [bad]}
feature_selector = MockFeatureSelector([])
feature_selector = FeatureSelector([])

expected_err = "No features were retained while parsing your GFF file.\n" \
"This may be due to a lack of features matching 'Select for...with value...'"
Expand Down Expand Up @@ -221,7 +214,7 @@ def test_ref_tables_alias_multisource_concat(self):

# Notice: screening for "ID" name attribute happens earlier in counter.load_config()
expected_alias = {"Gene:WBGene00023193": ("additional_class", "Gene:WBGene00023193", "unknown")}
_, alias, _ = ReferenceFeatures(feature_source, **kwargs).get(MockFeatureSelector([]))
_, alias, _ = ReferenceFeatures(feature_source, **kwargs).get(FeatureSelector([]))

self.assertDictEqual(alias, expected_alias)

Expand All @@ -236,7 +229,7 @@ def test_ref_tables_alias_multisource_concat_all_features_false(self):

with self.assertRaisesRegex(ValueError, expected_err):
# No aliases saved due to all_features=False and the lack of identity matches
_, alias, _ = ReferenceFeatures(feature_source, **kwargs).get(MockFeatureSelector([]))
_, alias, _ = ReferenceFeatures(feature_source, **kwargs).get(FeatureSelector([]))

"""Does ReferenceFeatures.get() properly concatenate identity match tuples when multiple GFF files define
matches for a feature?"""
Expand Down Expand Up @@ -762,7 +755,7 @@ def test_gff_megazord(self):
'Overlap': 'partial', 'Strand': '', 'nt5end': '', 'Length': ''}]
files = {gff.format(ftype): [] for gff in self.genomes.values() for ftype in ('gff3', 'gtf')}

fs = FeatureSelector(rules, LibraryStats())
fs = FeatureSelector(rules)
rt = ReferenceFeatures(files)

# The test is passed if this command
Expand Down
Loading