Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
c4b5ff4
The SAM_reader class now uses pysam for the initial parsing step. The…
AlexTate Mar 24, 2023
50af6e3
Adding some optimizations to the pure-python implementation of _parse…
AlexTate Mar 31, 2023
7edae91
The _parse_alignments() method of SAM_reader has been replaced by a d…
AlexTate Mar 31, 2023
e8b897f
The Mismatches selector has been added to tiny-count, the CSVReacer, …
AlexTate Mar 31, 2023
34801f2
Fixing package import issues for tiny.rna.counter.parsing and adding …
AlexTate Apr 3, 2023
f2ad1a9
- Small refactors to make unit testing a little easier
AlexTate Apr 3, 2023
4934793
Adding a check to _gather_metadata(). The first alignment is inspecte…
AlexTate Apr 3, 2023
60bdbda
Updating SamSqValidator.read_sq_headers() to use pysam directly (not …
AlexTate Apr 3, 2023
035f4b5
Backtracking on a recent change to CSVReader.check_backward_compatibi…
AlexTate Apr 3, 2023
80aac9d
Updates for type hints and docstrings
AlexTate Apr 3, 2023
0140684
Updating Features Sheets in testdata and template folders. Mismatches…
AlexTate Apr 3, 2023
d6fd8f4
Updating unit tests for the new Features Sheet and match tuple formats
AlexTate Apr 3, 2023
b6857f8
Adding a unit test for SamSqValidator.read_sq_headers()
AlexTate Apr 3, 2023
ee74985
Updating non-SAM_reader tests to use the new match tuple format
AlexTate Apr 3, 2023
fc7039e
Updating SAM_reader tests for the new alignment dictionary format (by…
AlexTate Apr 3, 2023
414991b
Updating SAM_reader tests for the new class design
AlexTate Apr 3, 2023
a941c63
Adding sq_headers.sam testfile for unit_tests_validation.py
AlexTate Apr 3, 2023
c145954
Bugfixes for decollapsed outputs and sequence-based counting mode
AlexTate Apr 4, 2023
2a1c80e
Final polish updates to AlignmentIter and a related unit test
AlexTate Apr 5, 2023
ee1bfcb
Documentation updates for the Mismatches selector
AlexTate Apr 5, 2023
4d87ff1
Typo correction
AlexTate Apr 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions START_HERE/features.csv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Strand,5' End Nucleotide,Length,Overlap
Class,mask,masked,,,1,both,all,all,Partial
Class,miRNA,miRNA,,,2,sense,all,16-24,5' anchored
Class,piRNA,piRNA-5'A,,,2,both,A,24-32,Nested
Class,piRNA,piRNA-5'T,,,2,both,T,24-32,Nested
Class,siRNA,siRNA,,,2,both,all,15-22,Nested
Class,unk,unknown,,,3,both,all,all,Nested
Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Overlap,Mismatches,Strand,5' End Nucleotide,Length
Class,mask,masked,,,1,Partial,,both,all,all
Class,miRNA,miRNA,,,2,5' anchored,,sense,all,16-24
Class,piRNA,piRNA-5'A,,,2,Nested,,both,A,24-32
Class,piRNA,piRNA-5'T,,,2,Nested,,both,T,24-32
Class,siRNA,siRNA,,,2,Nested,,both,all,15-22
Class,unk,unknown,,,3,Nested,,both,all,all
1 change: 1 addition & 0 deletions doc/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ Selectors in the Features Sheet can be specified as a single value, a list of co
| `Type Filter` | ✓ | ✓ | ✓ | |
| `Hierarchy` | | ✓ | | |
| `Overlap` | ✓ | ✓ | | |
| `Mismatches` | ✓ | ✓ | ✓ | ✓ |
| `Strand` | ✓ | ✓ | | |
| `5' nt` | ✓ | ✓ | ✓ | |
| `Length` | ✓ | ✓ | ✓ | ✓ |
Expand Down
6 changes: 6 additions & 0 deletions doc/tiny-count.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,12 @@ selector, M, N
#### Unstranded Features
If these features match rules with `5' anchored` and `3' anchored` overlap selectors, they will be downgraded to `anchored` selectors. Alignments overlapping these features are evaluated for shared start and/or end coordinates, but 5' and 3' ends are not distinguished.

### Mismatches
The Mismatches column allows you to place constraints the edit distance, or the number of mismatches and indels, from the alignment to the reference. The Mismatch definition is explicit, i.e., a value of 3 means exactly 3, not 3 or less. Definitions support ranges (e.g., 0-3), lists (e.g., 1, 3), wildcards, and single values.

#### Edit Distance Determination
An alignment's edit distance is determined from its NM tag. If the first alignment in a SAM file doesn't have an NM tag, then the edit distance is calculated from the CIGAR string for all subsequent alignments in the file. If the first alignment has an NM tag then any subsequent alignments missing the tag will have a default edit distance of 0.

### Hierarchy
Each rule must be assigned a hierarchy value. This value is used to sort Stage 2 matches so that matches with smaller hierarchy values take precedence in Stage 3.
- Each feature can have multiple hierarchy values if it matched more than one rule during Stage 1 selection
Expand Down
12 changes: 8 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#!/usr/bin/env python
import os
import sys
import pysam
import setuptools

from setuptools.command.install import install
Expand Down Expand Up @@ -40,9 +41,10 @@ def get_cython_extension_defs():
error out if there are build issues, and therefore must be used as optional imports."""

pyx_files = [
# (file path, optional)
('tiny/rna/counter/stepvector/_stepvector.pyx', True),
('tests/cython_tests/stepvector/test_cython.pyx', True)
# (file path, optional, include)
('tiny/rna/counter/stepvector/_stepvector.pyx', True, []),
('tests/cython_tests/stepvector/test_cython.pyx', True, []),
('tiny/rna/counter/parsing/alignments.pyx', False, pysam.get_include())
]

cxx_extension_args = {
Expand All @@ -61,9 +63,10 @@ def get_cython_extension_defs():
return [setuptools.Extension(
pyx_filename.replace('./', '').replace('/', '.').rstrip('.pyx'),
sources=[pyx_filename],
include_dirs=include,
optional=optional,
**cxx_extension_args)
for pyx_filename, optional in pyx_files]
for pyx_filename, optional, include in pyx_files]


def get_macos_sdk_path():
Expand Down Expand Up @@ -120,6 +123,7 @@ def get_macos_sdk_path():
ext_modules=cythonize(
get_cython_extension_defs(),
compiler_directives={'language_level': '3'},
include_path=pysam.get_include(),
gdb_debug=False
),
scripts=scripts,
Expand Down
14 changes: 7 additions & 7 deletions tests/testdata/config_files/features.csv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Strand,5' End Nucleotide,Length,Overlap
Class,mask,,,,1,both,all,all,Partial
Class,miRNA,,,,2,sense,all,16-22,Nested
Class,piRNA,5pA,,,2,both,A,24-32,Nested
Class,piRNA,5pT,,,2,both,T,24-32,Nested
Class,siRNA,,,,2,both,all,15-22,Nested
Class,unk,,,,3,both,all,all,Nested
Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Overlap,Mismatches,Strand,5' End Nucleotide,Length
Class,mask,,,,1,Partial,0,both,all,all
Class,miRNA,,,,2,Nested,0,sense,all,16-22
Class,piRNA,5pA,,,2,Nested,0,both,A,24-32
Class,piRNA,5pT,,,2,Nested,0,both,T,24-32
Class,siRNA,,,,2,Nested,0,both,all,15-22
Class,unk,,,,3,Nested,0,both,all,all
4 changes: 4 additions & 0 deletions tests/testdata/counter/validation/sam/sq_headers.sam
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
@HD VN:1.0 SO:unsorted
@SQ SN:I LN:123
@SQ SN:II LN:456
@SQ SN:III
12 changes: 7 additions & 5 deletions tests/unit_test_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,10 @@
'Filter_t': "",
'Strand': "both",
'Hierarchy': 0,
'Overlap': "partial",
'Mismatch': "",
'nt5end': "all",
'Length': "all", # A string is expected by FeatureSelector due to support for lists and ranges
'Overlap': "partial"}]
'Length': "all",}] # A string is expected by FeatureSelector due to support for lists and ranges


def csv_factory(type: str, rows: List[dict], header=()):
Expand Down Expand Up @@ -130,7 +131,7 @@ def get_dir_checksum_tree(root_path: str) -> dict:
return dir_tree


def make_parsed_sam_record(Name="0_count=1", Seq="CAAGACAGAGCTTCACCGTTC", Chrom='I', Start=15064570, Strand:bool = True):
def make_parsed_sam_record(Name="0_count=1", Seq="CAAGACAGAGCTTCACCGTTC", Chrom='I', Start=15064570, Strand=True, NM=0):
return {
"Name": Name,
"Length": len(Seq),
Expand All @@ -139,7 +140,8 @@ def make_parsed_sam_record(Name="0_count=1", Seq="CAAGACAGAGCTTCACCGTTC", Chrom=
"Chrom": Chrom,
"Start": Start,
"End": Start + len(Seq),
"Strand": Strand
"Strand": Strand,
"NM": NM
}


Expand Down Expand Up @@ -186,7 +188,7 @@ def strand_to_bool(strand):


# For performing reverse complement
complement = bytes.maketrans(b'ACGTacgt', b'TGCAtgca')
complement = str.maketrans('ACGTacgt', 'TGCAtgca')


class ShellCapture:
Expand Down
6 changes: 4 additions & 2 deletions tests/unit_tests_counter.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,11 @@ def setUpClass(self):
'Filter_s': "",
'Filter_t': "",
'Hierarchy': "1",
'Overlap': "Partial",
'Mismatch': "",
'Strand': "antisense",
"nt5end": '"C,G,U"', # Needs to be double-quoted due to commas
'Length': "all",
'Overlap': "Partial"
}

# Represents the parsed Features Sheet row above
Expand All @@ -51,10 +52,11 @@ def setUpClass(self):
'Filter_s': _row['Filter_s'],
'Filter_t': _row['Filter_t'],
'Hierarchy': int(_row['Hierarchy']),
'Overlap': _row['Overlap'].lower(),
'Mismatch': _row['Mismatch'],
'Strand': _row['Strand'],
'nt5end': _row["nt5end"].upper().translate({ord('U'): 'T'}),
'Length': _row['Length'],
'Overlap': _row['Overlap'].lower()
}]

# Represents an unparsed Samples Sheet row
Expand Down
22 changes: 11 additions & 11 deletions tests/unit_tests_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def make_feature_for_interval_test(self, iv_rule, feat_id, chrom, strand: str, s
fs = FeatureSelector(rules)

# Feature with specified coordinates, matching rule 0 with hierarchy 0 and the appropriate selector for iv_rule
selectors = fs.build_interval_selectors(feat_iv, [(0, 0, iv_rule)])
selectors = fs.build_interval_selectors(feat_iv, [(0, 0, iv_rule, Wildcard())])
match_tuple = (selectors[feat_iv][0],)
feat = {(feat_id, strand_to_bool(strand), match_tuple)}

Expand Down Expand Up @@ -398,19 +398,19 @@ def test_build_interval_selectors_grouping(self):
fs = FeatureSelector(deepcopy(rules_template))
iv = HTSeq.GenomicInterval('I', 10, 20, '+')

match_tuples = [('n/a', 'n/a', 'partial'),
('n/a', 'n/a', 'nested'),
('n/a', 'n/a', 'exact'),
('n/a', 'n/a', "5' anchored"),
('n/a', 'n/a', "3' anchored"),
match_tuples = [('n/a', 'n/a', 'partial', 'n/a'),
('n/a', 'n/a', 'nested', 'n/a'),
('n/a', 'n/a', 'exact', 'n/a'),
('n/a', 'n/a', "5' anchored", 'n/a'),
('n/a', 'n/a', "3' anchored", 'n/a'),
# iv_shifted_1 Shift values:
('n/a', 'n/a', 'partial, -5, 5'), # 5': -5 3': 5
('n/a', 'n/a', 'nested, -5, 5'), # 5': -5 3': 5
('n/a', 'n/a', 'partial, -5, 5', 'n/a'), # 5': -5 3': 5
('n/a', 'n/a', 'nested, -5, 5', 'n/a'), # 5': -5 3': 5
# iv_shifted_2
('n/a', 'n/a', 'exact, -10, 10'), # 5': -10 3': 10
('n/a', 'n/a', 'exact, -10, 10', 'n/a'), # 5': -10 3': 10
# iv_shifted_3
('n/a', 'n/a', "5' anchored, -1, -1"), # 5': -1 3': -1
('n/a', 'n/a', "3' anchored, -1, -1")] # 5': -1 3': -1
('n/a', 'n/a', "5' anchored, -1, -1", 'n/a'), # 5': -1 3': -1
('n/a', 'n/a', "3' anchored, -1, -1", 'n/a')] # 5': -1 3': -1

iv_shifted_1 = HTSeq.GenomicInterval('I', 5, 25, '+')
iv_shifted_2 = HTSeq.GenomicInterval('I', 0, 30, '+')
Expand Down
Loading