Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
e77c243
Changing the internal and external names of the Tag column
AlexTate Oct 16, 2022
bc2adac
First draft changes in tiny-count
AlexTate Oct 16, 2022
5f24e6c
tiny-deseq.r has been updated to handle feature counts using the new …
AlexTate Oct 16, 2022
dace53d
class_charts and scatter_by_dge_class have been updated to treat the …
AlexTate Oct 18, 2022
eef300a
Renaming the "tags" attribute in the Features class to "classes" for …
AlexTate Oct 18, 2022
138ad24
Adding two new command line options to tiny-plot:
AlexTate Oct 18, 2022
e0748a0
Including new parameters for labelling unknown/unassigned classes
AlexTate Oct 18, 2022
2f4da09
Updated the Features Sheet example for the new column configuration
AlexTate Oct 18, 2022
1c142dc
Description of class counting via the Class= attribute has been remov…
AlexTate Oct 18, 2022
34e51fd
The Tagged Counting section has been rewritten to describe the new "c…
AlexTate Oct 18, 2022
ee7ca26
Removed references to Class= counting in the class_charts description…
AlexTate Oct 18, 2022
d240b02
Adding new parameters for unassigned/unknown class labels to the CWL …
AlexTate Oct 18, 2022
436b875
Updating column config in the features.csv template
AlexTate Oct 18, 2022
1d95e19
Added a brief check in the CSV reader to catch usage of Features Shee…
AlexTate Oct 18, 2022
79cdfe5
Merge branch 'master' into issue-240
AlexTate Oct 18, 2022
fe49acb
Unit tests have been updated for the new class/tagged counting approach
AlexTate Oct 19, 2022
a2fd955
Slight refactor to move ReferenceTables.finalize_tables() back into R…
AlexTate Oct 22, 2022
6894cc1
Merge branch 'master' into issue-240
AlexTate Oct 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 14 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,11 @@ tiny get-template

### Requirements for User-Provided Input Files

| Input Type | File Extension | Requirements |
|----------------------------------------------------------------------------|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Reference annotations<br/>[(example)](START_HERE/reference_data/ram1.gff3) | GFF3 / GFF2 / GTF | Column 9 attributes (defined as "tag=value" or "tag "):<ul><li>Each feature must have an `ID` or `gene_id` or `Parent` tag (referred to as `ID` henceforth).</li><li>Feature classes can be defined with the `Class` tag. If undefined, the default value \__UNKNOWN_\_ will be used.</li><li>Discontinuous features must be defined with the `Parent` tag whose value is the logical parent's `ID`, or by sharing the same `ID`.</li><li>Attribute values containing commas must represent lists.</li><li>`Parent` tags with multiple values are not yet supported.</li><li>See the example link (left) for col. 9 formatting.</li></ul> |
| Sequencing data<br/>[(example)](START_HERE/fastq_files) | FASTQ(.gz) | Files must be demultiplexed. |
| Reference genome<br/>[(example)](START_HERE/reference_data/ram1.fa) | FASTA | Chromosome identifiers (e.g. Chr1): <ul><li>Must match your reference annotation file chromosome identifiers</li><li>Are case sensitive</li></ul> |
| Input Type | File Extension | Requirements |
|----------------------------------------------------------------------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Reference annotations<br/>[(example)](START_HERE/reference_data/ram1.gff3) | GFF3 / GFF2 / GTF | Column 9 attributes (defined as "tag=value" or "tag "):<ul><li>Each feature must have an `ID` or `gene_id` or `Parent` tag (referred to as `ID` henceforth).</li><li>Discontinuous features must be defined with the `Parent` tag whose value is the logical parent's `ID`, or by sharing the same `ID`.</li><li>Attribute values containing commas must represent lists.</li><li>`Parent` tags with multiple values are not yet supported.</li><li>See the example link (left) for col. 9 formatting.</li></ul> |
| Sequencing data<br/>[(example)](START_HERE/fastq_files) | FASTQ(.gz) | Files must be demultiplexed. |
| Reference genome<br/>[(example)](START_HERE/reference_data/ram1.fa) | FASTA | Chromosome identifiers (e.g. Chr1): <ul><li>Should match your reference annotation file chromosome identifiers</li><li>Are case sensitive</li></ul> |



Expand Down Expand Up @@ -174,17 +174,20 @@ A "collapsed" FASTA contains unique reads found in fastp's quality filtered FAST
The tiny-count step produces a variety of outputs

#### Feature Counts
Custom Python scripts and HTSeq are used to generate a single table of feature counts that includes columns for each library analyzed. A feature's _Feature ID_ and _Feature Class_ are simply the values of its `ID` and `Class` attributes. Features lacking a Class attribute will be assigned class `_UNKNOWN_`. We have also included a _Feature Name_ column which displays aliases of your choice, as specified in the _Alias by..._ column of the Features Sheet. If _Alias by..._ is set to`ID`, the _Feature Name_ column is left empty.
Custom Python scripts and HTSeq are used to generate a single table of feature counts which includes each counted library. Each matched feature is represented with the following metadata columns:
- **_Feature ID_** is determined, in order of preference, by one of the following GFF column 9 attributes: `ID`, `gene_id`, `Parent`.
- **_Classifier_** is determined by the rules in your Features Sheet. It is the _Classify as..._ value of each matching rule. Since multiple rules can match a feature, some Feature IDs will be listed multiple times with different classifiers.
- **_Feature Name_** displays aliases of your choice, as specified in the _Alias by..._ column of the Features Sheet. If _Alias by..._ is set to`ID`, the _Feature Name_ column is left empty.

For example, if your Features Sheet has a rule which specifies _Alias by..._ `sequence_name` and the GFF entry for this feature has the following attributes column:
For example, if your Features Sheet has a rule which specifies _Alias by..._ `sequence_name`, _Classify as..._ `miRNA`, and the GFF entry for this feature has the following attributes column:
```
... ID=406904;sequence_name=mir-1,hsa-miR-1;Class=miRNA; ...
... ID=406904;sequence_name=mir-1,hsa-miR-1; ...
```
The row for this feature in the feature counts table would read:

| Feature ID | Feature Name | Feature Class | Group1_rep_1 | Group1_rep_2 | ... |
|------------|------------------|---------------|--------------|--------------|-----|
| 406904 | mir-1, hsa-miR-1 | miRNA | 1234 | 999 | ... |
| Feature ID | Classifier | Feature Name | Group1_rep_1 | Group1_rep_2 | ... |
|------------|------------|------------------|--------------|--------------|-----|
| 406904 | miRNA | mir-1, hsa-miR-1 | 1234 | 999 | ... |

#### Normalized Counts
If your Samples Sheet has settings for Normalization, an additional copy of the Feature Counts table is produced with the specified per-library normalizations applied.
Expand Down
2 changes: 1 addition & 1 deletion START_HERE/features.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Select for...,with value...,Alias by...,Tag,Hierarchy,Strand,5' End Nucleotide,Length,Overlap,Feature Source
Select for...,with value...,Alias by...,Classify as...,Hierarchy,Strand,5' End Nucleotide,Length,Overlap,Feature Source
Class,mask,Alias,,1,both,all,all,Partial,./reference_data/ram1.gff3
Class,miRNA,Alias,,2,sense,all,16-22,Full,./reference_data/ram1.gff3
Class,piRNA,Alias,5pA,2,both,A,24-32,Full,./reference_data/ram1.gff3
Expand Down
6 changes: 6 additions & 0 deletions START_HERE/run_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,12 @@ plot_vector_points: False
plot_len_dist_min:
plot_len_dist_max:

##-- Use this label in class plots for counts assigned by rules lacking a classifier --##
plot_unknown_class: "_UNKNOWN_"

##-- Use this label in class plots for unassigned counts --##
plot_unassigned_class: "_UNASSIGNED_"


######----------------------------- OUTPUT DIRECTORIES ------------------------------######
#
Expand Down
6 changes: 3 additions & 3 deletions doc/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,9 +123,9 @@ Supported values are:
DESeq2 requires that your experiment design has at least one degree of freedom. If your experiment doesn't include at least one sample group with more than one replicate, tiny-deseq.r will be skipped and DGE related plots will not be produced.

## Features Sheet Details
| _Column:_ | Select for... | with value... | Alias by... | Tag | Hierarchy | Strand | 5' End Nucleotide | Length | Overlap | Feature Source |
|------------|---------------|---------------|-------------|-----|-----------|--------|-------------------|--------|-------------|----------------|
| _Example:_ | Class | miRNA | Name | | 1 | sense | all | all | 5' anchored | ram1.gff3 |
| _Column:_ | Select for... | with value... | Alias by... | Classify as... | Hierarchy | Strand | 5' End Nucleotide | Length | Overlap | Feature Source |
|------------|---------------|---------------|-------------|----------------|-----------|--------|-------------------|--------|-------------|----------------|
| _Example:_ | Class | miRNA | Name | miRNA | 1 | sense | all | all | 5' anchored | ram1.gff3 |

The Features Sheet allows you to define selection rules that determine how features are chosen when multiple features are found overlap an alignment locus. Selected features are "assigned" a portion of the reads associated with the alignment.

Expand Down
15 changes: 15 additions & 0 deletions doc/Parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,14 @@ The scatter plots produced by tiny-plot have rasterized points by default. This

The min and/or max bounds for plotted lengths can be set with this option. See [tiny-plot's documentation](tiny-plot.md#length-bounds) for more information about how these values are determined if they aren't set.

### Labels for Class-related Plots
| Run Config Key | Commandline Argument |
|------------------------|----------------------|
| plot_unknown_class: | `--unknown-class` |
| plot_unassigned_class: | `--unassigned-class` |

The labels that should be used for special groups in `class_charts` and `sample_avg_scatter_by_dge_class` plots. The "unknown" class group represents counts which were assigned by a Features Sheet rule which lacked a "Classify as..." label. The "unassigned" class group represents counts which weren't assigned to a feature.

### Full tiny-plot Help String
```
tiny-plot [-rc RAW_COUNTS] [-nc NORM_COUNTS] [-uc RULE_COUNTS]
Expand Down Expand Up @@ -318,4 +326,11 @@ Optional arguments:
len_dist plots will start at this value
-lda VALUE, --len-dist-max VALUE
len_dist plots will end at this value
-una LABEL, --unassigned-class LABEL
Use this label in class-related plots for unassigned
counts
-unk LABEL, --unknown-class LABEL
Use this label in class-related plots for counts which
were assigned by rules lacking a "Classify as..."
value
```
10 changes: 5 additions & 5 deletions doc/tiny-count.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,14 @@ Selection occurs in three stages, with the output of each stage as input to the
3. Finally, features are selected for read assignment based on the small RNA attributes of the alignment locus. Once reads are assigned to a feature, they are excluded from matches with larger hierarchy values.

## Stage 1: Feature Attribute Parameters
| _features.csv columns:_ | Select for... | with value... | Tag |
|-------------------------|---------------|---------------|-----|
| _features.csv columns:_ | Select for... | with value... | Classify as... |
|-------------------------|---------------|---------------|----------------|

Each feature's column 9 attributes are searched for the key-value combinations defined in the `Select for...` and `with value...` columns. Features, and the rules they matched, are retained for later evaluation at alignment loci in Stages 2 and 3.

#### Feature Classification
You can optionally specify a classifier for each rule. These classifiers are later used to group and label counts in class-related plots. Features that match rules with a classifier are counted separately; the classifier becomes part of the feature's ID to create a distinct "sub-feature", and these sub-features continue to be treated as distinct in downstream DGE analysis. Classified features receive counts exclusively from the rule(s) which hold the same `Classify as...` value. Counts from multiple rules can be pooled by using the same classifier. In the final counts table, this value is displayed in the Classifier column of features matching the rule, and each feature-classifier pair is shown on its own row.

#### Value Lists
Attribute keys are allowed to have multiple comma separated values, and these values are treated as a list; only one of the listed values needs to match the `with value...` to be considered a valid match to the rule. For example, if a rule contained `Class` and `WAGO` in these columns, then a feature with attributes `... ;Class=CSR,WAGO; ...` would be considered a match for the rule.

Expand All @@ -39,9 +42,6 @@ Attribute keys are allowed to have multiple comma separated values, and these va
#### Wildcard Support
Wildcard values (`all`, `*`, or an empty cell) can be used in the `Select for...` / `with value...` fields. With this functionality you can evaluate features for the presence of an attribute key without regarding its values, or you can check all attribute keys for the presence of a specific value, or you can skip Stage 1 selection altogether to permit the evaluation of the complete feature set in Stage 2. In the later case, feature-rule matching pairs still serve as the basis for selection; each rule still applies only to its matching subset from previous Stages.

#### Tagged Counting (advanced)
You can optionally specify a tag for each rule. Feature assignments resulting from tagged rules will have reads counted separately from those assigned by non-tagged rules. This essentially creates a new "sub-feature" for each feature that a tagged rule matches, and these "sub-features" are treated as distinct during downstream DGE analysis. Additionally, these counts subsets can be pooled across any number of rules by specifying the same tag. We recommend using tag names which _do not_ pertain to the `Select for...` / `with value...` in order to avoid potentially confusing results in class-related plots.

## Stage 2: Overlap and Hierarchy Parameters
| _features.csv columns:_ | Hierarchy | Overlap |
|-------------------------|-----------|---------|
Expand Down
10 changes: 6 additions & 4 deletions doc/tiny-plot.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,19 +64,21 @@ Percentage label darkness and bar colors reflect the magnitude of the rule's con


## class_charts
Features can have multiple classes associated with them, so it is useful to see the proportions of counts by class. The class_charts plot type shows the percentage of _mapped_ reads that were assigned to features by class. Each feature's associated classes are determined by the `Class=` attribute in your GFF files.
Features can have multiple classes associated with them, so it is useful to see the proportions of counts by class. The class_charts plot type shows the percentage of _mapped_ reads that were assigned to features by class. Each feature's associated classes are determined by the rules that it matched during Stage 1 selection, and is therefore determined by its GFF annotations.

<p float="left" align="center">
<img src="../images/plots/class_chart.jpg" width="80%" alt="class_chart with 8 classes"/>
</p>

#### Class \_UNASSIGNED_
This category represents the percentage of mapped reads that were unassigned. Sources of unassigned reads include:
This category represents the percentage of mapped reads that weren't assigned to any features. Sources of unassigned reads include:
- A lack of features passing selection at alignment loci
- Alignments which do not overlap with any features

#### Count Normalization
A feature with multiple associated classes will have its counts split evenly across these classes before being grouped and summed.
You can customize this label using the [unassigned class parameter.](Parameters.md#labels-for-class-related-plots)

#### Class \_UNKNOWN_
This category represents the percentage of mapped reads that matched rules which did not have a specified `Classify as...` value. You can customize this label using the [unknown class parameter.](Parameters.md#labels-for-class-related-plots)

#### Class Chart Styles
Proportions in rule_charts and class_charts are plotted using the same function. Styles are the same between the two. See [rule chart styles](#rule-chart-styles) for more info.
Expand Down
2 changes: 1 addition & 1 deletion tests/unit_test_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
rules_template = [{'Identity': ("Name", "N/A"),
'Strand': "both",
'Hierarchy': 0,
'Tag': '',
'Class': '',
'nt5end': "all",
'Length': "all", # A string is expected by FeatureSelector due to support for lists and ranges
'Overlap': "partial"}]
Expand Down
4 changes: 2 additions & 2 deletions tests/unit_tests_counter.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def setUpClass(self):
'Key': "Class",
'Value': "CSR",
'Name': "Alias",
'Tag': "",
'Class': "",
'Hierarchy': "1",
'Strand': "antisense",
"nt5end": '"C,G,U"', # Needs to be double-quoted due to commas
Expand All @@ -47,7 +47,7 @@ def setUpClass(self):
_row = self.csv_feat_row_dict
self.parsed_feat_rule = [{
'Identity': (_row['Key'], _row['Value']),
'Tag': _row['Tag'],
'Class': _row['Class'],
'Hierarchy': int(_row['Hierarchy']),
'Strand': _row['Strand'],
'nt5end': _row["nt5end"].upper().translate({ord('U'): 'T'}),
Expand Down
Loading