Skip to content

Pipeline: new location for configuring GFF files and aliases#245

Merged
taimontgomery merged 26 commits intomasterfrom
issue-234
Nov 2, 2022
Merged

Pipeline: new location for configuring GFF files and aliases#245
taimontgomery merged 26 commits intomasterfrom
issue-234

Conversation

@AlexTate
Copy link
Member

@AlexTate AlexTate commented Oct 27, 2022

Config file changes:
The Alias by... and Feature Source columns have been removed from the Features Sheet. This is a healthy change because these columns were exclusively coupled to each other, and none of the other columns, per rule. This understandably led to some confusion.

GFF file inputs are now defined in the Paths File, where all other non-sample file inputs reside. Its YAML data type is a list of mappings, where each list item holds the path to the file and an optional list of alias attributes for the file. When the Paths File is parsed, only unique GFF files are retained, and if there are duplicate entries for the same path but different aliases, the aliases are merged with duplicates removed and order preserved.

Command line argument changes:
The command line arguments for tiny-count have been updated accordingly. Rather than adding the Paths File to the two existing inputs (Samples Sheet and Features Sheet), users need only pass the Paths File which contains the locations of all required file inputs.

Codebase improvements:
A new class, PathsFile, has been added to configuration.py to act as an API to tiny-count and the Configuration class. It validates the config file at construction and automatically resolved relative paths upon lookup. This is true in both "pipeline" mode and standalone mode.

Misc. changes and bugfixes:

  • The joinpath() and from_here() functions (used in all configuration file classes) have been hardened to more reliably handle a wider variety of inputs
  • CSV files containing a greater than expected number of columns are now handled properly in CSVReader.validate_csv_header()
  • If GFFValidator was unable to parse any chromosomes from reference genome files, it now continues its search with the next best option (SAM files, currently). Previously this was treated as an indication of chromosome non-overlap.

Closes #234

…e and reliability improvements.

The routine for determining the final output directory name has been cleaned up in setup_file_groups()

Removed "Alias by..." and "Feature Source" from the list of expected Features Sheet columns.
… args for Features Sheet and Samples Sheet have been replaced by a required arg for the Paths File.

load_config() and load_samples() have been modified to use the Paths File to determine their inputs.

Handling of GFFs and aliases have been removed from load_config(). This is now handled by a new function: load_annotations()
… in the Paths File. Also corrected some rare usages of "Paths Sheet" to "Paths File"
…nput. I still think it's a good idea to create a standalone PathsFile class that tiny-config and tiny-count can use.
… handle CWL file objects. Required inputs are checked against None in Configuration.process_paths_sheet(). CSVReader.validate_csv_header() now handles cases where more than the expected number of columns are provided (not sure how I missed that before).
# Conflicts:
#	tiny/entry.py
#	tiny/rna/configuration.py
…s that are Nonetype.

Updates to ConfigBase.from_here() to allow it to handle inputs that are already CWL file classes, and inputs that are Nonetype or empty string. Previously, this function would return the config file's directory if destination="" or None. In this context this isn't very helpful. Now it will return an empty string.
… performing validation of the Paths File, it also includes some convenience functions like automatic "from_here" path resolution on key lookup and a function that converts any of the contained parameter types to a CWL file object.
… the benefit of validating Paths File contents when tiny-count is run as a standalone step. Eventually GFFValidator will be modified so that we can use genome and ebwt inputs as validation targets during standalone runs too (currently only the SAM files are used). For now, I need to keep things moving with other tasks.

GFF loading and validation now take place in the same function.
…cessarily a bugfix because the order was reversed twice (hence why it passed testing).

 chroms_shared_with_ebwt() now silences stderr output from bowtie-inspect if there was an error. chroms_shared_with_genomes() now skips genome files that don't exist, and validate_chroms() will continue searching for "sequence file chromosomes" if no chromosomes were retained from genomes. These changes were made in preparation for including ebwt and genomes in tiny-count's call to GFFValidator.
…silient.

A CWL file object entry is now included for "paths_file" in the Run Config. This means there will be two keys containing the path to the Paths File; one of them is at the top and is a regular string for user friendliness, and the other is in the automatic section for the workflow runner.

Replaced a missing "_" from the run_directory name that had been recently removed.
…path resolution. This commit also reinstates the old guarantees that each GFF is parsed only once, and that "ID" alias values are not shown in the Feature Name column.

Tried to make it pretty resilient. If multiple entries exist for the same GFF file with unique alias keys in each entry, the alias keys are merged under the single resulting record for the file
…rs so that it can be used by unit_tests_counter
…ntains updates for PathsFile's new automatic path resolution, bugfixes for chained assignment using PathsFile (BAD!), and so forms the default prefix from the given test's PathsFile object (previously a new timestamp was created at minute boundaries and causing intermittent test failures)
# Conflicts:
#	README.md
#	START_HERE/features.csv
#	doc/Configuration.md
#	tests/unit_tests_counter.py
#	tiny/rna/configuration.py
#	tiny/templates/features.csv
@AlexTate
Copy link
Member Author

Merge conflicts have been resolved

@taimontgomery
Copy link
Collaborator

Tested successfully on ram1 data.

@taimontgomery taimontgomery merged commit b17b2b3 into master Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pipeline: major changes to Features Sheet and Paths File

2 participants