Pipeline: new location for configuring GFF files and aliases#245
Merged
taimontgomery merged 26 commits intomasterfrom Nov 2, 2022
Merged
Pipeline: new location for configuring GFF files and aliases#245taimontgomery merged 26 commits intomasterfrom
taimontgomery merged 26 commits intomasterfrom
Conversation
…e and reliability improvements. The routine for determining the final output directory name has been cleaned up in setup_file_groups() Removed "Alias by..." and "Feature Source" from the list of expected Features Sheet columns.
… args for Features Sheet and Samples Sheet have been replaced by a required arg for the Paths File. load_config() and load_samples() have been modified to use the Paths File to determine their inputs. Handling of GFFs and aliases have been removed from load_config(). This is now handled by a new function: load_annotations()
… in the Paths File. Also corrected some rare usages of "Paths Sheet" to "Paths File"
…nput. I still think it's a good idea to create a standalone PathsFile class that tiny-config and tiny-count can use.
… handle CWL file objects. Required inputs are checked against None in Configuration.process_paths_sheet(). CSVReader.validate_csv_header() now handles cases where more than the expected number of columns are provided (not sure how I missed that before).
# Conflicts: # tiny/entry.py # tiny/rna/configuration.py
…s that are Nonetype. Updates to ConfigBase.from_here() to allow it to handle inputs that are already CWL file classes, and inputs that are Nonetype or empty string. Previously, this function would return the config file's directory if destination="" or None. In this context this isn't very helpful. Now it will return an empty string.
… performing validation of the Paths File, it also includes some convenience functions like automatic "from_here" path resolution on key lookup and a function that converts any of the contained parameter types to a CWL file object.
…calling self.paths
… the benefit of validating Paths File contents when tiny-count is run as a standalone step. Eventually GFFValidator will be modified so that we can use genome and ebwt inputs as validation targets during standalone runs too (currently only the SAM files are used). For now, I need to keep things moving with other tasks. GFF loading and validation now take place in the same function.
…cessarily a bugfix because the order was reversed twice (hence why it passed testing). chroms_shared_with_ebwt() now silences stderr output from bowtie-inspect if there was an error. chroms_shared_with_genomes() now skips genome files that don't exist, and validate_chroms() will continue searching for "sequence file chromosomes" if no chromosomes were retained from genomes. These changes were made in preparation for including ebwt and genomes in tiny-count's call to GFFValidator.
…the new location for gff_files
…silient. A CWL file object entry is now included for "paths_file" in the Run Config. This means there will be two keys containing the path to the Paths File; one of them is at the top and is a regular string for user friendliness, and the other is in the automatic section for the workflow runner. Replaced a missing "_" from the run_directory name that had been recently removed.
…path resolution. This commit also reinstates the old guarantees that each GFF is parsed only once, and that "ID" alias values are not shown in the Feature Name column. Tried to make it pretty resilient. If multiple entries exist for the same GFF file with unique alias keys in each entry, the alias keys are merged under the single resulting record for the file
…rs so that it can be used by unit_tests_counter
…ntains updates for PathsFile's new automatic path resolution, bugfixes for chained assignment using PathsFile (BAD!), and so forms the default prefix from the given test's PathsFile object (previously a new timestamp was created at minute boundaries and causing intermittent test failures)
…ific tests for loading GFF entries
# Conflicts: # README.md # START_HERE/features.csv # doc/Configuration.md # tests/unit_tests_counter.py # tiny/rna/configuration.py # tiny/templates/features.csv
Member
Author
|
Merge conflicts have been resolved |
Collaborator
|
Tested successfully on ram1 data. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Config file changes:
The
Alias by...andFeature Sourcecolumns have been removed from the Features Sheet. This is a healthy change because these columns were exclusively coupled to each other, and none of the other columns, per rule. This understandably led to some confusion.GFF file inputs are now defined in the Paths File, where all other non-sample file inputs reside. Its YAML data type is a list of mappings, where each list item holds the
pathto the file and an optional list ofaliasattributes for the file. When the Paths File is parsed, only unique GFF files are retained, and if there are duplicate entries for the same path but different aliases, the aliases are merged with duplicates removed and order preserved.Command line argument changes:
The command line arguments for
tiny-counthave been updated accordingly. Rather than adding the Paths File to the two existing inputs (Samples Sheet and Features Sheet), users need only pass the Paths File which contains the locations of all required file inputs.Codebase improvements:
A new class, PathsFile, has been added to configuration.py to act as an API to tiny-count and the Configuration class. It validates the config file at construction and automatically resolved relative paths upon lookup. This is true in both "pipeline" mode and standalone mode.
Misc. changes and bugfixes:
joinpath()andfrom_here()functions (used in all configuration file classes) have been hardened to more reliably handle a wider variety of inputsCSVReader.validate_csv_header()Closes #234