Description
This issue started out as discussion around ways to support nextstrain run in augur subsample config. The solution is clear: pass filepaths in subsample config through resolve_config_path before dumping the config variable in Snakemake.
The conversation has then moved to discuss whether/how to apply resolve_config_path to all values at the start of Snakemake (instead of within each individual rule), then to discuss when/how to modify Snakemake config. Options:
- Before running Snakemake.
- Example: using CUE to generate a
phylogenetic/defaults/config.yaml.
- We don't do this anywhere currently.
- At Snakemake startup (i.e. before evaluating any rules).
- Example: modifying a value in
config.
- Example: modifying the structure of
config to resolve wildcards.
- At Snakemake runtime.
-
Example: modifying a value from config while using it in a rule.
rule foo:
input:
reference = resolve_config_path(config["files"]["reference"])
-
Most usage of resolve_config_path does this currently, and it's the only way to leverage Snakemake's built-in wildcards functionality.
-
This is not an option for augur subsample, which reads from the dump of config at startup.
Original issue: filter/subsample config compatibility with nextstrain run
I ran into this issue while updating the wnv repo to support nextstrain run (draft).
The issue stems from a fundamental change in working directory behavior between nextstrain build and nextstrain run:
nextstrain build: Executes in the workflow directory
nextstrain run: Executes in a user-specified analysis directory
This is a breaking change for all file paths in config, and the updated repos have addressed this using the resolve_config_path() helper function which searches for files in both directories. This works great for rules where file paths are passed directly to individual parameters, allowing each path to be wrapped with resolve_config_path().
augur filter
The WNV repo follows the flexible pattern of generalized subsampling where all augur filter arguments are stored in a literal string:
subsampling:
region: >-
--query "is_lab_host != 'true'"
--query-columns is_lab_host:str
--min-length '8200'
--group-by region year
--subsample-max-sequences 3000
--exclude defaults/exclude.txt
--include defaults/all-lineages/include.txt
Since file paths like defaults/exclude.txt are embedded within the literal string, they cannot be individually processed by resolve_config_path(). When executed from the analysis directory, these paths don't exist.
I can't think of a solution that doesn't involve breaking apart the config value into separate strings so that resolve_config_path can be used on the file paths. This is the pattern used in rule filter by other nextstrain run-compatible repos, but goes against the generalized subsampling pattern.
augur subsample
augur subsample has a similar situation. While file paths are accessed directly by config key, the access happens within augur subsample and not Snakemake, so resolve_config_path is not applicable.
A possible solution is to apply resolve_config_path to file paths in subsampling config before writing the config YAML that is then used by augur subsample (draft).
Description
This issue started out as discussion around ways to support
nextstrain runinaugur subsampleconfig. The solution is clear: pass filepaths in subsample config throughresolve_config_pathbefore dumping theconfigvariable in Snakemake.The conversation has then moved to discuss whether/how to apply
resolve_config_pathto all values at the start of Snakemake (instead of within each individual rule), then to discuss when/how to modify Snakemake config. Options:phylogenetic/defaults/config.yaml.config.resolve_filepaths.configto resolve wildcards.Example: modifying a value from
configwhile using it in a rule.Most usage of
resolve_config_pathdoes this currently, and it's the only way to leverage Snakemake's built-in wildcards functionality.This is not an option for
augur subsample, which reads from the dump ofconfigat startup.Original issue: filter/subsample config compatibility with
nextstrain runI ran into this issue while updating the wnv repo to support
nextstrain run(draft).The issue stems from a fundamental change in working directory behavior between
nextstrain buildandnextstrain run:nextstrain build: Executes in the workflow directorynextstrain run: Executes in a user-specified analysis directoryThis is a breaking change for all file paths in config, and the updated repos have addressed this using the
resolve_config_path()helper function which searches for files in both directories. This works great for rules where file paths are passed directly to individual parameters, allowing each path to be wrapped withresolve_config_path().augur filter
The WNV repo follows the flexible pattern of generalized subsampling where all
augur filterarguments are stored in a literal string:Since file paths like
defaults/exclude.txtare embedded within the literal string, they cannot be individually processed byresolve_config_path(). When executed from the analysis directory, these paths don't exist.I can't think of a solution that doesn't involve breaking apart the config value into separate strings so that
resolve_config_pathcan be used on the file paths. This is the pattern used inrule filterby othernextstrain run-compatible repos, but goes against the generalized subsampling pattern.augur subsample
augur subsamplehas a similar situation. While file paths are accessed directly by config key, the access happens withinaugur subsampleand not Snakemake, soresolve_config_pathis not applicable.A possible solution is to apply
resolve_config_pathto file paths in subsampling config before writing the config YAML that is then used byaugur subsample(draft).