Skip to content

When/how to modify Snakemake config? #23

@victorlin

Description

@victorlin

Description

This issue started out as discussion around ways to support nextstrain run in augur subsample config. The solution is clear: pass filepaths in subsample config through resolve_config_path before dumping the config variable in Snakemake.

The conversation has then moved to discuss whether/how to apply resolve_config_path to all values at the start of Snakemake (instead of within each individual rule), then to discuss when/how to modify Snakemake config. Options:

  1. Before running Snakemake.
    • Example: using CUE to generate a phylogenetic/defaults/config.yaml.
    • We don't do this anywhere currently.
  2. At Snakemake startup (i.e. before evaluating any rules).
    • Example: modifying a value in config.
    • Example: modifying the structure of config to resolve wildcards.
  3. At Snakemake runtime.
    • Example: modifying a value from config while using it in a rule.

      rule foo:
          input:
              reference = resolve_config_path(config["files"]["reference"])
    • Most usage of resolve_config_path does this currently, and it's the only way to leverage Snakemake's built-in wildcards functionality.

    • This is not an option for augur subsample, which reads from the dump of config at startup.

Original issue: filter/subsample config compatibility with nextstrain run

I ran into this issue while updating the wnv repo to support nextstrain run (draft).

The issue stems from a fundamental change in working directory behavior between nextstrain build and nextstrain run:

  • nextstrain build: Executes in the workflow directory
  • nextstrain run: Executes in a user-specified analysis directory

This is a breaking change for all file paths in config, and the updated repos have addressed this using the resolve_config_path() helper function which searches for files in both directories. This works great for rules where file paths are passed directly to individual parameters, allowing each path to be wrapped with resolve_config_path().

augur filter

The WNV repo follows the flexible pattern of generalized subsampling where all augur filter arguments are stored in a literal string:

subsampling:
  region: >-
    --query "is_lab_host != 'true'"
    --query-columns is_lab_host:str
    --min-length '8200'
    --group-by region year
    --subsample-max-sequences 3000
    --exclude defaults/exclude.txt
    --include defaults/all-lineages/include.txt

Since file paths like defaults/exclude.txt are embedded within the literal string, they cannot be individually processed by resolve_config_path(). When executed from the analysis directory, these paths don't exist.

I can't think of a solution that doesn't involve breaking apart the config value into separate strings so that resolve_config_path can be used on the file paths. This is the pattern used in rule filter by other nextstrain run-compatible repos, but goes against the generalized subsampling pattern.

augur subsample

augur subsample has a similar situation. While file paths are accessed directly by config key, the access happens within augur subsample and not Snakemake, so resolve_config_path is not applicable.

A possible solution is to apply resolve_config_path to file paths in subsampling config before writing the config YAML that is then used by augur subsample (draft).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions