Skip to content

Configuring augur subsample in Snakemake workflows #27

@victorlin

Description

@victorlin

Background

augur subsample takes a YAML config file as input. In a Snakemake workflow, there are at least 3 ways to provide this config:

  1. Store the config file in the repo and pass it to the command as --config.
  2. Store the config as a section in Snakemake workflow config. Write the workflow config to a file (e.g. results/run_config.yaml) and pass it to the command as --config with --config-section targeting the specific section of workflow config to be used as augur subsample config.
  3. Generate the config file from a separate templated config file and pass it to the command as --config.

Current approach

  • Option (1) was not chosen because:
    • Moving the filtering/subsampling config out of workflow config would be a big change to users. Keeping it in the workflow config would make the switch feel more like an implementation detail.
    • For workflows that run augur subsample multiple times (e.g. once per build or output dataset), this becomes less maintainable with a separate YAML file per augur subsample invocation.
  • Option (3) was not chosen because:
    • First reason from above.
    • It requires an extra step to be done before the workflow is run.
  • … and thus option (2) it was.
    • It worked fine for the WNV repo which runs augur subsample 3 times – the switch from augur filter was straightforward and I went with it.
    • When I tried it next in the rsv repo, the issues with this approach became obvious:
      • Since the rsv repo runs augur subsample 18 times, I turned to YAML anchors and aliases to avoid excessive duplication. This isn't great, and we shouldn't expectm users to do the same.
      • By storing the config within Snakemake workflow config, we are subject to Snakemake's config merging behavior which results in a poor experience for external users providing additional config with --configfile.

It may be time to reconsider whether this is the right pattern going forwards.

Alternatives

  1. (1) from the "Background" section.
    • Prototype for measles.
    • Note: this requires augur subsample to search for filepaths relative to the config file location (issue).
  2. (2) from the "Background" section, but store a templated config in the workflow config and pre-process it.
    • This would require workflow-level code to expand the templated config into per-invocation config.
    • This would need a new schema for the concise templated config. I think it can be made flexible enough to be pathogen-agnostic, but need to do some prototyping.
    • This doesn't solve the problem with Snakemake's config merging behavior for external users. That could be solved by a custom config merge.
    • Consider dumping just the specific subsample section to a YAML file to take advantage of Snakemake's input file change detection. Dumping the entire YAML as done currently means subsample will run on every small config change.
  3. (3) from the "Background" section.
    • Prototype for rsv.
    • Note: for nextstrain run compatibility, this would require augur subsample to search for filepaths relative to workflow analysis directory (issue).
  4. ?

Links

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions