Skip to content

augur subsample: merging of external & default configs #106

@victorlin

Description

@victorlin

#103 works fine for the automated Nextstrain builds, but it requires unusual workarounds for external users with custom config. Example on NW-PaGe:

# FIXME: this should be named 'contextual' but 'all-time' is necessary to override defaults.
all-time:
exclude: config/outliers_ppx.txt
exclude_where:
- qc.overallStatus=bad
- qc.overallStatus=mediocre
group_by:
- year
- country
min_date: 1975-01-01
max_sequences: 3000
min_length: 10000
query: genome_coverage>0.3 & missing_data<1000 & division != 'Washington'

The issue is that subsample.'genome/all-time'.samples.all-time is added by default, so any external config must accept that sample to be part of the final output, or at the least redefine it with custom config (which is what I did above).

This is a config merging issue that technically applies to WNV too, where augur subsample is already used. NW-PaGe/WNV's external config defines a Washington-focused build named wa. The custom name is an unintentional workaround around the issue – wa is not present in the default config, so it does not inherit any build-specific defaults. If NW-PaGe/WNV wants to re-use any of the existing build names from nextstrain/WNV, it would encounter the same config merging issue.

Using a different build name is not a valid workaround for rsv because the workflow is only meant to handle certain build names based on wildcards build_name=genome|G|F and resolution=all-time|6y|3y.

Possible solutions

Options 1-7 were my initial ideas but they aren't great. Option 8 seems to be best so far. Open to any other ideas.

  1. Use the current workaround of re-defining existing samples.

    • [con] This wouldn't work if there are more default samples than custom samples to be defined.
    • [con] This merges the existing sample config in, which could be considered a feature but is likely to cause confusing behavior if not overridden.
    • [con] New or renamed default samples would break the workaround.
  2. Update augur subsample to ignore "empty" samples.

    • This allows external config to "remove" default samples by overwriting them with null:

      samples:
        all-time: null
        wa: 
        contextual: 
    • [con] New or renamed default samples would break the workaround.

  3. Trick Snakemake into removing subsample.genome/all-time.samples.all-time (or even better, all of subsample) for external configs.

    • I've tried various approaches, but haven't found one that works.
  4. Move Nextstrain-specific defaults out of the default workflow config file.

    • This was discussed when reworking WNV config, and the decision was to keep them in defaults.
  5. Adjust the workflow to allow custom build names.

    • This would allow for the same workaround that NW-PaGe/WNV uses.
    • I haven't tried yet, but my guess is it will complicate the workflow.
  6. Separate augur subsample config from Snakemake config.

  7. Ignore all defaults with Snakemake's --replace-workflow-config.

    • [con] There are some useful defaults. At least in this case, it's mainly subsample that doesn't need to be inherited.
  8. Add a custom_subsample config param to override the default subsample config. (ref)

    • [con] Two different config params for the same rule.
    • There's precedent with inputs + additional_inputs.
  9. Custom workflow config merging public#28

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions