Use augur subsample by victorlin · Pull Request #103 · nextstrain/rsv

victorlin · 2025-09-22T22:04:53Z

Description of proposed changes

This PR contains 2 prep commits + 1 main commit. Message from main commit:

The previous subsampling implementation was fixed to a two-sample recent+background split with some hardcoded parameters. Replacing it with augur subsample allows for more flexible configuration.

To keep the workflow config schema concise, we generate each augur subsample config dynamically using a matrix defined in the config and merging code in config.smk.

This is a breaking change and the old configuration will no longer work.

Related issue(s)

Closes #101

Checklist

Draft necessary changes on NW-PaGe fork (branch)
Checks pass
~~Update changelog~~

old implementations

YAML anchors and aliases: b0d6728
matrix: d24a8d1
dataset-first glob syntax: 5741f0f

joverlee521 · 2025-10-06T21:00:33Z

+  genome/all-time:
+    samples:
+      all-time:
+        <<: [*subsample_genome, *subsample_all-time, *subsample_defaults]


In general, I'm not sure this pattern of YAML anchors/aliases is very user friendly. If we do use anchors/aliases, then we should probably disable them in the dumped config so that users can at least see the fully expanded config in results/run_config.yaml (see suggested workaround in yaml/pyyaml#103).

I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu. These could be expanded at start of Snakemake (or whatever comes of discussion in nextstrain/public#23).

If we do use anchors/aliases, then we should probably disable them in the dumped config

Yes, agreed. (I'd go further: we should write out small-multiples with lots of duplication, but that's beyond this PR.)

I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu.

Across the 9 builds almost all of the difference is the subsampling parameters, so at some level it is going to be either complex or verbose. I don't think glob-like syntax will help here with the structure as it is now, but it could if you moved towards something more like:

subsample: samples: min_length: "genome/*": 10_000 "G/*": 600 "F/*": 1_200 group_by: "year country" min_coverage: 0.3 resolutions: "*/all-time": {"min_date": "1795-01-01"} "*/6y": {"min_date": "6Y", "background_min_date": "1975-01-01"} "*/3y": {"min_date": "3Y", "background_min_date": "1975-01-01"}

I also don't think we should use YAML anchors and aliases, and would like to decide on a good alternative in nextstrain/public#27. I'll consider something like the above with config pre-processing to translate into augur subsample-ready config.

For my ability to follow this thread, the reason against using YAML anchors and aliases:

not sure if the pattern of YAML anchors/aliases is very user friendly (for which users? A google search is returning several articles recommending YAML anchors (example) to avoid YAML duplication, but maybe these are not our user group or feel free to add comments from WA-DOH/others etc.)

requires resolving the anchors explicitly in the dumped config (similar lift as config pre-processing to translate into augur subsample-ready config)

Feel free to add to the above list, mostly looking for a summary statement

not sure if the pattern of YAML anchors/aliases is very user friendly

I don't think we're advising against anchors per-se -- they can be very useful! -- rather I think the pushback was against the specific usage of anchors in this config. (The pushback wasn't from me, so I won't elaborate.)

requires resolving the anchors explicitly in the dumped config

Anchors are resolved automatically, i.e. yaml.dump(yaml.load(...)) will not preserve anchors.

Pushed up another version of glob syntax in bff7e53. This is the same structure as avian-flu with some additional features.

rsv/config/configfile.yaml

Lines 69 to 77 in 08eb257

max_sequences:

'*/*/all-time': 3000

'*/F-antibody-escape/all-time': 2000

'*/*/(6y|3y)':

recent: 3000

background: 300

'*/F-antibody-escape/(6y|3y)':

recent: 2000

background: 200

Also added a draft of the same syntax for refine config: 08eb257.

I'm biased but I really like the "option-first" syntax for refine (and other non-subsample rules) shown in 08eb257 etc

One minor note (not to be implemented here, just for reference), in avian-flu you can supply a scalar value directly if all wildcard combinations use the same value. E.g. the following are identical:

coalescent: '*/*/*': opt coalescent: opt

For subsample - reading the vaious YAMLs (and only reading them, not actually testing them), I feel there's too much magic. I couldn't confidently tell you how each dataset will be filtered! I'm sure I could learn, but from a first impression it feels hard! To riff on various approaches, have you explored something like this:

subsample: # Define which samples each dataset uses # (I couldn't work this out from just reading bff7e5 or e54f10, so contents may be wrong!) - dataset: '*/*/all-time': - all-time '*/*/(6y|3y)': - recent - background - samples: all-time: exclude: config/outliers_ppx.txt include: 'a/*/*': config/include_a.txt 'b/*/*': config/include_b.txt min_date: 1975-01-01 max_sequences: '*/*/*': 3000 '*/F-antibody-escape/*': 2000 ...

Yeah, I like the "option-first" approach in bff7e53. I think it makes it easier to stitch together which options are used per build. However, I'm a little unsure about the concatenation of query and list options. This keeps the config concise, but makes it's more difficult to understand how the override of the default config works for these options.

At first glance, the glob syntax being a little daunting for users to grok and edit. They can read the results/**_config.yaml to see what the expanded config will be, but that does not help with editing or overriding the config...I wonder if we should also support directly providing the subsample config, i.e.

subsample: "a/genome/6y": samples: recent: ... background: ...

Then they can copy the expanded config into their workflow config and edit the values as needed.

Thanks both, let's stick with the option-first idea and make adjustments from there.

[@jameshadfield] One minor note (not to be implemented here, just for reference), in avian-flu you can supply a scalar value directly if all wildcard combinations use the same value.

This works nicely for backwards compatibility. I'll look into adding it here.

[@jameshadfield] For subsample… I feel there's too much magic.

If I understand correctly, the "magic" lies in the relationship between datasets and samples. As Jover said, results/**_config.yaml files should help, but yes it's not very obvious from the workflow config alone.

[@jameshadfield] To riff on various approaches, have you explored something like this …

These are the differences I see in your example, let me know if I got anything wrong:

List instead of dict under subsample

A new declaration block

Updated config-defining syntax of samples > options > datasets

(1) is fine. Earlier I advocated for list with two reasons: (a) to have an intuitive merge order and (b) to avoid Snakemake's config merging issues. I've since realized that (a) is not much of an issue because dict keys are guaranteed to be ordered in Python. (b) is still an issue, but we're already using the separate key workaround elsewhere (custom_subsample, additional_inputs).

(2) seems reasonable. It's extra verbosity, but helps readability and could also be used for validation. We'd have to consider how it fits with the rest of the config. Something like this could work:

subsample: samples: '*/*/(6y|3y)': [recent, background] # no need for all-time; config processing automatically creates a single sample options: query: '*/*/*': missing_data<1000 …

(3) I'm not sure about this. The "samples > options > datasets" ordering feels backwards. Where would the global defaults go (*/*/*)?

[@joverlee521] At first glance, the glob syntax being a little daunting for users to grok and edit...I wonder if we should also support directly providing the subsample config

Good idea. I suspect most users won't be producing very many datasets, and the direct subsample config will be easier to work with. I'll look into adding this support.

let's stick with the option-first idea and make adjustments from there.

I find the subsampling section in bff7e53 much easier to read than e54f105. Although I further find the "samples > options > datasets" ordering more natural, as I tend to think of samples as the most important concept.

If I understand correctly, the "magic" lies in the relationship between datasets and samples.

For me the magic is two-fold, although they're inter-related:

not knowing which samples a dataset is using. In e54f105 or bff7e53, do any datasets just use the default sample, or do they all use either {all-time} or {recent, background}? I think bff7e53 is the former and e54f105 is the latter.

What's the behaviour of the default sample? It seems to be both a sample on its own (for datasets which don't explicitly define sample names) and also merged into each and every explicitly defined sample?

(I know "default" is the word to use, but from the YAML structure it's more like the "implicit/unspecified" sample at the moment.)

As Jover said, results/**_config.yaml files should help, but yes it's not very obvious from the workflow config alone.

I think we're never going to find a dead-simple config structure, but I think we can find one where you can figure it out by reading the YAML a couple of times, perhaps cross-referencing results/**_config.yaml to double-check. I think I understand the YAMLs in e54f105 or bff7e53 but only after reading them many many times!

(1) List instead of dict under subsample

This is me being bad at YAML. 'dataset' and 'samples' were intended to be keys in the main 'subsample' dict, to better support merging. Merging is important in our configs more generally so we try to keep this possibility. That being said, the subsampling YAMLs are hard enough as they are, having easy-to-reason-with overlay YAMLs seems.... hard!

(2) seems reasonable. It's extra verbosity, but helps readability and could also be used for validation.

Making the association between datasets and samples clearer is a big one for me. If we go with the option-first idea something like your example in (2) is good, but... the "no need for all-time; config processing automatically creates a single sample" is more magic! We should find some way of explicitly saying that '*/*/all-time' uses the defaults. I came up with samples['*/*/all-time'] = [], but that's also a little magic, or samples['*/*/all-time'] = ['DEFAULT'], but that's odd because default doesn't appear in the defined samples for */*/(6y|3y) even though it will be used.

(3) ... Where would the global defaults go (*/*/*)?

I generally find the concept of defaults hard to reason with in all of these ideas. However I see that without it we have verbose YAMLs, even if we use anchors. (And of course you can write YAMLs which don't make any use of the defaults if you want to!)

I think clarity would be improved if we made it explicit. For bff7e53 this would look like:

options: include: DEFAULT: 'a/*/*': config/include_a.txt 'b/*/*': config/include_b.txt

For my "samples > options > datasets" configuration it would look like

samples: default: include: 'a/*/*': config/include_a.txt 'b/*/*': config/include_b.txt

victorlin · 2025-10-07T02:18:29Z

I'll wait for a decision in nextstrain/public#27 before continuing here.

Preparing to use build_dir in config.smk, and I figured it'd be good to move both auspice_dir with it.

Similar to "Add separate frequencies config" (0b22185), this rule shouldn't rely on config from another rule.

To be used for validating augur subsample config.

The previous subsampling implementation was fixed to a two-sample recent+background split with some hardcoded parameters. Replacing it with augur subsample allows for more flexible configuration. To keep the workflow config schema concise, we generate each augur subsample config dynamically using a patterns defined in the config helper functions in config.smk. This is a breaking change and the old configuration will no longer work.

victorlin self-assigned this Sep 22, 2025

victorlin force-pushed the victorlin/use-augur-subsample branch from e869e31 to 18232af Compare September 22, 2025 22:24

victorlin mentioned this pull request Sep 22, 2025

Config improvements #102

Merged

2 tasks

victorlin linked an issue Sep 22, 2025 that may be closed by this pull request

Use augur subsample #101

Open

victorlin mentioned this pull request Sep 23, 2025

augur subsample: merging of external & default configs #106

Closed

victorlin force-pushed the victorlin/use-augur-subsample branch from 18232af to b0d6728 Compare September 24, 2025 20:20

joverlee521 reviewed Oct 6, 2025

View reviewed changes

victorlin mentioned this pull request Oct 7, 2025

Use augur subsample nextstrain/mumps#45

Draft

3 tasks

victorlin marked this pull request as draft October 7, 2025 02:18

victorlin mentioned this pull request Oct 14, 2025

Disable aliases in dumped config nextstrain/shared#62

Merged

3 tasks

victorlin force-pushed the victorlin/use-augur-subsample branch from b0d6728 to 5ee2efa Compare February 21, 2026 03:04

victorlin force-pushed the victorlin/update-filter-config branch from 3340439 to 7dd63f2 Compare February 21, 2026 03:04

victorlin force-pushed the victorlin/use-augur-subsample branch from 5ee2efa to 5490644 Compare February 26, 2026 01:04

victorlin force-pushed the victorlin/use-augur-subsample branch from 5490644 to 42aa5f6 Compare March 6, 2026 01:48

victorlin force-pushed the victorlin/update-filter-config branch from e243d2f to 0b22185 Compare March 6, 2026 01:48

Base automatically changed from victorlin/update-filter-config to master March 6, 2026 18:55

victorlin force-pushed the victorlin/use-augur-subsample branch from 42aa5f6 to b437074 Compare March 7, 2026 02:28

victorlin force-pushed the victorlin/use-augur-subsample branch from a727f0c to 9b071fb Compare March 24, 2026 00:01

victorlin force-pushed the victorlin/use-augur-subsample branch from 9b071fb to 627a61b Compare April 8, 2026 00:28

victorlin and others added 2 commits April 8, 2026 13:59

Move constants up in file

b7cb625

Preparing to use build_dir in config.smk, and I figured it'd be good to move both auspice_dir with it.

Add separate config section for filter_for_pre_subsample_alignment

e052166

Similar to "Add separate frequencies config" (0b22185), this rule shouldn't rely on config from another rule.

victorlin force-pushed the victorlin/use-augur-subsample branch 2 times, most recently from e54f105 to e48b466 Compare April 9, 2026 18:20

victorlin added 2 commits April 9, 2026 11:36

Define valid dataset parts

6b44998

To be used for validating augur subsample config.

victorlin force-pushed the victorlin/use-augur-subsample branch from e48b466 to 5741f0f Compare April 9, 2026 18:37

joverlee521 mentioned this pull request Apr 10, 2026

phylo: Support different build parameters for augur refine nextstrain/measles#118

Open

victorlin added 2 commits April 16, 2026 15:00

🚧 option-first glob syntax

bff7e53

🚧 glob syntax for refine

08eb257

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use augur subsample#103

Use augur subsample#103
victorlin wants to merge 6 commits intomasterfrom
victorlin/use-augur-subsample

victorlin commented Sep 22, 2025 •

edited

Loading

Uh oh!

joverlee521 Oct 6, 2025

Uh oh!

jameshadfield Oct 6, 2025 •

edited

Loading

Uh oh!

victorlin Oct 7, 2025

Uh oh!

j23414 Oct 7, 2025

Uh oh!

jameshadfield Oct 8, 2025

Uh oh!

victorlin Apr 16, 2026

Uh oh!

jameshadfield Apr 17, 2026 •

edited

Loading

Uh oh!

joverlee521 Apr 17, 2026

Uh oh!

victorlin Apr 20, 2026

Uh oh!

jameshadfield Apr 20, 2026

Uh oh!

victorlin commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	max_sequences:
	'//all-time': 3000
	'*/F-antibody-escape/all-time': 2000
	'//(6y\|3y)':
	recent: 3000
	background: 300
	'*/F-antibody-escape/(6y\|3y)':
	recent: 2000
	background: 200

Conversation

victorlin commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Related issue(s)

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameshadfield Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameshadfield Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

victorlin commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

victorlin commented Sep 22, 2025 •

edited

Loading

jameshadfield Oct 6, 2025 •

edited

Loading

jameshadfield Apr 17, 2026 •

edited

Loading