Skip to content

dvc.yaml: future of foreach stages #5440

@jorgeorpinel

Description

@jorgeorpinel

Sorry for long text in advance and as a disclaimer, I plan to tag others here and probably not be too involved in this discussion for now (if it happens). So please feel free to manage the issue as you best see fit.


First a Q: What is the main use case or reason for foreach stages? I haven't been able to find an independent issue where this is explained. It seems to have spawned naturally out of the parameterization feature (OP #3633). The usage is explained in its PR (#4734) but again I can't find the official motivation (what problem gets solved).
Edit: I found some motivation in https://discuss.dvc.org/t/using-dvc-to-keep-track-of-multiple-model-variants/471/2 towards the idea of "generalizing stages", OK let's go with that for now.

The reason I ask is that while it's a greatly engineered feature, the current syntax may encourage a "misuse" of DVC. Specifically, it seems rather difficult to connect the stages defined inside foreach/do to one another (let alone among foreach clauses). For example:

stages:
  mystages:
    foreach: ${mylist}
    do:
      cmd: ./${item.exec} in ${item.in} out ${item.out}

First, the command in each stage should probably change (thus ${item.exec} above).
Second, only if ${item.out}i = ${item.in}i+1 (i = [1 ... len(mylist)]) would these stages form a pipeline. This kind of patterns imply a very careful construction of params.yaml (or embedded vars:).
While quite doable, I doubt all this will be obvious to users.

We can document it (maybe transfer this to dvc.org if that is the conclusion), but most users tend to jump into usage first and ask questions later. And it seems to me that by far the most intuitive way to use this now is to create a bunch of completely disconnected stages — which earlier led me to discuss #5181 — that are really one stage that partitions data inputs/outputs for batching or parallel processing (which we don't support yet).

So we may end up running into support cases like https://discuss.dvc.org/t/versioning-predictions/656 very often. BTW that case does have a nice usage for this feature: facilitate per-file data provenance.

Per some earlier discussions (example) I know we don't want foreach stages to be considered "groups", so I assume this is a misuse.

Questions:

Thanks!

See also #5172

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionrequires active participation to reach a conclusionenhancementEnhances DVC

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions