You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sorry for long text in advance and as a disclaimer, I plan to tag others here and probably not be too involved in this discussion for now (if it happens). So please feel free to manage the issue as you best see fit.
First a Q: What is the main use case or reason for foreach stages? I haven't been able to find an independent issue where this is explained. It seems to have spawned naturally out of the parameterization feature (OP #3633). The usage is explained in its PR (#4734) but again I can't find the official motivation (what problem gets solved). Edit: I found some motivation in https://discuss.dvc.org/t/using-dvc-to-keep-track-of-multiple-model-variants/471/2 towards the idea of "generalizing stages", OK let's go with that for now.
The reason I ask is that while it's a greatly engineered feature, the current syntax may encourage a "misuse" of DVC. Specifically, it seems rather difficult to connect the stages defined inside foreach/do to one another (let alone among foreach clauses). For example:
stages:
mystages:
foreach: ${mylist}do:
cmd: ./${item.exec} in ${item.in} out ${item.out}
First, the command in each stage should probably change (thus ${item.exec} above).
Second, only if ${item.out}i = ${item.in}i+1 (i = [1 ... len(mylist)]) would these stages form a pipeline. This kind of patterns imply a very careful construction of params.yaml (or embedded vars:).
While quite doable, I doubt all this will be obvious to users.
We can document it (maybe transfer this to dvc.org if that is the conclusion), but most users tend to jump into usage first and ask questions later. And it seems to me that by far the most intuitive way to use this now is to create a bunch of completely disconnected stages — which earlier led me to discuss #5181 — that are really one stage that partitions data inputs/outputs for batching or parallel processing (which we don't support yet).
Per some earlier discussions (example) I know we don't want foreach stages to be considered "groups", so I assume this is a misuse.
Questions:
Is there an alternative syntax that doesn't encourage disconnected stage sets? (perhaps a 2nd substitution expression like $${list.val} instead of foreach/do, as suggested originally).
Sorry for long text in advance and as a disclaimer, I plan to tag others here and probably not be too involved in this discussion for now (if it happens). So please feel free to manage the issue as you best see fit.
First a Q: What is the main use case or reason for foreach stages? I haven't been able to find an independent issue where this is explained. It seems to have spawned naturally out of the parameterization feature (OP #3633). The usage is explained in its PR (#4734) but again I can't find the official motivation (what problem gets solved).
Edit: I found some motivation in https://discuss.dvc.org/t/using-dvc-to-keep-track-of-multiple-model-variants/471/2 towards the idea of "generalizing stages", OK let's go with that for now.
The reason I ask is that while it's a greatly engineered feature, the current syntax may encourage a "misuse" of DVC. Specifically, it seems rather difficult to connect the stages defined inside
foreach/doto one another (let alone among foreach clauses). For example:First, the command in each stage should probably change (thus
${item.exec}above).Second, only if
${item.out}i =${item.in}i+1 (i = [1 ...len(mylist)]) would these stages form a pipeline. This kind of patterns imply a very careful construction of params.yaml (or embeddedvars:).While quite doable, I doubt all this will be obvious to users.
We can document it (maybe transfer this to dvc.org if that is the conclusion), but most users tend to jump into usage first and ask questions later. And it seems to me that by far the most intuitive way to use this now is to create a bunch of completely disconnected stages — which earlier led me to discuss #5181 — that are really one stage that partitions data inputs/outputs for batching or parallel processing (which we don't support yet).
Per some earlier discussions (example) I know we don't want foreach stages to be considered "groups", so I assume this is a misuse.
Questions:
$${list.val}instead offoreach/do, as suggested originally).vars:?Thanks!