docs: repro: add --pull#1841
Conversation
| corresponding pipelines, including the target stages themselves. This option | ||
| has no effect if `targets` are not provided. | ||
|
|
||
| - `--pull` - try automatically pulling cached outputs if they are not present in |
There was a problem hiding this comment.
okay, a few questions:
try- what happens if it fails?pulling- make it a link to thedvc pullprobablycached outputs- here not sure if it's better to use DVC-tracked outputs. (otherwise when you read it is bit hard mentally since you are they cached but not present in cache).
WDYT?
There was a problem hiding this comment.
- as it was before, it will simply not restore from run-cache.
Addressed 2 and 3. Thank you.
There was a problem hiding this comment.
Yep, got it. I guess it's fine for now. No reason to further improve this since we don't have run-cache documented anywhere. So we can keep as is -an advanced option.
There was a problem hiding this comment.
we don't have run-cache documented anywhere
BTW the run-cache is already mentioned in 6 cmd refs (published) and in the Data Pipelines page of the GS, which I just noticed/realized just now. I thought we were not going to include any info about experiments until it's more stable? Should we remove these mentions or prioritize documenting run-cache? Thanks
There was a problem hiding this comment.
@jorgeorpinel Let's not remove it. It is in a semi-official state, people already use it due to cml and other sources. We are on our way to cleaning up the ui overall and publishing experiments.
Run-cache doc by itself doesn't really mean anything to the users, which is why I didn't write it in the summer. It only makes sense in particular commands, so the doc about run-cache internals could wait for the high-level commands.
There was a problem hiding this comment.
Let's not remove remove it. It is in a semi-official state
OK, I agree it's best too keep, but it could be problematic that the run-cache mentions are completely out of context (no explanation of the concept anywhere).
Run-cache doc by itself doesn't really mean anything to the users... the doc about run-cache internals could wait...
Much disagree 🐶 I mean it's not so important whether it's a stand-alone doc or a new section in existing doc(s), but the basics about run-cache seem like a quite important thing to document to me.
It only makes sense in particular commands
Yeah anywhere we want to put it as long as it's published would be great since this is already semi-official.
There was a problem hiding this comment.
@jorgeorpinel Agreed, I've added the run-cache doc ticket to next sprint, just preliminarily. Thanks 🙂
| - `--pull` - try automatically [pulling](/doc/command-reference/pull) missing | ||
| cache for outputs restored from run-cache. |
There was a problem hiding this comment.
Back on this. Per treeverse/dvc#4538 (comment):
dvc repro --pull pulls regular files, hashes for which might've been restored from the existing run-cache, so kinda like regular dvc pull
Unfortunately I don't understand either one of the explanations. What's the relationship between run-cache and repro --pull? Maybe a step-by-step explanation like 1. Use repro --pull; 2. run-cache is checked before executing commands (default repro behavior I think); 3. Some output hashes are found? (but not the actual files? This is the confusing part); 4. Hashes are looked for in the cache but not found; 5. The files are looked for in remote storage. Something like that
Please @efiop ! Thanks in advance
There was a problem hiding this comment.
@jorgeorpinel Even if we leave the run-cache out, repro --pull would still try to dvc pull outputs that are missing, but the pipeline didn't change. E.g. when you forgot to dvc pull beforehand and you are trying to dvc repro otherwise up-to-date pipeline, so dvc repro --pull will just pull the outputs for such stages instead of trying to reproduce them.
Run-cache is then just a special source of lock files, and repro --pull works the same way as explained above.
Want to point out again that --pull is still a temporary solution that was needed to improve pull --run-cache that is also not complete in a product sense. So I would recommend not spending much time on this, as the product scenario is WIP and there is no reason to optimize the docs for it too much.
There was a problem hiding this comment.
OK it makes more sense now, thanks.
In this case I do feel like need to spend enough time understanding what's going on so that when the coming bulk of docs related to new features hit, I'm better prepared. So thanks again for baring with me!
There was a problem hiding this comment.
Last Q @efiop. Does this only check the default remote (if one is set)? Or all remotes?
There was a problem hiding this comment.
Actually, 2 more questions...
- Does it check only the local run-cache? Or also the remote run-cache for possible dep/out hashes?
- What happens if you do
repro --pull --no-run-cache? Is the run-cache check skipped?
Thanks!
There was a problem hiding this comment.
Does this only check the default remote (if one is set)? Or all remotes?
Only the default remote right now.
Does it check only the local run-cache? Or also the remote run-cache for possible dep/out hashes?
Yes, only local run-cache.
What happens if you do repro --pull --no-run-cache? Is the run-cache check skipped?
Correct. It will only pull if you have your lock file complete (so hashes are already there, just the outputs are missing from cache), but won't try to use run-cache.
Please feel free to ask any questions, I do understand that this incomplete feature is a bit confusing.
There was a problem hiding this comment.
That's all I can think of for now. Thanks @efiop! Updated in https://github.com/iterative/dvc.org/pull/1881/files#diff-6c1f3192f09e2722ba169e9fa219b3b5158bbafa470b382c2d6135db7aa1e20d.
Per treeverse/dvc#4538
❗ Please read the guidelines in the Contributing to the Documentation list if you make any substantial changes to the documentation or JS engine.
🐛 Please make sure to mention
Fix #issue(if applicable) in the description of the PR. This causes GitHub to close it automatically when the PR is merged.Please choose to allow us to edit your branch when creating the PR.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏