Skip to content

docs: repro: add --pull#1841

Merged
shcheklein merged 2 commits into
masterfrom
repro_pull
Oct 6, 2020
Merged

docs: repro: add --pull#1841
shcheklein merged 2 commits into
masterfrom
repro_pull

Conversation

@efiop
Copy link
Copy Markdown
Contributor

@efiop efiop commented Oct 5, 2020

Per treeverse/dvc#4538

You may disregard these recommendations if you used the Edit on GitHub button from dvc.org to improve a doc in place.

❗ Please read the guidelines in the Contributing to the Documentation list if you make any substantial changes to the documentation or JS engine.

🐛 Please make sure to mention Fix #issue (if applicable) in the description of the PR. This causes GitHub to close it automatically when the PR is merged.

Please choose to allow us to edit your branch when creating the PR.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Comment thread content/docs/command-reference/repro.md Outdated
Comment thread content/docs/command-reference/repro.md Outdated
corresponding pipelines, including the target stages themselves. This option
has no effect if `targets` are not provided.

- `--pull` - try automatically pulling cached outputs if they are not present in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, a few questions:

  1. try - what happens if it fails?
  2. pulling - make it a link to the dvc pull probably
  3. cached outputs - here not sure if it's better to use DVC-tracked outputs. (otherwise when you read it is bit hard mentally since you are they cached but not present in cache).

WDYT?

@jorgeorpinel ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. as it was before, it will simply not restore from run-cache.

Addressed 2 and 3. Thank you.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, got it. I guess it's fine for now. No reason to further improve this since we don't have run-cache documented anywhere. So we can keep as is -an advanced option.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have run-cache documented anywhere

BTW the run-cache is already mentioned in 6 cmd refs (published) and in the Data Pipelines page of the GS, which I just noticed/realized just now. I thought we were not going to include any info about experiments until it's more stable? Should we remove these mentions or prioritize documenting run-cache? Thanks

Copy link
Copy Markdown
Contributor Author

@efiop efiop Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel Let's not remove it. It is in a semi-official state, people already use it due to cml and other sources. We are on our way to cleaning up the ui overall and publishing experiments.

Run-cache doc by itself doesn't really mean anything to the users, which is why I didn't write it in the summer. It only makes sense in particular commands, so the doc about run-cache internals could wait for the high-level commands.

Copy link
Copy Markdown
Contributor

@jorgeorpinel jorgeorpinel Oct 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not remove remove it. It is in a semi-official state

OK, I agree it's best too keep, but it could be problematic that the run-cache mentions are completely out of context (no explanation of the concept anywhere).

Run-cache doc by itself doesn't really mean anything to the users... the doc about run-cache internals could wait...

Much disagree 🐶 I mean it's not so important whether it's a stand-alone doc or a new section in existing doc(s), but the basics about run-cache seem like a quite important thing to document to me.

It only makes sense in particular commands

Yeah anywhere we want to put it as long as it's published would be great since this is already semi-official.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel Agreed, I've added the run-cache doc ticket to next sprint, just preliminarily. Thanks 🙂

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thank! 🦊

@shcheklein shcheklein temporarily deployed to dvc-landing-repro-pull-uuofdyr October 6, 2020 00:46 Inactive
@shcheklein shcheklein merged commit 15aa177 into master Oct 6, 2020
@efiop efiop deleted the repro_pull branch October 6, 2020 01:05
Comment thread content/docs/command-reference/repro.md
Comment on lines +158 to +159
- `--pull` - try automatically [pulling](/doc/command-reference/pull) missing
cache for outputs restored from run-cache.
Copy link
Copy Markdown
Contributor

@jorgeorpinel jorgeorpinel Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back on this. Per treeverse/dvc#4538 (comment):

dvc repro --pull pulls regular files, hashes for which might've been restored from the existing run-cache, so kinda like regular dvc pull

Unfortunately I don't understand either one of the explanations. What's the relationship between run-cache and repro --pull? Maybe a step-by-step explanation like 1. Use repro --pull; 2. run-cache is checked before executing commands (default repro behavior I think); 3. Some output hashes are found? (but not the actual files? This is the confusing part); 4. Hashes are looked for in the cache but not found; 5. The files are looked for in remote storage. Something like that

Please @efiop ! Thanks in advance

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel Even if we leave the run-cache out, repro --pull would still try to dvc pull outputs that are missing, but the pipeline didn't change. E.g. when you forgot to dvc pull beforehand and you are trying to dvc repro otherwise up-to-date pipeline, so dvc repro --pull will just pull the outputs for such stages instead of trying to reproduce them.

Run-cache is then just a special source of lock files, and repro --pull works the same way as explained above.

Want to point out again that --pull is still a temporary solution that was needed to improve pull --run-cache that is also not complete in a product sense. So I would recommend not spending much time on this, as the product scenario is WIP and there is no reason to optimize the docs for it too much.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK it makes more sense now, thanks.

In this case I do feel like need to spend enough time understanding what's going on so that when the coming bulk of docs related to new features hit, I'm better prepared. So thanks again for baring with me!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last Q @efiop. Does this only check the default remote (if one is set)? Or all remotes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, 2 more questions...

  • Does it check only the local run-cache? Or also the remote run-cache for possible dep/out hashes?
  • What happens if you do repro --pull --no-run-cache? Is the run-cache check skipped?

Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only check the default remote (if one is set)? Or all remotes?

Only the default remote right now.

Does it check only the local run-cache? Or also the remote run-cache for possible dep/out hashes?

Yes, only local run-cache.

What happens if you do repro --pull --no-run-cache? Is the run-cache check skipped?

Correct. It will only pull if you have your lock file complete (so hashes are already there, just the outputs are missing from cache), but won't try to use run-cache.

Please feel free to ask any questions, I do understand that this incomplete feature is a bit confusing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel jorgeorpinel mentioned this pull request Oct 20, 2020
jorgeorpinel added a commit that referenced this pull request Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants