Skip to content

reproduce: Update --pull#9375

Closed
daavoo wants to merge 4 commits into
mainfrom
4742-make-that-dvc-repro-pull-pulls-all-missing-files
Closed

reproduce: Update --pull#9375
daavoo wants to merge 4 commits into
mainfrom
4742-make-that-dvc-repro-pull-pulls-all-missing-files

Conversation

@daavoo
Copy link
Copy Markdown
Contributor

@daavoo daavoo commented Apr 27, 2023

Closes #4742

@daavoo daavoo added feature is a feature A: pipelines Related to the pipelines feature labels Apr 27, 2023
@daavoo daavoo self-assigned this Apr 27, 2023
@daavoo daavoo linked an issue Apr 27, 2023 that may be closed by this pull request
daavoo

This comment was marked as outdated.

@dberenbaum

This comment was marked as outdated.

@dberenbaum

This comment was marked as outdated.

@daavoo

This comment was marked as outdated.

@dberenbaum

This comment was marked as outdated.

@dberenbaum

This comment was marked as outdated.

@daavoo

This comment was marked as outdated.

@dberenbaum

This comment was marked as outdated.

Comment thread dvc/repo/reproduce.py Outdated
try:
if kwargs.get("pull", False) and stage.is_data_source or stage.is_import:
logger.debug("Pulling '%s'", stage.addressing)
stage.repo.pull(stage.addressing)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on using high-level API in reproduce().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry Could you elaborate more?

We currently call stage.repo.cloud.pull as part of reproduce (when --pull is passed), in the restore of the run cache which is way more down the rabbit hole:

https://github.com/iterative/dvc/blob/2a04263f021db190641ee2ca077b70563c74cdea/dvc/stage/cache.py#L201-L203

Is it better to "hide" it?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataCloud.pull() is an internal API and only concerns with pulling objects that you want, so it's a good fit to use here.

The only problem with it is that it does not support cloud versioning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only concerns with pulling objects that you want

In the current P.R. I am using repo.pull(dep.addressing) so would only pull what I want.
Should I change to cloud.pull API?
Still don't understand the benefit/reason, though

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.repo.cloud.pull() has much smaller scope than repo.pull(), so it'll be faster and more reliable, and goes through less code. Also has less chances of getting broken in the future.

@daavoo daavoo force-pushed the 4742-make-that-dvc-repro-pull-pulls-all-missing-files branch from 0dccb45 to 22eaf57 Compare April 28, 2023 12:32
@daavoo daavoo changed the title reproduce: Update pull flag to also pull run cache, data_sources and imports. reproduce: Update --pull Apr 28, 2023
@daavoo daavoo force-pushed the 4742-make-that-dvc-repro-pull-pulls-all-missing-files branch from 22eaf57 to 99dc272 Compare April 28, 2023 12:51
@daavoo daavoo force-pushed the 4742-make-that-dvc-repro-pull-pulls-all-missing-files branch from 99dc272 to 61332f7 Compare May 1, 2023 10:55
@daavoo
Copy link
Copy Markdown
Contributor Author

daavoo commented May 1, 2023

@dberenbaum Wanted to double-check again the expected high-level behavior from a product perspective before getting into implementation / UX discussion (and fixing a couple of bugs).

I pushed a new version with a flag --pull-missing-deps which I believe behaves as expected.

I pushed a couple of runs to the example-get-started remote and tested with the following script:

Details
export AWS_PROFILE=iterative-sandbox

echo "FRESH CLONE"
git clone git@github.com:iterative/example-get-started.git
cd example-get-started
dvc remote add -d --local storage s3://dvc-public/remote/get-started
dvc exp run --pull --pull-missing-deps
cd ..
rm -rf example-get-started

echo "RUN CACHE TRAIN & UNCHANGED EVALUATE"
git clone git@github.com:iterative/example-get-started.git
cd example-get-started
dvc remote add -d --local storage s3://dvc-public/remote/get-started
dvc exp run -S train.n_est=60 --pull-missing-deps
cd ..
rm -rf example-get-started

echo "RUN TRAIN AND EVALUATE"
git clone git@github.com:iterative/example-get-started.git
cd example-get-started
dvc remote add -d --local storage s3://dvc-public/remote/get-started
dvc exp run -S train.n_est=66 --pull-missing-deps
cd ..
rm -rf example-get-started

echo "RUN-CACHE FEATURIZE & RUN TRAIN AND EVALUATE"
git clone git@github.com:iterative/example-get-started.git
cd example-get-started
dvc remote add -d --local storage s3://dvc-public/remote/get-started
dvc exp run -S featurize.max_features=150 -S train.n_est=77 --pull-missing-deps
cd ..
rm -rf example-get-started

Unless I am missing something, this behaves as you described in the issue.

If you agree, now the problem is that the code is a mess as it breaks many previous assumptions that we should discuss.

@daavoo daavoo force-pushed the 4742-make-that-dvc-repro-pull-pulls-all-missing-files branch from 61332f7 to 77f8209 Compare May 1, 2023 11:37
@dberenbaum
Copy link
Copy Markdown
Contributor

dberenbaum commented May 1, 2023

@daavoo Yep, the behavior looks good and matches what's described in the issue, but I think it maybe goes further than needed, and it would be better to break this down into smaller pieces. AFAICT, there are 3 main features combined here:

  1. Avoid pulling outputs that will be overwritten by the run. This is what your initial PR did. I think it makes sense to merge that since it's already useful and seems relatively straightforward. Edit: it also fixes the initial request in Make that dvc repro --pull pulls all missing files. #4742.
  2. Avoid pulling dependencies for stages that have not changed from what is in dvc.lock. This is the most important next step IMO since it is a common scenario (I either made changes only to downstream changes or want to verify that nothing changed).
  3. Avoid pulling dependencies for stages that match a run-cache entry. This is very cool but not as necessary IMO (sorry, I should have left run-cache out of the discussion in Make that dvc repro --pull pulls all missing files. #4742). I don't think it's as commonly needed, and it seems more confusing to users (changing the dvc.lock without pulling anything might be hard to understand).

@daavoo
Copy link
Copy Markdown
Contributor Author

daavoo commented May 2, 2023

@dberenbaum Thanks for the synthesis, makes sense.

My thoughts:

  • 1

Makes sense as a non-breaking change for the current --pull flag.

  • 2

Should we enable that behavior as part of --pull or add a new flag for it?

It changes existing assumptions of reproduce, as missing dependencies are currently considered a change and will cause repro to fail.
This is why these P.R. adds some continue statements in the reproduce_stages loop.

  • 3

I don't think it's as commonly needed,

To me, it is actually what would add value to the current run cache, but I am ok with cutting it off or postponing it

daavoo added 4 commits May 2, 2023 15:35
After #8738 , run cache is not saved for a stage if any output has `cache: false`.
It is a redundant check and just prevents using `StageCache._load` in cases where a dependency doesn't exists locally.
@daavoo daavoo force-pushed the 4742-make-that-dvc-repro-pull-pulls-all-missing-files branch from 77f8209 to b754444 Compare May 2, 2023 13:38
@dberenbaum
Copy link
Copy Markdown
Contributor

Makes sense as a non-breaking change for the current --pull flag.

Agreed. WDYT about doing that and merging as one PR?

Should we enable that behavior as part of --pull or add a new flag for it?

Yes, makes sense as its own flag, at least for now. Maybe --allow-missing (like we have for dvc pull)?

  • If the stage is unchanged other than missing/deleted, --allow-missing would skip it.
  • If the stage has missing/deleted files and other changes, --allow-missing would still fail.
  • To automatically skip unchanged stages and pull for changed stages, use --allow-missing --pull.

Does it make sense? Do you think behavior is more clear that way?

To me, it is actually what would add value to the current run cache, but I am ok with cutting it off or postponing it

Sure, it's useful and powerful, just maybe not as obvious. I think it makes sense as part of --allow-missing above (if you don't want that behavior, you could do --allow-missing --no-run-cache), so including it is about how much additional effort it would take.

@daavoo daavoo closed this May 3, 2023
daavoo added a commit that referenced this pull request May 10, 2023
daavoo added a commit that referenced this pull request May 10, 2023
daavoo added a commit that referenced this pull request May 10, 2023
daavoo added a commit that referenced this pull request May 12, 2023
daavoo added a commit that referenced this pull request May 12, 2023
daavoo added a commit that referenced this pull request May 16, 2023
daavoo added a commit that referenced this pull request May 16, 2023
daavoo added a commit that referenced this pull request May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A: pipelines Related to the pipelines feature feature is a feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make that dvc repro --pull pulls all missing files.

3 participants