From 34217b8e2b7e11d178053b844a3b5a2a1a4105d6 Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Fri, 24 Jul 2020 23:07:27 +0530 Subject: [PATCH 01/17] cmd: rewrite Downstream example and added info for sequential execution of stages --- content/docs/command-reference/repro.md | 65 +++++++++++++++---------- 1 file changed, 38 insertions(+), 27 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 9314a82d35..7422d14c6a 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -54,8 +54,7 @@ other options. It saves all the data files, intermediate or final results into the DVC cache (unless the `--no-commit` option is used), and updates the hash -values of changed dependencies and outputs in the DVC files (`dvc.lock` and -`.dvc`). +values of changed dependencies and outputs in the `dvc.lock` and `.dvc` files. ### Parallel stage execution @@ -83,11 +82,12 @@ $ dvc dag ``` This pipeline consists of two parallel branches (`A` and `B`), and the final -"result" stage, where the branches merge. To reproduce both branches at the same -time, you could run `dvc repro A2` and `dvc repro B2` at the same time (e.g. in -separate terminals). After both finish successfully, you can then run -`dvc repro train`: DVC will know that both branches are already up-to-date and -only execute the final stage. +"result" stage, where the branches merge. If you run `dvc repro` at this point, +it would reproduce the complete pipeline with all stages executing sequentially. +To reproduce both branches at the same time, you could run `dvc repro A2` and +`dvc repro B2` at the same time (e.g. in separate terminals). After both finish +successfully, you can then run `dvc repro train`: DVC will know that both +branches are already up-to-date and only execute the final stage. ## Options @@ -151,7 +151,8 @@ only execute the final stage. each execution, meaning the cache cannot be trusted for such stages. - `--downstream` - only execute the stages after the given `targets` in their - corresponding pipelines, including the target stages themselves. + corresponding pipelines, including the target stages themselves. This option + doesn't have any effect if no `targets` are provided. - `-h`, `--help` - prints the usage/help message, and exit. @@ -262,31 +263,41 @@ The answer to universe is 42 - The Hitchhiker's Guide to the Galaxy ``` -Now, using the `--downstream` option results in the following output: +And add a new stage to the pipeline: ```dvc -$ dvc repro --downstream +$ dvc run -n final -d count.txt -o alphabet.txt \ + "cat count.txt | egrep -o '[a-zA-Z]+' > alphabet.txt" +``` + +Now, using the `--downstream` option with `count` as a target stage results in +the following output: + +```dvc +$ dvc repro --downstream count Data and pipelines are up to date. ``` -The reason being that the `text.txt` file is not a dependency in the last stage -of the pipeline, used as the default target by `dvc repro`. `text.txt` is a -dependency of the `filter` stage, which happens earlier (shown in the figure -below), so it's skipped given the `--downstream` option. +The reason being that the `text.txt` file is a dependency in the `filter` stage +of the pipeline which happens before the `count` stage (shown in the following +figure) and hence did not get updated. ```dvc $ dvc dag - .------------. - | filter | - `------------' - * - * - * - .---------. - | count | - `---------' -``` -> Note that using `dvc repro --downstream` without a target will always have a -> similar effect, where all previous stages are ignored — only if the last stage -> is changed will it have any effect. + +--------+ + | filter | + +--------+ + * + * + * + +-------+ + | count | + +-------+ + * + * + * + +-------+ + | final | + +-------+ +``` From a071b7d83c95cbd011ae5e4bb6cc19bc7799c998 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 24 Jul 2020 15:15:09 -0500 Subject: [PATCH 02/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 7422d14c6a..fd1c0902c7 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -82,7 +82,7 @@ $ dvc dag ``` This pipeline consists of two parallel branches (`A` and `B`), and the final -"result" stage, where the branches merge. If you run `dvc repro` at this point, +`train` stage, where the branches merge. If you run `dvc repro` at this point, it would reproduce the complete pipeline with all stages executing sequentially. To reproduce both branches at the same time, you could run `dvc repro A2` and `dvc repro B2` at the same time (e.g. in separate terminals). After both finish From edff33eb8ab738a20c862d027218f4801f2a1809 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 24 Jul 2020 15:16:34 -0500 Subject: [PATCH 03/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index fd1c0902c7..97699311ef 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -84,7 +84,7 @@ $ dvc dag This pipeline consists of two parallel branches (`A` and `B`), and the final `train` stage, where the branches merge. If you run `dvc repro` at this point, it would reproduce the complete pipeline with all stages executing sequentially. -To reproduce both branches at the same time, you could run `dvc repro A2` and +To reproduce both branches simultaneously, you could run `dvc repro A2` and `dvc repro B2` at the same time (e.g. in separate terminals). After both finish successfully, you can then run `dvc repro train`: DVC will know that both branches are already up-to-date and only execute the final stage. From 71a5088371ea48ce19fb9de0db8004a9d7fb7284 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 24 Jul 2020 15:19:55 -0500 Subject: [PATCH 04/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 97699311ef..1da6e31047 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -152,7 +152,7 @@ branches are already up-to-date and only execute the final stage. - `--downstream` - only execute the stages after the given `targets` in their corresponding pipelines, including the target stages themselves. This option - doesn't have any effect if no `targets` are provided. + has no effect if no `targets` are provided. - `-h`, `--help` - prints the usage/help message, and exit. From 163ed1982c41a99b0f1756f02a5a9017bfab20db Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Sat, 25 Jul 2020 12:41:34 +0530 Subject: [PATCH 05/17] cmd: Updated Downstream example --- content/docs/command-reference/repro.md | 37 +++++++++++++------------ 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 1da6e31047..abd8351af3 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -83,11 +83,11 @@ $ dvc dag This pipeline consists of two parallel branches (`A` and `B`), and the final `train` stage, where the branches merge. If you run `dvc repro` at this point, -it would reproduce the complete pipeline with all stages executing sequentially. -To reproduce both branches simultaneously, you could run `dvc repro A2` and -`dvc repro B2` at the same time (e.g. in separate terminals). After both finish -successfully, you can then run `dvc repro train`: DVC will know that both -branches are already up-to-date and only execute the final stage. +it would reproduce each branch sequentially before train. To reproduce both +branches simultaneously, you could run `dvc repro A2` and `dvc repro B2` at the +same time (e.g. in separate terminals). After both finish successfully, you can +then run `dvc repro train`: DVC will know that both branches are already +up-to-date and only execute the final stage. ## Options @@ -152,7 +152,7 @@ branches are already up-to-date and only execute the final stage. - `--downstream` - only execute the stages after the given `targets` in their corresponding pipelines, including the target stages themselves. This option - has no effect if no `targets` are provided. + has no effect if `targets` are not provided. - `-h`, `--help` - prints the usage/help message, and exit. @@ -263,19 +263,26 @@ The answer to universe is 42 - The Hitchhiker's Guide to the Galaxy ``` -And add a new stage to the pipeline: +And update the `process.py` file to count the number of digits. -```dvc -$ dvc run -n final -d count.txt -o alphabet.txt \ - "cat count.txt | egrep -o '[a-zA-Z]+' > alphabet.txt" +```python +import sys +num_digits = 0 +with open(sys.argv[1], 'r') as f: + for number in f: + num_digits += len(number) - 1 +print("Number of digits:",end=" ") +print(num_digits) ``` -Now, using the `--downstream` option with `count` as a target stage results in +Now, using the `--downstream` option with `count` as a target stage, results in the following output: ```dvc $ dvc repro --downstream count -Data and pipelines are up to date. +Running stage 'count' with command: + python3 process.py numbers.txt > count.txt +Updating lock file 'dvc.lock' ``` The reason being that the `text.txt` file is a dependency in the `filter` stage @@ -293,11 +300,5 @@ $ dvc dag * +-------+ | count | - +-------+ - * - * - * - +-------+ - | final | +-------+ ``` From cf873a480b7f0aceedee02c6cf01f75179101ce4 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 27 Jul 2020 17:38:03 -0500 Subject: [PATCH 06/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index abd8351af3..f1657c63da 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -83,7 +83,7 @@ $ dvc dag This pipeline consists of two parallel branches (`A` and `B`), and the final `train` stage, where the branches merge. If you run `dvc repro` at this point, -it would reproduce each branch sequentially before train. To reproduce both +it would reproduce each branch sequentially before `train`. To reproduce both branches simultaneously, you could run `dvc repro A2` and `dvc repro B2` at the same time (e.g. in separate terminals). After both finish successfully, you can then run `dvc repro train`: DVC will know that both branches are already From ca04fb0becba7dc250c2f292f32f5ebc7d4172e9 Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Tue, 28 Jul 2020 12:00:03 +0530 Subject: [PATCH 07/17] repro: Updated Downstream example --- content/docs/command-reference/repro.md | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index abd8351af3..a251f1955c 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -240,7 +240,7 @@ If we now run `dvc repro`, we should see this: $ dvc repro Stage 'filter' didn't change, skipping Running stage 'count' with command: - python3 process.py numbers.txt > count.txt + python process.py numbers.txt > count.txt Updating lock file 'dvc.lock' ``` @@ -263,16 +263,12 @@ The answer to universe is 42 - The Hitchhiker's Guide to the Galaxy ``` -And update the `process.py` file to count the number of digits. +Let's say we want to print the filename also in the description and so we update +the `process.py` as: ```python -import sys -num_digits = 0 -with open(sys.argv[1], 'r') as f: - for number in f: - num_digits += len(number) - 1 -print("Number of digits:",end=" ") -print(num_digits) +print('Number of lines in %s:'%(sys.argv[1])) +print(num_lines) ``` Now, using the `--downstream` option with `count` as a target stage, results in @@ -281,13 +277,14 @@ the following output: ```dvc $ dvc repro --downstream count Running stage 'count' with command: - python3 process.py numbers.txt > count.txt + python process.py numbers.txt > count.txt Updating lock file 'dvc.lock' ``` -The reason being that the `text.txt` file is a dependency in the `filter` stage -of the pipeline which happens before the `count` stage (shown in the following -figure) and hence did not get updated. +The change in the `text.txt` file is ignored as it is a dependency in the +`filter` stage which did not get updated in the above command. This is because +the `filter` stage happens before the `count` stage in the pipeline (shown in +the following figure). ```dvc $ dvc dag From 30ce7bb7967432391a12685aa23cee820906c5d0 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jul 2020 02:13:25 -0500 Subject: [PATCH 08/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index b5e69ef132..c0388751e6 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -299,3 +299,5 @@ $ dvc dag | count | +-------+ ``` + +> Refer to `dvc dag` for more details on that command. From 66e0603c960af396429e5510f4f360e4dffccedc Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Thu, 30 Jul 2020 00:31:02 +0530 Subject: [PATCH 09/17] cmd: updated last para for the description of --downstream and improved formatting --- content/docs/command-reference/repro.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index c0388751e6..22b03c6878 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -267,7 +267,7 @@ Let's say we want to print the filename also in the description and so we update the `process.py` as: ```python -print('Number of lines in %s:'%(sys.argv[1])) +print(f'Number of lines in {sys.argv[1]}:') print(num_lines) ``` @@ -281,10 +281,11 @@ Running stage 'count' with command: Updating lock file 'dvc.lock' ``` -The change in the `text.txt` file is ignored as it is a dependency in the -`filter` stage which did not get updated in the above command. This is because -the `filter` stage happens before the `count` stage in the pipeline (shown in -the following figure). +The change in the `text.txt` file is ignored because that file is a dependency +in the `filter` stage, which did not get updated in the above command. This is +because `filter` happens before `count` in the pipeline (shown below) and the +`--downstream` option only execute the stages after a given target stage +(`count` in this case), including the target stage itself. ```dvc $ dvc dag From 8597f533b21a853ad64bedd6aa266344b8546b1f Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Fri, 31 Jul 2020 01:02:25 +0530 Subject: [PATCH 10/17] repro.md: updated Downstream example --- content/docs/command-reference/repro.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 22b03c6878..a273651d0d 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -271,8 +271,8 @@ print(f'Number of lines in {sys.argv[1]}:') print(num_lines) ``` -Now, using the `--downstream` option with `count` as a target stage, results in -the following output: +Now, using the `--downstream` option with `dvc repro`, results in the execution +of stages after the target stage (`count` in this case) in the pipeline. ```dvc $ dvc repro --downstream count @@ -281,11 +281,9 @@ Running stage 'count' with command: Updating lock file 'dvc.lock' ``` -The change in the `text.txt` file is ignored because that file is a dependency -in the `filter` stage, which did not get updated in the above command. This is -because `filter` happens before `count` in the pipeline (shown below) and the -`--downstream` option only execute the stages after a given target stage -(`count` in this case), including the target stage itself. +The change in `text.txt` is ignored because that file is a dependency in the +`filter` stage, which did not get updated in the above command. This is because +`filter` happens before `count` in the pipeline (shown below). ```dvc $ dvc dag From 70b7d2a3a22a2ab94a5f6c550e6f97d115e44516 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 15:05:23 -0500 Subject: [PATCH 11/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index a273651d0d..613a6fb048 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -271,8 +271,8 @@ print(f'Number of lines in {sys.argv[1]}:') print(num_lines) ``` -Now, using the `--downstream` option with `dvc repro`, results in the execution -of stages after the target stage (`count` in this case) in the pipeline. +Now, using the `--downstream` option with `dvc repro` results in the execution +only of stages after the target (`count`). ```dvc $ dvc repro --downstream count From 73499f2950e17fc1a04a7dc050b5cc043b089c56 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 15:05:41 -0500 Subject: [PATCH 12/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 613a6fb048..c3292d4197 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -272,7 +272,7 @@ print(num_lines) ``` Now, using the `--downstream` option with `dvc repro` results in the execution -only of stages after the target (`count`). +only of stages after the target (`count`): ```dvc $ dvc repro --downstream count From e40402ca83e2b392328fd066bede2742894c45fc Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 15:11:36 -0500 Subject: [PATCH 13/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index c3292d4197..5aa5a7e923 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -272,7 +272,7 @@ print(num_lines) ``` Now, using the `--downstream` option with `dvc repro` results in the execution -only of stages after the target (`count`): +only of the target stage (and any following ones): ```dvc $ dvc repro --downstream count From 1696951338434baf2fe91668521f087c7cacd071 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 15:11:44 -0500 Subject: [PATCH 14/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 5aa5a7e923..77056c1d48 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -282,8 +282,8 @@ Updating lock file 'dvc.lock' ``` The change in `text.txt` is ignored because that file is a dependency in the -`filter` stage, which did not get updated in the above command. This is because -`filter` happens before `count` in the pipeline (shown below). +`filter` stage, which wasn't executed by the `dvc repro` above. This is because +`filter` happens before the target (`count`) in the pipeline, as shown below: ```dvc $ dvc dag From bfe6800d16804fb26f22747e39f0c949562d6c75 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 15:12:49 -0500 Subject: [PATCH 15/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 77056c1d48..e6aeba6946 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -272,7 +272,7 @@ print(num_lines) ``` Now, using the `--downstream` option with `dvc repro` results in the execution -only of the target stage (and any following ones): +only of the target stage, and following ones (none in these case): ```dvc $ dvc repro --downstream count From e83bc5d4aad670ed41a8fe1ddc33250b36bc9368 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 16:58:07 -0500 Subject: [PATCH 16/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index e6aeba6946..04b878fba6 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -272,7 +272,7 @@ print(num_lines) ``` Now, using the `--downstream` option with `dvc repro` results in the execution -only of the target stage, and following ones (none in these case): +only of the target stage, and following ones (none in this case): ```dvc $ dvc repro --downstream count From 012b72fe97b1719330745df36c3992becf104bd7 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 16:59:22 -0500 Subject: [PATCH 17/17] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 04b878fba6..8d3d12dd0e 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -263,7 +263,7 @@ The answer to universe is 42 - The Hitchhiker's Guide to the Galaxy ``` -Let's say we want to print the filename also in the description and so we update +Let's say we also want to print the filename in the description, and so we update the `process.py` as: ```python