From e8aa34a11a43a32863e4a97e6730df3cd873181f Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Sat, 27 Jun 2020 12:10:06 +0530 Subject: [PATCH 1/8] Changed repro.md such that it now do not use Dvcfile as a default stage --- content/docs/command-reference/repro.md | 56 +++++++++++-------------- 1 file changed, 24 insertions(+), 32 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index ea924c7a82..6bc7ade554 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -13,7 +13,7 @@ usage: dvc repro [-h] [-q | -v] [-f] [-s] [-c ] [-m] [--dry] [-i] [--no-commit] [--downstream] [targets [targets ...]] positional arguments: - targets Stage or .dvc file to reproduce. 'Dvcfile' by default. + targets Stage to reproduce. ``` ## Description @@ -24,6 +24,8 @@ the dependency graph (a by the [stage files](/doc/command-reference/run) (DVC-files with dependencies) that are found in the project. The commands defined in these stages can then be executed in the correct order, reproducing pipeline results. +`dvc repro` relies on the DAG definition that it reads from `dvc.yaml`, and uses +`dvc.lock` to determine what exactly needs to be run. > Pipeline stages are typically defined using the `dvc run` command, while > initial data dependencies can be registered by the `dvc add` command. @@ -40,9 +42,6 @@ There's a few ways to restrict the stages that will be regenerated by this command: by specifying stage file `targets`, or by using the `--single-item`, `--cwd`, or other options. -If specific [DVC-files](/doc/user-guide/dvc-files-and-directories) (`targets`) -are omitted, `Dvcfile` will be assumed. - `dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data files, intermediate or final results. @@ -101,8 +100,7 @@ only execute the final stage. (non-recursively) if multiple stage files are given as `targets`. - `-c `, `--cwd ` - directory within the project to reproduce from. - If no `targets` are given, it attempts to use `Dvcfile` in the specified - directory. Instead of using `--cwd`, one can alternately specify a target in a + Instead of using `--cwd`, one can alternately specify a target in a subdirectory as `path/to/target.dvc`. This option can be useful for example with subdirectories containing a separate pipeline that can either be reproduced as part of the pipeline in the parent directory, or as an @@ -169,7 +167,7 @@ only execute the final stage. ## Examples For simplicity, let's build a pipeline defined below. (If you want get your -hands-on something more real, see this shot +hands-on something more real, see this short [pipeline tutorial](/doc/tutorials/pipelines)). It takes this `text.txt` file: ``` @@ -184,17 +182,15 @@ best And runs a few simple transformations to filter and count numbers: ```dvc -$ dvc run -f filter.dvc -d text.txt -o numbers.txt \ +$ dvc run -n filter -d text.txt -o numbers.txt \ "cat text.txt | egrep '[0-9]+' > numbers.txt" -$ dvc run -f Dvcfile -d numbers.txt -d process.py -M count.txt \ +$ dvc run -n count -d numbers.txt -d process.py -M count.txt \ "python process.py numbers.txt > count.txt" ``` -> Note that using `-f Dvcfile` with `dvc run` above is optional, the stage file -> name would otherwise default to `count.txt.dvc`. We use `Dvcfile` in this -> example because that's the default stage file name `dvc repro` will read -> without having to provide any `targets`. +> Note that a stage name is required when executing `dvc run`. It can be +> specified with `-n` (`--name`) option as we did above. Where `process.py` is a script that, for simplicity, just prints the number of lines: @@ -213,23 +209,23 @@ The result of executing these `dvc run` commands should look like this: ```dvc $ tree . -├── Dvcfile <---- second stage with a default DVC name ├── count.txt <---- result: "2" -├── filter.dvc <---- first stage +├── dvc.lock <---- file to record pipeline state +├── dvc.yaml <---- file containing list of stages. ├── numbers.txt <---- intermediate result of the first stage ├── process.py <---- code that implements data transformation └── text.txt <---- text file to process ``` -You may want to check the contents of `Dvcfile` and `count.txt` for later +You may want to check the contents of `dvc.lock` and `count.txt` for later reference. -Ok, now, let's run the `dvc repro` command (remember, by default it reproduces -outputs tracked in `Dvcfile`, in this case `count.txt`): +Ok, now, let's run the `dvc repro` command: ```dvc $ dvc repro -WARNING: assuming default target 'Dvcfile'. +Stage 'filter' didn't change, skipping +Stage 'count' didn't change, skipping Data and pipelines are up to date. ``` @@ -247,16 +243,13 @@ If we now run `dvc repro`, we should see this: ```dvc $ dvc repro -WARNING: assuming default target 'Dvcfile'. -Stage 'Dvcfile' changed. -Reproducing 'Dvcfile' -Running command: - python process.py numbers.txt > count.txt -Output 'count.txt' doesn't use cache. Skipping saving. -Saving information to 'Dvcfile'. +Stage 'filter' didn't change, skipping +Running stage 'count' with command: + python3 process.py numbers.txt > count.txt +Updating lock file 'dvc.lock' ``` -You can now check that `Dvcfile` and `count.txt` have been updated with the new +You can now check that `dvc.lock` and `count.txt` have been updated with the new information and updated dependency/output file hash values, and a new result, respectively. @@ -277,14 +270,13 @@ Now, using the `--downstream` option results in the following output: ```dvc $ dvc repro --downstream -WARNING: assuming default target 'Dvcfile'. Data and pipelines are up to date. ``` The reason being that the `text.txt` file is a dependency in the target -[DVC-file](/doc/user-guide/dvc-files-and-directories) (`Dvcfile` by default). -This `Dvcfile` stage is dependent on `filter.dvc`, which happens first in this -pipeline (shown in the following figure): +[DVC-file](/doc/user-guide/dvc-files-and-directories). This `count` stage is +dependent on `filter` stage, which happens first in this pipeline (shown in the +following figure): ```dvc $ dvc dag @@ -296,6 +288,6 @@ $ dvc dag * * .---------. - | Dvcfile | + | count | `---------' ``` From 25ebe1db212ced3dcf2e6a3af55477c17a26e27b Mon Sep 17 00:00:00 2001 From: Sarthak khandelwal Date: Thu, 2 Jul 2020 03:39:13 +0530 Subject: [PATCH 2/8] added corrections to repro.md Co-authored-by: Jorge Orpinel --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 6bc7ade554..ffe86de49a 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -250,7 +250,7 @@ Updating lock file 'dvc.lock' ``` You can now check that `dvc.lock` and `count.txt` have been updated with the new -information and updated dependency/output file hash values, and a new result, +information: updated dependency/output file hash values, and a new result, respectively. ## Example: Downstream From dd6323beb2b4aa8e5b47c87f675e22d64d2a4c09 Mon Sep 17 00:00:00 2001 From: Sarthak khandelwal Date: Thu, 2 Jul 2020 03:41:23 +0530 Subject: [PATCH 3/8] updated dependency status of "text.txt" file. Co-authored-by: Jorge Orpinel --- content/docs/command-reference/repro.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index ffe86de49a..e289b86252 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -273,7 +273,8 @@ $ dvc repro --downstream Data and pipelines are up to date. ``` -The reason being that the `text.txt` file is a dependency in the target +The reason being that the `text.txt` file is a dependency in the last +stage of the pipeline (used by default by `dvc repro`), [DVC-file](/doc/user-guide/dvc-files-and-directories). This `count` stage is dependent on `filter` stage, which happens first in this pipeline (shown in the following figure): From 702711191e7d28081792cefa5e9b78b9633d87a2 Mon Sep 17 00:00:00 2001 From: Sarthak khandelwal Date: Thu, 2 Jul 2020 03:44:53 +0530 Subject: [PATCH 4/8] removed reference to DVC-file from downstream example. Co-authored-by: Jorge Orpinel --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index e289b86252..077d5c7e82 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -275,7 +275,7 @@ Data and pipelines are up to date. The reason being that the `text.txt` file is a dependency in the last stage of the pipeline (used by default by `dvc repro`), -[DVC-file](/doc/user-guide/dvc-files-and-directories). This `count` stage is +This last `count` stage is dependent on `filter` stage, which happens first in this pipeline (shown in the following figure): From 4b63e7fee029f2c70e9929ea7f0c1254b6b968b8 Mon Sep 17 00:00:00 2001 From: Sarthak khandelwal Date: Thu, 2 Jul 2020 03:48:17 +0530 Subject: [PATCH 5/8] updated the styling --- content/docs/command-reference/repro.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 077d5c7e82..d2a34baea5 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -273,12 +273,10 @@ $ dvc repro --downstream Data and pipelines are up to date. ``` -The reason being that the `text.txt` file is a dependency in the last -stage of the pipeline (used by default by `dvc repro`), -This last `count` stage is +The reason being that the `text.txt` file is a dependency in the last stage of +the pipeline (used by default by `dvc repro`), This last `count` stage is dependent on `filter` stage, which happens first in this pipeline (shown in the following figure): - ```dvc $ dvc dag From 70ad7105c648f798f40c9820c8afa5b18acae575 Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Thu, 2 Jul 2020 03:59:33 +0530 Subject: [PATCH 6/8] corrected styling --- content/docs/command-reference/repro.md | 1 + 1 file changed, 1 insertion(+) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index d2a34baea5..fc78250d05 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -277,6 +277,7 @@ The reason being that the `text.txt` file is a dependency in the last stage of the pipeline (used by default by `dvc repro`), This last `count` stage is dependent on `filter` stage, which happens first in this pipeline (shown in the following figure): + ```dvc $ dvc dag From 1c080259f0701d3e781a2d1175732f7452c1fd9b Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Thu, 2 Jul 2020 04:30:15 +0530 Subject: [PATCH 7/8] Temporarily removed out of PR scope stuff --- content/docs/command-reference/repro.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index fc78250d05..0064970ba6 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -24,8 +24,6 @@ the dependency graph (a by the [stage files](/doc/command-reference/run) (DVC-files with dependencies) that are found in the project. The commands defined in these stages can then be executed in the correct order, reproducing pipeline results. -`dvc repro` relies on the DAG definition that it reads from `dvc.yaml`, and uses -`dvc.lock` to determine what exactly needs to be run. > Pipeline stages are typically defined using the `dvc run` command, while > initial data dependencies can be registered by the `dvc add` command. From 64897ea432767cd5ea0840c9a0d2387c42a0e2cd Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Sat, 4 Jul 2020 14:25:44 +0530 Subject: [PATCH 8/8] Updated target information and removed unnecessary information --- content/docs/command-reference/repro.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 0064970ba6..8245f80434 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -13,7 +13,7 @@ usage: dvc repro [-h] [-q | -v] [-f] [-s] [-c ] [-m] [--dry] [-i] [--no-commit] [--downstream] [targets [targets ...]] positional arguments: - targets Stage to reproduce. + targets Stage or .dvc file to reproduce ``` ## Description @@ -187,9 +187,6 @@ $ dvc run -n count -d numbers.txt -d process.py -M count.txt \ "python process.py numbers.txt > count.txt" ``` -> Note that a stage name is required when executing `dvc run`. It can be -> specified with `-n` (`--name`) option as we did above. - Where `process.py` is a script that, for simplicity, just prints the number of lines: