With the introduction of the new multiple-stage pipeline, we will need to find a way of defining variables in the pipeline. For example, the intermediate file name cleansed.csv is used from two stages in the following pipeline and it needs to be defined into a variable:
stages:
process:
cmd: "./process.bin --input data --output cleansed.csv"
deps:
- path: data/
outs:
- path: cleansed.csv
train:
cmd: "python train.py"
deps:
- path: cleansed.csv
- path: train.py
- path: params.yaml
params:
lr: 0.042
layers: 8
classes: 4
outs:
- path: model.pkl
- path: log.csv
cache: true
- path: summary.json
We need to solve two problems here:
- Define a variable in one place and reuse it from multiple places/stages.
- Often users prefer to read file names from config files (like in the
train stage), not from the command line (like in the process stage).
We can solve both of the problems using a single abstraction - parameters file variable:
stages:
process:
cmd: ./process.bin
outs:
- path: "params.yaml:cleansed_file_name"
....
train:
cmd: "python train.py"
deps:
- path: "params.yaml:cleansed_file_name"
This feature is useful in the current DVC design as well. It is convenient to read file names from params file and still define dependency properly like dvc run -d params.yaml:input_file -o params.yaml:model.pkl
With the introduction of the new multiple-stage pipeline, we will need to find a way of defining variables in the pipeline. For example, the intermediate file name
cleansed.csvis used from two stages in the following pipeline and it needs to be defined into a variable:We need to solve two problems here:
trainstage), not from the command line (like in theprocessstage).We can solve both of the problems using a single abstraction -
parameters file variable:This feature is useful in the current DVC design as well. It is convenient to read file names from params file and still define dependency properly like
dvc run -d params.yaml:input_file -o params.yaml:model.pkl