I see a lot of value in using DVC during the development phase of a DS project, especially having the ability to reproduce outputs only if dependencies have changed.
One of the problems we are trying to solve is how we can move a data scientists code back and forth from development and production. Ideally we would want their local development experience to be translated easily into production. I've created a toy project with DVC to see if we could use it for developing a multi-step pipeline which does data transfers between each step.
However, there is one thing that is unclear when scheduling this same pipeline in Kubeflow/Airflow. Let's assume that my pipeline is as follows
- Get Data
- Transform Data
- Train Model
- Evaluate Model
If I do all of my local development (dvc run, dvc repro) then everything works. But in a production setting I will have unique inputs to my pipeline. For example the datetime stamp or other input variables will change. I can integrate this with DVC by having a file called parameters as a dependency to the Get Data step.
So when I run the pipeline on Airflow on different days, then the dependencies for step 1 will be different, which means it will get recomputed.
The problem that I have is that all of the steps in the graph have their hashes hardcoded based on the local development environment. So even if I rerun this whole pipeline multiple times with the same input parameters, none of the *.dvc files in the pipeline will be updated, meaning everything will rerun from scratch. That's because they are running in an isolated production environment and not committing code back to the project repo. So dvc looses it's value when wrapped in a scheduler.
Am I missing something, or is DVC primarily useful in local development only?
I see a lot of value in using DVC during the development phase of a DS project, especially having the ability to reproduce outputs only if dependencies have changed.
One of the problems we are trying to solve is how we can move a data scientists code back and forth from development and production. Ideally we would want their local development experience to be translated easily into production. I've created a toy project with DVC to see if we could use it for developing a multi-step pipeline which does data transfers between each step.
However, there is one thing that is unclear when scheduling this same pipeline in Kubeflow/Airflow. Let's assume that my pipeline is as follows
If I do all of my local development (
dvc run,dvc repro) then everything works. But in a production setting I will have unique inputs to my pipeline. For example the datetime stamp or other input variables will change. I can integrate this with DVC by having a file calledparametersas a dependency to theGet Datastep.So when I run the pipeline on Airflow on different days, then the dependencies for step 1 will be different, which means it will get recomputed.
The problem that I have is that all of the steps in the graph have their hashes hardcoded based on the local development environment. So even if I rerun this whole pipeline multiple times with the same input parameters, none of the
*.dvcfiles in the pipeline will be updated, meaning everything will rerun from scratch. That's because they are running in an isolated production environment and not committing code back to the project repo. Sodvclooses it's value when wrapped in a scheduler.Am I missing something, or is DVC primarily useful in local development only?