( I haven't found jonilaserson's discuss-question as a feature request in this Git Repo yet, but since I have the same need I'm copying it)
Say I have a million files in the directory ./data/pre.
I have a python script process_dir.py which goes over each file in ./data/pre and processes it and creates a file in the same name in a directory ./data/post (if such file already exists, it skips processing it).
I defined a pipeline:
dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.py
Now let’s say I removed one file from data/pre.
When I run dvc repro it will still process all the 999,999 files again, because it will remove the entire content of the ./data/post directory before running the process stage. Is there any way to manage the pipeline so that process.py will not process the same file twice?
The only way I could think of is creating process_file.py which handles a single file, and executing 1 million commands like this:
dvc run -n process_1 -d process_file.py -d data/pre/1.txt -o data/post/1.txt python process_file 1.txt
dvc run -n process_2 -d process_file.py -d data/pre/2.txt -o data/post/2.txt python process_file 2.txt
…
dvc run -n process_1000000 -d process_file.py -d data/pre/1000000.txt -o data/post/1000000.txt python process_file 1000000.txt
Is there a more elegant way?
End of verbatim quote
I was thinking that the dvc run --outs-persist flag might be helpful as a workaround, but this only works when I add new files (and hence prevents the old files from being processsed again). It doesn't help in the scenario described above, where input files are deleted, and also any changes to process_dir.py won't be picked up by the pipeline...
Would a flag, which treats each file in a directory as independent input, and then creates these mini-pipelines under the hood, make any sense?
Something like dvc run -d process_dir.py --deps_independent data/pre --outs_independent data/post python process_file.py?
( I haven't found jonilaserson's discuss-question as a feature request in this Git Repo yet, but since I have the same need I'm copying it)
Say I have a million files in the directory
./data/pre.I have a python script
process_dir.pywhich goes over each file in./data/preand processes it and creates a file in the same name in a directory./data/post(if such file already exists, it skips processing it).I defined a pipeline:
dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.pyNow let’s say I removed one file from
data/pre.When I run
dvc reproit will still process all the 999,999 files again, because it will remove the entire content of the./data/postdirectory before running the process stage. Is there any way to manage the pipeline so thatprocess.pywill not process the same file twice?The only way I could think of is creating
process_file.pywhich handles a single file, and executing 1 million commands like this:Is there a more elegant way?
End of verbatim quote
I was thinking that the
dvc run --outs-persistflag might be helpful as a workaround, but this only works when I add new files (and hence prevents the old files from being processsed again). It doesn't help in the scenario described above, where input files are deleted, and also any changes toprocess_dir.pywon't be picked up by the pipeline...Would a flag, which treats each file in a directory as independent input, and then creates these mini-pipelines under the hood, make any sense?
Something like
dvc run -d process_dir.py --deps_independent data/pre --outs_independent data/post python process_file.py?