Pipeline to process only new files

( I haven't found [jonilaserson's discuss-question](https://discuss.dvc.org/t/pipeline-to-process-only-new-files/446/2) as a feature request in this Git Repo yet, but since I have the same need I'm copying it)

---

Say I have a million files in the directory `./data/pre`.

I have a python script `process_dir.py` which goes over each file in `./data/pre` and processes it and creates a file in the same name in a directory `./data/post` (if such file already exists, it skips processing it).

I defined a pipeline:  
`dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.py`

Now let’s say I removed one file from `data/pre`.

When I run `dvc repro` it will still process all the 999,999 files again, because it will remove the entire content of the `./data/post` directory before running the process stage. Is there any way to manage the pipeline so that `process.py` will not process the same file twice?

The only way I could think of is creating `process_file.py` which handles a single file, and executing 1 million commands like this:
```
dvc run -n process_1 -d process_file.py -d data/pre/1.txt -o data/post/1.txt python process_file 1.txt
dvc run -n process_2 -d process_file.py -d data/pre/2.txt -o data/post/2.txt python process_file 2.txt
…
dvc run -n process_1000000 -d process_file.py -d data/pre/1000000.txt -o data/post/1000000.txt python process_file 1000000.txt
```

Is there a more elegant way?

--- 
End of verbatim quote

I was thinking that the `dvc run --outs-persist` flag might be helpful as a workaround, but this only works when I add new files (and hence prevents the old files from being processsed again). It doesn't help in the scenario described above, where input files are deleted, and also any changes to `process_dir.py` won't be picked up by the pipeline...

Would a flag, which treats each file in a directory as independent input, and then creates these mini-pipelines under the hood, make any sense?  
Something like `dvc run -d process_dir.py --deps_independent data/pre --outs_independent data/post python process_file.py`? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline to process only new files #4279

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pipeline to process only new files #4279

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions