azure: support external file dependencies#3852
Conversation
0640ed2 to
f505594
Compare
|
I'm not really sure why the tests are failing on Travis. Locally, they pass having followed the instructions at https://dvc.org/doc/user-guide/contributing/core#testing-remotes. |
There was a problem hiding this comment.
@steffansluis, Are you sure it's running locally?
AzureRemote remote parses path via regex:
https://github.com/iterative/dvc/blob/865d355554b27be5ea49640c14556c46f7906327/dvc/remote/azure.py#L22-L27
And, on TestLocalDependency, the path is of the following type which fails on regex parsing:
https://github.com/iterative/dvc/blob/f5055940a2f806ccdf8be0193115641effd69416/tests/unit/dependency/test_local.py#L14
Same thing is happening on TestAzureOutput. Maybe, change the path that's getting passed to AzureRemote for tests?
There was a problem hiding this comment.
(Having done source tests/remotes_env).
python -m tests tests/unit/output/test_azure.py tests/unit/dependency/test_azure.py shows (among other things) the following output:
[gw1] [ 25%] PASSED tests/unit/output/test_azure.py::TestLocalOutput::test_save_missing
[gw3] [ 50%] PASSED tests/unit/dependency/test_azure.py::TestLocalDependency::test_save_missing
[gw0] [ 75%] PASSED tests/unit/output/test_azure.py::TestAzureOutput::test_save_missing
[gw2] [100%] PASSED tests/unit/dependency/test_azure.py::TestAzureDependency::test_save_missing
The tests are basic copies of the tests for S3: https://github.com/iterative/dvc/blob/f5055940a2f806ccdf8be0193115641effd69416/tests/unit/dependency/test_s3.py#L5-L7
And looking at the tests for the HDFS output (which also uses regexes) this should be fine?:
https://github.com/iterative/dvc/blob/f5055940a2f806ccdf8be0193115641effd69416/tests/unit/dependency/test_hdfs.py#L5-L7
There was a problem hiding this comment.
Does it stay true for multipart objects?
There was a problem hiding this comment.
Also, does the user set it or is it automatically set by azure?
There was a problem hiding this comment.
We need very good tests for this to get it accepted, as otherwise it might be too dangerous to use.
There was a problem hiding this comment.
For potential future reference: Content-MD5 appears to not behave as specified in the documentation of the REST API. Namely, large files (I tested with a 4GB file) do not get assigned a Content-MD5 property automatically. Hence, I've changed it back to use the etag instead
(see also: https://discordapp.com/channels/485586884165107732/565699007037571084/713462619469643796)
- Implement get_file_checksum with get_etag - Implement a DependencyAzure and OutputAzure classes - Add AzureOutput to OUTS, OUTS_MAP and CHECKSUMS_SCHEMA - Add AzureDependency to DEPS and DEPS_MAP - Add unit tests for AzureOutput and AzureDependency Fixes treeverse#3540.
The etag is different for the same file in different locations.
9a6679b to
5df9cb8
Compare
|
@steffansluis Btw, could you please update the docs in https://dvc.org/doc/command-reference/import-url and in https://dvc.org/doc/user-guide/external-dependencies ? |
be641b7 to
415934e
Compare
|
@efiop It's very basic but I think it should do: treeverse/dvc.org#1337 |
415934e to
4c10325
Compare
Due to inconsistencies with Azure's Content-MD5 property, it is not stable enough to be used at the moment. Using the etag offers enough reliability to support external dependencies. See also: https://discordapp.com/channels/485586884165107732/565699007037571084/713462619469643796
4c10325 to
f63add9
Compare
efiop
left a comment
There was a problem hiding this comment.
Thanks @steffansluis ! 🙏
Wasn't able to push to your branch, so I'll add a func a test for get_file_checksum on top.
Documention PR: treeverse/dvc.org#1337
Fixes #3540.
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)