Skip to content

azure: support external file dependencies#3852

Merged
efiop merged 3 commits into
treeverse:masterfrom
feedbackfruits:azure-external-dependencies-and-outputs
May 23, 2020
Merged

azure: support external file dependencies#3852
efiop merged 3 commits into
treeverse:masterfrom
feedbackfruits:azure-external-dependencies-and-outputs

Conversation

@steffansluis
Copy link
Copy Markdown
Contributor

@steffansluis steffansluis commented May 22, 2020

Documention PR: treeverse/dvc.org#1337

  • Implement get_file_checksum with get_etag
  • Implement AzureDependency
  • Add AzureDependency to DEPS and DEPS_MAP
  • Add unit tests for AzureDependency

Fixes #3540.

  • ❗ I have followed the Contributing to DVC checklist.

  • 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

@steffansluis
Copy link
Copy Markdown
Contributor Author

I'm not really sure why the tests are failing on Travis. Locally, they pass having followed the instructions at https://dvc.org/doc/user-guide/contributing/core#testing-remotes.

Comment thread tests/unit/dependency/test_azure.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steffansluis, Are you sure it's running locally?

AzureRemote remote parses path via regex:
https://github.com/iterative/dvc/blob/865d355554b27be5ea49640c14556c46f7906327/dvc/remote/azure.py#L22-L27

And, on TestLocalDependency, the path is of the following type which fails on regex parsing:
https://github.com/iterative/dvc/blob/f5055940a2f806ccdf8be0193115641effd69416/tests/unit/dependency/test_local.py#L14

Same thing is happening on TestAzureOutput. Maybe, change the path that's getting passed to AzureRemote for tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Having done source tests/remotes_env).

python -m tests tests/unit/output/test_azure.py tests/unit/dependency/test_azure.py shows (among other things) the following output:

[gw1] [ 25%] PASSED tests/unit/output/test_azure.py::TestLocalOutput::test_save_missing                                                                                                       
[gw3] [ 50%] PASSED tests/unit/dependency/test_azure.py::TestLocalDependency::test_save_missing                                                                                               
[gw0] [ 75%] PASSED tests/unit/output/test_azure.py::TestAzureOutput::test_save_missing                                                                                                       
[gw2] [100%] PASSED tests/unit/dependency/test_azure.py::TestAzureDependency::test_save_missing                                                                                               

The tests are basic copies of the tests for S3: https://github.com/iterative/dvc/blob/f5055940a2f806ccdf8be0193115641effd69416/tests/unit/dependency/test_s3.py#L5-L7

And looking at the tests for the HDFS output (which also uses regexes) this should be fine?:
https://github.com/iterative/dvc/blob/f5055940a2f806ccdf8be0193115641effd69416/tests/unit/dependency/test_hdfs.py#L5-L7

@efiop efiop changed the title azure: support external dependencies and outputs azure: support external file dependencies May 22, 2020
Comment thread dvc/remote/azure.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it stay true for multipart objects?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does the user set it or is it automatically set by azure?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need very good tests for this to get it accepted, as otherwise it might be too dangerous to use.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For potential future reference: Content-MD5 appears to not behave as specified in the documentation of the REST API. Namely, large files (I tested with a 4GB file) do not get assigned a Content-MD5 property automatically. Hence, I've changed it back to use the etag instead

(see also: https://discordapp.com/channels/485586884165107732/565699007037571084/713462619469643796)

- Implement get_file_checksum with get_etag
- Implement a DependencyAzure and OutputAzure classes
- Add AzureOutput to OUTS, OUTS_MAP and CHECKSUMS_SCHEMA
- Add AzureDependency to DEPS and DEPS_MAP
- Add unit tests for AzureOutput and AzureDependency

Fixes treeverse#3540.
The etag is different for the same file in different locations.
@steffansluis steffansluis force-pushed the azure-external-dependencies-and-outputs branch from 9a6679b to 5df9cb8 Compare May 22, 2020 17:49
Comment thread dvc/remote/azure.py Outdated
@efiop
Copy link
Copy Markdown
Contributor

efiop commented May 22, 2020

@steffansluis steffansluis force-pushed the azure-external-dependencies-and-outputs branch from be641b7 to 415934e Compare May 22, 2020 20:36
@steffansluis
Copy link
Copy Markdown
Contributor Author

@efiop It's very basic but I think it should do: treeverse/dvc.org#1337

Comment thread dvc/dependency/azure.py Outdated
@steffansluis steffansluis force-pushed the azure-external-dependencies-and-outputs branch from 415934e to 4c10325 Compare May 22, 2020 21:02
Due to inconsistencies with Azure's Content-MD5 property, it is not
stable enough to be used at the moment. Using the etag offers enough
reliability to support external dependencies. See also:
https://discordapp.com/channels/485586884165107732/565699007037571084/713462619469643796
@steffansluis steffansluis force-pushed the azure-external-dependencies-and-outputs branch from 4c10325 to f63add9 Compare May 22, 2020 22:41
Copy link
Copy Markdown
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @steffansluis ! 🙏

Wasn't able to push to your branch, so I'll add a func a test for get_file_checksum on top.

@efiop efiop merged commit f9fbe97 into treeverse:master May 23, 2020
@efiop efiop mentioned this pull request May 23, 2020
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

azure: support external dependencies and outputs

3 participants