Skip to content

fix: migration of old datasets#639

Merged
jsam merged 1 commit into
SwissDataScienceCenter:masterfrom
jsam:636-fix-migration-00
Sep 12, 2019
Merged

fix: migration of old datasets#639
jsam merged 1 commit into
SwissDataScienceCenter:masterfrom
jsam:636-fix-migration-00

Conversation

@jsam
Copy link
Copy Markdown
Contributor

@jsam jsam commented Aug 23, 2019

closes #636
closes #646

@jsam jsam requested a review from a team as a code owner August 23, 2019 20:11
@jsam jsam changed the title fix: migration of old datasets [WIP] fix: migration of old datasets Aug 23, 2019
@rokroskar
Copy link
Copy Markdown
Member

can you make sure to check for any paths in a Dataset that are outside the repo root? Those should be migrated, see #646 which will fail with

stderr: 'fatal: ../../../data/zhbikes/2019_verkehrszaehlungen_werte_fussgaenger_velo.csv: '../../../data/zhbikes/2019_verkehrszaehlungen_werte_fussgaenger_velo.csv' is outside repository

even if the bug which causes the reported problem is fixed...

Comment thread renku/cli/_providers/zenodo.py Outdated
Comment thread renku/cli/_providers/zenodo.py Outdated
Comment thread renku/cli/migrate.py Outdated
rokroskar
rokroskar previously approved these changes Aug 27, 2019
@rokroskar rokroskar dismissed their stale review August 27, 2019 16:08

accidentally approved :oops: #sorry

@jsam
Copy link
Copy Markdown
Contributor Author

jsam commented Aug 29, 2019

can you make sure to check for any paths in a Dataset that are outside the repo root? Those should be migrated, see #646 which will fail with

stderr: 'fatal: ../../../data/zhbikes/2019_verkehrszaehlungen_werte_fussgaenger_velo.csv: '../../../data/zhbikes/2019_verkehrszaehlungen_werte_fussgaenger_velo.csv' is outside repository

even if the bug which causes the reported problem is fixed...

I'm trying to reproduce it, but still not fully understanding how that happened. Let's merge this as is until I can look more in-depth into #646 issue.

@jsam jsam changed the title [WIP] fix: migration of old datasets fix: migration of old datasets Aug 29, 2019
@rokroskar
Copy link
Copy Markdown
Member

Using https://dev.renku.ch/projects/lorenzo.cavazzi.tech/zurich-bikes-tutorial I get:

renku doctor
...
stderr: 'fatal: ../../../data/zhbikes/2019_verkehrszaehlungen_werte_fussgaenger_velo.csv: '../../../data/zhbikes/2019_verkehrszaehlungen_werte_fussgaenger_velo.csv' is outside repository

Copy link
Copy Markdown
Member

@rokroskar rokroskar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! I think checking for the old-style relative paths that point to outside the repo is the final thing to add.

Comment thread renku/api/repository.py
Comment thread renku/api/repository.py
Comment thread renku/api/repository.py
Comment thread renku/cli/_checks/migrate_datasets.py
Comment thread renku/cli/migrate.py Outdated
@rokroskar
Copy link
Copy Markdown
Member

Running through the toy example from #636 (comment) but using the code from this PR instead of renku 0.5.2 I end up in this state:

 git status
On branch master
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        deleted:    data/mydata/metadata.yml

and I still have a submodule in the repo:

cat .gitmodules
[submodule "test-migration"]
        path = .renku/vendors/local/private/tmp/test-migration
        url = /private/tmp/test-migration

checked out in the .renku/vendors/local directory:

tree .renku
.renku
├── datasets
│   └── 164bc1a7-3919-4851-8975-230242852f0e
│       └── metadata.yml
├── metadata.yml
├── refs
│   └── datasets
│       └── mydata -> ../../datasets/164bc1a7-3919-4851-8975-230242852f0e/metadata.yml
└── vendors
    └── local
        └── private
            └── tmp
                └── test-migration
                    ├── Dockerfile
                    ├── data
                    │   └── mydata
                    │       └── metadata.yml
                    ├── datafile
                    └── requirements.txt

@rokroskar
Copy link
Copy Markdown
Member

rokroskar commented Aug 30, 2019

argh... I'm finding now another issue where the file paths are absolute - so they can't be resolved when the repo is cloned somewhere else. Do you want to deal with that in the migration as well? Otherwise I need to check for this every time I instantiate a DatasetFile.

@jsam
Copy link
Copy Markdown
Contributor Author

jsam commented Aug 30, 2019

How did you test this? I tested it again this morning and it works for me and also I have a test which reproduces the report from #636 which is passing.

@jsam jsam requested a review from rokroskar September 3, 2019 08:07
@rokroskar
Copy link
Copy Markdown
Member

rokroskar commented Sep 4, 2019

I don't see any different behavior

looks good :D

@rokroskar
Copy link
Copy Markdown
Member

Trying to migrate the dataset from https://renkulab.io/gitlab/ACE-ASAID/meteorology (after deleting the new one) I get

renku dataset ls-files
Error: Broken resource of type `datasets` found.
Migration is required.
Please check `renku migrate --help` command to fix the issue.
Hint: `renku migrate datasets`

which I guess is because of absolute paths:

cat .renku/datasets/ce2e0f25-dff1-49f5-8d37-19aaa12c0e22/metadata.yml
'@context':
  _id: '@id'
  _label: rdfs:label
  _project: schema:isPartOf
  created: schema:dateCreated
  creator: schema:creator
  date_published: schema:datePublished
  description: schema:description
  files: schema:hasPart
  identifier: schema:identifier
  in_language: schema:inLanguage
  keywords: schema:keywords
  license: schema:license
  name: schema:name
  path: prov:atLocation
  prov: http://www.w3.org/ns/prov#
  schema: http://schema.org/
  url: schema:url
  version: schema:version
  wfprov: http://purl.org/wf4ever/wfprov#
'@type':
- prov:Entity
- schema:Dataset
- wfprov:Artifact
_id: ce2e0f25-dff1-49f5-8d37-19aaa12c0e22
_label: ce2e0f25-dff1-49f5-8d37-19aaa12c0e22
_project:
  '@context':
    _id: '@id'
    foaf: http://xmlns.com/foaf/0.1/
    prov: http://www.w3.org/ns/prov#
  '@type':
  - foaf:Project
  - prov:Location
  _id: https://renkulab.io/ACE-ASAID/meteorology
authors:
- '@type': dcterms:creator
  affiliation: null
  email: jenny_t152@yahoo.co.uk
  name: Jen Thomas
created: '2018-11-23T14:42:17.985570'
creator: []
date_published: null
description: null
files:
- '@type':
  - prov:Entity
  - schema:DigitalDocument
  - wfprov:Artifact
  _id: blob/UNCOMMITTED/MAWS__SMSAWS__20161123.txt
  _label: MAWS__SMSAWS__20161123.txt@UNCOMMITTED
  _project: null
  added: '2018-11-23T15:12:42.232297'
  creator: []
  dataset: null
  name: MAWS__SMSAWS__20161123.txt
  path: MAWS__SMSAWS__20161123.txt
  url: /home/jen/projects/ace_data_management/data_staging_for_renku/meteorology/MAWS__SMSAWS__20161123.txt
- '@type':
  - prov:Entity
  - schema:DigitalDocument
  - wfprov:Artifact
  _id: blob/UNCOMMITTED/MAWS__SMSAWS__20161124.txt
  _label: MAWS__SMSAWS__20161124.txt@UNCOMMITTED
  _project: null
  added: '2018-11-23T15:12:42.257562'
  creator: []
  dataset: null
  name: MAWS__SMSAWS__20161124.txt
  path: MAWS__SMSAWS__20161124.txt
  url: /home/jen/projects/ace_data_management/data_staging_for_renku/meteorology/MAWS__SMSAWS__20161124.txt
...

Copy link
Copy Markdown
Member

@rokroskar rokroskar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking more closely at the migration code, there is a lot of looping through the datasets and files - this could get annoyingly expensive. Wouldn't it be better to loop once and define the necessary functions either at the Dataset or DatasetFile level?

I also have some issues with the tests, getting output like:

fatal: repository '/home/sam/Work/data_repos/old-datasets-v0.3.0' does not exist
fatal: clone of '/home/sam/Work/data_repos/old-datasets-v0.3.0' into submodule path '/private/var/folders/99/jng0jljd13db0h4jwbwtx4mh0000gn/T/pytest-of-rok/pytest-18/repo0/.renku/vendors/local/home/sam/Work/data_repos/old-datasets-v0.3.0' failed
Failed to clone '.renku/vendors/local/home/sam/Work/data_repos/old-datasets-v0.3.0'. Retry scheduled
fatal: repository '/home/sam/Work/data_repos/old-datasets-v0.3.0' does not exist
fatal: clone of '/home/sam/Work/data_repos/old-datasets-v0.3.0' into submodule path '/private/var/folders/99/jng0jljd13db0h4jwbwtx4mh0000gn/T/pytest-of-rok/pytest-18/repo0/.renku/vendors/local/home/sam/Work/data_repos/old-datasets-v0.3.0' failed
Failed to clone '.renku/vendors/local/home/sam/Work/data_repos/old-datasets-v0.3.0' a second time, aborting

Structurally, I don't understand why the migrations (or the tests) are lumped under "CLI"? In both contexts it should be under "models" imho.

@rokroskar
Copy link
Copy Markdown
Member

Another thing I noticed - authors is no longer migrated to creator - it looks like the model migrations are now skipped?

@rokroskar
Copy link
Copy Markdown
Member

apologies, my repo found itself in a state from which the only easy way out was a force-push...

Comment thread renku/cli/_checks/migrate_datasets.py
Comment thread renku/models/migrations/__init__.py
Comment thread renku/cli/_checks/migrate_datasets.py
Comment thread tests/cli/test_migrate.py
Copy link
Copy Markdown
Member

@rokroskar rokroskar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost there - I think the lock file handling is the last small detail. Otherwise all the other problems seem to be fixed. Thanks!!

Comment thread renku/cli/_checks/migrate_datasets.py Outdated
Comment thread renku/cli/_checks/migrate_datasets.py Outdated
Comment thread renku/cli/_checks/migrate_datasets.py Outdated
Comment thread renku/models/migrations/__init__.py
Comment thread renku/cli/_checks/migrate_datasets.py
@jsam jsam requested a review from rokroskar September 12, 2019 13:15
Comment thread renku/errors.py
)


class NothingToCommit(RenkuException, click.ClickException):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be an error or a warning?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, empty commits should never happen from renku side. Now I agree that it's not the best UX, but currently, due to how things are implemented (via decorators) we don't have an effective way of capturing this exception and showing nice message to the user. 😞

Copy link
Copy Markdown
Member

@rokroskar rokroskar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure about returning an error when there is nothing to commit, but otherwise it all seems to work now!! 🎉

@jsam jsam merged commit 4d4d7d2 into SwissDataScienceCenter:master Sep 12, 2019
@jsam jsam deleted the 636-fix-migration-00 branch September 12, 2019 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cli: missing entity commit migrating data results in wrong paths in the dataset metadata

2 participants