-
Notifications
You must be signed in to change notification settings - Fork 535
Description
This is the first concrete dev. issue being opened as part of work on #4174; based on the discussion with the Scholars Portal team (see the notes in that issue).
The plan is to create an (external) tool that will allow the data owner to modify data variable metadata (for example: add weights to variables; change variable type; add categorical labels... etc. in order to make the metadata more descriptive and/or accurate).
On the Dataverse end, we need to provide an API to accept the updated version of the metadata (in DDI/xml format) and save it in the database permanently.
Creating this API endpoint would be trivial; so is parsing variable-level DDI (we already have legacy code for this, that we inherited from the DVN project). The main technical challenge is that, as of now, our tabular (data variable) metadata are immutable. The DataVariable objects are created during tabular ingest; they are linked directly to DataFile objects (bypassing versionable FileMetadata hierarchy). With no mechanism provided for modifying the information in this objects.
So the first task is to make the DataVariable metadata versionable. We may need to invest some thought into how we want to achieve this. A trivial solution would be to simply replicate what we are doing with all the other metadata. By simply linking the DataTable->DataVariable hierarchy to FileMetadata (instead of DataFile); thus allowing multiple versions of DataVariables associated with the same DataFile; the same way the same DataFile can have different file names in different versions.
HOWEVER, this system is fairly wasteful by design. If you have a dataset and you modify a single metadata field - fix a typo in the title, for example - we always create a new DatasetVersion, that duplicates ALL THE METADATA fields that existed in the previous version (not just the changed title!), including all the FileMetadata information associated with every file in the dataset. Making all the variable-level metadata subject to this automatic duplication will increase the size of the database by on order of magnitude or worse, I believe. We probably want to avoid this, by designing some mechanism for creating new versions of these variable-level metadata only when there is an actual change. We should be able to reuse the same DataVariable objects between different versions, until an actual change is made - and then we create a new one. This would warrant a dedicated architecture discussion.
Aside for that, all very doable.