Skip to content

Comments

Iqss/6485 - Multiple File Stores#6488

Merged
kcondon merged 37 commits intoIQSS:developfrom
TexasDigitalLibrary:IQSS/6485
Feb 19, 2020
Merged

Iqss/6485 - Multiple File Stores#6488
kcondon merged 37 commits intoIQSS:developfrom
TexasDigitalLibrary:IQSS/6485

Conversation

@qqmyers
Copy link
Member

@qqmyers qqmyers commented Jan 7, 2020

What this PR does / why we need it: Implements the ability to configure multiple file stores.

Which issue(s) this PR closes:

Closes #6485

Special notes for your reviewer: The install script should set up a file store, backward compatible with an existing installation. The docs also describe how to do this.

Suggestions on how to test this: Multiple things that could be tested:
Basic install comes up with a single file store working as before (can upload/download files)
Ability to configure a second file store (e.g. with a different file path)
Ability to configure an additional s3 store
Ability to configure another s3 store using a separate aws profile
Ability to change the default store for new files (new store allows upload/download, files in other stores are still downloadable)
Ability to list stores available and use the admin API and/or superuser UI (both described in docs) to set/reset to default which store is used for a given dataverse.
Verify that a store set for a given dataverse works for datasets created/edited within that dataverse.
With a s3 store configured with the id 's3', verify that rsync still works
Without an s3 store configured with the id 's3', verify that rsync upload does not appear as an option.

Does this PR introduce a user interface change?: For superusers, adds an entry in the 'Edit'/'General Information' form for a dataverse to change the store for a given dataverse.

Is there a release notes update needed for this change?: Yes - existing installations will have to update some jvm options to move to this version. The config docs describe the changes needed for a file store or s3 store.

Additional documentation:
DCM/rsync has hardcoded using a vanilla s3 store. Since I don't have a good way to test any rsync changes, I didn't try to update it to support multiple stores, but did try to make changes to enable an existing instance to keep using DCM/rsync. Adding the jvm options to define an s3 store with id 's3', as described in the docs, should be enough to make DCM work with this update. It should be possible to make the rsync capability available per store, and support stores with specified endpoints, etc. in the future.

qqmyers added 10 commits January 7, 2020 09:33
Conflicts:
	pom.xml
	src/main/java/edu/harvard/iq/dataverse/api/Datasets.java
	src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java
Note: DCM doesn't use any of the other S3 options in creating the Amazon
 client, so it is somewhat hardcoded that was as well. A rewrite might
 identify a specific s3 store to use, build the amazon client from it's
settings and still use the general dcm-s3-bucket-name entry. If DCM
should be supported across multiple stores, a
dataverse.files.<id>.dcm-s3-bucket-name option could be created so that
everything is ties to a specific store.
manageable by superuser only

Conflicts:
	src/main/java/edu/harvard/iq/dataverse/EditDatafilesPage.java
	src/main/java/edu/harvard/iq/dataverse/dataaccess/DataAccess.java
	src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java
@coveralls
Copy link

coveralls commented Jan 8, 2020

Coverage Status

Coverage increased (+0.01%) to 19.471% when pulling 034787b on TexasDigitalLibrary:IQSS/6485 into 1a2bca6 on IQSS:develop.

@IQSS IQSS deleted a comment from qqmyers Jan 9, 2020
Copy link
Contributor

@carlosmcgregor carlosmcgregor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I focused mainly on the Swift driver and the changes seem to be alright.

Only a side note: @pdurbin , I remember us having a conversation re: - in the configuration file. I noticed S3 configuration entries still have - . Not sure if there is a plan to remove them in the future. It would also be nice if some similar entries, like temporaryUrlExpiry in the Swift configuration had the same name as the entry for S3.

@pdurbin
Copy link
Member

pdurbin commented Jan 9, 2020

I remember us having a conversation re: - in the configuration file.

@carlosmcgregor sorry, can you please remind me if you wanted a different character?

@carlosmcgregor
Copy link
Contributor

Quoting @poikilotherm from this comment:

At least pretty please avoid - and _ in the config names. - is an invalid char for environment variables, _ is used for substitution. See gdcc/dataverse-kubernetes#11 Best option is to use a "." for namespacing as used elsewhere plus camel / pascal case.

@pdurbin
Copy link
Member

pdurbin commented Jan 10, 2020

At least pretty please avoid - and _ in the config names.

Ok. Thanks, @carlosmcgregor . @poikilotherm do you still feel this way?

@landreev
Copy link
Contributor

@qqmyers This is looking very good!
A couple of questions/requests:

  • Is there a use case for having the storage driver selection menu on the dataverse page? - Somebody asked during our tech hour today if making it configurable via API only would be enough.

  • I very much like that your solution makes it possible to keep any already existing storage identifiers unchanged. But it would help to include an explicit migration instruction; something we could include in the release notes. It was very easy for me to figure out what settings needed to be changed/added for our existing S3 buckets here, by reading your config guide section. But for other installations deploying the release, we should at least warn them upfront that their S3 and/or swift storage volumes are going to stop working unless reconfigured. Something like doc/release-notes/6485-multiple-storage.md using our normal notation.

Thank you!

@qqmyers
Copy link
Member Author

qqmyers commented Jan 14, 2020

@landreev - Thanks. The question as to whether the ability to configure a dataverse needs a GUI option is probably best answered by TDL (@CCMumma et. al.). With their instance supporting multiple universities, with local admins, having some way for the local admins to allow users to access a particular store (which may require approval for large data, or a cost, etc.) is useful and probably important long-term. That said, if it turns out to be rare, an API option may be enough for now (and it's not clear if associating a store with a dataverse is the only/best way to manage this longer term). I will note that it only shows for superusers right now, so the dataverse page is unchanged unless you have the privilege to make the change.

W.r.t. release notes - I agree that release notes are in order. I haven't kept up to date with where they should go before the final notes come out, but with your hint above, I'll go ahead and move the backward compatibility info from the docs (as @pdurbin wanted) and into some release notes.

@pdurbin
Copy link
Member

pdurbin commented Jan 15, 2020

@qqmyers thanks. I'd feel remiss if I didn't mention that there are now some merge conflicts so if you would please resolve them at your convenience, it would be most appreciated! I assume they're from pull request #6517.

@landreev
Copy link
Contributor

@qqmyers

I will note that it only shows for superusers right now

Yes, we saw that. And, to be clear, I personally don't have a problem with that selection being there. I believe our UI designer asked the question because our normal approach was to keep superuser-only functionality out of the main pages. But, there are exceptions of course, and if there is a use case for it, it should be ok, IMO.

@kcondon kcondon assigned qqmyers and unassigned kcondon Feb 11, 2020
@qqmyers
Copy link
Member Author

qqmyers commented Feb 11, 2020

@kcondon - could be code, but to verify, you have
./asadmin create-jvm-options "-Ddataverse.files.file.type=file"
./asadmin create-jvm-options "-Ddataverse.files.file1.label=file1"
./asadmin create-jvm-options "-Ddataverse.files.file1.directory="

(first entry is still type file, others have the file1 id)

@kcondon
Copy link
Contributor

kcondon commented Feb 11, 2020

@qqmyers I do not have what you described since one item appears incorrect:
./asadmin create-jvm-options "-Ddataverse.files.file.type=file" should be
./asadmin create-jvm-options "-Ddataverse.files.file1.type=file"

        <jvm-options>-Ddataverse.files.storage-driver-id=file1</jvm-options>
        <jvm-options>-Ddataverse.files.file1.directory=/usr/local/glassfish4/glassfish/domains/domain1/files1</jvm-options>
        <jvm-options>-Ddataverse.files.file1.type=file</jvm-options>
        <jvm-options>-Ddataverse.files.file1.label=file</jvm-options>

@qqmyers
Copy link
Member Author

qqmyers commented Feb 11, 2020

@kcondon - any chance there's a second jvm-option that has the type as file1? OR a lack of restart after an update to the jvm-options?
So far when I trace through the code, I quickly get to a
System.getProperty("dataverse.files." + driverId + ".type", "Undefined");
call whose result should be what shows up in the 'createDataAccessObject: Unsupported storage method file1' message.

@kcondon
Copy link
Contributor

kcondon commented Feb 11, 2020

@qqmyers No, I had double checked the restart, good. Here is the result of grepping on file1:

grep file1 domain.xml
        <jvm-options>-Ddataverse.files.storage-driver-id=file1</jvm-options>
        <jvm-options>-Ddataverse.files.file1.directory=/usr/local/glassfish4/glassfish/domains/domain1/files1</jvm-options>
        <jvm-options>-Ddataverse.files.file1.type=file</jvm-options>
        <jvm-options>-Ddataverse.files.file1.label=file</jvm-options>

@landreev
Copy link
Contributor

I don't know if it helps, but looking at the code in DataAccess.populateDrivers(), I'm seeing this:

   logger.info("Found Storage Driver: " + driverId + " for " + p.get(property).toString());

yet I'm not seeing the message above in the server logs on the system where @kcondon has been testing the branch.

avoid a doi: in the dataset storageidentifier
fix dataset and file storageidentifier assignment issues
update db datafile entries to have file:// if they don't have a prefix

Conflicts:
	src/main/resources/db/migration/V4.19.0.1__tbd_multistore.sql
https://github.com/TexasDigitalLibrary/dataverse.git into IQSS/6485

Conflicts:
	doc/release-notes/6485-multiple-stores.md
@qqmyers
Copy link
Member Author

qqmyers commented Feb 14, 2020

@kcondon, @landreev - I've made some changes that should fix things. It looks like some/all ways of creating file store entries left the storageidentifier without a <diverId>:// prefix, unlike the s3 and swift stores, so things only worked sometimes or if one used the same dir for the default and the file1 store, etc. What that involves is updating the db script so that all existing storageidentifiers for files that have no prefix get a file:// one (consistent with having to id the default store 'file' in the config options), and a change to the FileStorageIO class to assign the driver prefix upon write (not sure that's the best place to do it, but it's where the s3 and swift stores do it). There was also a place in delete where the text string 'file' was being prepended instead of using the driverId.

With those, I think the test cases you had above should work.

I also realized that the datasets were getting an extra 'doi:' in their storageidentifiers from one path, so I updated that, and also found a couple places where things may not have been dealing with alt identifiers correctly.

W.r.t. not seeing the 'Found' driver statements, those are only populated/cached when someone want's to set the storage in the Edit/General Info Dataverse page. So I don't think that not seeing them was related. (That log could/should be .fine() at some point, unless documenting the set of storage drivers available once per startup makes sense.)

Hopefully good to go again. I've been trying to use multiple s3 and file stores on TDL's dev server and so far so good.

@djbrooke
Copy link
Contributor

Thanks @qqmyers, I'll move this back over to QA.

@kcondon kcondon self-assigned this Feb 18, 2020
@kcondon
Copy link
Contributor

kcondon commented Feb 18, 2020

@qqmyers Thanks for all the work! I'm testing this now. Btw, Jenkins reports the build as unstable due to tests failing. This is from the build console:

[ERROR] Failures: 
[ERROR]   FileAccessIOTest.testGetStorageLocation:214 expected:<[file]:///tmp/files/tmp/da...> but was:<[dummmy]:///tmp/files/tmp/da...>
[INFO] 
[ERROR] Tests run: 1121, Failures: 1, Errors: 0, Skipped: 6
[INFO] 
[ERROR] There are test failures.

Update: it turns out Jenkins won't execute post build shell when build is unstable so need to fix this before I can test war file.

@kcondon kcondon assigned qqmyers and unassigned kcondon Feb 18, 2020
@qqmyers
Copy link
Member Author

qqmyers commented Feb 18, 2020

@kcondon - test is fixed

@kcondon kcondon self-assigned this Feb 19, 2020
@kcondon kcondon merged commit bf79e64 into IQSS:develop Feb 19, 2020
@kcondon
Copy link
Contributor

kcondon commented Feb 19, 2020

@qqmyers Thanks for the work, was able to configure and use 2 s3 profiles, 2 local file stores. Did notice a warning when using local file store and publishing a dataset, which triggers export:

[2020-02-18T19:11:51.930-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.engine.command.impl.FinalizeDatasetPublicationCommand] [tid: _ThreadID=52 _ThreadName=jk-connector(2)] [timeMillis: 1582071111930] [levelValue: 900] [[
Dataset publication finalization: exception while exporting:IO Exception thrown exporting as export_dcterms.cached]]

I did not see export_dcterms.cached file on the file system but that did not appear to cause and problems. Any thoughts?

@djbrooke djbrooke added this to the 4.20 milestone Feb 19, 2020
@qqmyers
Copy link
Member Author

qqmyers commented Feb 19, 2020

@kcondon - and thanks to you for great testing!

For the exporter error, it is hard to see how the changes here could affect one export and not others, but if dcterms is the first one (you're not seeing any cached exports), it could be related to multi-stores (the export code does a ~non-standard thing - it creates a fake 'placeholder' storageIO instance that it only uses to write the aux file - using two different methods in two different cases).

FWIW - I'm not sure the ExportService is doing what's expected either: if the code to create an outputStream from the storageIO works on line https://github.com/IQSS/dataverse/blame/develop/src/main/java/edu/harvard/iq/dataverse/export/ExportService.java#L258, it looks like tempFileRequired is false and the code at https://github.com/IQSS/dataverse/blame/develop/src/main/java/edu/harvard/iq/dataverse/export/ExportService.java#L269 should run, which stores to a local temp file only. It looks like an IOException has to occur in the block with the first line so that code at https://github.com/IQSS/dataverse/blame/develop/src/main/java/edu/harvard/iq/dataverse/export/ExportService.java#L275 runs instead.

If that analysis is correct (if you concur I can raise an issue), it would mean that proper functioning has required failure of the initial attempt to stream the aux file, raising the odd possibility that a fix in the multistore code might be exposing an error here. That said, it still seems like this would be an all-formats or none issue.

@pdurbin
Copy link
Member

pdurbin commented Feb 24, 2020

I'm catching up but my take on the last comment is that we're using #6666 to track the issue with exports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow multiple stores in Dataverse

9 participants