-
Notifications
You must be signed in to change notification settings - Fork 535
Description
Hello, I have recently worked on a customized tool that use Dataverse Installation Search API and I was questionning myself on some choices done in code.
My need was to find all datasets without one custom field and without any files.
Customized field is ok /api/v1/search?q=-alternativeURL:*&type=dataset
But no files... I found that json reponse had fileCount property constructed by java code after solr result.
Problem is : search is really slow if you ask lot of data.
Ex : /api/v1/search?q=*&type=dataset&subtree=root&sort=name&order=asc&per_page=1000 takes 37 seconds for a 2.3Mo json file of 1000 datasets on our Dataverse installation (1991 datasets and 29844 datafiles).
I ended up doing it using mostly SQL queries.
-
I think it would be intesting to have information on solr document instead, in order to have better performance and mostly be able to ask
fileCount:0in Dataverse Repository UI search bar or using Search API.
Or maybe display it on Dataverse Installation search snippet result or in a customized tool in my case (ex: 66 files).
I have seen that the question and the idea had exist in the past duringfileCountimplementation : Dataset file count in search results from API #6601 in Search API show fileCount for datasets #6601 #6623 -
I'm a bit disturbed to see simple query based on solr search hiting 37 seconds for only 1000 elements.
My guess is there is too much SQL done afterward SOLR result.
The interest of using SOLR is to have ultra fast response, my humble opinion is that it's start to lose interest, it would be faster using 100% SQL or 100% Solr don't you think ?
As Java developper I do understand that complex application make those choices really complicated due to maintenance of SQL queries and critetrias or to not overload SOLR documents.
But there is a performance danger to make SQL queries for each documents of SOLR found.
I might be off topic, feel free to correct me.
I can't wait to read you and, as always, thanks for you time and services for the community of Dataverse software.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status