Skip to content

Indexing: Support automatic reindex for objects created while Solr is down (Near Realtime Search) #702

@eaquigley

Description

@eaquigley

Author Name: Philip Durbin (@pdurbin)
Original Redmine Issue: 4160, https://redmine.hmdc.harvard.edu/issues/4160
Original Date: 2014-06-30
Original Assignee: Philip Durbin


Dataverse 4.0 requires "near realtime search" because the moment dataverses, datasets, or files are added, updated, or deleted the "cards" and facet counts must immediately reflect the change.

"Near realtime search means thats documents are available for search almost immediately after being indexed - additions and updates to documents are seen in 'near' realtime." -- http://wiki.apache.org/solr/NearRealtimeSearch

In order to support near realtime search, we must handle indexing failure and re-try the indexing operation.

As we are designing this system, we should probably consider other cases where detecting failure of a network service and re-trying is desirable, such as:

  • registering DOIs
  • posting to Twitter

We should also considering using notifications for cases where re-indexing was attempted several times but continues to fail.

In DVN 3.x there is a method called getUnindexedStudies at https://github.com/IQSS/dvn/blob/3.6.1/DVN-root/DVN-web/src/main/java/edu/harvard/iq/dvn/core/index/IndexServiceBean.java#L1061 that uses the following query to determine which studies need to be re-indexed:

List<Study> studies = (List<Study>) em.createQuery("SELECT s from Study s where s.lastIndexTime < s.lastUpdateTime OR s.lastIndexTime is NULL").getResultList();

Another approach could be to use a database table as a queue (thought this approach could be problematic: https://blog.engineyard.com/2011/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you/ )

See also:

http://lucene.472066.n3.nabble.com/strategies-for-managing-Solr-indexing-failures-and-retries-td4139186.html


Related issue(s): #229
Redmine related issue(s): 3643


Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions