Skip to content

Make most SOLR fields ignore diacritics#599

Closed
cdrini wants to merge 1 commit intointernetarchive:masterfrom
cdrini:178/hotfix/normalize-author-name-in-solr
Closed

Make most SOLR fields ignore diacritics#599
cdrini wants to merge 1 commit intointernetarchive:masterfrom
cdrini:178/hotfix/normalize-author-name-in-solr

Conversation

@cdrini
Copy link
Copy Markdown
Collaborator

@cdrini cdrini commented Oct 24, 2017

Meant to address #178
ol normalized solr

Do note I'm not too experienced with solr, so please review carefully :)

The following fields will now map any ascii-able (see below) characters down to ascii before indexing and querying:

<field name="title" type="text"/>
<field name="title_suggest" type="textgen"/>
<field name="subtitle" type="text"/>
<field name="alternative_title" type="text" stored="false" multiValued="true"/>
<field name="alternative_subtitle" type="text" stored="false" multiValued="true"/>
<field name="by_statement" type="text" stored="false" multiValued="true"/>
<field name="publish_date" type="text" multiValued="true"/>
<field name="lccn" type="textgen" multiValued="true"/>
<field name="oclc" type="text" multiValued="true"/>
<field name="contributor" type="textgen" multiValued="true"/>
<field name="publish_place" type="text" multiValued="true"/>
<field name="publisher" type="text" multiValued="true"/>
<field name="first_sentence" type="text" multiValued="true"/>
<field name="author_name" type="textgen" multiValued="true"/>
<field name="author_alternative_name" type="textgen" multiValued="true"/>
<field name="subject" type="text" multiValued="true"/>
<field name="place" type="textgen" multiValued="true"/>
<field name="person" type="textgen" multiValued="true"/>
<field name="time" type="text" multiValued="true"/>
<field name="text" type="text" multiValued="true"/>
<field name="name" type="textgen" indexed="true" stored="true"/>
<field name="alternate_names" type="textgen" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="true"/>

Here are the ascii mappings that will be applied: https://www.apt-browse.org/browse/ubuntu/trusty/universe/all/solr-common/3.6.2+dfsg-2/file/etc/solr/conf/mapping-FoldToASCII.txt (this file might not be exactly the one we use depending on what version is in solr, but should be similar).

Note this will replace whatever file is currently at /etc/solr/conf/schema.xml with a symlink to the schema file in $OL_ROOT/conf/solr/conf/schema.xml.

Note: to fully reindex solr from the vagrant environment:

sudo /etc/init.d/tomcat6 restart # restart solr's server
sudo -u vagrant make reindex-solr

@cdrini
Copy link
Copy Markdown
Collaborator Author

cdrini commented Oct 24, 2017

@mek Can you confirm the solr version on production? There are things in the code which make me fear it might still be 1.4, but not sure.

@cdrini
Copy link
Copy Markdown
Collaborator Author

cdrini commented Oct 24, 2017

@mek @charles Could either of you confirm that this file exists on your local instances: /etc/solr/conf/mapping-FoldToASCII.txt. I can't remember if I had to add this or if it was already there :P

@tfmorris
Copy link
Copy Markdown
Contributor

Thanks for tackling an issue first reported in 2010 by the lead engineer, then reported again in 2011, by the (then) project leader. Although it doesn't affect US (aka SF) users much, it's a been a critical usability issue for the rest of the world for the better part of a decade.

I'll try to review more carefully in the next day or two, but I'm concerned by the "map down to ASCII" and "fold to ASCII" phraseology. The character encoding should be normalized, but in the Unicode domain, not the ASCII domain.

Do you really mean ASCII or is that just a convenient shorthand for NFKC?

@hornc
Copy link
Copy Markdown
Collaborator

hornc commented Oct 24, 2017

@mek you should take a backup of /etc/solr/conf/schema.xml before deploying, given @cdrini 's note above! Just in case :)

@hornc
Copy link
Copy Markdown
Collaborator

hornc commented Oct 24, 2017

@cdrini that mapping txt was there on my dev instance:

vagrant@ol-dev:~$ ls -l /etc/solr/conf/mapping-FoldToASCII.txt 
-rw-r--r-- 1 root root 78514 Dec 24  2012 /etc/solr/conf/mapping-FoldToASCII.txt

@cdrini
Copy link
Copy Markdown
Collaborator Author

cdrini commented Oct 24, 2017

@mek actually, can you perform a diff of it with $OL_ROOT/conf/solr/conf/schema.xml? They should be identical, and I'd be curious to see what the differences are (if any)

@cdrini
Copy link
Copy Markdown
Collaborator Author

cdrini commented Oct 24, 2017

@tfmorris Here's hoping it works :)
You are correct, sorry for the misleading wording! This will map to ascii where possible, leaving all other characters untouched. The final string will be some sort of Unicode (solr handles those details). I'll update this commit message accordingly.

@cdrini cdrini force-pushed the 178/hotfix/normalize-author-name-in-solr branch from e2ca745 to b8849aa Compare October 24, 2017 21:28
@cdrini
Copy link
Copy Markdown
Collaborator Author

cdrini commented Oct 24, 2017

Here are the ascii mappings that will be applied: https://www.apt-browse.org/browse/ubuntu/trusty/universe/all/solr-common/3.6.2+dfsg-2/file/etc/solr/conf/mapping-FoldToASCII.txt (this file might not be exactly the one we use depending on what version is in solr, but should be similar).

@tfmorris
Copy link
Copy Markdown
Contributor

tfmorris commented Nov 2, 2017

I had a look at the FoldToASCII mappings and I'm not feeling any more comfortable that this helps anyone other than English speakers.

What about normalization of all the non-Roman character sets in the world? Why can't NFKC be used?

The current search scheme prevents users from finding entries which differ only in trivial encoding details such as precomposed diacriticals vs non-spacing diacriticals (for non-Roman character sets).

@cdrini
Copy link
Copy Markdown
Collaborator Author

cdrini commented Nov 2, 2017

Wow. I did some research into NFKC and unicode normalization, and that stuff is mind-blowing! Did not know all that stuff was happening under-the-hood; super neat!

Here's my quick summary for the uninitiated: https://gist.github.com/cdrini/ef398d918959444b282fbb566082bb7b (from reading http://unicode.org/reports/tr15/ ).
(Also a good read if you want to know pretty much the extent of my knowledge on this subject :P )

@tfmorris I think we've again come up with another case of misleading wording :P. "Normalization" here does not mean "unicode normalization" (that would be too clear) but basically just removing diacritics. As far as I can tell, no form of unicode normalization does this. I think normalizing to NFKC is a great idea, especially if there are inconsistencies in our data, but I would consider that a different issue from that which was originally posed in #178 (although that issue, again as a result of using the word "normalization", ended up touching on unicode normalization as well, it was originally about allowing searches to match with diacritics stripped).

To clarify: unicode normalization would ensure that the following searches all return the same results (which they currently do not; although they do if searching works for some reason 🤔):

Stripping diacritics ensures the following searches all return the same results:

Stripping diacritics would allow people to search without having to use the special keys for diacritic characters, which would be a boon for both English speakers searching for non-English works as well as for non-English speakers searching for non-English works.

If you got all the way to the end of this essaycomment, props to you—I owe you a 🍪 :)

@mekarpeles mekarpeles changed the title Normalize fields in solr Adding custom author/title SOLR fields to infer diacritics Nov 2, 2017
@cdrini cdrini changed the title Adding custom author/title SOLR fields to infer diacritics Make most SOLR fields ignore diacritics Nov 6, 2017
@mekarpeles mekarpeles self-requested a review November 8, 2017 22:04
Copy link
Copy Markdown
Member

@mekarpeles mekarpeles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving the changes but we should hold off merging until we figure out how reindexing will occur

@cdrini notes:
Probably hold off on merging the diacritics stripping PR Did some research and it looks like a full reindex will require a bit more planning to avoid any downtime (this site seems to have some good tips: http://www.ehabelgindy.com/how-to-achieve-zero-downtime-sitecore-deployments-part-ii/ )

@tfmorris
Copy link
Copy Markdown
Contributor

tfmorris commented Nov 8, 2017

Sorry! I conflated normalization and folding, leading you on a wild goose chase. I think using one of the normalized forms for our data would be an improvement, but it's separate from the searching indexing issue.

My objection to ASCII remains though. It's not an English speaking world. You want the ICU Folding Filter, at a minimum. If you wanted to get fancy, you could investigate the ICU Transform Filter to do stuff like transliteration:

Source | Transliteration
キャンパス | kyanpasu
Αλφαβητικός Κατάλογος | Alphabētikós Katálogos
биологическом | biologichyeskom

words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are English (only) stop words

@tfmorris
Copy link
Copy Markdown
Contributor

On the subject of stop words, the current list is English only which may not be appropriate. Additionally whatever is used to index personal names (e.g. authors) should use stop words at all, most likely. The current field type "textgen" includes English stop words which will zap things like A, An, etc.

@tfmorris
Copy link
Copy Markdown
Contributor

@tfmorris
Copy link
Copy Markdown
Contributor

@mek Speaking of reindexing, is there a staging server that can be used to test search changes? Search is a cornerstone of the user experience, so it's definitely something we want to be improving, not making worse -- which means we need to be able to test our changes.

@tfmorris
Copy link
Copy Markdown
Contributor

Internet Archive is using the ICU components for its index:

https://medium.com/@giovannidamiola/making-the-internet-archives-full-text-search-faster-30fb11574ea9

That post also has some good hints on how to replace a generic English only stopwords list with common words from the corpus.

@LeadSongDog
Copy link
Copy Markdown

Interesting read. Very ingenious handling of common words: Avoiding a complete ignore of "the", rather than turning "the" "quick" "brown" "fox" into "the" "quick" "brown" "fox", they used "the quick" "quick" "brown" "fox" instead. Slick!

@cdrini
Copy link
Copy Markdown
Collaborator Author

cdrini commented Nov 19, 2017

@tfmorris Thanks for the info! And no worries, it was an informative 'goose chase' :) I like the looks of the ICU Folding Filter! It seems like it handles Unicode normalization, case folding, and diacritic stripping (i.e. win-win-win!). The Solr docs say it's better than "ASCII Folding Filter, Lower Case Filter, and ICU Normalizer 2 Filter.". The full foldings performed by this filter are here: https://lucene.apache.org/core/3_6_2/api/contrib-icu/org/apache/lucene/analysis/icu/ICUFoldingFilter.html (Info about each type of folding can be found here: http://www.unicode.org/reports/tr30/tr30-4.html#_Toc22 ).

Adding this is slightly more involved since we have to enable some solr plugins (which might take some time to figure out depending on how prod is configured). I think I'd like to setup a workflow for reindexing solr with minimal downtime first (since that will be necessary for most solr schema changes), but I was able to get the ICU folding filter running locally in ~1 day: https://github.com/cdrini/openlibrary/compare/178/hotfix/normalize-author-name-in-solr...cdrini:feature/solr-icu-folding-filter?w=1 (note this is mostly just me playing around until it worked and needs some more love to be production ready :) )

With regards to stopwords, that is very much outside the scope of this PR. Also, at least on the local dev environment, that stopwords file is empty :P Not sure if this is the case for prod, though.

@cdrini cdrini force-pushed the 178/hotfix/normalize-author-name-in-solr branch from b8849aa to 6fdb1c4 Compare November 19, 2017 04:10
@tfmorris
Copy link
Copy Markdown
Contributor

tfmorris commented Dec 9, 2017

@cdrini Cool! Let me know when you've got a revised (or different) PR that you'd like reviewed.

@mekarpeles It's probably obvious, but even though this is "approved" I don't think it should be merged in its current state.

Also, on a related note, any info on a staging server? (asked above, but I pinged the wrong mek)

Speaking of reindexing, is there a staging server that can be used to test search changes? Search is a cornerstone of the user experience, so it's definitely something we want to be improving, not making worse -- which means we need to be able to test our changes.

@mekarpeles
Copy link
Copy Markdown
Member

@cdrini let's bring this up for our Tuesday call to see what next steps are.
At some point in the future, I'd also love to hear your opinions on a move to ES

@tfmorris
Copy link
Copy Markdown
Contributor

What is needed to make some progress on this? Can we make a list of the blockers?

@mekarpeles
Copy link
Copy Markdown
Member

@cdrini as per @tfmorris's comment, is this something we can safely merge in without doing a full re-index (and let it become eventually consistent?)

@mekarpeles
Copy link
Copy Markdown
Member

@tfmorris are there any updates re: getting solr up to date for this merge?

@tfmorris
Copy link
Copy Markdown
Contributor

Sorry for the lack of progress on this. I'll try to get to it in the next few days.

@cdrini cdrini mentioned this pull request Sep 5, 2018
14 tasks
@LeadSongDog
Copy link
Copy Markdown

@mekarpeles mekarpeles added the State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] label May 9, 2019
@mekarpeles
Copy link
Copy Markdown
Member

cc: @seabelis (so you are aware this fix-in-progress exists)

@cdrini
Copy link
Copy Markdown
Collaborator Author

cdrini commented May 9, 2019

For clarity, this is blocked by #1067 . We have the fix, but no way to get in into production right now.

@tfmorris tfmorris added Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Theme: Internationalization Making OpenLibrary work for both foreign-language users and books. [managed] labels Jul 5, 2019
@tfmorris
Copy link
Copy Markdown
Contributor

PR #2246 includes support for the ICUFoldingFilter that I mentioned.

@cdrini
Copy link
Copy Markdown
Collaborator Author

cdrini commented Jul 30, 2019

I'm going to go ahead and reject this; it's been open for a very long time, and this isn't the way I'd do it now, anyways. I can always look at the closed PR for reference :)

@cdrini
Copy link
Copy Markdown
Collaborator Author

cdrini commented Apr 20, 2020

Re-opening as a result of discussion on #3290 (see #3290 (comment) ). We can use ASCII Folding as a temporary patch until we update to solr 8 and use the ICUFolding filtter.

@cdrini cdrini reopened this Apr 20, 2020
@cdrini cdrini removed the State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] label Apr 20, 2020
@cdrini cdrini force-pushed the 178/hotfix/normalize-author-name-in-solr branch from 6fdb1c4 to 8b00120 Compare April 20, 2020 14:24
@cdrini cdrini added this to the Next Sprint (Proposed) milestone Apr 20, 2020
@mekarpeles mekarpeles self-assigned this Apr 27, 2020
@tfmorris
Copy link
Copy Markdown
Contributor

tfmorris commented May 1, 2020

As I mentioned at least a couple of times in 2017, this isn't an adequate solution. The real solution can be found in c7026ff and is only a few lines.

Since the bug has been open for over a decade, let's fix it right.

@cdrini
Copy link
Copy Markdown
Collaborator Author

cdrini commented May 4, 2020

Ok, cool, we're on the same page. Note c7026ff is off your solr8 branch, so is blocked by #3317 . I've been working off your branch taking the solr8 commits out and should hopefully have a prod ready version this month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Theme: Internationalization Making OpenLibrary work for both foreign-language users and books. [managed]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants