Make most SOLR fields ignore diacritics by cdrini · Pull Request #599 · internetarchive/openlibrary

cdrini · 2017-10-24T04:41:04Z

Meant to address #178

Do note I'm not too experienced with solr, so please review carefully :)

The following fields will now map any ascii-able (see below) characters down to ascii before indexing and querying:

<field name="title" type="text"/>
<field name="title_suggest" type="textgen"/>
<field name="subtitle" type="text"/>
<field name="alternative_title" type="text" stored="false" multiValued="true"/>
<field name="alternative_subtitle" type="text" stored="false" multiValued="true"/>
<field name="by_statement" type="text" stored="false" multiValued="true"/>
<field name="publish_date" type="text" multiValued="true"/>
<field name="lccn" type="textgen" multiValued="true"/>
<field name="oclc" type="text" multiValued="true"/>
<field name="contributor" type="textgen" multiValued="true"/>
<field name="publish_place" type="text" multiValued="true"/>
<field name="publisher" type="text" multiValued="true"/>
<field name="first_sentence" type="text" multiValued="true"/>
<field name="author_name" type="textgen" multiValued="true"/>
<field name="author_alternative_name" type="textgen" multiValued="true"/>
<field name="subject" type="text" multiValued="true"/>
<field name="place" type="textgen" multiValued="true"/>
<field name="person" type="textgen" multiValued="true"/>
<field name="time" type="text" multiValued="true"/>
<field name="text" type="text" multiValued="true"/>
<field name="name" type="textgen" indexed="true" stored="true"/>
<field name="alternate_names" type="textgen" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="true"/>

Here are the ascii mappings that will be applied: https://www.apt-browse.org/browse/ubuntu/trusty/universe/all/solr-common/3.6.2+dfsg-2/file/etc/solr/conf/mapping-FoldToASCII.txt (this file might not be exactly the one we use depending on what version is in solr, but should be similar).

Note this will replace whatever file is currently at /etc/solr/conf/schema.xml with a symlink to the schema file in $OL_ROOT/conf/solr/conf/schema.xml.

Note: to fully reindex solr from the vagrant environment:

sudo /etc/init.d/tomcat6 restart # restart solr's server
sudo -u vagrant make reindex-solr

cdrini · 2017-10-24T04:42:44Z

@mek Can you confirm the solr version on production? There are things in the code which make me fear it might still be 1.4, but not sure.

cdrini · 2017-10-24T04:43:42Z

@mek @charles Could either of you confirm that this file exists on your local instances: /etc/solr/conf/mapping-FoldToASCII.txt. I can't remember if I had to add this or if it was already there :P

tfmorris · 2017-10-24T06:31:12Z

Thanks for tackling an issue first reported in 2010 by the lead engineer, then reported again in 2011, by the (then) project leader. Although it doesn't affect US (aka SF) users much, it's a been a critical usability issue for the rest of the world for the better part of a decade.

I'll try to review more carefully in the next day or two, but I'm concerned by the "map down to ASCII" and "fold to ASCII" phraseology. The character encoding should be normalized, but in the Unicode domain, not the ASCII domain.

Do you really mean ASCII or is that just a convenient shorthand for NFKC?

hornc · 2017-10-24T20:01:12Z

@mek you should take a backup of /etc/solr/conf/schema.xml before deploying, given @cdrini 's note above! Just in case :)

hornc · 2017-10-24T20:52:12Z

@cdrini that mapping txt was there on my dev instance:

vagrant@ol-dev:~$ ls -l /etc/solr/conf/mapping-FoldToASCII.txt 
-rw-r--r-- 1 root root 78514 Dec 24  2012 /etc/solr/conf/mapping-FoldToASCII.txt

cdrini · 2017-10-24T21:16:25Z

@mek actually, can you perform a diff of it with $OL_ROOT/conf/solr/conf/schema.xml? They should be identical, and I'd be curious to see what the differences are (if any)

cdrini · 2017-10-24T21:21:57Z

@tfmorris Here's hoping it works :)
You are correct, sorry for the misleading wording! This will map to ascii where possible, leaving all other characters untouched. The final string will be some sort of Unicode (solr handles those details). I'll update this commit message accordingly.

cdrini · 2017-10-24T21:32:05Z

Here are the ascii mappings that will be applied: https://www.apt-browse.org/browse/ubuntu/trusty/universe/all/solr-common/3.6.2+dfsg-2/file/etc/solr/conf/mapping-FoldToASCII.txt (this file might not be exactly the one we use depending on what version is in solr, but should be similar).

tfmorris · 2017-11-02T07:01:00Z

I had a look at the FoldToASCII mappings and I'm not feeling any more comfortable that this helps anyone other than English speakers.

What about normalization of all the non-Roman character sets in the world? Why can't NFKC be used?

The current search scheme prevents users from finding entries which differ only in trivial encoding details such as precomposed diacriticals vs non-spacing diacriticals (for non-Roman character sets).

cdrini · 2017-11-02T10:51:54Z

Wow. I did some research into NFKC and unicode normalization, and that stuff is mind-blowing! Did not know all that stuff was happening under-the-hood; super neat!

Here's my quick summary for the uninitiated: https://gist.github.com/cdrini/ef398d918959444b282fbb566082bb7b (from reading http://unicode.org/reports/tr15/ ).
(Also a good read if you want to know pretty much the extent of my knowledge on this subject :P )

@tfmorris I think we've again come up with another case of misleading wording :P. "Normalization" here does not mean "unicode normalization" (that would be too clear) but basically just removing diacritics. As far as I can tell, no form of unicode normalization does this. I think normalizing to NFKC is a great idea, especially if there are inconsistencies in our data, but I would consider that a different issue from that which was originally posed in #178 (although that issue, again as a result of using the word "normalization", ended up touching on unicode normalization as well, it was originally about allowing searches to match with diacritics stripped).

To clarify: unicode normalization would ensure that the following searches all return the same results (which they currently do not; although they do if searching works for some reason 🤔):

https://openlibrary.org/search/authors?q=Å (82 results ✓; \u00c5)
https://openlibrary.org/search/authors?q=Å (0 results ✗; \u0041\u030a)
https://openlibrary.org/search/authors?q=Å (82 results ✓; \u212b)

Stripping diacritics ensures the following searches all return the same results:

https://openlibrary.org/search/authors?q=Dvorak (95 results)
https://openlibrary.org/search/authors?q=Dvořák (87 results)

Stripping diacritics would allow people to search without having to use the special keys for diacritic characters, which would be a boon for both English speakers searching for non-English works as well as for non-English speakers searching for non-English works.

If you got all the way to the end of this ~~essay~~comment, props to you—I owe you a 🍪 :)

mekarpeles

I'm approving the changes but we should hold off merging until we figure out how reindexing will occur

@cdrini notes:
Probably hold off on merging the diacritics stripping PR Did some research and it looks like a full reindex will require a bit more planning to avoid any downtime (this site seems to have some good tips: http://www.ehabelgindy.com/how-to-achieve-zero-downtime-sitecore-deployments-part-ii/ )

tfmorris · 2017-11-08T23:14:35Z

Sorry! I conflated normalization and folding, leading you on a wild goose chase. I think using one of the normalized forms for our data would be an improvement, but it's separate from the searching indexing issue.

My objection to ASCII remains though. It's not an English speaking world. You want the ICU Folding Filter, at a minimum. If you wanted to get fancy, you could investigate the ICU Transform Filter to do stuff like transliteration:

Source | Transliteration
キャンパス | kyanpasu
Αλφαβητικός Κατάλογος | Alphabētikós Katálogos
биологическом | biologichyeskom

tfmorris · 2017-11-16T23:41:41Z

conf/solr/conf/schema.xml

-                words="stopwords.txt"
-                enablePositionIncrements="true"
-                />
+        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>


These are English (only) stop words

tfmorris · 2017-11-16T23:51:24Z

On the subject of stop words, the current list is English only which may not be appropriate. Additionally whatever is used to index personal names (e.g. authors) should use stop words at all, most likely. The current field type "textgen" includes English stop words which will zap things like A, An, etc.

tfmorris · 2017-11-16T23:53:34Z

Here are some background links on Solr language processing:

https://www.liip.ch/en/blog/search-get-past-english-with-solr
https://lucene.apache.org/solr/guide/6_6/detecting-languages-during-indexing.html#detecting-languages-during-indexing
https://lucene.apache.org/solr/guide/6_6/language-analysis.html

tfmorris · 2017-11-17T00:18:55Z

@mek Speaking of reindexing, is there a staging server that can be used to test search changes? Search is a cornerstone of the user experience, so it's definitely something we want to be improving, not making worse -- which means we need to be able to test our changes.

tfmorris · 2017-11-17T00:36:36Z

Internet Archive is using the ICU components for its index:

https://medium.com/@giovannidamiola/making-the-internet-archives-full-text-search-faster-30fb11574ea9

That post also has some good hints on how to replace a generic English only stopwords list with common words from the corpus.

LeadSongDog · 2017-11-17T01:59:50Z

Interesting read. Very ingenious handling of common words: Avoiding a complete ignore of "the", rather than turning "the" "quick" "brown" "fox" into "the" "quick" "brown" "fox", they used "the quick" "quick" "brown" "fox" instead. Slick!

cdrini · 2017-11-19T04:09:26Z

@tfmorris Thanks for the info! And no worries, it was an informative 'goose chase' :) I like the looks of the ICU Folding Filter! It seems like it handles Unicode normalization, case folding, and diacritic stripping (i.e. win-win-win!). The Solr docs say it's better than "ASCII Folding Filter, Lower Case Filter, and ICU Normalizer 2 Filter.". The full foldings performed by this filter are here: https://lucene.apache.org/core/3_6_2/api/contrib-icu/org/apache/lucene/analysis/icu/ICUFoldingFilter.html (Info about each type of folding can be found here: http://www.unicode.org/reports/tr30/tr30-4.html#_Toc22 ).

Adding this is slightly more involved since we have to enable some solr plugins (which might take some time to figure out depending on how prod is configured). I think I'd like to setup a workflow for reindexing solr with minimal downtime first (since that will be necessary for most solr schema changes), but I was able to get the ICU folding filter running locally in ~1 day: https://github.com/cdrini/openlibrary/compare/178/hotfix/normalize-author-name-in-solr...cdrini:feature/solr-icu-folding-filter?w=1 (note this is mostly just me playing around until it worked and needs some more love to be production ready :) )

With regards to stopwords, that is very much outside the scope of this PR. Also, at least on the local dev environment, that stopwords file is empty :P Not sure if this is the case for prod, though.

tfmorris · 2017-12-09T19:54:11Z

@cdrini Cool! Let me know when you've got a revised (or different) PR that you'd like reviewed.

@mekarpeles It's probably obvious, but even though this is "approved" I don't think it should be merged in its current state.

Also, on a related note, any info on a staging server? (asked above, but I pinged the wrong mek)

Speaking of reindexing, is there a staging server that can be used to test search changes? Search is a cornerstone of the user experience, so it's definitely something we want to be improving, not making worse -- which means we need to be able to test our changes.

mekarpeles · 2017-12-29T21:27:20Z

@cdrini let's bring this up for our Tuesday call to see what next steps are.
At some point in the future, I'd also love to hear your opinions on a move to ES

tfmorris · 2018-03-10T16:26:50Z

What is needed to make some progress on this? Can we make a list of the blockers?

mekarpeles · 2018-05-01T16:32:53Z

@cdrini as per @tfmorris's comment, is this something we can safely merge in without doing a full re-index (and let it become eventually consistent?)

mekarpeles · 2018-06-26T22:54:38Z

@tfmorris are there any updates re: getting solr up to date for this merge?

tfmorris · 2018-06-27T13:57:24Z

Sorry for the lack of progress on this. I'll try to get to it in the next few days.

LeadSongDog · 2018-09-18T15:07:25Z

@cdrini @tfmorris @hornc @mekarpeles
This doesn't seem to be improving much.
Why is it that, with an umlaut,
https://openlibrary.org/search?author=Hans+Peter+M%C3%BCller
finds results on 8 different author records, but the advanced search
https://openlibrary.org/search/authors?q=Hans+Peter+Mu%CC%88ller&mode=everything
finds zero?

Meanwhile, without the umlaut, both
https://openlibrary.org/search?author=Hans+Peter+Muller
and
https://openlibrary.org/search?q=Hans+Peter+Muller&mode=everything
find 4 works by https://openlibrary.org/search?author_key=OL3183241A&author=Hans+Peter+Muller
and 2 works by https://openlibrary.org/search?author_key=OL2135497A&author=Hans+Peter+Muller

mekarpeles · 2019-05-09T15:54:30Z

cc: @seabelis (so you are aware this fix-in-progress exists)

cdrini · 2019-05-09T20:40:28Z

For clarity, this is blocked by #1067 . We have the fix, but no way to get in into production right now.

tfmorris · 2019-07-30T20:17:50Z

PR #2246 includes support for the ICUFoldingFilter that I mentioned.

cdrini · 2019-07-30T20:52:55Z

I'm going to go ahead and reject this; it's been open for a very long time, and this isn't the way I'd do it now, anyways. I can always look at the closed PR for reference :)

cdrini · 2020-04-20T14:17:41Z

Re-opening as a result of discussion on #3290 (see #3290 (comment) ). We can use ASCII Folding as a temporary patch until we update to solr 8 and use the ICUFolding filtter.

tfmorris · 2020-05-01T06:42:06Z

As I mentioned at least a couple of times in 2017, this isn't an adequate solution. The real solution can be found in c7026ff and is only a few lines.

Since the bug has been open for over a decade, let's fix it right.

cdrini · 2020-05-04T14:41:59Z

Ok, cool, we're on the same page. Note c7026ff is off your solr8 branch, so is blocked by #3317 . I've been working off your branch taking the solr8 commits out and should hopefully have a prod ready version this month.

cdrini force-pushed the 178/hotfix/normalize-author-name-in-solr branch from e2ca745 to b8849aa Compare October 24, 2017 21:28

mekarpeles changed the title ~~Normalize fields in solr~~ Adding custom author/title SOLR fields to infer diacritics Nov 2, 2017

cdrini changed the title ~~Adding custom author/title SOLR fields to infer diacritics~~ Make most SOLR fields ignore diacritics Nov 6, 2017

mekarpeles self-requested a review November 8, 2017 22:04

mekarpeles approved these changes Nov 8, 2017

View reviewed changes

tfmorris mentioned this pull request Nov 8, 2017

Index normalized author name in solr #178

Closed

tfmorris reviewed Nov 16, 2017

View reviewed changes

cdrini force-pushed the 178/hotfix/normalize-author-name-in-solr branch from b8849aa to 6fdb1c4 Compare November 19, 2017 04:10

cdrini mentioned this pull request Sep 5, 2018

Full re-index of solr data on prod #1067

Closed

14 tasks

mekarpeles mentioned this pull request Oct 25, 2018

Search for translations in different languages to the original work #1432

Closed

mekarpeles added the State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] label May 9, 2019

tfmorris added Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Theme: Internationalization Making OpenLibrary work for both foreign-language users and books. [managed] labels Jul 5, 2019

cdrini closed this Jul 30, 2019

cdrini mentioned this pull request Apr 20, 2020

Add LCC and Dewey decimal numbers to solr in April solr reindex #3290

Closed

cdrini reopened this Apr 20, 2020

cdrini removed the State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] label Apr 20, 2020

map ascii-able characters to ascii in solr

8b00120

cdrini force-pushed the 178/hotfix/normalize-author-name-in-solr branch from 6fdb1c4 to 8b00120 Compare April 20, 2020 14:24

cdrini added this to the Next Sprint (Proposed) milestone Apr 20, 2020

mekarpeles self-assigned this Apr 27, 2020

cdrini closed this May 4, 2020

cdrini removed this from the Next Sprint (Proposed) milestone May 11, 2020

cdrini mentioned this pull request Aug 31, 2021

Author names should ignore diacritics in solr #5600

Merged

Uh oh!

Conversation

cdrini commented Oct 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdrini commented Oct 24, 2017

Uh oh!

cdrini commented Oct 24, 2017

Uh oh!

tfmorris commented Oct 24, 2017

Uh oh!

hornc commented Oct 24, 2017

Uh oh!

hornc commented Oct 24, 2017

Uh oh!

cdrini commented Oct 24, 2017

Uh oh!

cdrini commented Oct 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdrini commented Oct 24, 2017

Uh oh!

tfmorris commented Nov 2, 2017

Uh oh!

cdrini commented Nov 2, 2017

Uh oh!

mekarpeles left a comment

Choose a reason for hiding this comment

Uh oh!

tfmorris commented Nov 8, 2017

Uh oh!

tfmorris Nov 16, 2017

Choose a reason for hiding this comment

Uh oh!

tfmorris commented Nov 16, 2017

Uh oh!

tfmorris commented Nov 16, 2017

Uh oh!

tfmorris commented Nov 17, 2017

Uh oh!

tfmorris commented Nov 17, 2017

Uh oh!

LeadSongDog commented Nov 17, 2017

Uh oh!

cdrini commented Nov 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tfmorris commented Dec 9, 2017

Uh oh!

mekarpeles commented Dec 29, 2017

Uh oh!

tfmorris commented Mar 10, 2018

Uh oh!

mekarpeles commented May 1, 2018

Uh oh!

mekarpeles commented Jun 26, 2018

Uh oh!

tfmorris commented Jun 27, 2018

Uh oh!

LeadSongDog commented Sep 18, 2018

Uh oh!

mekarpeles commented May 9, 2019

Uh oh!

cdrini commented May 9, 2019

Uh oh!

tfmorris commented Jul 30, 2019

Uh oh!

cdrini commented Jul 30, 2019

Uh oh!

cdrini commented Apr 20, 2020

Uh oh!

tfmorris commented May 1, 2020

Uh oh!

cdrini commented May 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

cdrini commented Oct 24, 2017 •

edited

Loading

cdrini commented Oct 24, 2017 •

edited

Loading

cdrini commented Nov 19, 2017 •

edited

Loading

cdrini commented May 4, 2020 •

edited

Loading