fix: ensure short strings of legitimate content are not excluded#867
Merged
gijsk merged 9 commits intomozilla:mainfrom May 20, 2024
Merged
fix: ensure short strings of legitimate content are not excluded#867gijsk merged 9 commits intomozilla:mainfrom
gijsk merged 9 commits intomozilla:mainfrom
Conversation
gijsk
requested changes
May 17, 2024
Contributor
gijsk
left a comment
There was a problem hiding this comment.
Thanks! Just one nit, but I think this otherwise looks good. We really appreciated the thorough evaluation. Like you, we feel it's a balancing act in terms of the impact, but we think it's generally a positive change so let's roll with it. :-)
6e44d6b to
f418741
Compare
mislav
added a commit
to mislav/go-readability
that referenced
this pull request
Jun 19, 2025
This ports "fix: ensure short strings of legitimate content are not excluded" (mozilla/readability#867)
mislav
added a commit
to mislav/go-readability
that referenced
this pull request
Jun 23, 2025
This ports "fix: ensure short strings of legitimate content are not excluded" (mozilla/readability#867)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR changes one of the
haveToRemovechecks to allow for short paragraphs, such as the written dialog in the linked issue, by addinglinkDensityto the primary "short content" check. The rationale being that short strings which don't contain any links are likely to be text the user would want to read, provided that the initial preprocessing has already removed the bulk of the page elements.I regenerated all test cases to check for regressions and added a couple of new checks to deal with those:
adWordsandloadingWordsregexes to help identify ad blocks and loading indicators.textDensity+ image count to exclude elements without useful content.Test case changes:
citylab-1now includes the published time in the Readability content.ehow-1lost an extraneous "Other People Are Reading" header but gained an extraneous "Found This Helpful" headerehow-2lost its "Other People Are Reading" header and gained the word "Save" previously used for a buttonengadgetnow shows the base price and score originally present in the product reviewfirefox-nightly-blogno longer shows "Leave a Reply"mercurialnow correctly shows a previously excluded code snippets and commandsqqnow shows the page header with published datetoc-missingnow incorrectly shows "Interactive Editor"wikipedianow shows image captions wrapped in<div><p>...</p></div>The changes which are definitely regressions seem like an acceptable trade for the improvements gained. I would like some input on whether the
adWordsandloadingWordsregexes are acceptable though. They are scoped so that they will only produce a match against a node's entireinnerTextstring, so it's unlikely to impact real content, but it still feels like a slippery slope.Closes #861