Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 104 additions & 92 deletions Readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,36 @@ Readability.prototype = {
this._fixRelativeUris(articleContent);
},

/**
* Iterate over a NodeList, which doesn't natively fully implement the Array
* interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeLists are iterable<node>.
This means they can be iterated over with for..of loops.
It could result in the following code:

function convertRelativeURIs(tagName, propName) {
    for(var elem of articleContent.getElementsByTagName(tagName)){
        var relativeURI = elem.getAttribute(propName);
        if (relativeURI != null)
            elem.setAttribute(propName, toAbsoluteURI(relativeURI));
    }
}

This would work in Firefox, but wouldn't in Node 0.10. It can be made to work in Node 0.12 and io.js with some experimental flags on I believe. The future (hopefully coming soon) is that ES6 will be deployed consistently everywhere, so it will eventually work everywhere.

I don't know what's best, but felt it's worth sharing options :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability's code needs to run on iOS, so I suspect what you suggest is not an option (at least for now, sadly).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of forEach used to be that the extra function calls produced overhead, so the performance is worse. I don't know if that's still (meaningfully) the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always prepending test suite run calls with the time exec, I didn't notice any significant impact on performance. I think js engines are now optimizing this properly, and even if it doesn't perform as good I think more readable code is worth the (cheap) price to pay here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions calls are removed by the JIT compiler, especially if the collection type is stable.
If you notice a significant difference (which is unlikely), file a bug, that's the sort of things JS engine folks want to know about.
More on the topic http://rfrn.org/~shu/2013/03/20/two-reasons-functional-style-is-slow-in-spidermonkey.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions calls are removed by the JIT compiler, especially if the collection type is stable.

Which isn't the case here for real DOMs, right? :-)

I'm also confused because your link seems to imply that there really is still a perf penalty unless you heavily optimize the function(s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, please note we're speaking milliseconds here. I think the argument of the impossibility to deal with DOM mutations within function based loops is much more worth spending our time.

*
* For convenience, the current object context is applied to the provided
* iterate function.
*
* @param NodeList nodeList The NodeList.
* @param Function fn The iterate function.
* @return void
*/
_forEachNode: function(nodeList, fn) {
return Array.prototype.forEach.call(nodeList, fn, this);
},

/**
* Iterate over a NodeList, return true if any of the provided iterate
* function calls returns true, false otherwise.
*
* For convenience, the current object context is applied to the
* provided iterate function.
*
* @param NodeList nodeList The NodeList.
* @param Function fn The iterate function.
* @return Boolean
*/
_someNode: function(nodeList, fn) {
return Array.prototype.some.call(nodeList, fn, this);
},

/**
* Converts each <a> and <img> uri in the given element to an absolute URI.
*
Expand Down Expand Up @@ -149,19 +179,18 @@ Readability.prototype = {

function convertRelativeURIs(tagName, propName) {
var elems = articleContent.getElementsByTagName(tagName);
for (var i = elems.length; --i >= 0;) {
var elem = elems[i];
this._forEachNode(elems, function(elem) {
var relativeURI = elem.getAttribute(propName);
if (relativeURI != null)
elems[i].setAttribute(propName, toAbsoluteURI(relativeURI));
}
elem.setAttribute(propName, toAbsoluteURI(relativeURI));
});
}

// Fix links.
convertRelativeURIs("a", "href");
convertRelativeURIs.call(this, "a", "href");

// Fix images.
convertRelativeURIs("img", "src");
convertRelativeURIs.call(this, "img", "src");
},

/**
Expand Down Expand Up @@ -217,19 +246,17 @@ Readability.prototype = {
var doc = this._doc;

// Remove all style tags in head
var styleTags = doc.getElementsByTagName("style");
for (var st = styleTags.length - 1; st >= 0; st -= 1) {
styleTags[st].parentNode.removeChild(styleTags[st]);
}
this._forEachNode(doc.getElementsByTagName("style"), function(styleNode) {
styleNode.parentNode.removeChild(styleNode);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly here though, the issue I'd be worried about is that Array.forEach can't deal with removing items from the array while iterating. The reason this loop iterated backwards is to deal with NodeLists that are dynamic, ie when you call removeChild, the item will disappear from the doc.getElementsByTagName result, and therefore from the nodelist.

Your change does not handle that, so on compliant DOM implementations it would break this code. It doesn't with JSDOMParser, but I believe at least iOS will want to use Readability with a "real" DOM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var a = [1, 2, 3];
a.forEach(function(x, i) {
  if (x === 2)
    a.splice(i, 1)
});
console.log(a) // [1,3]

I agree this isn't the most easy code to deal with, but in my world still better than using traditional for loops (I've probably been too much into FP lately, trying to avoid mutable state changes as much as I can, or when I can't trying to limit side effects as much as possible — which is nearly impossible with the DOM :/)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you're probably right that dealing with a mutable DOM that way is probably not a good idea. I should just cancel this PR then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, no, actually the docs say that I am just wrong:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/forEach

The range of elements processed by forEach() is set before the first invocation of callback. Elements that are appended to the array after the call to forEach() begins will not be visited by callback. If the values of existing elements of the array are changed, the value passed to callback will be the value at the time forEach() visits them; elements that are deleted before being visited are not visited.

I don't think any of these loops care about added items?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any of these loops care about added items?

It doesn't seem so, though that means we won't ever be able to introduce such a behavior; on the other hand, appending items to a currently iterated collection is definitely asking for trouble imoe :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmpf. I thought this was all going to go OK, and then I realized we'll still need to fix the _setNodeTag stuff for non-JSDOMParser, which will involve reparenting stuff (ie, below, move everything from a font tag into a different span tag and then replace the font with the span). This is unfortunate because if there are nested tags, we will be adding + removing things, which will mess up the loop. For instance:

<font><font>Hello</font></font>

We'll hit the first font first with the forEach loop, which then means we'll reparent the second one, which might mean we won't iterate over the second node? Not sure... :-\

Tested in Firefox's devtools console:

>>> f = document.documentElement.appendChild(document.createElement('font'));
<font>
>>> f.appendChild(document.createElement('font'))
<font>
>>> document.getElementsByTagName("font")
HTMLCollection [ <font>, <font> ]
>>> document.getElementsByTagName("font").forEach
undefined
>>> Array.prototype.forEach.call(document.getElementsByTagName("font"), function(f) {
>>>   var x = document.createElement("span");
>>>   Array.prototype.forEach.call(f.childNodes, function(fc) {
>>>      x.appendChild(fc);
>>>   });
>>>   f.parentNode.replaceChild(x, f);
>>> });
undefined
>>> document.getElementsByTagName("font")
HTMLCollection [ <font> ]
>>> document.getElementsByTagName("font")[0].parentNode
<span>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn. Sounds so much complicated. Indirectly related, just out of curiosity: why do we want to replace font tags with spans? couldn't we just remove these font tags? Note that we currently keep the font tag attributes, which makes generated HTML looking a little silly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to avoid the styles of the readable page being messed up by the page's (font tag's) styles, I think. Like I said, we can leave it for a followup, it's not currently broken with this change.


if (doc.body) {
this._replaceBrs(doc.body);
}

var fonts = doc.getElementsByTagName("FONT");
for (var i = fonts.length; --i >=0;) {
this._setNodeTag(fonts[i], "SPAN");
}
this._forEachNode(doc.getElementsByTagName("font"), function(fontNode) {
this._setNodeTag(fontNode, "SPAN");
});
},

/**
Expand All @@ -255,9 +282,7 @@ Readability.prototype = {
* <div>foo<br>bar<p>abc</p></div>
*/
_replaceBrs: function (elem) {
var brs = elem.getElementsByTagName("br");
for (var i = 0; i < brs.length; i++) {
var br = brs[i];
this._forEachNode(elem.getElementsByTagName("br"), function(br) {
var next = br.nextSibling;

// Whether 2 or more <br> elements have been found and replaced with a
Expand Down Expand Up @@ -296,7 +321,7 @@ Readability.prototype = {
next = sibling;
}
}
}
});
},

_setNodeTag: function (node, tag) {
Expand Down Expand Up @@ -336,26 +361,21 @@ Readability.prototype = {
this._cleanConditionally(articleContent, "div");

// Remove extra paragraphs
var articleParagraphs = articleContent.getElementsByTagName('p');
for (var i = articleParagraphs.length - 1; i >= 0; i -= 1) {
var imgCount = articleParagraphs[i].getElementsByTagName('img').length;
var embedCount = articleParagraphs[i].getElementsByTagName('embed').length;
var objectCount = articleParagraphs[i].getElementsByTagName('object').length;

if (imgCount === 0 &&
embedCount === 0 &&
objectCount === 0 &&
this._getInnerText(articleParagraphs[i], false) === '')
articleParagraphs[i].parentNode.removeChild(articleParagraphs[i]);
}
this._forEachNode(articleContent.getElementsByTagName('p'), function(paragraph) {
var imgCount = paragraph.getElementsByTagName('img').length;
var embedCount = paragraph.getElementsByTagName('embed').length;
var objectCount = paragraph.getElementsByTagName('object').length;
var totalCount = imgCount + embedCount + objectCount;

if (totalCount === 0 && !this._getInnerText(paragraph, false))
paragraph.parentNode.removeChild(paragraph);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmpf. All of this is pretty inefficient, but that can be another followup...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh. You can't believe the number of wtf-per-minute I've been producing while initially the code. We'll address all this iteratively, if you don't mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was the same on my side. And yes, iteratively for sure. See also https://twitter.com/mleibovic/status/578676569582907392 for the state of some of the people here...

});

var brs = articleContent.getElementsByTagName("BR");
for (var i = brs.length; --i >= 0;) {
var br = brs[i];
this._forEachNode(articleContent.getElementsByTagName("br"), function(br) {
var next = this._nextElement(br.nextSibling);
if (next && next.tagName == "P")
br.parentNode.removeChild(br);
}
});
},

/**
Expand Down Expand Up @@ -522,16 +542,15 @@ Readability.prototype = {
elementsToScore.push(node);
} else {
// EXPERIMENTAL
for (var i = 0, il = node.childNodes.length; i < il; i += 1) {
var childNode = node.childNodes[i];
this._forEachNode(node.childNodes, function(childNode) {
if (childNode.nodeType === Node.TEXT_NODE) {
var p = doc.createElement('p');
p.textContent = childNode.textContent;
p.style.display = 'inline';
p.className = 'readability-styled';
node.replaceChild(p, childNode);
}
}
});
}
}
node = this._getNextNode(node);
Expand All @@ -544,17 +563,17 @@ Readability.prototype = {
* A score is determined by things like number of commas, class names, etc. Maybe eventually link density.
**/
var candidates = [];
for (var pt = 0; pt < elementsToScore.length; pt += 1) {
var parentNode = elementsToScore[pt].parentNode;
this._forEachNode(elementsToScore, function(elementToScore) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here's something which I've been wondering about... are you trying to be consistent for its own sake? Because you could just call elementsToScore.forEach(...), right? Or update _forEachNode to check Array.isArray before doing the .call thing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or update _forEachNode to check Array.isArray before doing the .call thing...

Would this be to address a possible performance penalty? Because applying the method from the prototype on an array works exactly the same way, and has nearly no impact on performances…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think I was just confused by the fact that we're using this helper function all the time, which says it's there to make things work with NodeList, but we use it for other things, too... :-)

I don't really mind either way, up to you!

var parentNode = elementToScore.parentNode;
var grandParentNode = parentNode ? parentNode.parentNode : null;
var innerText = this._getInnerText(elementsToScore[pt]);
var innerText = this._getInnerText(elementToScore);

if (!parentNode || typeof(parentNode.tagName) === 'undefined')
continue;
return;

// If this paragraph is less than 25 characters, don't even count it.
if (innerText.length < 25)
continue;
return;

// Initialize readability data for the parent.
if (typeof parentNode.readability === 'undefined') {
Expand Down Expand Up @@ -586,7 +605,7 @@ Readability.prototype = {

if (grandParentNode)
grandParentNode.readability.contentScore += contentScore / 2;
}
});

// After we've calculated scores, loop through all of the possible
// candidate nodes we found and find the one with the highest score.
Expand Down Expand Up @@ -797,7 +816,7 @@ Readability.prototype = {

/**
* Attempts to get excerpt and byline metadata for the article.
*
*
* @return Object with optional "excerpt" and "byline" properties
*/
_getArticleMetadata: function() {
Expand All @@ -813,14 +832,13 @@ Readability.prototype = {
var propertyPattern = /^\s*og\s*:\s*description\s*$/gi;

// Find description tags.
for (var i = 0; i < metaElements.length; i++) {
var element = metaElements[i];
this._forEachNode(metaElements, function(element) {
var elementName = element.getAttribute("name");
var elementProperty = element.getAttribute("property");

if (elementName === "author") {
metadata.byline = element.getAttribute("content");
continue;
return;
}

var name = null;
Expand All @@ -839,7 +857,7 @@ Readability.prototype = {
values[name] = content.trim();
}
}
}
});

if ("description" in values) {
metadata.excerpt = values["description"];
Expand All @@ -860,14 +878,13 @@ Readability.prototype = {
* @param Element
**/
_removeScripts: function(doc) {
var scripts = doc.getElementsByTagName('script');
for (var i = scripts.length - 1; i >= 0; i -= 1) {
scripts[i].nodeValue="";
scripts[i].removeAttribute('src');
this._forEachNode(doc.getElementsByTagName('script'), function(scriptNode) {
scriptNode.nodeValue = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we set nodeValue at all, and why we nullcheck the element's parentNode... Well, best leave that as-is for now, and file another followup, please? :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #57

scriptNode.removeAttribute('src');

if (scripts[i].parentNode)
scripts[i].parentNode.removeChild(scripts[i]);
}
if (scriptNode.parentNode)
scriptNode.parentNode.removeChild(scriptNode);
});
},

/**
Expand All @@ -877,49 +894,42 @@ Readability.prototype = {
*
* @param Element
**/
_hasSinglePInsideElement: function(e) {
_hasSinglePInsideElement: function(element) {
// There should be exactly 1 element child which is a P:
if (e.children.length != 1 || e.firstElementChild.tagName !== "P") {
if (element.children.length != 1 || element.firstElementChild.tagName !== "P") {
return false;
}
// And there should be no text nodes with real content
var childNodes = e.childNodes;
for (var i = childNodes.length; --i >= 0;) {
var node = childNodes[i];
if (node.nodeType == Node.TEXT_NODE &&
this.REGEXPS.hasContent.test(node.textContent)) {
return false;
}
}

return true;
// And there should be no text nodes with real content
return !this._someNode(element.childNodes, function(node) {
return node.nodeType === Node.TEXT_NODE &&
this.REGEXPS.hasContent.test(node.textContent);
});
},

/**
* Determine whether element has any children block level elements.
*
* @param Element
*/
_hasChildBlockElement: function (e) {
var length = e.children.length;
for (var i = 0; i < length; i++) {
var child = e.children[i];
if (this.DIV_TO_P_ELEMS.indexOf(child.tagName) !== -1 || this._hasChildBlockElement(child))
return true;
}
return false;
_hasChildBlockElement: function (element) {
return this._someNode(element.childNodes, function(node) {
return this.DIV_TO_P_ELEMS.indexOf(node.tagName) !== -1 ||
this._hasChildBlockElement(node);
});
},

/**
* Get the inner text of a node - cross browser compatibly.
* This also strips out any excess whitespace to be found.
*
* @param Element
* @param Boolean normalizeSpaces (default: true)
* @return string
**/
_getInnerText: function(e, normalizeSpaces) {
var textContent = e.textContent.trim();
normalizeSpaces = (typeof normalizeSpaces === 'undefined') ? true : normalizeSpaces;
var textContent = e.textContent.trim();

if (normalizeSpaces) {
return textContent.replace(this.REGEXPS.normalize, " ");
Expand Down Expand Up @@ -978,14 +988,17 @@ Readability.prototype = {
* @param Element
* @return number (float)
**/
_getLinkDensity: function(e) {
var links = e.getElementsByTagName("a");
var textLength = this._getInnerText(e).length;
_getLinkDensity: function(element) {
var textLength = this._getInnerText(element).length;
if (textLength === 0)
return;

var linkLength = 0;

for (var i = 0, il = links.length; i < il; i += 1) {
linkLength += this._getInnerText(links[i]).length;
}
// XXX implement _reduceNodeList?
this._forEachNode(element.getElementsByTagName("a"), function(linkNode) {
linkLength += this._getInnerText(linkNode).length;
});

return linkLength / textLength;
},
Expand Down Expand Up @@ -1398,28 +1411,27 @@ Readability.prototype = {
* @return void
**/
_clean: function(e, tag) {
var targetList = e.getElementsByTagName(tag);
var isEmbed = (tag === 'object' || tag === 'embed');

for (var y = targetList.length - 1; y >= 0; y -= 1) {
this._forEachNode(e.getElementsByTagName(tag), function(element) {
// Allow youtube and vimeo videos through as people usually want to see those.
if (isEmbed) {
var attributeValues = "";
for (var i = 0, il = targetList[y].attributes.length; i < il; i += 1) {
attributeValues += targetList[y].attributes[i].value + '|';
for (var i = 0, il = element.attributes.length; i < il; i += 1) {
attributeValues += element.attributes[i].value + '|';
}

// First, check the elements attributes to see if any of them contain youtube or vimeo
if (this.REGEXPS.videos.test(attributeValues))
continue;
return;

// Then check the elements inside this element for the same.
if (this.REGEXPS.videos.test(targetList[y].innerHTML))
continue;
if (this.REGEXPS.videos.test(element.innerHTML))
return;
}

targetList[y].parentNode.removeChild(targetList[y]);
}
element.parentNode.removeChild(element);
});
},

/**
Expand Down Expand Up @@ -1571,7 +1583,7 @@ Readability.prototype = {
if (!metadata.excerpt) {
var paragraphs = articleContent.getElementsByTagName("p");
if (paragraphs.length > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, can you file a followup issue to make this less dumb? If we only care about the first P, we should just do our own iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why this code is a problem. Hints?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets all the paragraphs (walks the entire tree), but then we only really want the first one. It could just stop looking after the first paragraph, ie the equivalent of querySelector. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #60

metadata.excerpt = paragraphs[0].textContent;
metadata.excerpt = paragraphs[0].textContent.trim();
}
}

Expand Down
5 changes: 5 additions & 0 deletions test/test-pages/basic-tags-cleaning/expected-metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"title": "Basic tag cleaning test",
"byline": null,
"excerpt": "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod\n tempor incididunt ut labore et dolore magna aliqua."
}
Loading