new_audit(blocked-from-indexing): page is blocked from indexing#3657
new_audit(blocked-from-indexing): page is blocked from indexing#3657brendankenny merged 9 commits intoGoogleChrome:masterfrom
Conversation
patrickhulce
left a comment
There was a problem hiding this comment.
maybe I reviewed too early 😄
seems like there's some weirdness around the user agent bit?
| return false; | ||
| } | ||
|
|
||
| const date = Date.parse(parts[1]); |
There was a problem hiding this comment.
seems like we would want to do a parts.slice(1).join(':'), no? losing the time and timezone
There was a problem hiding this comment.
Oh, good catch! Tests didn't catch that because Date.parse is very forgiving.
| */ | ||
| function hasUA(directives) { | ||
| const parts = directives.split(':'); | ||
| return parts.length > 1 && parts[0] !== UNAVAILABLE_AFTER; |
There was a problem hiding this comment.
should it be looking for a GOOGLEBOT_USER_AGENT const maybe?
let's also rename this then, if it returns false when a UA is specified how about hasNoUserAgent or something
| } | ||
|
|
||
| mainResource.responseHeaders | ||
| .filter(h => h.name.toLowerCase() === ROBOTS_HEADER && !hasUA(h.value) && |
There was a problem hiding this comment.
so to be clear, we're looking for the robots header that has a user agent specified and is specified to block?
seems like this might miss a few cases maybe I'm misunderstanding
can there not be multiple directives in the header?
There was a problem hiding this comment.
we are looking for robots header that doesn't have UA specified (we don't support UA specific headers) and is blocking indexing
|
@patrickhulce Thank you for a review! With all the tests I'm pretty sure it's working as intended, but I made the We want to ignore all user agent specific tags (we are only looking for I left additional comment. Please let me know if that makes sense to you. |
patrickhulce
left a comment
There was a problem hiding this comment.
ah ok makes a lot more sense now sorry for my confusion :)
| } | ||
|
|
||
| /** | ||
| * Returns false if robots header specifies user agent (e.g. `googlebot: noindex`) |
There was a problem hiding this comment.
Ah, I see is this comment a typo then? Returns *true* if robots header specifies a user agent?
There was a problem hiding this comment.
Right, I forgot to update both comments. Thanks!
| } | ||
|
|
||
| /** | ||
| * Returns false if any of provided directives blocks page from being indexed |
There was a problem hiding this comment.
same here, doesn't this return true when any of the directives blocks the page from being indexed?
| it('ignores UA specific directives', () => { | ||
| const mainResource = { | ||
| responseHeaders: [ | ||
| {name: 'x-robots-tag', value: 'googlebot: unavailable_after: 25 Jun 2007 15:00:00 PST'}, |
There was a problem hiding this comment.
ah ok, I was confused about how multiple user agent + a default would be expressed. I didn't realize it'd be duplicate headers rather than a csv
would you mind adding a default value here that's valid just for future readers
i.e.
responseHeaders: [
{name: 'x-robots-tag', value: 'googlebot: unavailable_after: 25 Jun 2007 15:00:00 PST'},
{name: 'x-robots-tag', value: 'unavailable_after: 25 Jun 2027 15:00:00 PST'},
]
rviscomi
left a comment
There was a problem hiding this comment.
just one suggestion, otherwise LGTM 👍
| * @returns {boolean} | ||
| */ | ||
| function isUnavailable(directive) { | ||
| const parts = directive.split(':'); |
There was a problem hiding this comment.
Sometimes I find it easier to use array deconstruction in cases like this:
const [key, value] = directive.split(':');There was a problem hiding this comment.
Yeah, I agree that'd be much more elegant, but in this case it won't work:
const [key, value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':');value in this case would be 12 Jun 2017 12 instead of 12 Jun 2017 12:30:00. I could do:
const [key, ...value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':');But then, value is an array and I still have to .join(':') it, so not much different from a current solution :(
There was a problem hiding this comment.
Would split(':', 1) resolve your concern?
There was a problem hiding this comment.
TIL about second parameter of .split!
This gives me access to unavailable_after but doesn't give me the date:
const [key, value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':', 1);value will be empty. Am I missing something here? 🤔
There was a problem hiding this comment.
Yeah you're right, it doesn't do what I thought it would. Carry on!
|
@patrickhulce I've addressed your comments 👍 PTAL |
5bb9d4f to
d1c2667
Compare
| static get meta() { | ||
| return { | ||
| name: 'is-crawlable', | ||
| description: 'Page isn’t blocked from indexing', |
There was a problem hiding this comment.
I'd prefer to word this more affirmatively (i.e. Page can be indexed or Page is indexable), but I'm guessing we can't because it'd be misleading and there are many other ways a page could be prevented from indexing?
|
@brendankenny there are smokehouse server changes FYI if you wanted to review :) |
brendankenny
left a comment
There was a problem hiding this comment.
whoops, hit submit too soon. Just looking at the server changes, a suggestion and a request :)
| extraHeaders = Array.isArray(extraHeaders) ? extraHeaders : [extraHeaders]; | ||
|
|
||
| extraHeaders.forEach(header => { | ||
| const parts = header.split(':'); |
There was a problem hiding this comment.
I actually would prefer const [key, ...value], but that's just down to preference.
You could also do header.split(/:(.+)/);, which should give the correct split (captured groups also appear in the resulting array)
|
|
||
| extraHeaders.forEach(header => { | ||
| const parts = header.split(':'); | ||
| headers[parts[0]] = parts.slice(1).join(':'); |
There was a problem hiding this comment.
this might be complete overkill, but can we make a set of allowed headers and only add to headers if found in there? We block hidden files and anything outside of the working directory, but you never know...
There was a problem hiding this comment.
One can't be too careful!
|
@brendankenny header safelist added PTAL |
brendankenny
left a comment
There was a problem hiding this comment.
thanks for your patience! LGTM
📃 🚫 🤖 🚼
|
@brendankenny thanks for merging 🙌 |
Closes #3182
Failing:

Passing:
