(For background see WICG/nav-speculation#259; in particular this is attempting to formalize the proposal at WICG/nav-speculation#259 (comment). cc @jeremyroman)
Problem
When constructing a URLPattern from a base URL, we have found it surprising how many components get inherited. The particularly problematic ones are search and hash. Consider code such as:
const pattern1 = new URLPattern("/wp-login.php", document.baseURI);
// or:
const pattern2 = new URLPattern({ pathname: "/wp-login.php", baseURL: document.baseURI });
if (!pattern1.test(userInput)) {
// They're not targeting the login page for our blog. OK to proceed!
}
Can you spot the bug? There are actually two!
- If
document.baseURL is something like https://example.com/, then this sort of check gets defeated by a userInput derived from a relative URL such as /wp-login.php?bypass or /wp-login.php#bypass.
- If
document.baseURL is something like https://example.com/?utm_source=blah, then this sort of check fails even for userInput derived from /wp-login.php.
This is because these URLPattern instances inherited the empty-string search and hash components from the base URL:
const noBaseURL = new URLPattern("/wp-login.php");
console.assert(noBaseURL.search === "*");
console.assert(noBaseURL.hash === "*");
const withBaseURL1 = new URLPattern("/wp-login.php", "https://example.com/");
console.assert(withBaseURL1.search === "");
console.assert(withBaseURL1.hash === "");
const withBaseURL2 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah");
console.assert(withBaseURL2.search === "utm_source=blah");
console.assert(withBaseURL2.hash === "");
Non-path/search/hash cases
Another instance that some might find unexpected is the following:
const pattern = new URLPattern({ protocol: "https", baseURL: document.baseURI });
This pattern inherits all non-protocol components from the base URL. Whereas, some might expect that since we overrode such an "early" component, we'd only get matching on protocol, not the rest.
This is especially problematic in environments where the base URL argument is implicit, e.g. speculation rules, service worker static routing, or any framework which attempts to specialize itself to "the current page", such as
function addRouteHandlerForThisPage(urlPatternString, handler) {
routerTable.set(new URLPattern(urlPatternString, document.baseURI), handler);
}
In such environments, you might provide a URL pattern such as
{
"href_matches": { "protocol": "https" }
}
or
condition: {
urlPattern: { protocol: "https" }
}
and expect this to match all https:// URLs. But instead, since it inherits all non-protocol components from the base URL, it essentially only matches the base URL itself.
Proposed solution
- Whenever a
URLPattern input has a given component, do not (by default) inherit any components from the base URL that are "later" than that component. "Later" means mostly the usual order: protocol, hostname, port, pathname, search, hash. For example:
- If the input has a
protocol component, do not inherit any components
- If the input has a
pathname component, do not inherit the search and hash components
- For
username and password, we define those to be "later" than protocol, hostname, and port (and define password to be "later" than username). For example:
- If the input has a
hostname component, do not inherit port, pathname, search, hash, username, and password
- If the input has a
username component, do not inherit pathname, search, hash, password
- Add an option,
baseURLInheritance, which lets you control this behavior
- The new default, per the above, is
"auto"
"all" gives the current behavior
- Maybe, you could also do an array of components to inherit, e.g.
["port", "search"]. We can leave that out for now minus compelling use cases.
Then:
const pattern1 = new URLPattern("/wp-login.php", "https://example.com/");
console.assert(pattern1.protocol === "https");
console.assert(pattern1.username === "");
console.assert(pattern1.password === "");
console.assert(pattern1.hostname === "example.com");
console.assert(pattern1.port === "");
console.assert(pattern1.pathname === "/wp-login.php");
console.assert(pattern1.search === "*");
console.assert(pattern1.hash === "*");
const pattern2 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah");
// Same as pattern1, including:
console.assert(pattern2.search === "*");
console.assert(pattern2.hash === "*");
const pattern3 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah#heading");
// Same as pattern1, including:
console.assert(pattern3.search === "*");
console.assert(pattern3.hash === "*");
const pattern4 = new URLPattern("/wp-login.php?user=foo", "https://example.com/");
// Same as pattern1 before search. Then:
console.assert(pattern4.search === "user=foo");
console.assert(pattern4.hash === "*");
const pattern5 = new URLPattern("/wp-login.php?user=foo#bar", "https://example.com/");
// Same as pattern1 before search. Then:
console.assert(pattern5.search === "user=foo");
console.assert(pattern5.hash === "bar");
const pattern6 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah#heading", { baseURLInheritance: "all" });
console.assert(pattern6.search === "utm_source=blah");
console.assert(pattern6.hash === "heading");
Considerations
Compat
There might be some compat impact here. I suspect it is low, as we'd need to:
- Be using
URLPattern
- ...with a base URL containing a given component
- ...and a pattern that does not contain that component
- ...executed against an input where it makes a difference
"using URLPattern" is swinging between 0.02% and 0.05% of page views, so we're in a good place to start with. We plan to add use counters to see how often this conjunction happens in the wild.
Alternatives considered
-
We could make this new behavior opt-in instead of opt-out. I would strongly prefer to make it opt-out, because: I think it's more sensible and less likely to lead to bugs; and, I think it's the best default for various web platform features, such as speculation rules or service worker scope matching, which hope to base themselves on URL patterns.
-
We could just tell people to always write their patterns like "/wp-login.php?*#*. I think that would be a sad place to end up.
-
We could restrict this logic to just pathname -> search -> hash. I think this is reasonable as those are the most problematic components. This would mean not solving the { protocol: "https" } problem, but maybe that's OK.
Server-side considerations
From my experience, most server-side uses of URLPattern are pathname-focused. They probably aren't overly impacted by this. Any confirmation from the server-side community would be helpful.
Who does the work?
Jeremy and I are happy to volunteer on the implementation/spec/test work. Jeremy is working on the use counter.
(For background see WICG/nav-speculation#259; in particular this is attempting to formalize the proposal at WICG/nav-speculation#259 (comment). cc @jeremyroman)
Problem
When constructing a
URLPatternfrom a base URL, we have found it surprising how many components get inherited. The particularly problematic ones aresearchandhash. Consider code such as:Can you spot the bug? There are actually two!
document.baseURLis something likehttps://example.com/, then this sort of check gets defeated by auserInputderived from a relative URL such as/wp-login.php?bypassor/wp-login.php#bypass.document.baseURLis something likehttps://example.com/?utm_source=blah, then this sort of check fails even foruserInputderived from/wp-login.php.This is because these
URLPatterninstances inherited the empty-stringsearchandhashcomponents from the base URL:Non-path/search/hash cases
Another instance that some might find unexpected is the following:
This pattern inherits all non-
protocolcomponents from the base URL. Whereas, some might expect that since we overrode such an "early" component, we'd only get matching onprotocol, not the rest.This is especially problematic in environments where the base URL argument is implicit, e.g. speculation rules, service worker static routing, or any framework which attempts to specialize itself to "the current page", such as
In such environments, you might provide a URL pattern such as
{ "href_matches": { "protocol": "https" } }or
and expect this to match all
https://URLs. But instead, since it inherits all non-protocol components from the base URL, it essentially only matches the base URL itself.Proposed solution
URLPatterninput has a given component, do not (by default) inherit any components from the base URL that are "later" than that component. "Later" means mostly the usual order:protocol,hostname,port,pathname,search,hash. For example:protocolcomponent, do not inherit any componentspathnamecomponent, do not inherit thesearchandhashcomponentsusernameandpassword, we define those to be "later" thanprotocol,hostname, andport(and definepasswordto be "later" thanusername). For example:hostnamecomponent, do not inheritport,pathname,search,hash,username, andpasswordusernamecomponent, do not inheritpathname,search,hash,passwordbaseURLInheritance, which lets you control this behavior"auto""all"gives the current behavior["port", "search"]. We can leave that out for now minus compelling use cases.Then:
Considerations
Compat
There might be some compat impact here. I suspect it is low, as we'd need to:
URLPattern"using
URLPattern" is swinging between 0.02% and 0.05% of page views, so we're in a good place to start with. We plan to add use counters to see how often this conjunction happens in the wild.Alternatives considered
We could make this new behavior opt-in instead of opt-out. I would strongly prefer to make it opt-out, because: I think it's more sensible and less likely to lead to bugs; and, I think it's the best default for various web platform features, such as speculation rules or service worker scope matching, which hope to base themselves on URL patterns.
We could just tell people to always write their patterns like
"/wp-login.php?*#*. I think that would be a sad place to end up.We could restrict this logic to just
pathname->search->hash. I think this is reasonable as those are the most problematic components. This would mean not solving the{ protocol: "https" }problem, but maybe that's OK.Server-side considerations
From my experience, most server-side uses of
URLPatternare pathname-focused. They probably aren't overly impacted by this. Any confirmation from the server-side community would be helpful.Who does the work?
Jeremy and I are happy to volunteer on the implementation/spec/test work. Jeremy is working on the use counter.