Skip to content

Base URL inheritance gives unintuitive results #179

@domenic

Description

@domenic

(For background see WICG/nav-speculation#259; in particular this is attempting to formalize the proposal at WICG/nav-speculation#259 (comment). cc @jeremyroman)

Problem

When constructing a URLPattern from a base URL, we have found it surprising how many components get inherited. The particularly problematic ones are search and hash. Consider code such as:

const pattern1 = new URLPattern("/wp-login.php", document.baseURI);

// or:

const pattern2 = new URLPattern({ pathname: "/wp-login.php", baseURL: document.baseURI });

if (!pattern1.test(userInput)) {
  // They're not targeting the login page for our blog. OK to proceed!
}

Can you spot the bug? There are actually two!

  • If document.baseURL is something like https://example.com/, then this sort of check gets defeated by a userInput derived from a relative URL such as /wp-login.php?bypass or /wp-login.php#bypass.
  • If document.baseURL is something like https://example.com/?utm_source=blah, then this sort of check fails even for userInput derived from /wp-login.php.

This is because these URLPattern instances inherited the empty-string search and hash components from the base URL:

const noBaseURL = new URLPattern("/wp-login.php");
console.assert(noBaseURL.search === "*");
console.assert(noBaseURL.hash === "*");

const withBaseURL1 = new URLPattern("/wp-login.php", "https://example.com/");
console.assert(withBaseURL1.search === "");
console.assert(withBaseURL1.hash === "");

const withBaseURL2 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah");
console.assert(withBaseURL2.search === "utm_source=blah");
console.assert(withBaseURL2.hash === "");

Non-path/search/hash cases

Another instance that some might find unexpected is the following:

const pattern = new URLPattern({ protocol: "https", baseURL: document.baseURI });

This pattern inherits all non-protocol components from the base URL. Whereas, some might expect that since we overrode such an "early" component, we'd only get matching on protocol, not the rest.

This is especially problematic in environments where the base URL argument is implicit, e.g. speculation rules, service worker static routing, or any framework which attempts to specialize itself to "the current page", such as

function addRouteHandlerForThisPage(urlPatternString, handler) {
  routerTable.set(new URLPattern(urlPatternString, document.baseURI), handler);
}

In such environments, you might provide a URL pattern such as

{
  "href_matches": { "protocol": "https" }
}

or

condition: {
  urlPattern: { protocol: "https" }
}

and expect this to match all https:// URLs. But instead, since it inherits all non-protocol components from the base URL, it essentially only matches the base URL itself.

Proposed solution

  • Whenever a URLPattern input has a given component, do not (by default) inherit any components from the base URL that are "later" than that component. "Later" means mostly the usual order: protocol, hostname, port, pathname, search, hash. For example:
    • If the input has a protocol component, do not inherit any components
    • If the input has a pathname component, do not inherit the search and hash components
  • For username and password, we define those to be "later" than protocol, hostname, and port (and define password to be "later" than username). For example:
    • If the input has a hostname component, do not inherit port, pathname, search, hash, username, and password
    • If the input has a username component, do not inherit pathname, search, hash, password
  • Add an option, baseURLInheritance, which lets you control this behavior
    • The new default, per the above, is "auto"
    • "all" gives the current behavior
    • Maybe, you could also do an array of components to inherit, e.g. ["port", "search"]. We can leave that out for now minus compelling use cases.

Then:

const pattern1 = new URLPattern("/wp-login.php", "https://example.com/");
console.assert(pattern1.protocol === "https");
console.assert(pattern1.username === "");
console.assert(pattern1.password === "");
console.assert(pattern1.hostname === "example.com");
console.assert(pattern1.port === "");
console.assert(pattern1.pathname === "/wp-login.php");
console.assert(pattern1.search === "*");
console.assert(pattern1.hash === "*");

const pattern2 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah");
// Same as pattern1, including:
console.assert(pattern2.search === "*");
console.assert(pattern2.hash === "*");

const pattern3 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah#heading");
// Same as pattern1, including:
console.assert(pattern3.search === "*");
console.assert(pattern3.hash === "*");


const pattern4 = new URLPattern("/wp-login.php?user=foo", "https://example.com/");
// Same as pattern1 before search. Then:
console.assert(pattern4.search === "user=foo");
console.assert(pattern4.hash === "*");

const pattern5 = new URLPattern("/wp-login.php?user=foo#bar", "https://example.com/");
// Same as pattern1 before search. Then:
console.assert(pattern5.search === "user=foo");
console.assert(pattern5.hash === "bar");


const pattern6 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah#heading", { baseURLInheritance: "all" });
console.assert(pattern6.search === "utm_source=blah");
console.assert(pattern6.hash === "heading");

Considerations

Compat

There might be some compat impact here. I suspect it is low, as we'd need to:

  • Be using URLPattern
  • ...with a base URL containing a given component
  • ...and a pattern that does not contain that component
  • ...executed against an input where it makes a difference

"using URLPattern" is swinging between 0.02% and 0.05% of page views, so we're in a good place to start with. We plan to add use counters to see how often this conjunction happens in the wild.

Alternatives considered

  • We could make this new behavior opt-in instead of opt-out. I would strongly prefer to make it opt-out, because: I think it's more sensible and less likely to lead to bugs; and, I think it's the best default for various web platform features, such as speculation rules or service worker scope matching, which hope to base themselves on URL patterns.

  • We could just tell people to always write their patterns like "/wp-login.php?*#*. I think that would be a sad place to end up.

  • We could restrict this logic to just pathname -> search -> hash. I think this is reasonable as those are the most problematic components. This would mean not solving the { protocol: "https" } problem, but maybe that's OK.

Server-side considerations

From my experience, most server-side uses of URLPattern are pathname-focused. They probably aren't overly impacted by this. Any confirmation from the server-side community would be helpful.

Who does the work?

Jeremy and I are happy to volunteer on the implementation/spec/test work. Jeremy is working on the use counter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions