Skip to content

should we attempt to URL canonicalize patterns? #33

@wanderview

Description

@wanderview

The URL() constructor does a number of things to canonicalize the input string. For example:

  • Illegal characters are automatically URL encoded. For most components this is percent encoding, but for hostname it appears to use IDNA encoding, etc.
  • Some characters, like {, normally get percent encoded by URL encoding, but URLPattern uses it as one of its special characters.
  • For some protocols (http/https/etc) the pathname is required to consist of at least /. The slash is added if missing.
  • For some protocols (http/https/etc) the pathname is flattened to remove .. and ..
  • If the port matches the default port for a given protocol, then it is coerced to the empty string.
  • Probably many others...

So the question is can we and should we apply these transformations to component patterns. My original intent was to try to do so, but I have identified a number of problems:

  1. Often the transformations are not safe within the custom regexp pattern groupings. For example, if you try to percent encode a unicode character within a regexp [a-z] character list it won't work correctly since each character in the percent encoding is considered independently. Similar difficulties arise with the .. and . flattening since those characters may appear within a regexp. In general we don't want to have to duplicate the regexp parser in URLPattern, so solving this seems quite difficult.
  2. A URLPattern may often not have a fixed protocol. This makes it difficult to apply transformations that are conditional on protocol.
  3. Developers may confused that some characters are URL encoded, but others are not.

Therefore I'm leaning towards not applying any canonicalization or automatic encoding for patterns. For example, if a non-ascii character is included in the pattern we would throw. In contrast, however, URL values passed to test() and exec() would be fully URL canonicalized. Therefore developers would be required to write patterns to match canonical URLs, but we would not fully enforce or automatically help with that at URLPattern construction time.

I intend to implement the above to start, but if it becomes a problem we could fall back to the a half measure. The URL canonicalization could be applied to patterns, but only outside of custom regexp groups. If you include a unicode character within a custom regexp then URLPattern would throw, but otherwise would be automatically URL encoded. Of course, pattern special characters like { would still need to be exempted from encoding, so it would not be quite equivalent to URL constructor behavior.

Generally it will be easier to move from the "no canonicalization" approach to "canonicalize outside of regexp" without breaking existing patterns.

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions