-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Escaping (escape, escapeURIComponent) and unescaping (unescape, unescapeURIComponent) work by interpreting the source String as binary and forcing encoding on the output string. However, this completely breaks when encoding isn't single-byte-long in ASCII range, such as UTF-16. All of this applies to all variants of UTF-16 and UTF-32, as far as I'm aware.
For example, escaping technically correctly percent-encodes each octet, but the characters are not encoded in UTF-16, instead the encoding is forced on raw ASCII values:
"\uFEFF".encode(Encoding::UTF_16LE).bytes
=> [255, 254]
CGI.escape_uri_component("\uFEFF".encode(Encoding::UTF_16LE)).bytes
=> [37, 70, 70, 37, 70, 69] # This is %FF%FE in ASCII
CGI.escape_uri_component("\uFEFF".encode(Encoding::UTF_16LE))
=> "\u4625\u2546\u4546" # But this is just 3 unrelated characters!
CGI.escape_uri_component("\uFEFF".encode(Encoding::UTF_16LE)).encoding
=> #<Encoding:UTF-16LE>On the other hand, unescaping tries to interpret sequential triplets of bytes as a percent-encoded octet, but with the characters in the string all being multibyte, there will always be extra bytes in between, making unescaping completely impossible:
"%FE%FF".encode(Encoding::UTF_16LE).bytes
=> [37, 0, 70, 0, 69, 0, 37, 0, 70, 0, 70, 0]
CGI.unescape_uri_component("%FE%FF".encode(Encoding::UTF_16LE))
=> "%\u0000F\u0000E\u0000%\u0000F\u0000F\u0000" # This is in UTF-8, hence the extra NULs
CGI.unescape_uri_component("%FE%FF".encode(Encoding::UTF_16LE), Encoding::UTF_16LE)
=> "%FE%FF" # This is UTF-16, but not very unescapedBut it may work on accident instead:
what = "┵〥㔷┴䔥㐵┴㐡".encode(Encoding::UTF_16BE)
=> "\u2535\u3025\u3537\u2534\u4525\u3435\u2534\u3421"
CGI.unescapeURIComponent(what)
=> "PWNED!"Round-tripping looks to be working, but that's mainly by virtue of the methods being inverses of each other. In reality, "escaped" string is definitely not valid:
CGI.unescape_uri_component(CGI.escapeURIComponent("A %$\u1234".encode(Encoding::UTF_16LE)), Encoding::UTF_16LE)
=> "A %$\u1234"
CGI.escapeURIComponent(" %$\u1234".encode(Encoding::UTF_16LE))
=> "\u3225\u2530\u3030\u3225\u2535\u3030\u3225\u2534\u3030\u2534\u3231"
CGI.escapeURIComponent(" %$\u1234".encode(Encoding::UTF_16LE)).encode(Encoding::UTF_8)
=> "㈥┰〰㈥┵〰㈥┴〰┴㈱"RFC 3986 specifies that data needs to be encoded as a sequence of characters, which in turn can use whatever encoding is suitable. So this method of escaping/unescaping is definitely wrong.
I believe that this issue completely breaks interoperabilty with wchar_t APIs.
[un]escapeHTML seem to use a very different approach, seemingly not being susceptible to this:
CGI.escapeHTML("<html>".encode(Encoding::UTF_16LE))
=> "<html>"
CGI.unescapeHTML(CGI.escapeHTML("<html>".encode(Encoding::UTF_16LE)))
=> "<html>"
CGI.unescapeHTML(CGI.escapeHTML("<html>".encode(Encoding::UTF_16LE))).encoding
=> #<Encoding:UTF-16LE>[un]escapeElement, on the other hand, try to use UTF-8 strings inside, not working with wide encoding at all:
CGI.escapeElement("<html><body>".encode(Encoding::UTF_16LE), "body".encode(Encoding::UTF_16LE))
# <...>/lib/ruby/3.3.0/cgi/util.rb:187:in `escapeElement': incompatible character encodings: UTF-8 and UTF-16LE (Encoding::CompatibilityError)
CGI.unescapeElement("<html><body>".encode(Encoding::UTF_16LE), "body".encode(Encoding::UTF_16LE))
# <...>/lib/ruby/3.3.0/cgi/util.rb:207:in `unescapeElement': incompatible character encodings: UTF-8 and UTF-16LE (Encoding::CompatibilityError)(This session was on Ruby 3.3, but the same happens in 3.4 and I believe that there is no difference in the extracted gem either.)