Backend test for utf8 encoding - Ready for Review/Merge @muxator#3737
Backend test for utf8 encoding - Ready for Review/Merge @muxator#3737muxator merged 11 commits intoether:developfrom
Conversation
|
@muxator ready for review :) |
Updating
|
|
||
| if (!req.directDatabaseAccess) { | ||
| text = await fsp_readFile(destFile, "utf8"); | ||
| let bytelike = unescape(encodeURIComponent(text)); |
There was a problem hiding this comment.
Quick question without having studied this code yet: is the unescape()/escape() roundtrip avoidable?
I ask this because those functions seem to be on the way of deprecation (source on MDN):
All of the language features and behaviours specified in this annex have one or more undesirable characteristics and in the absence of legacy usage would be removed from this specification.
Programmers should not use or assume the existence of these features and behaviours when writing new ECMAScript code.
There was a problem hiding this comment.
escape & it's sister aren't going anywhere any time soon. This gives us a few years (easily) to focus on handling the characters gracefully inside of the editor.
RE roundtrip, absolutely you could catch non processable characters and drop the import entirely. For me, in this PR, it was about making import work. Import is a small amount of processing handled by Etherpad instances and doing a roundtrip here is not computationally expensive.. That said, a DDoS attack using this as a vector is not ideal. Pushing 4Gb via post to the upload endpoint and doing the roundtrip would not be enjoyable. But Etherpad has much larger and more exploitable DDoS endpoints than this (think pad export URIs) so it's IMHO not a huge concern/worry.
If you want it to reject "characters" for the current implemented charset from settings.json we can do that but imho this serves it's purpose and our users well. Imagine the computational power requirements of processing an entire document beyond what we already are doing here though? .escape and .unescape are, despite being deprecated, optimized. We can use that to our advantage instead of trying to bake in our own unoptimized character by character check.
cc @rhelmer as he might know the schedule for deprecation of escape and unescape.
Again, let me emphasize, this issue is much bigger than Etherpad. I'm defo a noob at charsets and character handling but it looks like browsers will need to work closely with editors et al over the coming years to get this support implemented properly. <3
Finally, I imagine there are better fixes, I spent 99.9% of my time on this issue drilling down to the root cause and trying to get a vague understanding of what is a relatively nuanced topic that frankly, I have very little interest in. I <3 Etherpad users but for me, it's 2020, charset drama should be out of the way already and I don't want to go too far down that rabbit hole :D
There was a problem hiding this comment.
RE roundtrip, absolutely you could...
Whoa. When one is aiming to sanitize input doing an encode/decode roundtrip is actually a sound strategy. I am not thinking about performances right now.
My question was about why use two functions (encode/decodeURIComponent + unescape/escape) instead of a single one. I am sure there is a reason. Maybe this covers more cases? If so, what are they?
There was a problem hiding this comment.
Good question, I don't know the answer! See what happens if you just do decode. I encoded then decoded because I figured it'd let less noise through and also create a saner output. Also that's what the SO article did.. https://stackoverflow.com/a/2670211
There was a problem hiding this comment.
escape& it's sister aren't going anywhere any time soon. This gives us a few years (easily) to focus on handling the characters gracefully inside of the editor.
[...]
cc @rhelmer as he might know the schedule for deprecation of escape and unescape.
Hm. I agree that it is probably not going anywhere soon, let me research a bit more though and see what the recommended way to do this is from standards-y folks.
|
Nice work, @JohnMcLear, thanks! I'll have a look ASAP. |
|
@muxator bump :D |
|
Tonight |
7dc6785 to
bcf63cc
Compare
|
I am having a lot of emojons looking at this 🤣 |
bcf63cc to
7cc6404
Compare
|
Wow, still going through... |
|
Are you testing character by character or something? :D |
|
Change by change of the original PR. |
7cc6404 to
ee0e4d3
Compare
|
Ok I am done. I do not know how to use the review interface so I am posting here.
If we are safe with 3 I'll merge this PR right away. Thanks. |
ee0e4d3 to
3964ee8
Compare
|
I stripped point 3 (import/export) from this PR, and moved it in #3796, please comment there, @JohnMcLear. All the other changes (the utf8 tests and the change from GET to POST) are merged. |
|
<3 looks good. |
This PR introduces better handling for databases that don't have proper UTF8 encoding.
It includes a backend Test for UTF8 encoding which works by reading a properly formed HTML document that includes lots of weird characters which are then imported into Etherpad using The API method
setHTML. This can be merged separate to #3737 as one test is native to Etherpad and the other native to UeberDB.If the database is improperly configured these tests should crash the Etherpad instance (intentionally). Along with ether/ueberDB@a5db6c4 // #3734 this will also provide a test a
initof database.This PR also fixes
setTextandsetHTMLsyntax in the pad backend tests. Previously if you used.getwithsetTextorsetHTMLthose tests would fail if thetextorhtmlvariables were beyond the capacity ofgetThis PR also (sorry!) should resolve the issue of uploading "emojis" through the UI. It does this by replacing the
isAsciicheck withIt works as well as copy/pasting and the setHTML. Basically the following characters don't work in Etherpad at all.
🏋️♀️ 🏋🏻♀️ 🏋🏼♀️ 🏋🏽♀️ 🏋🏾♀️ 🏋🏿♀️ 🏋️♂️ 🏋🏻♂️ 🏋🏼♂️ 🏋🏽♂️ 🏋🏾♂️ 🏋🏿♂️ 🤼♀️ 🤼♂️ 🤸♀️ 🤸🏻♀️ 🤸🏼♀️ 🤸🏽♀️ 🤸🏾♀️ 🤸🏿♀️ 🤸♂️ 🤸🏻♂️ 🤸🏼♂️ 🤸🏽♂️ 🤸🏾♂️ 🤸🏿♂️ ⛹️♀️ ⛹🏻♀️ ⛹🏼♀️ ⛹🏽♀️ ⛹🏾♀️ ⛹🏿♀️ ⛹️♂️ ⛹🏻♂️ ⛹🏼♂️ ⛹🏽♂️ ⛹🏾♂️ ⛹🏿♂️ 🤺 🤾♀️ 🤾🏻♀️ 🤾🏼♀️ 🤾🏾♀️ 🤾🏾♀️ 🤾🏿♀️ 🤾♂️ 🤾🏻♂️ 🤾🏼♂️ 🤾🏽♂️ 🤾🏾♂️ 🤾🏿♂️ 🏌️♀️ 🏌🏻♀️ 🏌🏼♀️ 🏌🏽♀️ 🏌🏾♀️ 🏌🏿♀️ 🏌️♂️ 🏌🏻♂️ 🏌🏼♂️ 🏌🏽♂️ 🏌🏾♂️ 🏌🏿♂️ 🏇 🏇🏻 🏇🏼 🏇🏽 🏇🏾 🏇🏿 🧘♀️ 🧘🏻♀️ 🧘🏼♀️ 🧘🏽♀️ 🧘🏾♀️ 🧘🏿♀️ 🧘♂️ 🧘🏻♂️ 🧘🏼♂️ 🧘🏽♂️ 🧘🏾♂️ 🧘🏿♂️ 🏄♀️ 🏄🏻♀️ 🏄🏼♀️ 🏄🏽♀️ 🏄🏾♀️ 🏄🏿♀️ 🏄♂️ 🏄🏻♂️ 🏄🏼♂️ 🏄🏽♂️ 🏄🏾♂️ 🏄🏿♂️ 🏊♀️ 🏊🏻♀️ 🏊🏼♀️ 🏊🏽♀️ 🏊🏾♀️ 🏊🏿♀️ 🏊♂️ 🏊🏻♂️ 🏊🏼♂️ 🏊🏽♂️ 🏊🏾♂️ 🏊🏿♂️ 🤽♀️ 🤽🏻♀️ 🤽🏼♀️ 🤽🏽♀️ 🤽🏾♀️ 🤽🏿♀️ 🤽♂️ 🤽🏻♂️ 🤽🏼♂️ 🤽🏽♂️ 🤽🏾♂️ 🤽🏿♂️ 🚣♀️ 🚣🏻♀️ 🚣🏼♀️ 🚣🏽♀️ 🚣🏾♀️ 🚣🏿♀️ 🚣♂️ 🚣🏻♂️ 🚣🏼♂️ 🚣🏽♂️
Try copy / pasting them into any pad, it will format them incorrectly. This is related to how Etherpad "thinks" about characters as a specific length and tbh I feel right now it's outside of the scope of what I'm trying to achieve.
At first I thought it was Etherpad being a bit old fashioned, then I had the same experience in Atom, Google Docs. And then other Document editors and IDEs, and discovered, it's a bit of a mess out there...
So I think my fix is "good enough" but now we have a set of characters we know we can test against so another developer can continue the struggle further to support