Skip to content

Backend test for utf8 encoding - Ready for Review/Merge @muxator#3737

Merged
muxator merged 11 commits intoether:developfrom
JohnMcLear:backendTestForUTF8Encoding
Mar 31, 2020
Merged

Backend test for utf8 encoding - Ready for Review/Merge @muxator#3737
muxator merged 11 commits intoether:developfrom
JohnMcLear:backendTestForUTF8Encoding

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

@JohnMcLear JohnMcLear commented Mar 20, 2020

This PR introduces better handling for databases that don't have proper UTF8 encoding.

It includes a backend Test for UTF8 encoding which works by reading a properly formed HTML document that includes lots of weird characters which are then imported into Etherpad using The API method setHTML. This can be merged separate to #3737 as one test is native to Etherpad and the other native to UeberDB.

If the database is improperly configured these tests should crash the Etherpad instance (intentionally). Along with ether/ueberDB@a5db6c4 // #3734 this will also provide a test a init of database.

This PR also fixes setText and setHTML syntax in the pad backend tests. Previously if you used .get with setText or setHTML those tests would fail if the text or html variables were beyond the capacity of get

This PR also (sorry!) should resolve the issue of uploading "emojis" through the UI. It does this by replacing the isAscii check with

    let bytelike = unescape(encodeURIComponent(text));
    text = decodeURIComponent(escape(bytelike));

It works as well as copy/pasting and the setHTML. Basically the following characters don't work in Etherpad at all.

🏋️‍♀️ 🏋🏻‍♀️ 🏋🏼‍♀️ 🏋🏽‍♀️ 🏋🏾‍♀️ 🏋🏿‍♀️ 🏋️‍♂️ 🏋🏻‍♂️ 🏋🏼‍♂️ 🏋🏽‍♂️ 🏋🏾‍♂️ 🏋🏿‍♂️ 🤼‍♀️ 🤼‍♂️ 🤸‍♀️ 🤸🏻‍♀️ 🤸🏼‍♀️ 🤸🏽‍♀️ 🤸🏾‍♀️ 🤸🏿‍♀️ 🤸‍♂️ 🤸🏻‍♂️ 🤸🏼‍♂️ 🤸🏽‍♂️ 🤸🏾‍♂️ 🤸🏿‍♂️ ⛹️‍♀️ ⛹🏻‍♀️ ⛹🏼‍♀️ ⛹🏽‍♀️ ⛹🏾‍♀️ ⛹🏿‍♀️ ⛹️‍♂️ ⛹🏻‍♂️ ⛹🏼‍♂️ ⛹🏽‍♂️ ⛹🏾‍♂️ ⛹🏿‍♂️ 🤺 🤾‍♀️ 🤾🏻‍♀️ 🤾🏼‍♀️ 🤾🏾‍♀️ 🤾🏾‍♀️ 🤾🏿‍♀️ 🤾‍♂️ 🤾🏻‍♂️ 🤾🏼‍♂️ 🤾🏽‍♂️ 🤾🏾‍♂️ 🤾🏿‍♂️ 🏌️‍♀️ 🏌🏻‍♀️ 🏌🏼‍♀️ 🏌🏽‍♀️ 🏌🏾‍♀️ 🏌🏿‍♀️ 🏌️‍♂️ 🏌🏻‍♂️ 🏌🏼‍♂️ 🏌🏽‍♂️ 🏌🏾‍♂️ 🏌🏿‍♂️ 🏇 🏇🏻 🏇🏼 🏇🏽 🏇🏾 🏇🏿 🧘‍♀️ 🧘🏻‍♀️ 🧘🏼‍♀️ 🧘🏽‍♀️ 🧘🏾‍♀️ 🧘🏿‍♀️ 🧘‍♂️ 🧘🏻‍♂️ 🧘🏼‍♂️ 🧘🏽‍♂️ 🧘🏾‍♂️ 🧘🏿‍♂️ 🏄‍♀️ 🏄🏻‍♀️ 🏄🏼‍♀️ 🏄🏽‍♀️ 🏄🏾‍♀️ 🏄🏿‍♀️ 🏄‍♂️ 🏄🏻‍♂️ 🏄🏼‍♂️ 🏄🏽‍♂️ 🏄🏾‍♂️ 🏄🏿‍♂️ 🏊‍♀️ 🏊🏻‍♀️ 🏊🏼‍♀️ 🏊🏽‍♀️ 🏊🏾‍♀️ 🏊🏿‍♀️ 🏊‍♂️ 🏊🏻‍♂️ 🏊🏼‍♂️ 🏊🏽‍♂️ 🏊🏾‍♂️ 🏊🏿‍♂️ 🤽‍♀️ 🤽🏻‍♀️ 🤽🏼‍♀️ 🤽🏽‍♀️ 🤽🏾‍♀️ 🤽🏿‍♀️ 🤽‍♂️ 🤽🏻‍♂️ 🤽🏼‍♂️ 🤽🏽‍♂️ 🤽🏾‍♂️ 🤽🏿‍♂️ 🚣‍♀️ 🚣🏻‍♀️ 🚣🏼‍♀️ 🚣🏽‍♀️ 🚣🏾‍♀️ 🚣🏿‍♀️ 🚣‍♂️ 🚣🏻‍♂️ 🚣🏼‍♂️ 🚣🏽‍♂️

Try copy / pasting them into any pad, it will format them incorrectly. This is related to how Etherpad "thinks" about characters as a specific length and tbh I feel right now it's outside of the scope of what I'm trying to achieve.

At first I thought it was Etherpad being a bit old fashioned, then I had the same experience in Atom, Google Docs. And then other Document editors and IDEs, and discovered, it's a bit of a mess out there...

So I think my fix is "good enough" but now we have a set of characters we know we can test against so another developer can continue the struggle further to support

@JohnMcLear
Copy link
Copy Markdown
Member Author

@muxator ready for review :)

Comment thread src/node/handler/ImportHandler.js Outdated

if (!req.directDatabaseAccess) {
text = await fsp_readFile(destFile, "utf8");
let bytelike = unescape(encodeURIComponent(text));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question without having studied this code yet: is the unescape()/escape() roundtrip avoidable?

I ask this because those functions seem to be on the way of deprecation (source on MDN):

All of the language features and behaviours specified in this annex have one or more undesirable characteristics and in the absence of legacy usage would be removed from this specification.
Programmers should not use or assume the existence of these features and behaviours when writing new ECMAScript code.

Copy link
Copy Markdown
Member Author

@JohnMcLear JohnMcLear Mar 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

escape & it's sister aren't going anywhere any time soon. This gives us a few years (easily) to focus on handling the characters gracefully inside of the editor.

RE roundtrip, absolutely you could catch non processable characters and drop the import entirely. For me, in this PR, it was about making import work. Import is a small amount of processing handled by Etherpad instances and doing a roundtrip here is not computationally expensive.. That said, a DDoS attack using this as a vector is not ideal. Pushing 4Gb via post to the upload endpoint and doing the roundtrip would not be enjoyable. But Etherpad has much larger and more exploitable DDoS endpoints than this (think pad export URIs) so it's IMHO not a huge concern/worry.

If you want it to reject "characters" for the current implemented charset from settings.json we can do that but imho this serves it's purpose and our users well. Imagine the computational power requirements of processing an entire document beyond what we already are doing here though? .escape and .unescape are, despite being deprecated, optimized. We can use that to our advantage instead of trying to bake in our own unoptimized character by character check.

cc @rhelmer as he might know the schedule for deprecation of escape and unescape.

Again, let me emphasize, this issue is much bigger than Etherpad. I'm defo a noob at charsets and character handling but it looks like browsers will need to work closely with editors et al over the coming years to get this support implemented properly. <3

Finally, I imagine there are better fixes, I spent 99.9% of my time on this issue drilling down to the root cause and trying to get a vague understanding of what is a relatively nuanced topic that frankly, I have very little interest in. I <3 Etherpad users but for me, it's 2020, charset drama should be out of the way already and I don't want to go too far down that rabbit hole :D

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE roundtrip, absolutely you could...

Whoa. When one is aiming to sanitize input doing an encode/decode roundtrip is actually a sound strategy. I am not thinking about performances right now.

My question was about why use two functions (encode/decodeURIComponent + unescape/escape) instead of a single one. I am sure there is a reason. Maybe this covers more cases? If so, what are they?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I don't know the answer! See what happens if you just do decode. I encoded then decoded because I figured it'd let less noise through and also create a saner output. Also that's what the SO article did.. https://stackoverflow.com/a/2670211

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

escape & it's sister aren't going anywhere any time soon. This gives us a few years (easily) to focus on handling the characters gracefully inside of the editor.
[...]
cc @rhelmer as he might know the schedule for deprecation of escape and unescape.

Hm. I agree that it is probably not going anywhere soon, let me research a bit more though and see what the recommended way to do this is from standards-y folks.

@muxator
Copy link
Copy Markdown
Contributor

muxator commented Mar 22, 2020

Nice work, @JohnMcLear, thanks! I'll have a look ASAP.

@JohnMcLear JohnMcLear changed the title Backend test for utf8 encoding Backend test for utf8 encoding - Ready for Review/Merge @muxator Mar 23, 2020
@JohnMcLear
Copy link
Copy Markdown
Member Author

@muxator bump :D

@muxator
Copy link
Copy Markdown
Contributor

muxator commented Mar 30, 2020

Tonight

@muxator muxator force-pushed the backendTestForUTF8Encoding branch from 7dc6785 to bcf63cc Compare March 30, 2020 20:56
@muxator
Copy link
Copy Markdown
Contributor

muxator commented Mar 30, 2020

I am having a lot of emojons looking at this

🤣

@muxator muxator force-pushed the backendTestForUTF8Encoding branch from bcf63cc to 7cc6404 Compare March 30, 2020 21:32
@muxator
Copy link
Copy Markdown
Contributor

muxator commented Mar 30, 2020

Wow, still going through...

@JohnMcLear
Copy link
Copy Markdown
Member Author

Are you testing character by character or something? :D

muxator and others added 2 commits March 30, 2020 23:53
…tHTML()

This is allowed starting from fc661ee ("core: allow URL parameters and POST
bodies to co-exist"), which landed in Etherpad 1.8.0. For the discussion, see
issue #3568.
@muxator
Copy link
Copy Markdown
Contributor

muxator commented Mar 30, 2020

Change by change of the original PR.

@muxator muxator force-pushed the backendTestForUTF8Encoding branch from 7cc6404 to ee0e4d3 Compare March 30, 2020 22:32
@muxator
Copy link
Copy Markdown
Contributor

muxator commented Mar 30, 2020

Ok I am done.

I do not know how to use the review interface so I am posting here.
The work can be reduced in three parts. I need an answer on part 3:

  1. 117deb3: tests: in backend tests, use POST instead of GET for setText() and setHTML()

    This switches some API calls from GET to POST on the client side.

    I see you were surprised by the fact that they were GETs in the first place, but this was just how the Etherpad API has always been. 😄

    For context: the change that supports both GET & POST in the API was introduced in fc661ee (see that commit message and HTTP API needs POST for large parameters in NodeJS 8.14+ #3568 for the reason behind it). It landed in 1.8.0.

    No problems here, just for your info.

  2. 3964ee8: tests: backend tests for utf8 & emojis support
    This is good work, much needed. Perfect.

  3. ee0e4d3: ImportHandler: instead of failing (badly) when a non Ascii code is uploaded just convert it then convert it back
    We already talked about encode/decodeURIComponent + unescape/escape, that the latter is deprecated, but that we are going to keep it because it helps with utf8 sanitization. I am ok with that.

    I have another doubt about this change, which may impact import/export: please open ee0e4d3 and view it in a side by side diff.

    Before the change, we were doing utf8 "sanitization" (actually, failing the upload) only if useConvertor was false.

    After this change, we replaced that block with a more forgiving one, but moved it inside a block where there is no condition on useConvertor (e.g., we are doing this even when using LibreOffice).

    Is it possible that this could cause problems when importing via LibreOffice, as we were seeing recently on another thread?

If we are safe with 3 I'll merge this PR right away.

Thanks.

@muxator
Copy link
Copy Markdown
Contributor

muxator commented Mar 31, 2020

I stripped point 3 (import/export) from this PR, and moved it in #3796, please comment there, @JohnMcLear.

All the other changes (the utf8 tests and the change from GET to POST) are merged.

@JohnMcLear
Copy link
Copy Markdown
Member Author

<3 looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants