Backend test for utf8 encoding - Ready for Review/Merge @muxator by JohnMcLear · Pull Request #3737 · ether/etherpad

JohnMcLear · 2020-03-20T21:21:00Z

This PR introduces better handling for databases that don't have proper UTF8 encoding.

It includes a backend Test for UTF8 encoding which works by reading a properly formed HTML document that includes lots of weird characters which are then imported into Etherpad using The API method setHTML. This can be merged separate to #3737 as one test is native to Etherpad and the other native to UeberDB.

If the database is improperly configured these tests should crash the Etherpad instance (intentionally). Along with ether/ueberDB@a5db6c4 // #3734 this will also provide a test a init of database.

This PR also fixes setText and setHTML syntax in the pad backend tests. Previously if you used .get with setText or setHTML those tests would fail if the text or html variables were beyond the capacity of get

This PR also (sorry!) should resolve the issue of uploading "emojis" through the UI. It does this by replacing the isAscii check with

    let bytelike = unescape(encodeURIComponent(text));
    text = decodeURIComponent(escape(bytelike));

It works as well as copy/pasting and the setHTML. Basically the following characters don't work in Etherpad at all.

🏋️‍♀️ 🏋🏻‍♀️ 🏋🏼‍♀️ 🏋🏽‍♀️ 🏋🏾‍♀️ 🏋🏿‍♀️ 🏋️‍♂️ 🏋🏻‍♂️ 🏋🏼‍♂️ 🏋🏽‍♂️ 🏋🏾‍♂️ 🏋🏿‍♂️ 🤼‍♀️ 🤼‍♂️ 🤸‍♀️ 🤸🏻‍♀️ 🤸🏼‍♀️ 🤸🏽‍♀️ 🤸🏾‍♀️ 🤸🏿‍♀️ 🤸‍♂️ 🤸🏻‍♂️ 🤸🏼‍♂️ 🤸🏽‍♂️ 🤸🏾‍♂️ 🤸🏿‍♂️ ⛹️‍♀️ ⛹🏻‍♀️ ⛹🏼‍♀️ ⛹🏽‍♀️ ⛹🏾‍♀️ ⛹🏿‍♀️ ⛹️‍♂️ ⛹🏻‍♂️ ⛹🏼‍♂️ ⛹🏽‍♂️ ⛹🏾‍♂️ ⛹🏿‍♂️ 🤺 🤾‍♀️ 🤾🏻‍♀️ 🤾🏼‍♀️ 🤾🏾‍♀️ 🤾🏾‍♀️ 🤾🏿‍♀️ 🤾‍♂️ 🤾🏻‍♂️ 🤾🏼‍♂️ 🤾🏽‍♂️ 🤾🏾‍♂️ 🤾🏿‍♂️ 🏌️‍♀️ 🏌🏻‍♀️ 🏌🏼‍♀️ 🏌🏽‍♀️ 🏌🏾‍♀️ 🏌🏿‍♀️ 🏌️‍♂️ 🏌🏻‍♂️ 🏌🏼‍♂️ 🏌🏽‍♂️ 🏌🏾‍♂️ 🏌🏿‍♂️ 🏇 🏇🏻 🏇🏼 🏇🏽 🏇🏾 🏇🏿 🧘‍♀️ 🧘🏻‍♀️ 🧘🏼‍♀️ 🧘🏽‍♀️ 🧘🏾‍♀️ 🧘🏿‍♀️ 🧘‍♂️ 🧘🏻‍♂️ 🧘🏼‍♂️ 🧘🏽‍♂️ 🧘🏾‍♂️ 🧘🏿‍♂️ 🏄‍♀️ 🏄🏻‍♀️ 🏄🏼‍♀️ 🏄🏽‍♀️ 🏄🏾‍♀️ 🏄🏿‍♀️ 🏄‍♂️ 🏄🏻‍♂️ 🏄🏼‍♂️ 🏄🏽‍♂️ 🏄🏾‍♂️ 🏄🏿‍♂️ 🏊‍♀️ 🏊🏻‍♀️ 🏊🏼‍♀️ 🏊🏽‍♀️ 🏊🏾‍♀️ 🏊🏿‍♀️ 🏊‍♂️ 🏊🏻‍♂️ 🏊🏼‍♂️ 🏊🏽‍♂️ 🏊🏾‍♂️ 🏊🏿‍♂️ 🤽‍♀️ 🤽🏻‍♀️ 🤽🏼‍♀️ 🤽🏽‍♀️ 🤽🏾‍♀️ 🤽🏿‍♀️ 🤽‍♂️ 🤽🏻‍♂️ 🤽🏼‍♂️ 🤽🏽‍♂️ 🤽🏾‍♂️ 🤽🏿‍♂️ 🚣‍♀️ 🚣🏻‍♀️ 🚣🏼‍♀️ 🚣🏽‍♀️ 🚣🏾‍♀️ 🚣🏿‍♀️ 🚣‍♂️ 🚣🏻‍♂️ 🚣🏼‍♂️ 🚣🏽‍♂️

Try copy / pasting them into any pad, it will format them incorrectly. This is related to how Etherpad "thinks" about characters as a specific length and tbh I feel right now it's outside of the scope of what I'm trying to achieve.

At first I thought it was Etherpad being a bit old fashioned, then I had the same experience in Atom, Google Docs. And then other Document editors and IDEs, and discovered, it's a bit of a mess out there...

So I think my fix is "good enough" but now we have a set of characters we know we can test against so another developer can continue the struggle further to support

sdf

update

JohnMcLear · 2020-03-21T19:16:32Z

@muxator ready for review :)

Updating

muxator · 2020-03-22T00:29:14Z


  if (!req.directDatabaseAccess) {
    text = await fsp_readFile(destFile, "utf8");
+    let bytelike = unescape(encodeURIComponent(text));


Quick question without having studied this code yet: is the unescape()/escape() roundtrip avoidable?

I ask this because those functions seem to be on the way of deprecation (source on MDN):

All of the language features and behaviours specified in this annex have one or more undesirable characteristics and in the absence of legacy usage would be removed from this specification.
Programmers should not use or assume the existence of these features and behaviours when writing new ECMAScript code.

escape & it's sister aren't going anywhere any time soon. This gives us a few years (easily) to focus on handling the characters gracefully inside of the editor.

RE roundtrip, absolutely you could catch non processable characters and drop the import entirely. For me, in this PR, it was about making import work. Import is a small amount of processing handled by Etherpad instances and doing a roundtrip here is not computationally expensive.. That said, a DDoS attack using this as a vector is not ideal. Pushing 4Gb via post to the upload endpoint and doing the roundtrip would not be enjoyable. But Etherpad has much larger and more exploitable DDoS endpoints than this (think pad export URIs) so it's IMHO not a huge concern/worry.

If you want it to reject "characters" for the current implemented charset from settings.json we can do that but imho this serves it's purpose and our users well. Imagine the computational power requirements of processing an entire document beyond what we already are doing here though? .escape and .unescape are, despite being deprecated, optimized. We can use that to our advantage instead of trying to bake in our own unoptimized character by character check.

cc @rhelmer as he might know the schedule for deprecation of escape and unescape.

Again, let me emphasize, this issue is much bigger than Etherpad. I'm defo a noob at charsets and character handling but it looks like browsers will need to work closely with editors et al over the coming years to get this support implemented properly. <3

Finally, I imagine there are better fixes, I spent 99.9% of my time on this issue drilling down to the root cause and trying to get a vague understanding of what is a relatively nuanced topic that frankly, I have very little interest in. I <3 Etherpad users but for me, it's 2020, charset drama should be out of the way already and I don't want to go too far down that rabbit hole :D

RE roundtrip, absolutely you could...

Whoa. When one is aiming to sanitize input doing an encode/decode roundtrip is actually a sound strategy. I am not thinking about performances right now.

My question was about why use two functions (encode/decodeURIComponent + unescape/escape) instead of a single one. I am sure there is a reason. Maybe this covers more cases? If so, what are they?

Good question, I don't know the answer! See what happens if you just do decode. I encoded then decoded because I figured it'd let less noise through and also create a saner output. Also that's what the SO article did.. https://stackoverflow.com/a/2670211

escape & it's sister aren't going anywhere any time soon. This gives us a few years (easily) to focus on handling the characters gracefully inside of the editor.
[...]
cc @rhelmer as he might know the schedule for deprecation of escape and unescape.

Hm. I agree that it is probably not going anywhere soon, let me research a bit more though and see what the recommended way to do this is from standards-y folks.

muxator · 2020-03-22T00:30:47Z

Nice work, @JohnMcLear, thanks! I'll have a look ASAP.

update

JohnMcLear · 2020-03-30T14:21:21Z

@muxator bump :D

muxator · 2020-03-30T15:39:01Z

Tonight

muxator · 2020-03-30T21:04:52Z

I am having a lot of emojons looking at this

🤣

muxator · 2020-03-30T21:33:46Z

Wow, still going through...

JohnMcLear · 2020-03-30T21:51:38Z

Are you testing character by character or something? :D

…tHTML() This is allowed starting from fc661ee ("core: allow URL parameters and POST bodies to co-exist"), which landed in Etherpad 1.8.0. For the discussion, see issue #3568.

muxator · 2020-03-30T22:24:50Z

Change by change of the original PR.

muxator · 2020-03-30T22:55:09Z

Ok I am done.

I do not know how to use the review interface so I am posting here.
The work can be reduced in three parts. I need an answer on part 3:

117deb3: tests: in backend tests, use POST instead of GET for setText() and setHTML()

This switches some API calls from GET to POST on the client side.

I see you were surprised by the fact that they were GETs in the first place, but this was just how the Etherpad API has always been. 😄

For context: the change that supports both GET & POST in the API was introduced in fc661ee (see that commit message and HTTP API needs POST for large parameters in NodeJS 8.14+ #3568 for the reason behind it). It landed in 1.8.0.

No problems here, just for your info.
3964ee8: tests: backend tests for utf8 & emojis support
This is good work, much needed. Perfect.
ee0e4d3: ImportHandler: instead of failing (badly) when a non Ascii code is uploaded just convert it then convert it back
We already talked about encode/decodeURIComponent + unescape/escape, that the latter is deprecated, but that we are going to keep it because it helps with utf8 sanitization. I am ok with that.

I have another doubt about this change, which may impact import/export: please open ee0e4d3 and view it in a side by side diff.

Before the change, we were doing utf8 "sanitization" (actually, failing the upload) only if useConvertor was false.

After this change, we replaced that block with a more forgiving one, but moved it inside a block where there is no condition on useConvertor (e.g., we are doing this even when using LibreOffice).

Is it possible that this could cause problems when importing via LibreOffice, as we were seeing recently on another thread?

If we are safe with 3 I'll merge this PR right away.

Thanks.

muxator · 2020-03-31T00:18:32Z

I stripped point 3 (import/export) from this PR, and moved it in #3796, please comment there, @JohnMcLear.

All the other changes (the utf8 tests and the change from GET to POST) are merged.

JohnMcLear · 2020-03-31T10:05:08Z

<3 looks good.

JohnMcLear added 3 commits March 15, 2020 09:53

Merge pull request #2 from ether/develop

9ea5ad9

sdf

Merge pull request #3 from ether/develop

c696165

update

tests: backend tests for utf8 & emojis support

3964ee8

Merge pull request #4 from ether/develop

a59d48c

Updating

muxator reviewed Mar 22, 2020

View reviewed changes

JohnMcLear changed the title ~~Backend test for utf8 encoding~~ Backend test for utf8 encoding - Ready for Review/Merge @muxator Mar 23, 2020

Merge branch 'develop' of github.com:ether/etherpad-lite into develop

0cf1692

JohnMcLear mentioned this pull request Mar 23, 2020

Good way to test Charset encoding of MySQL / Databases #3733

Closed

JohnMcLear added 3 commits March 24, 2020 22:50

Merge branch 'develop' of github.com:ether/etherpad-lite into develop

f5a47db

Merge branch 'develop' of github.com:ether/etherpad-lite into develop

a2400ec

Merge branch 'develop' of github.com:ether/etherpad-lite into develop

d1cb72e

snyk-bot mentioned this pull request Mar 29, 2020

[Snyk] Security upgrade mocha from 2.4.5 to 6.0.0 enterstudio/etherpad-lite#19

Open

Merge pull request #6 from ether/develop

099d186

update

muxator force-pushed the backendTestForUTF8Encoding branch from 7dc6785 to bcf63cc Compare March 30, 2020 20:56

muxator force-pushed the backendTestForUTF8Encoding branch from bcf63cc to 7cc6404 Compare March 30, 2020 21:32

muxator and others added 2 commits March 30, 2020 23:53

tests: typo in a comment in the backend tests

404c66a

tests: in backend tests, use POST instead of GET for setText() and se…

117deb3

…tHTML() This is allowed starting from fc661ee ("core: allow URL parameters and POST bodies to co-exist"), which landed in Etherpad 1.8.0. For the discussion, see issue #3568.

muxator force-pushed the backendTestForUTF8Encoding branch from 7cc6404 to ee0e4d3 Compare March 30, 2020 22:32

muxator force-pushed the backendTestForUTF8Encoding branch from ee0e4d3 to 3964ee8 Compare March 31, 2020 00:08

muxator merged commit 6c12a7f into ether:develop Mar 31, 2020

muxator mentioned this pull request Mar 31, 2020

ImportHandler: instead of failing when a non Ascii code is uploaded do a roundtrip conversion - Is this mergeable on 03/04/2020? Seems buggy #3796

Closed

Uh oh!

Conversation

JohnMcLear commented Mar 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnMcLear commented Mar 21, 2020

Uh oh!

muxator Mar 22, 2020

Choose a reason for hiding this comment

Uh oh!

JohnMcLear Mar 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

muxator Mar 22, 2020

Choose a reason for hiding this comment

Uh oh!

JohnMcLear Mar 22, 2020

Choose a reason for hiding this comment

Uh oh!

rhelmer Apr 9, 2020

Choose a reason for hiding this comment

Uh oh!

muxator commented Mar 22, 2020

Uh oh!

JohnMcLear commented Mar 30, 2020

Uh oh!

muxator commented Mar 30, 2020

Uh oh!

muxator commented Mar 30, 2020

Uh oh!

muxator commented Mar 30, 2020

Uh oh!

JohnMcLear commented Mar 30, 2020

Uh oh!

muxator commented Mar 30, 2020

Uh oh!

muxator commented Mar 30, 2020

Uh oh!

muxator commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnMcLear commented Mar 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JohnMcLear commented Mar 20, 2020 •

edited

Loading

JohnMcLear Mar 22, 2020 •

edited

Loading

muxator commented Mar 31, 2020 •

edited

Loading