Allow online js_engine reconfiguration by nickva · Pull Request #5327 · apache/couchdb

nickva · 2024-10-30T00:54:16Z

This should make it possible to toggle between engines without having to restart the node.

We already had a config listener so make sure it also sets up the js_engine bits, as well.

Add a test to demonstrate it works both when setting a non-default value and reverting back to the default value.

jiahuili430 · 2024-10-30T02:36:28Z

+            _ -> ok
+        end
+    end,
+    case test_util:wait(WaitFun, 5000) of


The default timeout for an individual test is 5 seconds.
Running the test takes time, and the wait function can take up to 5 seconds.
Adding them together would exceed 5 seconds and cause timeout.
Shall we use a smaller number, like 4500?

Works as expected.

(node1@127.0.0.1)1> couch_server:get_js_engine(). <<"quickjs">> (node1@127.0.0.1)2> config:set("couchdb", "js_engine", "spidermonkey"). ok (node1@127.0.0.1)3> couch_server:get_js_engine(). <<"spidermonkey">> (node1@127.0.0.1)4> config:set("couchdb", "js_engine", "quickjs"). ok (node1@127.0.0.1)5> couch_server:get_js_engine(). <<"quickjs">>

Just curious what happens if I make a typo... Is that possible to avoid this error?

(node1@127.0.0.1)6> config:set("couchdb", "js_engine", "spider_monkey"). ok *** ERROR: Shell process terminated! (^G to start new job) *** $ curl $DB curl: (7) Failed to connect to 127.0.0.1 port 15984 after 0 ms: Couldn't connect to server

Good idea @jiahuili430 to lower the time a bit. It will blow up in both cases with an exception but the one in test_util will have a more helpful message

Currently it's somewhat intentional that it would crash on function clause on an invalid engine setting. But perhaps it's better to emit an error in the logs and then use the default? But I think that's for a separate PR.

Made a PR to handle it better : just stick with the default value #5329

jaydoane

Nice feature!

jaydoane · 2024-10-30T02:51:48Z

+            ?assertNotEqual(OldVal, get_proc_manager_default_js()),
+            NewVal = get_proc_manager_default_js(),
+
+            % Toggle back to the original default (test config:delete/3)


Suggested change

% Toggle back to the original default (test config:delete/3)

% Toggle back to the original default

Not sure if you want to leave that in?

Yeah, that's a note that we're testing the case of a user running config:delete/3 and returning to the set of defaults.

janl · 2024-10-30T09:31:30Z

I’m leaving this here by way of an official record for reviewing this.

In the old times, we loaded the Erlang MFAs from CouchDB config and then executed those. This allowed CouchDB admins to change what native code would be running as the couchdb operating system user. We closed a number of vulnerabilities[1] with refactoring this so that the native code definition comes from environment variables and not CouchDB config and thus be out of reach for just CouchDB admins. We also made it so that these env vars are only loaded on module initialisation and not dynamically. The desired security properties however come from moving the MFA definition to env vars (and also making them not strictly raw MFAs), the not-dynamic-loading was just an added bonus.

This patch makes loading those values dynamic again, but by the above analysis, that should not open us up to new (or old) attack vectors, as someone who can change the env of a running process is already further ahead than this vector would get them.

Can someone double check my logic here?

[1]:

rnewson · 2024-10-30T13:21:33Z

I don't see the pressing need for online reconfiguration here. As Jan pointed out we had it, and removed it (for security reasons) and with very little complaint (I don't recall any tbh).

There is some protection in not allowing an attacker that can manipulate a process's environment from magically changing what (potentially compromised) native code it executes. Sure, an attacker that can do that can probably cause a couchdb restart or crash, though that is at least a tangible event, it could be logged, the pid changes, etc. So the "from env" protection and the "no dynamic change after boot" protection are both providing something.

janl · 2024-10-30T14:13:11Z

could we put just configure_js_engine() into the config change handler and get this feature without losing any protections?

rnewson · 2024-10-30T14:19:52Z

e.g 1) couchdb boots with all the env vars it needs to know how to use spidermonkey and quicks 2) a runtime toggle as to which is used for new couchjs processes 3) profit?

nickva · 2024-10-30T14:28:00Z

I don't see the pressing need for online reconfiguration here.

@rnewson There is a need in deploying this change at a cloud db provider ;-) , where otherwise, it would require putting each node in maintenance mode, restarting it, putting it back in production and so on. It also makes it easy for any user to toggle the config.

@janl Thanks for the background. The analysis is correct that this doesn't allow adding any new MFA based options or changing arguments. Unless they attach a debugger to the beam process and update the OTP cached env red-black tree https://github.com/erlang/otp/blob/master/erts/emulator/sys/common/erl_osenv.c

nickva · 2024-10-30T14:34:08Z

could we put just configure_js_engine() into the config change handler and get this feature without losing any protections?

Hmm, that's what this PR does? There is already a config change handler, it reconfigures all the native (MFA) bits but didn't do the js engine switch.

e.g 1) couchdb boots with all the env vars it needs to know how to use spidermonkey and quicks 2) a runtime toggle as to which is used for new couchjs processes 3) profit?

That's effectively what's happening here. Erlang/OTPs' env is cached https://github.com/erlang/otp/blob/master/erts/emulator/sys/common/erl_osenv.c and not sure how it would be possible to alter it without already having high-enough access to the process and do a lot worse damage. We could cache it in the process again I suppose, but the danger is environment variables hold security bits, passwords, tokens, etc so a proc manager crashing would spill those into to the logs. Maybe cache the filtered prefix bits only? It still seems redundant?

janl · 2024-10-30T14:59:37Z

Hmm, that's what this PR does? There is already a config change handler, it reconfigures all the native (MFA) bits but didn't do the js engine switch.

as I read it, it moves the pull-from-env from init() into its own function that is passed into the config change handler. My suggestion would to just move the js engine switch into the config change handler and leave the rest in init.

That's effectively what's happening here. Erlang/OTPs' env is cached

gut feeling here, and no strong opinion, but the way we have it before is strictly better at expressing in our code what we mean to happen rather than relying on how OTP works today. Of course, if we did this every time, we reimplemented all of OTP and the is obviously silly.

nickva · 2024-10-30T15:08:55Z

@janl configure_js_engine/0 already does os:getenv(...) calls, but it does some of them only which leads to a different state if we initialize then just keep calling configure_js_engine/0 a few times. The way configure engine works is it has to be called after the main config is run. On change, the main config has to run again, then calls configure_js_engine(...).

but the way we have it before is strictly better at expressing in our code what we mean to happen

I can cache it I guess, even though I can't think of a way anyone would change the runtime env var settings of the process. Even a regular C process, how would changing its environment variable set work after it already started?

iilyak · 2024-10-30T16:04:41Z

 handle_config_change("query_server_config", _, _, _, _) ->
    gen_server:cast(?MODULE, reload_config),
    {ok, undefined};
+handle_config_change("couchdb", "js_engine", _, _, _) ->


Maybe out of scope of this PR, but we might want to handle handle_config_change("native_query_servers", _, _, _, _) in a similar way.

Maybe out of scope of this PR, but we might want to handle handle_config_change("native_query_servers", _, _, _, _) in a similar way.

We already do, it's just above in lines 283:285

This should make it possible to toggle between engines without having to restart the node. We already had a config listener so make sure it also sets up the js_engine bits, as well. Add a test to demonstrate it works both when setting a non-default value and reverting back to the default value.

In the previous PR [1] Jessica had discovered that an invalid engine setting crashes the server. Initially I had thought we would be strict about it and just let it crash, but that seems a bit harsh for this. Especially if there is a large cluster and configuration is pushed to all nodes. It would crash the whole cluster. Instead, since we have a default, stick with that. Especially now users can validate if a non-default js_engine is active from the welcome endpoint. [1] #5327

jiahuili430 reviewed Oct 30, 2024

View reviewed changes

jaydoane approved these changes Oct 30, 2024

View reviewed changes

nickva force-pushed the dynamic-quick-reconf branch 2 times, most recently from bb96489 to 240c2cc Compare October 30, 2024 14:07

iilyak reviewed Oct 30, 2024

View reviewed changes

nickva force-pushed the dynamic-quick-reconf branch from 240c2cc to d9c4ccd Compare October 30, 2024 16:23

nickva force-pushed the dynamic-quick-reconf branch from d9c4ccd to f77a170 Compare October 30, 2024 16:45

nickva merged commit 9cb958d into main Oct 30, 2024

nickva deleted the dynamic-quick-reconf branch October 30, 2024 17:18

nickva mentioned this pull request Oct 31, 2024

Do not crash the server on invalid js_engine setting #5329

Merged

	% Toggle back to the original default (test config:delete/3)
	% Toggle back to the original default

Conversation

nickva commented Oct 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickva Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaydoane left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janl commented Oct 30, 2024

Uh oh!

rnewson commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janl commented Oct 30, 2024

Uh oh!

rnewson commented Oct 30, 2024

Uh oh!

nickva commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nickva commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janl commented Oct 30, 2024

Uh oh!

nickva commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nickva Oct 31, 2024 •

edited

Loading

rnewson commented Oct 30, 2024 •

edited

Loading

nickva commented Oct 30, 2024 •

edited

Loading

nickva commented Oct 30, 2024 •

edited

Loading

nickva commented Oct 30, 2024 •

edited

Loading