Skip to content

Allow online js_engine reconfiguration#5327

Merged
nickva merged 1 commit intomainfrom
dynamic-quick-reconf
Oct 30, 2024
Merged

Allow online js_engine reconfiguration#5327
nickva merged 1 commit intomainfrom
dynamic-quick-reconf

Conversation

@nickva
Copy link
Copy Markdown
Contributor

@nickva nickva commented Oct 30, 2024

This should make it possible to toggle between engines without having to restart the node.

We already had a config listener so make sure it also sets up the js_engine bits, as well.

Add a test to demonstrate it works both when setting a non-default value and reverting back to the default value.

_ -> ok
end
end,
case test_util:wait(WaitFun, 5000) of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default timeout for an individual test is 5 seconds.
Running the test takes time, and the wait function can take up to 5 seconds.
Adding them together would exceed 5 seconds and cause timeout.
Shall we use a smaller number, like 4500?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected.

(node1@127.0.0.1)1> couch_server:get_js_engine().
<<"quickjs">>
(node1@127.0.0.1)2> config:set("couchdb", "js_engine", "spidermonkey").
ok
(node1@127.0.0.1)3> couch_server:get_js_engine().
<<"spidermonkey">>
(node1@127.0.0.1)4> config:set("couchdb", "js_engine", "quickjs").
ok
(node1@127.0.0.1)5> couch_server:get_js_engine().
<<"quickjs">>

Just curious what happens if I make a typo... Is that possible to avoid this error?

(node1@127.0.0.1)6> config:set("couchdb", "js_engine", "spider_monkey").
ok
*** ERROR: Shell process terminated! (^G to start new job) ***

$ curl $DB
curl: (7) Failed to connect to 127.0.0.1 port 15984 after 0 ms: Couldn't connect to server

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea @jiahuili430 to lower the time a bit. It will blow up in both cases with an exception but the one in test_util will have a more helpful message

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it's somewhat intentional that it would crash on function clause on an invalid engine setting. But perhaps it's better to emit an error in the logs and then use the default? But I think that's for a separate PR.

Copy link
Copy Markdown
Contributor Author

@nickva nickva Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a PR to handle it better : just stick with the default value #5329

Copy link
Copy Markdown
Contributor

@jaydoane jaydoane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature!

?assertNotEqual(OldVal, get_proc_manager_default_js()),
NewVal = get_proc_manager_default_js(),

% Toggle back to the original default (test config:delete/3)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
% Toggle back to the original default (test config:delete/3)
% Toggle back to the original default

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you want to leave that in?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a note that we're testing the case of a user running config:delete/3 and returning to the set of defaults.

@janl
Copy link
Copy Markdown
Member

janl commented Oct 30, 2024

I’m leaving this here by way of an official record for reviewing this.

In the old times, we loaded the Erlang MFAs from CouchDB config and then executed those. This allowed CouchDB admins to change what native code would be running as the couchdb operating system user. We closed a number of vulnerabilities[1] with refactoring this so that the native code definition comes from environment variables and not CouchDB config and thus be out of reach for just CouchDB admins. We also made it so that these env vars are only loaded on module initialisation and not dynamically. The desired security properties however come from moving the MFA definition to env vars (and also making them not strictly raw MFAs), the not-dynamic-loading was just an added bonus.

This patch makes loading those values dynamic again, but by the above analysis, that should not open us up to new (or old) attack vectors, as someone who can change the env of a running process is already further ahead than this vector would get them.

Can someone double check my logic here?

[1]:

@rnewson
Copy link
Copy Markdown
Member

rnewson commented Oct 30, 2024

I don't see the pressing need for online reconfiguration here. As Jan pointed out we had it, and removed it (for security reasons) and with very little complaint (I don't recall any tbh).

There is some protection in not allowing an attacker that can manipulate a process's environment from magically changing what (potentially compromised) native code it executes. Sure, an attacker that can do that can probably cause a couchdb restart or crash, though that is at least a tangible event, it could be logged, the pid changes, etc. So the "from env" protection and the "no dynamic change after boot" protection are both providing something.

@nickva nickva force-pushed the dynamic-quick-reconf branch 2 times, most recently from bb96489 to 240c2cc Compare October 30, 2024 14:07
@janl
Copy link
Copy Markdown
Member

janl commented Oct 30, 2024

could we put just configure_js_engine() into the config change handler and get this feature without losing any protections?

@rnewson
Copy link
Copy Markdown
Member

rnewson commented Oct 30, 2024

e.g 1) couchdb boots with all the env vars it needs to know how to use spidermonkey and quicks 2) a runtime toggle as to which is used for new couchjs processes 3) profit?

@nickva
Copy link
Copy Markdown
Contributor Author

nickva commented Oct 30, 2024

I don't see the pressing need for online reconfiguration here.

@rnewson There is a need in deploying this change at a cloud db provider ;-) , where otherwise, it would require putting each node in maintenance mode, restarting it, putting it back in production and so on. It also makes it easy for any user to toggle the config.

@janl Thanks for the background. The analysis is correct that this doesn't allow adding any new MFA based options or changing arguments. Unless they attach a debugger to the beam process and update the OTP cached env red-black tree https://github.com/erlang/otp/blob/master/erts/emulator/sys/common/erl_osenv.c

@nickva
Copy link
Copy Markdown
Contributor Author

nickva commented Oct 30, 2024

could we put just configure_js_engine() into the config change handler and get this feature without losing any protections?

Hmm, that's what this PR does? There is already a config change handler, it reconfigures all the native (MFA) bits but didn't do the js engine switch.

e.g 1) couchdb boots with all the env vars it needs to know how to use spidermonkey and quicks 2) a runtime toggle as to which is used for new couchjs processes 3) profit?

That's effectively what's happening here. Erlang/OTPs' env is cached https://github.com/erlang/otp/blob/master/erts/emulator/sys/common/erl_osenv.c and not sure how it would be possible to alter it without already having high-enough access to the process and do a lot worse damage. We could cache it in the process again I suppose, but the danger is environment variables hold security bits, passwords, tokens, etc so a proc manager crashing would spill those into to the logs. Maybe cache the filtered prefix bits only? It still seems redundant?

@janl
Copy link
Copy Markdown
Member

janl commented Oct 30, 2024

Hmm, that's what this PR does? There is already a config change handler, it reconfigures all the native (MFA) bits but didn't do the js engine switch.

as I read it, it moves the pull-from-env from init() into its own function that is passed into the config change handler. My suggestion would to just move the js engine switch into the config change handler and leave the rest in init.

That's effectively what's happening here. Erlang/OTPs' env is cached

gut feeling here, and no strong opinion, but the way we have it before is strictly better at expressing in our code what we mean to happen rather than relying on how OTP works today. Of course, if we did this every time, we reimplemented all of OTP and the is obviously silly.

@nickva
Copy link
Copy Markdown
Contributor Author

nickva commented Oct 30, 2024

@janl configure_js_engine/0 already does os:getenv(...) calls, but it does some of them only which leads to a different state if we initialize then just keep calling configure_js_engine/0 a few times. The way configure engine works is it has to be called after the main config is run. On change, the main config has to run again, then calls configure_js_engine(...).

but the way we have it before is strictly better at expressing in our code what we mean to happen

I can cache it I guess, even though I can't think of a way anyone would change the runtime env var settings of the process. Even a regular C process, how would changing its environment variable set work after it already started?

handle_config_change("query_server_config", _, _, _, _) ->
gen_server:cast(?MODULE, reload_config),
{ok, undefined};
handle_config_change("couchdb", "js_engine", _, _, _) ->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe out of scope of this PR, but we might want to handle handle_config_change("native_query_servers", _, _, _, _) in a similar way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe out of scope of this PR, but we might want to handle handle_config_change("native_query_servers", _, _, _, _) in a similar way.

We already do, it's just above in lines 283:285

@nickva nickva force-pushed the dynamic-quick-reconf branch from 240c2cc to d9c4ccd Compare October 30, 2024 16:23
This should make it possible to toggle between engines without having to
restart the node.

We already had a config listener so make sure it also sets up the js_engine
bits, as well.

Add a test to demonstrate it works both when setting a non-default value and
reverting back to the default value.
@nickva nickva force-pushed the dynamic-quick-reconf branch from d9c4ccd to f77a170 Compare October 30, 2024 16:45
@nickva nickva merged commit 9cb958d into main Oct 30, 2024
@nickva nickva deleted the dynamic-quick-reconf branch October 30, 2024 17:18
nickva added a commit that referenced this pull request Oct 31, 2024
In the previous PR [1] Jessica had discovered that an invalid engine setting
crashes the server. Initially I had thought we would be strict about it and just
let it crash, but that seems a bit harsh for this. Especially if there is a
large cluster and configuration is pushed to all nodes. It would crash the
whole cluster.

Instead, since we have a default, stick with that. Especially now users can
validate if a non-default js_engine is active from the welcome endpoint.

[1] #5327
nickva added a commit that referenced this pull request Oct 31, 2024
In the previous PR [1] Jessica had discovered that an invalid engine setting
crashes the server. Initially I had thought we would be strict about it and just
let it crash, but that seems a bit harsh for this. Especially if there is a
large cluster and configuration is pushed to all nodes. It would crash the
whole cluster.

Instead, since we have a default, stick with that. Especially now users can
validate if a non-default js_engine is active from the welcome endpoint.

[1] #5327
nickva added a commit that referenced this pull request Mar 3, 2025
In the previous PR [1] Jessica had discovered that an invalid engine setting
crashes the server. Initially I had thought we would be strict about it and just
let it crash, but that seems a bit harsh for this. Especially if there is a
large cluster and configuration is pushed to all nodes. It would crash the
whole cluster.

Instead, since we have a default, stick with that. Especially now users can
validate if a non-default js_engine is active from the welcome endpoint.

[1] #5327
nickva added a commit that referenced this pull request Mar 4, 2025
In the previous PR [1] Jessica had discovered that an invalid engine setting
crashes the server. Initially I had thought we would be strict about it and just
let it crash, but that seems a bit harsh for this. Especially if there is a
large cluster and configuration is pushed to all nodes. It would crash the
whole cluster.

Instead, since we have a default, stick with that. Especially now users can
validate if a non-default js_engine is active from the welcome endpoint.

[1] #5327
nickva added a commit that referenced this pull request Mar 4, 2025
In the previous PR [1] Jessica had discovered that an invalid engine setting
crashes the server. Initially I had thought we would be strict about it and just
let it crash, but that seems a bit harsh for this. Especially if there is a
large cluster and configuration is pushed to all nodes. It would crash the
whole cluster.

Instead, since we have a default, stick with that. Especially now users can
validate if a non-default js_engine is active from the welcome endpoint.

[1] #5327
nickva added a commit that referenced this pull request Mar 4, 2025
In the previous PR [1] Jessica had discovered that an invalid engine setting
crashes the server. Initially I had thought we would be strict about it and just
let it crash, but that seems a bit harsh for this. Especially if there is a
large cluster and configuration is pushed to all nodes. It would crash the
whole cluster.

Instead, since we have a default, stick with that. Especially now users can
validate if a non-default js_engine is active from the welcome endpoint.

[1] #5327
nickva added a commit that referenced this pull request Mar 4, 2025
In the previous PR [1] Jessica had discovered that an invalid engine setting
crashes the server. Initially I had thought we would be strict about it and just
let it crash, but that seems a bit harsh for this. Especially if there is a
large cluster and configuration is pushed to all nodes. It would crash the
whole cluster.

Instead, since we have a default, stick with that. Especially now users can
validate if a non-default js_engine is active from the welcome endpoint.

[1] #5327
nickva added a commit that referenced this pull request Mar 4, 2025
In the previous PR [1] Jessica had discovered that an invalid engine setting
crashes the server. Initially I had thought we would be strict about it and just
let it crash, but that seems a bit harsh for this. Especially if there is a
large cluster and configuration is pushed to all nodes. It would crash the
whole cluster.

Instead, since we have a default, stick with that. Especially now users can
validate if a non-default js_engine is active from the welcome endpoint.

[1] #5327
nickva added a commit that referenced this pull request Mar 4, 2025
In the previous PR [1] Jessica had discovered that an invalid engine setting
crashes the server. Initially I had thought we would be strict about it and just
let it crash, but that seems a bit harsh for this. Especially if there is a
large cluster and configuration is pushed to all nodes. It would crash the
whole cluster.

Instead, since we have a default, stick with that. Especially now users can
validate if a non-default js_engine is active from the welcome endpoint.

[1] #5327
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants