Skip to content

Conversation

@mskacelik
Copy link
Contributor

/cc @darranl
issue: https://issues.redhat.com/browse/WFCORE-7335

This is only a draft PR to showcase a possible solution.

When standalone.xml security manager subsystem configuration is invalid, f.g:

<subsystem xmlns="urn:jboss:domain:security-manager:1.0">
    <deployment-permissions>
        <maximum-set>
            <permission class="java.io.FilePermission" name="${badExpression}" actions="write,delete"/>
        </maximum-set>
    </deployment-permissions>
</subsystem>

Upon WF instance startup, the ExpressionResolverImpl.java will log and throw an exception of type OperationClientException

throw ControllerLogger.ROOT_LOGGER.cannotResolveExpression(initialValue);

OperationClientException means that:

This class implements {@link OperationClientException}, so if it is thrown during execution of an {@code OperationStepHandler}, the management kernel will adequately handle the exception as a user mistake, not a server fault.

This exception is then handled in the AbstractOperationContext:

} catch (Throwable t) {
// If it doesn't implement OperationClientException marker interface, throw it on to outer catch block
if (!(t instanceof OperationClientException)) {
throw t;
}
// Handler threw OCE; that's equivalent to a request that we set the failure description
final ModelNode failDesc = OperationClientException.class.cast(t).getFailureDescription();
step.response.get(FAILURE_DESCRIPTION).set(failDesc);
logStepFailure(step, true);
}

So here is the fundamental problem.

Possible Solution

I have come up with a solution by wrapping the OperationClientException of the expression resolver in the SecurityManagerSubsystemAdd, which, from my understanding, is only executed during the start-up of WildFly (boot). This wrapped exception is not handled in the AbstractOperationContext, making the WildFly startup fail due to the exception.

So, in the current implementation with invalid configuration:

  • using CLI => roll back the configuration (with or without -secmgr)
  • booting without -secmgr => logs the error, but the boot won't fail
  • booting with -secmgr => logs the error, but boot will fail
  • --admin-only mode => both CLI and booting (invalid XML) won't fail (with or without -secmgr)

Note

  • Is the --admin-only mode behavior valid in this case?
  • PR is missing the test, one because this is a Draft PR, so given that this solution might not be ideally implemented, I did not implement the tests yet, and secondly, I was not sure where to put these tests, if in wildfly-core repository or in the wildfly (repository) integration tests.
  • Due to this PR being a PoC, I only used RuntimeException, but maybe other exceptions would be suited better.

@wildfly-ci

This comment was marked as outdated.

@wildfly-ci

This comment was marked as outdated.

maximumPermissionsNode = MAXIMUM_PERMISSIONS.resolveModelAttribute(context, deploymentPermissionsModel);
} catch (ExpressionResolver.ExpressionResolutionUserException ex) {
// TODO: better exception choice ?
throw (System.getSecurityManager() != null) ? new RuntimeException(ex) : ex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just throw the original exception.

Don't use an exception type to try and influence the behavior of the operation; use OperationContext.setRollbackOnly().

I also don't think throwing a RuntimeException would abort the boot anyway. :) It would roll back a change after boot, but would not abort boot. OperationContext.setRollbackOnly() aborts the boot.

Please include a quick comment explaining the rationale for doing that; i.e. don't make people use git blame and research the JIRA.

Also, the current code at L84 throws an exception in a similar situation (detection of invalid config), so if the goal is to abort the boot, the same SM check + OperationContext.setRollbackOnly() seems applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info, I was not aware of the OperationContext.setRollbackOnly(), which seems to solve the issue.

With that said, yesterday I was trying to come up with an Integration test, something similar to https://github.com/wildfly/wildfly/blob/68bc534d9ef0adf349c10b278d7a310d429763ab/testsuite/integration/vdx/src/test/java/org/wildfly/test/integration/vdx/standalone/SmokeStandaloneTestCase.java. (groovy file for modifying the standalone.xml with invalid config, checking logs...)

But it seems that I had some issues regarding setting up -secmgr and pretty much checking halting/non-halting asserts. Do you think I am on the right track (vdx/standalone), or is there a better test solution that you can direct me to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test wouldn't belong in full testsuite/vdx. That testsuite module is only about testing handling of reporting of XML errors. This isn't really a matter of an XML error.

This seems like something that can be tested in WildFly Core's testsuite/manualmode. It needs to be in manualmode because that module is the place for tests that need to control the entire lifecycle of the server. I assume you need to do that as the test will cause the server VM to exit.

Have a look at the org.jboss.as.test.shared.AssumeTestGroupUtil class, which allows a test to see if it is running with the security manager enabled. You don't need to worry about turning on the SM. WF Core runs different test jobs, and PRs get tested with a job that has the SM one. (See the status block below.) There are also nightly jobs with the SM enabled. So just write the test to always run and use AssumeTestGroupUtil to control the behavior of the test based on whether the SM is enabled. If there is nothing useful to test if the SM is not enabled, then throw an AssumptionViolatedException if AssumeTestGroupUtil.isSecurityManagerDisabled() == true.

It's possible you don't need a broken config in the initial boot of the server. If not, that may give you more flexibility. The test can reload an existing server into admin-only mode, add a config that will fail in a normal boot, and then reload into normal mode. And then check what happened (plus clean up). What you are adding here are checks that will not execute in an admin-only server. The performBoottime method should not execute in admin-only.

See also the org.jboss.as.test.shared.logging package in testsuite/shared for utilities that can be used to capture logging data. I assume what the test would need to do is look at logging data. The reload will fail but I'm pretty sure you can't get useful data from any failure response. It would be in the log.

@mskacelik mskacelik changed the title WIP: [WFCORE-7335] Block deployments if the security manager subsyste… [WFCORE-7335] Block deployments if the security manager subsyste… Oct 21, 2025
@mskacelik mskacelik marked this pull request as ready for review October 21, 2025 15:10
@wildfly-ci

This comment was marked as outdated.

@wildfly-ci

This comment was marked as outdated.

@wildfly-ci

This comment was marked as outdated.

@mskacelik

This comment was marked as outdated.

Copy link
Contributor

@yersan yersan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managing the life cycle of the server in the test suite is a bit tricky sometimes, and it could influence on other tests, so I've added some comments about what I think could mitigate it

@wildfly-ci

This comment was marked as outdated.

@wildfly-ci

This comment was marked as outdated.

@wildfly-ci

This comment was marked as outdated.

@wildfly-ci

This comment was marked as outdated.

@wildfly-ci

This comment was marked as outdated.

@yersan
Copy link
Contributor

yersan commented Oct 23, 2025

It seems it did not help too much, but avoiding the thread.sleep is good. I have not looked closely, but I wonder if reading the logs to assert the error IDs gets confused because of traces added by other tests ...

@wildfly-ci

This comment was marked as outdated.

@mskacelik
Copy link
Contributor Author

It seems it did not help too much, but avoiding the thread.sleep is good. I have not looked closely, but I wonder if reading the logs to assert the error IDs gets confused because of traces added by other tests ...

Yes, that is very likely; the logs from the failed test do not contain WFLYSRV0056 ID at all, but the server.log content says otherwise. I tried to use the TestLogHandlerSetupTask solution for the logs, but I had some problems with it, but it could have been, and very likely is, a problem with me :)–so I'm going to investigate.

@mskacelik
Copy link
Contributor Author

mskacelik commented Oct 24, 2025

Ok, updated the tests to use a custom log file instead of a shared server.log.

Initially, I wanted to search for the expression error and the fatal shutdown log. But for some odd reason, I could not get a FATAL shutdown log saved into the custom log file, no matter if I set the LogHandlerSetup#getLevel with INFO, ALL, DEBUG, or TRACE.

So I had to try a different solution, and substitute the FATAL log checking with the WF server start log.

  • secmgr => reload fails => only one occurrence of WF server start log.
  • w/o secmgr => reload is successful => two occurrences of the WF server start log.

With that said, I am not 100% sure if the triggerFailedReloadToNormalMode method is correctly written for this checking, since secmgr and w/o secmgr use a different API for reloading, and not to mention, I'm still not sure if I understand the container.stop() correctly.

Edit: it seems like one of the jobs failed due to concurrent file read between jobs (?)

@mskacelik
Copy link
Contributor Author

@yersan, since there seems to be an issue regarding Windows FD handling (which I am not entirely sure how to fix).

I am considering changing the test approach slightly by not testing the logs at all, but rather by testing if the reload failed, such as using ServerReload.executeReloadAndWaitForCompletion with a smaller timeout bound.

WDYT?

Copy link
Contributor

@yersan yersan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mskacelik The testsuite is using the log facility in other tests, so the approach by looking at the logs should work.

Comment on lines 66 to 73
if (client != null) {
logHandlerSetup.tearDown(client);
client.close();
}
Copy link
Contributor

@yersan yersan Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the problems on Windows, see this note in the TestLogHandlerSetupTask

// remove file, note this needs to be done after the operations have been executed as we need to ensure that
// no FD's are open. This can be an issue on Windows.
Files.deleteIfExists(logPath);

So, it looks like the test could not be correctly triggering down the TestLogHandlerSetupTask. So, maybe the test is arriving here with an invalid client variable, which drives us to the code at L167:

client = TestSuiteEnvironment.getModelControllerClient();

You are modifying the local client reference, but not the one defined as class field instance.

Not sure if that would resolve the issue you currently have, but it seems it could be related.

I suggest not to pass a client variable to the cleanupSecurityManagerSubsystem method; always use the class instance client variable instead to ensure that the client variable that arrives at UnresolvedExpressionSecurityManagerTestCase#tearDown() has the expected value so you can tear down the TestLogHandlerSetupTask

Copy link
Contributor Author

@mskacelik mskacelik Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that parameter should not be there; that is from a previous implementation using try-with-resources.

I also changed a test a bit (checking a WFLYCTL0193 instead of checking number of occurrences of start log)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, it did not resolve the Windows issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been taking a look, but I have not found anything at first glance that can give me a hint about why in your case, when the log is torn down, Windows complains about the log file being in use.

The InstallationManagerBootTestCase is using the same setup to configure the logs, and there are no issues there. This needs a closer look, maybe configuring and removing the log on each test instead of at the end of the all tests ... but I'm not sure why it is complaining on Windows

Copy link
Contributor Author

@mskacelik mskacelik Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe configuring and removing the log on each test instead of at the end of the all tests ...

Is that not what the current tests are doing? (i.e., JUnit4's @Before and @After).

But it seems that outside of the Windows tests even the Linux tests failed, due to java.io.IOException: WFLYPRT0054: Channel closed.

The InstallationManagerBootTestCase is using the same setup to configure the logs, and there are no issues there

Maybe since the runtime is halting, reloading, and changing from admin to non-admin mode, it has some weird testing behaviors. The same way I find weird that there is a need for the setUpLogDirPath and tearLogDirPath... (unlike the other tests).

But there is the * new PropertyPermission("jboss.server.log.dir", "read") in the note.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that not what the current tests are doing? (i.e., JUnit4's @before and @after)

Yes that's right, I commented on it, but I did not realize that's what the test was already doing, so ignore my comment about that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked the logging handling mechanism a bit + change the step order:
from: wrong expression config -> trigger boot -> verify logs -> fix config
to: wrong expression config -> trigger boot -> fix config -> verify logs

So lets see what it will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, at least all the other jobs passed, but the Windows job failure is rather unusual. Not really sure why it's happening and why it's not happening to other tests that are similar to this. To the best of my understanding, Windows tests are executed without a security manager, so there should be no unusual behavior in the test results...

@mskacelik mskacelik force-pushed the WFCORE-7335 branch 2 times, most recently from eba33e4 to 3267bbb Compare November 4, 2025 13:41
// remove file, note this needs to be done after the operations have been executed as we need to ensure that
// no FD's are open. This can be an issue on Windows.
Files.deleteIfExists(logPath);
// small retry/backoff loop to handle lingering OS file locks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good try, but this is not going to work. It seems the main issue here is that the file is still in use by the running server. You are invoking logHandlerSetup.tearDown(client) before the container gets stopped, so delaying the deletion here is not going to take any effect. Calling it before the container gets stopped is expected, indeed you need to connect to the server and rollback the changes. We need to understand why in other tests, the file is deleted without issues even if the server is running, and in this test, it is not.

If we try to workaround it, there are chances where we could be putting problems under the carpet that in the future will arise again, so I would avoid it.

I understand this is a test issue, and it is not related to the fix itself, so if we do not find the root cause, we could assume the error by catching the exception in your test case and creating a follow up Jira with the intention of being fixed sooner than later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just an attempt, but yes I agree with you. Going to investigate further.

@wildfly-ci
Copy link

Core -> Full Integration Build 14797 outcome was FAILURE using a merge of 9307336
Summary: Tests failed: 1 (1 new), passed: 4541, ignored: 55 Build time: 02:49:41

Failed tests

org.jboss.as.test.clustering.cluster.singleton.SingletonDeploymentJBossAllTestCase.test: 	at org.jboss.as.arquillian.container.ArchiveDeployer.deployInternal(ArchiveDeployer.java:174)
	at org.jboss.as.arquillian.container.ArchiveDeployer.deployInternal(ArchiveDeployer.java:152)
	at org.jboss.as.arquillian.container.ArchiveDeployer.deploy(ArchiveDeployer.java:80)
	at org.jboss.as.arquillian.container.CommonDeployableContainer.deploy(CommonDeployableContainer.java:296)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at org.jboss.as.test.clustering.NodeUtil.deploy(NodeUtil.java:31)
	at org.jboss.as.test.clustering.cluster.AbstractClusteringTestCase.deploy(AbstractClusteringTestCase.java:256)
	at org.jboss.as.test.clustering.cluster.singleton.SingletonDeploymentTestCase.test(SingletonDeploymentTestCase.java:90)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
------- Stdout: -------
Warning! The CLI is running in a non-modular environment and cannot load commands from management extensions.
node-1 2025-11-25 16:21:32,011 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-8) WFLYUT0019: Host default-host stopping
node-1 2025-11-25 16:21:32,012 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-9) WFLYUT0008: Undertow HTTP listener default suspending
node-1 2025-11-25 16:21:32,014 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-9) WFLYUT0007: Undertow HTTP listener default stopped, was bound to [::1]:8080
node-1 2025-11-25 16:21:32,015 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-5) WFLYUT0004: Undertow 2.3.20.Final stopping
node-1 2025-11-25 16:21:32,038 INFO  [org.jboss.as] (MSC service thread 1-4) WFLYSRV0050: WildFly 39.0.0.Beta1-SNAPSHOT (WildFly Core 31.0.0.Beta3-SNAPSHOT) stopped in 29ms
node-1 2025-11-25 16:21:32,039 INFO  [org.jboss.as] (MSC service thread 1-8) WFLYSRV0049: WildFly 39.0.0.Beta1-SNAPSHOT (WildFly Core 31.0.0.Beta3-SNAPSHOT) starting
node-1 2025-11-25 16:21:32,230 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0039: Creating http management service using socket-binding (management-http)
node-1 2025-11-25 16:21:32,262 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 28) WFLYCLINF0001: Activating Infinispan subsystem.
node-1 2025-11-25 16:21:32,270 WARN  [org.wildfly.extension.elytron] (MSC service thread 1-2) WFLYELY00023: KeyStore file '/opt/buildAgent/work/e8e0dd9c7c4ba60/full/testsuite/integration/clustering/target/wildfly-clustering-singleton-ha-1/standalone/configuration/application.keystore' does not exist. Used blank.
node-1 2025-11-25 16:21:32,271 WARN  [org.wildfly.extension.elytron] (MSC service thread 1-2) WFLYELY01084: KeyStore /opt/buildAgent/work/e8e0dd9c7c4ba60/full/testsuite/integration/clustering/target/wildfly-clustering-singleton-ha-1/standalone/configuration/application.keystore not found, it will be auto-generated on first use with a self-signed certificate for host localhost
node-1 2025-11-25 16:21:32,273 INFO  [org.wildfly.extension.io] (ServerService Thread Pool -- 29) WFLYIO001: Worker 'default' has auto-configured to 8 IO threads with 64 max task threads based on your 4 available processors
node-1 2025-11-25 16:21:32,283 INFO  [org.jboss.as.jaxrs] (ServerService Thread Pool -- 30) WFLYRS0016: RESTEasy version 6.2.14.Final
node-1 2025-11-25 16:21:32,287 INFO  [org.jboss.as.clustering.jgroups] (ServerService Thread Pool -- 31) WFLYCLJG0001: Activating JGroups subsystem. JGroups version 5.4.11
node-1 2025-11-25 16:21:32,304 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-2) WFLYUT0003: Undertow 2.3.20.Final starting
node-1 2025-11-25 16:21:32,310 WARN  [org.jboss.as.txn] (ServerService Thread Pool -- 38) WFLYTX0013: The node-identifier attribute on the /subsystem=transactions is set to the default value. This is a danger for environments running multiple servers. Please make sure the attribute value is unique.
node-1 2025-11-25 16:21:32,310 INFO  [org.jboss.as.naming] (ServerService Thread Pool -- 33) WFLYNAM0001: Activating Naming Subsystem
node-1 2025-11-25 16:21:32,328 INFO  [org.jboss.as.naming] (MSC service thread 1-2) WFLYNAM0003: Starting Naming Service
node-1 2025-11-25 16:21:32,336 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-7) WFLYUT0012: Started server default-server.
node-1 2025-11-25 16:21:32,337 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-6) WFLYUT0006: Undertow HTTP listener default listening on [::1]:8080
node-1 2025-11-25 16:21:32,338 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-8) Queuing requests.
node-1 2025-11-25 16:21:32,340 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-8) WFLYUT0018: Host default-host starting
node-1 2025-11-25 16:21:32,350 INFO  [org.jboss.as.server.deployment.scanner] (MSC service thread 1-4) WFLYDS0013: Started FileSystemDeploymentService for directory /opt/buildAgent/work/e8e0dd9c7c4ba60/full/testsuite/integration/clustering/target/wildfly-clustering-singleton-ha-1/standalone/deployments
node-1 2025-11-25 16:21:32,412 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0212: Resuming server
node-1 2025-11-25 16:21:32,414 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://[::1]:9990/management
node-1 2025-11-25 16:21:32,414 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0054: Admin console is not enabled
node-1 2025-11-25 16:21:32,414 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: WildFly 39.0.0.Beta1-SNAPSHOT (WildFly Core 31.0.0.Beta3-SNAPSHOT) started in 374ms - Started 155 of 278 services (158 services are lazy, passive or on-demand) - Server configuration file in use: standalone-full-ha.xml - Minimum feature stability level: community


@darranl
Copy link
Contributor

darranl commented Dec 10, 2025

Just to check this looks still blocked by the testsuite?

@mskacelik
Copy link
Contributor Author

Just to check this looks still blocked by the testsuite?

Yes, beacuse of the failed Windows job. I had a chat with @.yersan about it that we will try to investigate the test manually in VM, but because of PTOs (among other things) we didn't get a chance for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants