upsmon: fix SHUTDOWNEXIT behavior; tag sub-processes in debug log records#3086
upsmon: fix SHUTDOWNEXIT behavior; tag sub-processes in debug log records#3086jimklimov merged 30 commits intonetworkupstools:masterfrom
Conversation
|
Re: |
|
I'm not a developer, but I play one. |
|
✅ Build nut 2.8.4.3514-master completed (commit 52ce507fcc by @jimklimov) |
|
@brucepleat : thanks for the heads-up. Here I'm following another loose thread - that a recently(ish) introduced feature might just have a bug, which is not platform-dependent. But if this hypothesis fails to solve the problem, my other options would be what you said, or something around the call to |
|
Testing in #3084 suggests this idea worked well, although on specifically Windows there are some surprises when running as a service:
|
|
Another idea that caught up now is that in non-monolithic mode of |
|
Some more work probably should be done, per #3084 (comment) |
…why/when it happened [networkupstools#3084] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…exitdelay handling into one clause, and ping the data server(s) while in the loop [networkupstools#3084] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…sed, so shutdownexitdelay does not apply currently [networkupstools#3084] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
942d180 to
191f012
Compare
|
Saving snapshot of the day here; the interaction for parent to wait for child (e.g. on Linux) is not working yet, probably |
|
✅ Build nut 2.8.4.3538-master completed (commit 0609e871b5 by @jimklimov) |
191f012 to
5ab624b
Compare
5ab624b to
e86a07f
Compare
|
This code looks quite ready to go; passed locally... will see what CI says. |
|
❌ Build nut 2.8.4.3540-master failed (commit 088effa9ae by @jimklimov) |
e86a07f to
f7e284b
Compare
…ation of FINALDELAY [networkupstools#3084] Notably, both primary and secondary upsmon `doshutdown()` to handle FSD. Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…DOWN_HOSTSYNC` notification message [networkupstools#3084] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…UTDOWNEXIT` setting [networkupstools#3084, networkupstools#2133] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…tion to FSD signals (interrupt a sleep() if needed) [networkupstools#3084] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…ing the long work cycle [networkupstools#3084] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
bcb9def to
005f84c
Compare
|
Addressed the "TODO ideas" above. New logs from the better reactive and better logged daemons (primary
|
|
@dvdesolve : can you please test if this works better for you now (also regarding "defunct" processes that should now be better reaped)? Hoping that CI does not complain substantially, and in case of Appveyor builds for Windows - does not overflow its 1hr allowance... |
|
...overflowed, but managed to upload the built tarball just before: |
|
CI: Agent out of space, irrelevant |
|
I will try to test fixes tomorrow. Should I also update primary controller to the test build as well as secondary systems? |
|
Probably better update both roles, changes should have impacted all sides (esp. the better visible logging, at least). |
…rify who normally runs as root or not [networkupstools#3084] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
|
✅ Build nut 2.8.4.3544-master completed (commit e59f879acf by @jimklimov) |
|
Just tested 2.8.4.3544 build on all related devices:
In total I've conducted 4 tests: 2 tests per secondary in which one shutdown process took longer than Test 1: secondary = Arch Linux; timeout = 45; HOSTSYNC = 120Log from primary (in reverse order): Log from secondary (in reverse order + some info on PIDs) Summary: both systems powered off, however UPS wasn't turned off - actually, it wasn't turned off in every test. I think something wrong with my system for now because it used to work earlier. In log from primary I can see these suspicious lines (took from latest run): Anyway, primary went off as soon as secondary went away. Test 2: secondary = Arch Linux; timeout = 45; HOSTSYNC = 20Log from primary (in reverse order): Log from secondary (in reverse order + some info on PIDs) Summary: both systems powered off, but primary went off earlier (as seen from timestamps). These two results look pretty good! Test 3: secondary = Windows 7; timeout = 45; HOSTSYNC = 120Log from primary (in reverse order): Summary: secondary system went off as it should but for some reason primary has waited all long till Test 4: secondary = Windows 7; timeout = 45; HOSTSYNC = 20Log from primary (in reverse order): Summary: both systems acted as they should (behavior is identical to the Test 2) |
|
@dvdesolve : thanks, great info and analysis!
In the first test, I see a gap between Do you think this is a problem to pursue further? Not sure whether/how we even can reliably say goodbye in case we intentionally stay up until the end - the two concepts just don't fit together :) The "defunct" PID 1394 is probably part of the forking stack (who called Overall, the secondary did not live long enough for the primary to reach the forced shutdown after
This seems like the
Here the "Host sync timer expired, forcing shutdown" on the primary after 20 sec, as expected.
Had the UPS shutdown command worked on the primary, the secondary's life might have been cut short by this; more or less as expected (a matter of setting the timings right on end-user side).
As noted in Test 1, maybe more verbose logging of
This is actually interesting, do you know when it worked "earlier", with which NUT version? Notably, there was a change with NUT v2.8.3 via PR #2686 - which revised how drivers handle the different shutdown operations (because not all devices do it similarly, and because for testing sometimes we want to see that the device was talked to during shutdown, but do not want to actually lose power to the rack - so have it e.g. beep instead). Does this seem related (could that add a regression)? Otherwise, check that the https://github.com/networkupstools/nut/blob/master/scripts/systemd/nutshutdown.in or equivalent logic is in your primary's shutdown routine, and that the |
|
Regarding lag between secondary goes off and primary sees it: IIRC I've faced such behavior even earlier during my tests. And it seems to be some kind of "dead" connection handling - anyway, at this stage secondary already should be in "off" state (that's why connection hung - may be Regarding powering off secondary systems during tests: for safety reason I've connected them to a separate power source if my setup was wrong (and it was because of playing with Test 3: Regarding UPS not going off: IIRC 2.8.2 and earlier versions worked great with Ippon Back Basic 1500/2200. Moreover, 2.8.3 on my CentOS Stream 9 (aarch64) controllers shuts down Ippon Innova RT 3/1 20K UPSes quite well (tested earlier). |
|
Tested again on stable 2.8.4. UPS haven't been shutted down. Excerpt from the journal (filtered by NUT and UPS words): |
|
I am not sure there are logs written locally at the time Probably you can tweak its copy in Also, making sure the driver works to shut down the UPS in greenhouse conditions (like But I guess this investigation belongs in another issue, not directly related to this PR which I guess can be considered as a success. A possible follow-up could be to extend the NUT networked protocol with a keyword that a client could hint to the server that it began the shutdown, so if there is any connection loss, it would be treated as logged off more quickly than it maybe happens today. Maybe the regular query to get client count as added in this and recent PRs, or some other looped query, could be treated as some sort of heartbeat in FSD mode - it already is used to fill the ether and avoid |
|
Thank you a lot for the efforts! |
… sort [networkupstools#3086] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
Closes: #3084
Follows-up from: #2133
Should also reap child processes better (e.g. notifiers while waiting to shut down).
Added logging to help troubleshoot that issue, and proposing a fix based on a hypothesis (that
upsdconsiders the loopingupsmonto be dead after some time of not hearing from it) from #3084 (comment)UPDATE: Added optional forked-process tagging (applied in
upsmon,upssched,upsdrvctl) forupsdebugx()etc., to help make out heads or tails of the multi-PID log messages from the daemons.