Skip to content

upsmon: fix SHUTDOWNEXIT behavior; tag sub-processes in debug log records#3086

Merged
jimklimov merged 30 commits intonetworkupstools:masterfrom
jimklimov:issue-3084
Sep 30, 2025
Merged

upsmon: fix SHUTDOWNEXIT behavior; tag sub-processes in debug log records#3086
jimklimov merged 30 commits intonetworkupstools:masterfrom
jimklimov:issue-3084

Conversation

@jimklimov
Copy link
Copy Markdown
Member

@jimklimov jimklimov commented Sep 11, 2025

Closes: #3084
Follows-up from: #2133

Should also reap child processes better (e.g. notifiers while waiting to shut down).

Added logging to help troubleshoot that issue, and proposing a fix based on a hypothesis (that upsd considers the looping upsmon to be dead after some time of not hearing from it) from #3084 (comment)

UPDATE: Added optional forked-process tagging (applied in upsmon, upssched, upsdrvctl) for upsdebugx() etc., to help make out heads or tails of the multi-PID log messages from the daemons.

@jimklimov jimklimov added this to the 2.8.5 milestone Sep 11, 2025
@jimklimov jimklimov added bug upsmon Shutdowns and overrides and battery level triggers Issues and PRs about system shutdown, especially if battery charge/runtime remaining is involved impacts-release-2.8.1 Issues reported against NUT release 2.8.1 (maybe vanilla or with minor packaging tweaks) impacts-release-2.8.2 Issues reported against NUT release 2.8.2 (maybe vanilla or with minor packaging tweaks) impacts-release-2.8.3 Issues reported against NUT release 2.8.3 (maybe vanilla or with minor packaging tweaks) impacts-release-2.8.4 Issues reported against NUT release 2.8.4 (maybe vanilla or with minor packaging tweaks) labels Sep 11, 2025
@jimklimov
Copy link
Copy Markdown
Member Author

Re: slowbuild-run/MATRIX_TAG="openbsd-openbsd-6.5-make-shellcheck -- agent connectivity problem

@brucepleat
Copy link
Copy Markdown

I'm not a developer, but I play one.
I know Windows Shutdown has scenarios where it can be blocked from shutting down. Even WU can get hung. Not as common as it used to be, but..

@AppVeyorBot
Copy link
Copy Markdown

@jimklimov
Copy link
Copy Markdown
Member Author

@brucepleat : thanks for the heads-up. Here I'm following another loose thread - that a recently(ish) introduced feature might just have a bug, which is not platform-dependent. But if this hypothesis fails to solve the problem, my other options would be what you said, or something around the call to SHUTDOWNCMD not being synchronous (fire and continue instead of fire and wait.... and investigating which one we historically want here in fact), or the shell-scripting environment on that Windows system (MSYS2? CygWin? GitBash?...) being/behaving different from what we assume from Linux etc. experience (I dunno, broken "sleep" in PoC script?)

@jimklimov
Copy link
Copy Markdown
Member Author

Testing in #3084 suggests this idea worked well, although on specifically Windows there are some surprises when running as a service:

  • it is stopped as part of system shutdown, and then some processes die off, but the upsmon.exe process remains in its requested loop but Event Log entries are not saved;
  • system(shutdowncmd) seems to return quickly

@jimklimov
Copy link
Copy Markdown
Member Author

Another idea that caught up now is that in non-monolithic mode of upsmon this trick would probably not work - the root-owned parent should not do any networking and might even not know about loaded UPSes. At least something to check before merging that PR...

@jimklimov
Copy link
Copy Markdown
Member Author

Some more work probably should be done, per #3084 (comment)

…why/when it happened [networkupstools#3084]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…exitdelay handling into one clause, and ping the data server(s) while in the loop [networkupstools#3084]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…sed, so shutdownexitdelay does not apply currently [networkupstools#3084]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@jimklimov jimklimov marked this pull request as draft September 24, 2025 18:23
@jimklimov
Copy link
Copy Markdown
Member Author

Saving snapshot of the day here; the interaction for parent to wait for child (e.g. on Linux) is not working yet, probably waitpid() would be better than openness of the pipe (select always returns 0, indefinitely).

Comment thread clients/upsmon.c Fixed
Comment thread common/common.c Fixed
@AppVeyorBot
Copy link
Copy Markdown

@jimklimov jimklimov marked this pull request as ready for review September 25, 2025 01:57
@jimklimov
Copy link
Copy Markdown
Member Author

This code looks quite ready to go; passed locally... will see what CI says.

@AppVeyorBot
Copy link
Copy Markdown

…ation of FINALDELAY [networkupstools#3084]

Notably, both primary and secondary upsmon `doshutdown()` to handle FSD.

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…DOWN_HOSTSYNC` notification message [networkupstools#3084]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…UTDOWNEXIT` setting [networkupstools#3084, networkupstools#2133]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…tion to FSD signals (interrupt a sleep() if needed) [networkupstools#3084]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…ing the long work cycle [networkupstools#3084]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
Comment thread clients/upsmon.c Dismissed
@jimklimov
Copy link
Copy Markdown
Member Author

Addressed the "TODO ideas" above. New logs from the better reactive and better logged daemons (primary HOSTSYNC 100, secondary SHUTDOWNEXIT 23):

  • primary:
    • "main-loop" got FSD signal at 21.9 sec
    • early sleep interruption works: "5-sec delay between main loop cycles finished, took 1.867021"
    • forceshutdown() began at 21.9s
    • sync_secondaries() notified with new SHUTDOWN_HOSTSYNC aka "Shutdown initiated; primary system is waiting for secondaries to log out or time out" at 21.9s
    • sync_secondaries() looping for HOSTSYNC since 22.1s (first report) to 53.9s (where UPS count becomes 1)
    • started doshutdown() and notified about it at 54s
    • reported "doshutdown: waiting for FINALDELAY=5 (to let notification handling complete)..." at 54s
    • "main-loop" proceeded to "doshutdown: call parent pipe for shutdown (async)" at 59s, exited immediately
    • "priv-parent" logged the powerdownflag activity (not created), "Calling shutdown command", and exited
...
  20.067979     [D2:74326:notify] a upsmon sub-process (notify) is exiting now

Network UPS Tools upsmon 2.8.4.77.4-81+gec25f25a3 (development iteration after 2.8.4)
   0.000000     [D1:74350] getprocname: determined process name for PID 74350: /home/jim/nut/clients/.libs/upsmon
   0.000000     [D1:74350] upscli_init_default_connect_timeout: upscli_default_connect_timeout=10.000000 sec assigned from: default_secs
   0.000393     [D1:74350:signaler] getprocname: determined process name for PID 73979: /home/jim/nut/clients/.libs/upsmon
   0.000473     [D1:74350:signaler] compareprocname: case-sensitive base name hit for PID 73979 of '/home/jim/nut/clients/.libs/upsmon'=>'upsmon' and checked 'upsmon'=>'upsmon'
   0.000575     [D1:74350:signaler] Signaled old daemon OK
   0.000598     Exiting upsmon instance after sending a signal to the running daemon successfully

  21.920909     Signal 10: User requested FSD
  21.921817     [D4:73979:main-loop] 5-sec delay between main loop cycles finished, took 1.867021
  21.922039     Got an FSD signal, initiating forced shutdown
  21.922118     [D1:73979:main-loop] Shutting down any UPSes in PRIMARY mode...
  21.922140     [D2:73979:main-loop] forceshutdown: For this system, timing options: SHUTDOWNEXIT=0 HOSTSYNC=100 FINALDELAY=5 DEADTIME=15; flags: USERFSD=1 EXIT_FLAG=0 USE_PIPE=1
  21.922158     [D2:73979:main-loop] forceshutdown: tell data server to setfsd(dummy@localhost)
  21.922180     [D2:73979:main-loop] Setting FSD on UPS dummy@localhost:12345
Client dummy-admin-m@127.0.0.1 set FSD on UPS [dummy]
  21.922822     [D1:73979:main-loop] This system is a primary for some device(s); waiting (up to HOSTSYNC=100 seconds) for secondaries to logout...
  21.922923     [D3:73979:main-loop] get_var: dummy@localhost:12345 / numlogins
  21.923445     [D3:73979:main-loop] sync_secondaries: Max data server logins per UPS still active: 2
  21.923522     [D2:73979:main-loop] do_notify: ntype 0x001a (SHUTDOWN_HOSTSYNC)
  21.923573     [D6:73979:main-loop] notify: sending notification for [upsmon itself]: type SHUTDOWN_HOSTSYNC with flags 0x0006: Shutdown initiated; primary system is waiting for secondaries to log out or time out
  21.923648     [D6:73979:main-loop] notify: NOTIFY_SYSLOG (as LOG_NOTICE)
  21.923670     Shutdown initiated; primary system is waiting for secondaries to log out or time out
  21.924422     [D2:73979:main-loop] notify (parent): forked a child to notify via subprocesses
  21.924552     [D3:73979:main-loop] do_notify: ntype 0x001a (SHUTDOWN_HOSTSYNC) finished
  21.924596     [D2:74364:notify] notify (grandchild): forked to notify via subprocesses
  21.924788     [D6:74364:notify] notify (grandchild): NOTIFY_WALL
  21.930605     [D6:74364:notify] notify (grandchild): exiting after notifications
  21.930681     [D1:74364:notify] close_sdbus_once: trying
  21.933048     [D2:74364:notify] a upsmon sub-process (notify) is exiting now
  22.174796     [D3:73979:main-loop] get_var: dummy@localhost:12345 / numlogins
  22.175240     [D3:73979:main-loop] sync_secondaries: Max data server logins per UPS still active: 2
  22.425486     [D3:73979:main-loop] get_var: dummy@localhost:12345 / numlogins
  22.426046     [D3:73979:main-loop] sync_secondaries: Max data server logins per UPS still active: 2
  22.676254     [D3:73979:main-loop] get_var: dummy@localhost:12345 / numlogins
  22.676576     [D3:73979:main-loop] sync_secondaries: Max data server logins per UPS still active: 2
  22.926781     [D3:73979:main-loop] get_var: dummy@localhost:12345 / numlogins
...
  53.748931     [D3:73979:main-loop] get_var: dummy@localhost:12345 / numlogins
  53.749261     [D3:73979:main-loop] sync_secondaries: Max data server logins per UPS still active: 2
  53.999465     [D3:73979:main-loop] get_var: dummy@localhost:12345 / numlogins
  53.999756     [D3:73979:main-loop] sync_secondaries: Max data server logins per UPS still active: 1
  53.999808     [D1:73979:main-loop] HOSTSYNC timeout expired or all the secondaries are gone, so shutting down this primary system now...
  53.999818     [D1:73979:main-loop] doshutdown: starting...
  53.999866     Executing automatic power-fail shutdown
  54.003781     [D2:73979:main-loop] do_notify: ntype 0x0006 (SHUTDOWN)
  54.003835     [D6:73979:main-loop] notify: sending notification for [upsmon itself]: type SHUTDOWN with flags 0x000a: Auto logout and shutdown proceeding
  54.003926     [D6:73979:main-loop] notify: NOTIFY_SYSLOG (as LOG_NOTICE)
  54.003987     Auto logout and shutdown proceeding
  54.004479     [D2:73979:main-loop] notify (parent): forked a child to notify via subprocesses
  54.004568     [D3:73979:main-loop] do_notify: ntype 0x0006 (SHUTDOWN) finished
  54.004621     [D1:73979:main-loop] doshutdown: waiting for FINALDELAY=5 (to let notification handling complete)...
  54.004604     [D2:74851:notify] notify (grandchild): forked to notify via subprocesses
  54.004683     [D6:74851:notify] notify (grandchild): NOTIFY_EXEC: calling NOTIFYCMD as 'TEMPDIR='/home/jim/nut/tests/NIT/tmp-issue-3084/run' /home/jim/nut/scripts/misc/notifyme-debug "Auto logout and shutdown proceeding"'
  54.015393     [D6:74851:notify] notify (grandchild): exiting after notifications
  54.015444     [D1:74851:notify] close_sdbus_once: trying
  54.017059     [D2:74851:notify] a upsmon sub-process (notify) is exiting now
  59.004812     doshutdown: call parent pipe for shutdown (async)
  59.004923     [D1:73979:main-loop] doshutdown: current exit_flag=0
  59.004937     [D1:73979:main-loop] Exiting upsmon immediately after initiating shutdown, by default
  59.004947     [D1:73979:main-loop] doshutdown: current exit_flag=0
  59.004966     [D1:73965:priv-parent] set_pdflag: SKIP creation of a POWERDOWNFLAG file: not configured
  59.004981     Exiting upsmon program after initiating shutdown
  59.004989     [D1:73965:priv-parent] runparent: upsmon parent: Calling shutdown command: echo "`date`: $$: TESTING_DUMMY_SHUTDOWN_NOW" | tee -a "/home/jim/nut/tests/NIT/tmp-issue-3084/run/upsmon.sd.log"
  59.005033     [D1:73979:main-loop] close_sdbus_once: trying
  59.006462     [D2:73979:main-loop] a upsmon sub-process (main-loop) is exiting now
Thu Sep 25 15:22:41 CEST 2025: 74970: TESTING_DUMMY_SHUTDOWN_NOW
  59.010509     upsmon parent: Exiting after calling (0) shutdown command: echo "`date`: $$: TESTING_DUMMY_SHUTDOWN_NOW" | tee -a "/home/jim/nut/tests/NIT/tmp-issue-3084/run/upsmon.sd.log"
  59.010567     [D2:73965:priv-parent] a upsmon sub-process (priv-parent) is exiting now
  • secondary:
    • "main-loop" detected FSD in ups.status at 20s, sent notifications, initiated shutdown as "not a primary for any device", notified about that
    • "doshutdown: waiting for FINALDELAY=5 (to let notification handling complete)..." at 20s
    • proceeded to "doshutdown: call parent pipe for shutdown (async)" at 25s
    • "priv-parent" logged the powerdownflag activity (not created), "Calling shutdown command", and began the loop to "wait for child" since 25s
    • "main-loop" reported "doshutdown: SHUTDOWNEXIT timeout expired" and exited at 48s
    • "priv-parent" reported "upsmon parent: child has exited" at 49s and also exited
...
  15.041416     [D4:74045:main-loop] Beginning 5-sec delay between main loop cycles
  15.054383     [D6:74344:notify] notify (grandchild): exiting after notifications
  15.054447     [D1:74344:notify] close_sdbus_once: trying
  15.056312     [D2:74344:notify] a upsmon sub-process (notify) is exiting now
  20.045729     [D4:74045:main-loop] 5-sec delay between main loop cycles finished, took 5.004302
  20.046335     [D5:74045:main-loop] sleep_inhibitor_status=-1
  20.047019     [D2:74045:main-loop] pollups: dummy@localhost:12345
  20.047097     [D3:74045:main-loop] get_var: dummy@localhost:12345 / status
  20.047548     [D3:74045:main-loop] get_var: dummy@localhost:12345 / buzzword
  20.047846     [D3:74045:main-loop] get_var: dummy@localhost:12345 / X-buzzword
  20.048133     [D2:74045:main-loop] parse_status: [FSD OL]
  20.048189     [D3:74045:main-loop] parsing: [FSD]
  20.048202     [D3:74045:main-loop] ups_fsd: dummy@localhost:12345 (first time)
  20.048210     [D2:74045:main-loop] do_notify: ntype 0x0003 (FSD)
  20.048265     [D6:74045:main-loop] notify: sending notification for [dummy@localhost:12345]: type FSD with flags 0x000a: UPS dummy@localhost:12345: forced shutdown in progress
  20.048332     [D6:74045:main-loop] notify: NOTIFY_SYSLOG (as LOG_NOTICE)
  20.048419     UPS dummy@localhost:12345: forced shutdown in progress
  20.048992     [D2:74045:main-loop] notify (parent): forked a child to notify via subprocesses
  20.049082     [D3:74045:main-loop] do_notify: ntype 0x0003 (FSD) finished
  20.049161     [D3:74045:main-loop] parsing: [OL]
  20.049165     [D2:74412:notify] notify (grandchild): forked to notify via subprocesses
  20.049229     [D6:74412:notify] notify (grandchild): NOTIFY_EXEC: calling NOTIFYCMD as 'TEMPDIR='/home/jim/nut/tests/NIT/tmp-issue-3084s/run' /home/jim/nut/scripts/misc/notifyme-debug "UPS dummy@localhost:12345: forced shutdown in progress"'
  20.049242     [D4:74045:main-loop] ups_on_line: dummy@localhost:12345 (no change)
  20.049317     [D3:74045:main-loop] Handled 2 status tokens
  20.049391     [D1:74045:main-loop] Critical UPS: dummy@localhost:12345
  20.049457     [D3:74045:main-loop] Current power value: 0
  20.049513     [D3:74045:main-loop] Minimum power value: 1
  20.049523     Too few UPS(es) are healthy (0<1), initiating forced shutdown
  20.049572     [D1:74045:main-loop] Shutting down any UPSes in PRIMARY mode...
  20.049584     [D2:74045:main-loop] forceshutdown: For this system, timing options: SHUTDOWNEXIT=23 HOSTSYNC=15 FINALDELAY=5 DEADTIME=15; flags: USERFSD=0 EXIT_FLAG=0 USE_PIPE=1
  20.049648     [D1:74045:main-loop] This system is not a primary for any device, shutting down now...
  20.049708     [D1:74045:main-loop] doshutdown: starting...
  20.049736     Executing automatic power-fail shutdown
  20.054879     [D2:74045:main-loop] do_notify: ntype 0x0006 (SHUTDOWN)
  20.054903     [D6:74045:main-loop] notify: sending notification for [upsmon itself]: type SHUTDOWN with flags 0x000a: Auto logout and shutdown proceeding
  20.054911     [D6:74045:main-loop] notify: NOTIFY_SYSLOG (as LOG_NOTICE)
  20.054965     Auto logout and shutdown proceeding
  20.055540     [D2:74045:main-loop] notify (parent): forked a child to notify via subprocesses
  20.055657     [D3:74045:main-loop] do_notify: ntype 0x0006 (SHUTDOWN) finished
  20.055731     [D1:74045:main-loop] doshutdown: waiting for FINALDELAY=5 (to let notification handling complete)...
  20.055763     [D2:74418:notify] notify (grandchild): forked to notify via subprocesses
  20.055874     [D6:74418:notify] notify (grandchild): NOTIFY_EXEC: calling NOTIFYCMD as 'TEMPDIR='/home/jim/nut/tests/NIT/tmp-issue-3084s/run' /home/jim/nut/scripts/misc/notifyme-debug "Auto logout and shutdown proceeding"'
  20.064346     [D6:74412:notify] notify (grandchild): exiting after notifications
  20.064534     [D1:74412:notify] close_sdbus_once: trying
  20.066964     [D2:74412:notify] a upsmon sub-process (notify) is exiting now
  20.071803     [D6:74418:notify] notify (grandchild): exiting after notifications
  20.071904     [D1:74418:notify] close_sdbus_once: trying
  20.075709     [D2:74418:notify] a upsmon sub-process (notify) is exiting now
  25.055905     doshutdown: call parent pipe for shutdown (async)
  25.056033     [D1:74045:main-loop] doshutdown: current exit_flag=0
  25.056056     Configured to only exit upsmon SHUTDOWNEXIT=23 sec after initiating shutdown
  25.056072     [D3:74045:main-loop] doshutdown: ping data server(s) for this client to remain not-timed-out
  25.056068     [D1:74031:priv-parent] set_pdflag: SKIP creation of a POWERDOWNFLAG file: not configured
  25.056097     [D1:74031:priv-parent] runparent: upsmon parent: Calling shutdown command: echo "`date`: $$: TESTING_DUMMY_SHUTDOWN_NOW" | tee -a "/home/jim/nut/tests/NIT/tmp-issue-3084s/run/upsmon.sd.log"
  25.056128     [D3:74045:main-loop] get_var: dummy@localhost:12345 / numlogins
Thu Sep 25 15:22:13 CEST 2025: 74538: TESTING_DUMMY_SHUTDOWN_NOW
  25.062135     [D1:74031:priv-parent] upsmon parent (exit_flag=0): waiting for child 74045 to exit, after calling (0) shutdown command: echo "`date`: $$: TESTING_DUMMY_SHUTDOWN_NOW" | tee -a "/home/jim/nut/tests/NIT/tmp-issue-3084s/run/upsmon.sd.log"
  25.062205     [D3:74031:priv-parent] upsmon parent: wait for child returned status=0, pid=0
  26.056726     [D3:74045:main-loop] doshutdown: ping data server(s) for this client to remain not-timed-out
  26.056833     [D3:74045:main-loop] get_var: dummy@localhost:12345 / numlogins
  26.062410     [D3:74031:priv-parent] upsmon parent: wait for child returned status=0, pid=0
  27.057365     [D3:74045:main-loop] doshutdown: ping data server(s) for this client to remain not-timed-out
  27.057487     [D3:74045:main-loop] get_var: dummy@localhost:12345 / numlogins
...
  46.067005     [D3:74045:main-loop] get_var: dummy@localhost:12345 / numlogins
  47.065677     [D3:74031:priv-parent] upsmon parent: wait for child returned status=0, pid=0
  47.067370     [D3:74045:main-loop] doshutdown: ping data server(s) for this client to remain not-timed-out
  47.067436     [D3:74045:main-loop] get_var: dummy@localhost:12345 / numlogins
  48.065926     [D3:74031:priv-parent] upsmon parent: wait for child returned status=0, pid=0
  48.067826     [D3:74045:main-loop] doshutdown: ping data server(s) for this client to remain not-timed-out
  48.067898     [D3:74045:main-loop] get_var: dummy@localhost:12345 / numlogins
  48.068226     [D3:74045:main-loop] doshutdown: SHUTDOWNEXIT timeout expired
  48.068277     [D1:74045:main-loop] doshutdown: current exit_flag=0
  48.068286     Exiting upsmon program after initiating shutdown
  48.068331     [D1:74045:main-loop] close_sdbus_once: trying
  48.069438     [D2:74045:main-loop] a upsmon sub-process (main-loop) is exiting now
  49.066213     [D3:74031:priv-parent] upsmon parent: wait for child returned status=0, pid=74045
  49.066334     [D1:74031:priv-parent] upsmon parent: child has exited
  49.066374     [D1:74031:priv-parent] upsmon parent: exit_flag=0
  49.066395     upsmon parent: Exiting after calling (0) shutdown command: echo "`date`: $$: TESTING_DUMMY_SHUTDOWN_NOW" | tee -a "/home/jim/nut/tests/NIT/tmp-issue-3084s/run/upsmon.sd.log"
  49.066433     [D2:74031:priv-parent] a upsmon sub-process (priv-parent) is exiting now

@jimklimov
Copy link
Copy Markdown
Member Author

@dvdesolve : can you please test if this works better for you now (also regarding "defunct" processes that should now be better reaped)?

Hoping that CI does not complain substantially, and in case of Appveyor builds for Windows - does not overflow its 1hr allowance...

@jimklimov
Copy link
Copy Markdown
Member Author

...overflowed, but managed to upload the built tarball just before:
https://ci.appveyor.com/api/buildjobs/w3b15almg85qdd3u/artifacts/NUT-for-Windows-x86_64-SNAPSHOT-2.8.4.3543-master.7z

@jimklimov
Copy link
Copy Markdown
Member Author

CI: Agent out of space, irrelevant

@dvdesolve
Copy link
Copy Markdown
Contributor

I will try to test fixes tomorrow. Should I also update primary controller to the test build as well as secondary systems?

@jimklimov
Copy link
Copy Markdown
Member Author

Probably better update both roles, changes should have impacted all sides (esp. the better visible logging, at least).

…rify who normally runs as root or not [networkupstools#3084]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@AppVeyorBot
Copy link
Copy Markdown

@dvdesolve
Copy link
Copy Markdown
Contributor

Just tested 2.8.4.3544 build on all related devices:

  • primary: Arch Linux 64-bit
  • secondary 1: Arch Linux 64-bit
  • secondary 2: Windows 7 Pro 64 bit

In total I've conducted 4 tests: 2 tests per secondary in which one shutdown process took longer than HOSTSYNC setting on primary (45 > 20) and other took lesser (45 < 120). Here are the summary of the results.

Test 1: secondary = Arch Linux; timeout = 45; HOSTSYNC = 120

Log from primary (in reverse order):

Sep 30 08:50:32 phit-user01 nut-monitor[965]: upsmon parent: Exiting after calling (0) shutdown command: /sbin/shutdown -h +0
Sep 30 08:50:32 phit-user01 nut-monitor[966]: Exiting upsmon program after initiating shutdown
Sep 30 08:50:32 phit-user01 nut-monitor[966]: doshutdown: call parent pipe for shutdown (async)
Sep 30 08:50:27 phit-user01 nut-monitor[966]: Auto logout and shutdown proceeding
Sep 30 08:50:27 phit-user01 nut-monitor[966]: Executing automatic power-fail shutdown
Sep 30 08:48:52 phit-user01 nut-server[1599]: Client upsmon@::1 set FSD on UPS [ups]
Sep 30 08:48:52 phit-user01 upsd[1599]: Client upsmon@::1 set FSD on UPS [ups]
Sep 30 08:48:52 phit-user01 nut-monitor[966]: Shutdown initiated; primary system is waiting for secondaries to log out or time out
Sep 30 08:48:52 phit-user01 nut-monitor[966]: Got an FSD signal, initiating forced shutdown
Sep 30 08:48:52 phit-user01 nut-monitor[966]: Signal 10: User requested FSD

Log from secondary (in reverse order + some info on PIDs)

Sep 30 08:49:47 desolve-laptop systemd-logind[708]: System is powering down.
Sep 30 08:49:47 desolve-laptop systemd-logind[708]: The system will power off now!
Sep 30 08:49:47 desolve-laptop nut-monitor[1396]: TEST SLEEP ENDED
Sep 30 08:49:02 desolve-laptop nut-monitor[1396]: TEST SLEEP STARTED
Sep 30 08:49:02 desolve-laptop nut-monitor[772]: Configured to not exit upsmon after initiating shutdown
Sep 30 08:49:02 desolve-laptop nut-monitor[772]: doshutdown: call parent pipe for shutdown (async)
Sep 30 08:48:57 desolve-laptop nut-monitor[772]: Auto logout and shutdown proceeding
Sep 30 08:48:57 desolve-laptop nut-monitor[772]: Executing automatic power-fail shutdown
Sep 30 08:48:57 desolve-laptop nut-monitor[772]: Too few UPS(es) are healthy (0<1), initiating forced shutdown
Sep 30 08:48:57 desolve-laptop nut-monitor[772]: UPS ups@user01.phit: forced shutdown in progress

PIDs before
nut          765  0.0  0.0  36468  6612 ?        Ss   08:47   0:00 /usr/bin/upsd -FF
root         771  0.0  0.0  11556  4148 ?        Ss   08:47   0:00 /usr/bin/upsmon -F
nut          772  0.6  0.1  13728  8288 ?        S    08:47   0:00 /usr/bin/upsmon -F
nut         1394  0.0  0.0      0     0 ?        Z    08:48   0:00 [upsmon] <defunct>

PIDs after
nut          765  0.0  0.0  36468  6612 ?        Ss   08:47   0:00 /usr/bin/upsd -FF
root         771  0.0  0.0  11556  4148 ?        Ss   08:47   0:00 /usr/bin/upsmon -F
nut          772  0.4  0.1  13728  8288 ?        S    08:47   0:00 /usr/bin/upsmon -F

Summary: both systems powered off, however UPS wasn't turned off - actually, it wasn't turned off in every test. I think something wrong with my system for now because it used to work earlier. In log from primary I can see these suspicious lines (took from latest run):

Sep 30 10:02:20 phit-user01 nutdrv_qx[704]: WARNING: send_to_all: write 29 bytes to socket 10 failed (ret=-1), disconnecting.: Broken pipe
Sep 30 10:02:20 phit-user01 nut-driver@ups[704]: WARNING: send_to_all: write 29 bytes to socket 10 failed (ret=-1), disconnecting.: Broken pipe

Anyway, primary went off as soon as secondary went away.

Test 2: secondary = Arch Linux; timeout = 45; HOSTSYNC = 20

Log from primary (in reverse order):

Sep 30 09:32:08 phit-user01 nut-monitor[969]: upsmon parent: Exiting after calling (0) shutdown command: /sbin/shutdown -h +0
Sep 30 09:32:08 phit-user01 nut-monitor[970]: Exiting upsmon program after initiating shutdown
Sep 30 09:32:08 phit-user01 nut-monitor[970]: doshutdown: call parent pipe for shutdown (async)
Sep 30 09:32:03 phit-user01 nut-monitor[970]: Auto logout and shutdown proceeding
Sep 30 09:32:03 phit-user01 nut-monitor[970]: Executing automatic power-fail shutdown
Sep 30 09:32:03 phit-user01 nut-monitor[970]: Host sync timer expired, forcing shutdown
Sep 30 09:31:42 phit-user01 nut-server[1869]: Client upsmon@::1 set FSD on UPS [ups]
Sep 30 09:31:42 phit-user01 upsd[1869]: Client upsmon@::1 set FSD on UPS [ups]
Sep 30 09:31:42 phit-user01 nut-monitor[970]: Shutdown initiated; primary system is waiting for secondaries to log out or time out
Sep 30 09:31:42 phit-user01 nut-monitor[970]: Got an FSD signal, initiating forced shutdown
Sep 30 09:31:42 phit-user01 nut-monitor[970]: Signal 10: User requested FSD

Log from secondary (in reverse order + some info on PIDs)

Sep 30 09:32:32 desolve-laptop systemd-logind[711]: System is powering down.
Sep 30 09:32:32 desolve-laptop systemd-logind[711]: The system will power off now!
Sep 30 09:32:32 desolve-laptop nut-monitor[1384]: TEST SLEEP ENDED
Sep 30 09:31:47 desolve-laptop nut-monitor[1384]: TEST SLEEP STARTED
Sep 30 09:31:47 desolve-laptop nut-monitor[775]: Configured to not exit upsmon after initiating shutdown
Sep 30 09:31:47 desolve-laptop nut-monitor[775]: doshutdown: call parent pipe for shutdown (async)
Sep 30 09:31:42 desolve-laptop nut-monitor[775]: Auto logout and shutdown proceeding
Sep 30 09:31:42 desolve-laptop nut-monitor[775]: Executing automatic power-fail shutdown
Sep 30 09:31:42 desolve-laptop nut-monitor[775]: Too few UPS(es) are healthy (0<1), initiating forced shutdown
Sep 30 09:31:42 desolve-laptop nut-monitor[775]: UPS ups@user01.phit: forced shutdown in progress

PIDs before
nut          768  0.0  0.0  36468  6784 ?        Ss   09:30   0:00 /usr/bin/upsd -FF
root         774  0.0  0.0  11556  4160 ?        Ss   09:30   0:00 /usr/bin/upsmon -F
nut          775  0.9  0.1  13728  8060 ?        S    09:30   0:00 /usr/bin/upsmon -F
nut         1381  0.0  0.0      0     0 ?        Z    09:31   0:00 [upsmon] <defunct>

PIDs after
nut          768  0.0  0.0  36468  6784 ?        Ss   09:30   0:00 /usr/bin/upsd -FF
root         774  0.0  0.0  11556  4160 ?        Ss   09:30   0:00 /usr/bin/upsmon -F
nut          775  0.5  0.1  13728  8060 ?        S    09:30   0:00 /usr/bin/upsmon -F

Summary: both systems powered off, but primary went off earlier (as seen from timestamps). These two results look pretty good!

Test 3: secondary = Windows 7; timeout = 45; HOSTSYNC = 120

Log from primary (in reverse order):

Sep 30 10:13:02 phit-user01 nut-monitor[975]: upsmon parent: Exiting after calling (0) shutdown command: /sbin/shutdown -h +0
Sep 30 10:13:02 phit-user01 nut-monitor[976]: Exiting upsmon program after initiating shutdown
Sep 30 10:13:02 phit-user01 nut-monitor[976]: doshutdown: call parent pipe for shutdown (async)
Sep 30 10:12:57 phit-user01 nut-monitor[976]: Auto logout and shutdown proceeding
Sep 30 10:12:57 phit-user01 nut-monitor[976]: Executing automatic power-fail shutdown
Sep 30 10:12:57 phit-user01 nut-monitor[976]: Host sync timer expired, forcing shutdown
Sep 30 10:10:56 phit-user01 nut-server[1572]: Client upsmon@::1 set FSD on UPS [ups]
Sep 30 10:10:56 phit-user01 upsd[1572]: Client upsmon@::1 set FSD on UPS [ups]
Sep 30 10:10:56 phit-user01 nut-monitor[976]: Shutdown initiated; primary system is waiting for secondaries to log out or time out
Sep 30 10:10:56 phit-user01 nut-monitor[976]: Got an FSD signal, initiating forced shutdown
Sep 30 10:10:56 phit-user01 nut-monitor[976]: Signal 10: User requested FSD

Summary: secondary system went off as it should but for some reason primary has waited all long till HOSTSYNC despite secondary being already asleep. nut.exe on secondary was terminated as soon as FSD was seen and upsmon.exe stayed until shutdown.exe /s /t 10 incantation in test script (added 10 more seconds to the 45 to recheck things before shutdown).

Test 4: secondary = Windows 7; timeout = 45; HOSTSYNC = 20

Log from primary (in reverse order):

Sep 30 10:02:20 phit-user01 nut-monitor[975]: upsmon parent: Exiting after calling (0) shutdown command: /sbin/shutdown -h +0
Sep 30 10:02:20 phit-user01 nut-monitor[976]: Exiting upsmon program after initiating shutdown
Sep 30 10:02:20 phit-user01 nut-monitor[976]: doshutdown: call parent pipe for shutdown (async)
Sep 30 10:02:15 phit-user01 nut-monitor[976]: Auto logout and shutdown proceeding
Sep 30 10:02:15 phit-user01 nut-monitor[976]: Executing automatic power-fail shutdown
Sep 30 10:02:15 phit-user01 nut-monitor[976]: Host sync timer expired, forcing shutdown
Sep 30 10:01:54 phit-user01 nut-server[3334]: Client upsmon@::1 set FSD on UPS [ups]
Sep 30 10:01:54 phit-user01 upsd[3334]: Client upsmon@::1 set FSD on UPS [ups]
Sep 30 10:01:54 phit-user01 nut-monitor[976]: Shutdown initiated; primary system is waiting for secondaries to log out or time out
Sep 30 10:01:54 phit-user01 nut-monitor[976]: Got an FSD signal, initiating forced shutdown
Sep 30 10:01:54 phit-user01 nut-monitor[976]: Signal 10: User requested FSD

Summary: both systems acted as they should (behavior is identical to the Test 2)

@jimklimov
Copy link
Copy Markdown
Member Author

@dvdesolve : thanks, great info and analysis!

  • Test 1

In the first test, I see a gap between Sep 30 08:49:47 desolve-laptop systemd-logind[708]: System is powering down. on the secondary, and Sep 30 08:50:27 phit-user01 nut-monitor[966]: Executing automatic power-fail shutdown. My guess is that this half a minute is somehow from the loss of connection (secondary was killed by system or maybe alive until networking or even power went down, so possibly without a proper closing of TCP connection - and the data server only knows it lost the client after a timeout, maybe higher debug verbosity on upsd could reveal when the secondary session was considered finished and why; at some level you would probably see the regular queries for number of connected clients, until there are no more).

Do you think this is a problem to pursue further? Not sure whether/how we even can reliably say goodbye in case we intentionally stay up until the end - the two concepts just don't fit together :)

The "defunct" PID 1394 is probably part of the forking stack (who called system()?) of the sleeping script in PID 1396, so waits for it to complete and does not respond to anything while in the syscall...

Overall, the secondary did not live long enough for the primary to reach the forced shutdown after HOSTSYNC=120, right?

In log from primary I can see these suspicious lines (took from latest run):
Sep 30 10:02:20 phit-user01 nut-driver@ups[704]: WARNING: send_to_all: write 29 bytes to socket 10 failed (ret=-1), disconnecting.: Broken pipe

This seems like the upsd (with its pipe/socket) went away before the driver did, so the driver could not post its regular update. Had this happened during normal life (data server crash/admin restart) the driver should have reconnected and posted the info later.

  • Test 2

Here the "Host sync timer expired, forcing shutdown" on the primary after 20 sec, as expected.
The secondary was "Configured to not exit upsmon after initiating shutdown" and so slept for 45 sec and the system went off, as expected.
Defunct PID 1381 may be due to script in PID 1384, as elaborated above.

both systems powered off, but primary went off earlier

Had the UPS shutdown command worked on the primary, the secondary's life might have been cut short by this; more or less as expected (a matter of setting the timings right on end-user side).

  • Test 3:

As noted in Test 1, maybe more verbose logging of upsd would reveal why/when it decided that the secondary logged off (or timed out), especially as the secondary client is configured to never exit so there is never a proper ending to the connection.
Also, when did the Windows box lose power - long before the 2 minutes elapsed, or closer to that? (Maybe upsmon.exe was alive through much of that shutdown routine).

  • PS:

...however UPS wasn't turned off - actually, it wasn't turned off in every test. I think something wrong with my system for now because it used to work earlier.

This is actually interesting, do you know when it worked "earlier", with which NUT version? Notably, there was a change with NUT v2.8.3 via PR #2686 - which revised how drivers handle the different shutdown operations (because not all devices do it similarly, and because for testing sometimes we want to see that the device was talked to during shutdown, but do not want to actually lose power to the rack - so have it e.g. beep instead). Does this seem related (could that add a regression)?

Otherwise, check that the https://github.com/networkupstools/nut/blob/master/scripts/systemd/nutshutdown.in or equivalent logic is in your primary's shutdown routine, and that the POWERDOWNFLAG file is configured and properly created, and later checked by this logic to actually call the driver program to tell the UPS to go off, and that the program can be launched (not too much of the file system is unmounted - you can read the program, libs, config, flag file at that point. Read-only FS should be OK and expected at that time.

@dvdesolve
Copy link
Copy Markdown
Contributor

Regarding lag between secondary goes off and primary sees it: IIRC I've faced such behavior even earlier during my tests. And it seems to be some kind of "dead" connection handling - anyway, at this stage secondary already should be in "off" state (that's why connection hung - may be systemd or other service manager killed networking earlier than upsmon been able to say goodbye) so primary can safely proceed with its further deals. In any case HOSTSYNC setting is specially designed for tackling such hangs. So I think this shouldn't be a problem at all - primary goal is to give a chance for secondaries to park them properly (for HOSTSYNC seconds at most); what happens afterwards - we can't be responsible for that.

Regarding powering off secondary systems during tests: for safety reason I've connected them to a separate power source if my setup was wrong (and it was because of playing with HOSTSYNC and waiting timeouts). So primary shut-offs haven't impacted secondary systems.

Test 3: upsmon.exe lived till the very shutdown: I've emulated long-wait with timeout /T 45 in test batch file followed by shutdown /s /t 10. Secondary waited 45 secs and then imminent shutdown popup window appeared, as it should. The only suspicious thing is why primary waited for HOSTSYNC seconds all the way long. May be 45 + 10 + actual shutdown time (quite laggy secondary!) + ~20 secs hung connection exceeded HOSTSYNC limit on primary.

Regarding UPS not going off: IIRC 2.8.2 and earlier versions worked great with Ippon Back Basic 1500/2200. Moreover, 2.8.3 on my CentOS Stream 9 (aarch64) controllers shuts down Ippon Innova RT 3/1 20K UPSes quite well (tested earlier).

@dvdesolve
Copy link
Copy Markdown
Contributor

Tested again on stable 2.8.4. UPS haven't been shutted down. Excerpt from the journal (filtered by NUT and UPS words):

сен 30 14:29:23 phit-user01 systemd[1]: Removed slice Slice /system/nut-driver.
сен 30 14:29:23 phit-user01 systemd[1]: Stopped Network UPS Tools - device driver for NUT device 'ups'.
сен 30 14:29:23 phit-user01 systemd[1]: nut-driver@ups.service: Deactivated successfully.
сен 30 14:29:22 phit-user01 nutdrv_qx[971]: Signal 15: exiting
сен 30 14:29:22 phit-user01 nut-driver@ups[2799]: Network UPS Tools upsdrvctl - UPS driver controller 2.8.4 release
сен 30 14:29:22 phit-user01 systemd[1]: Stopping Network UPS Tools - device driver for NUT device 'ups'...
сен 30 14:29:22 phit-user01 systemd[1]: Stopped target Network UPS Tools - target for power device drivers on this system.
сен 30 14:29:22 phit-user01 systemd[1]: Stopped Network UPS Tools - power devices information server.
сен 30 14:29:22 phit-user01 systemd[1]: nut-server.service: Deactivated successfully.
сен 30 14:29:22 phit-user01 upsd[975]: Signal 15: exiting
сен 30 14:29:22 phit-user01 upsd[975]: mainloop: Interrupted system call
сен 30 14:29:22 phit-user01 systemd[1]: Stopping Network UPS Tools - power devices information server...
сен 30 14:29:22 phit-user01 nut-server[975]: Signal 15: exiting
сен 30 14:29:22 phit-user01 nut-server[975]: mainloop: Interrupted system call
сен 30 14:29:22 phit-user01 systemd[1]: Stopped target Network UPS Tools - target for power device drivers, data server and monitoring client (if enabled) on this system.
сен 30 14:29:22 phit-user01 systemd[1]: Stopped Network UPS Tools - power device monitor and shutdown controller.
сен 30 14:29:22 phit-user01 systemd[1]: nut-monitor.service: Deactivated successfully.
сен 30 14:29:17 phit-user01 nut-monitor[981]: Auto logout and shutdown proceeding
сен 30 14:29:17 phit-user01 nut-monitor[981]: Executing automatic power-fail shutdown
сен 30 14:29:17 phit-user01 upsd[975]: Client upsmon@::1 set FSD on UPS [ups]
сен 30 14:29:17 phit-user01 nut-server[975]: Client upsmon@::1 set FSD on UPS [ups]
сен 30 14:29:15 phit-user01 nut-monitor[981]: Signal 10: User requested FSD

@jimklimov
Copy link
Copy Markdown
Member Author

jimklimov commented Sep 30, 2025

I am not sure there are logs written locally at the time systemd-shutdown hooks run, maybe the filesystems are read-only by that time. The current implementation at https://github.com/networkupstools/nut/blob/master/scripts/systemd/nutshutdown.in would try to write to /dev/console if it can, so if you can see it (on screen, serial console, UDP log sink via https://www.kernel.org/doc/Documentation/networking/netconsole.txt etc.) you may get at least the script's messages there.

Probably you can tweak its copy in /usr/lib/systemd/system-shutdown/nutshutdown to start the driver with debug as well (normally specify debug_min in /etc/nut/ups.conf - see https://github.com/networkupstools/nut/wiki/Changing-NUT-daemon-debug-verbosity for details) AND stream the driver's stdout+stderr to /dev/console as well, to better see how it goes or stumbles.

Also, making sure the driver works to shut down the UPS in greenhouse conditions (like systemctl stop nut-driver@ups ; nutdrv_qx -a ups -DDDDDD -k) as a first step would be helpful.

But I guess this investigation belongs in another issue, not directly related to this PR which I guess can be considered as a success.

A possible follow-up could be to extend the NUT networked protocol with a keyword that a client could hint to the server that it began the shutdown, so if there is any connection loss, it would be treated as logged off more quickly than it maybe happens today. Maybe the regular query to get client count as added in this and recent PRs, or some other looped query, could be treated as some sort of heartbeat in FSD mode - it already is used to fill the ether and avoid upsd session timeout, but not yet used to actually force such timeout when the heartbeat stops. But this would likely be a different issue, possibly with changes to RFC 9271 (protocol definition) etc. and so can take longer - especially if there are additions to the vocabulary of the protocol.

@dvdesolve
Copy link
Copy Markdown
Contributor

Thank you a lot for the efforts!

jimklimov added a commit to jimklimov/nut that referenced this pull request Sep 30, 2025
… sort [networkupstools#3086]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@jimklimov jimklimov changed the title upsmon: fix SHUTDOWNEXIT behavior; tag sub-processes in log records upsmon: fix SHUTDOWNEXIT behavior; tag sub-processes in debug log records Sep 30, 2025
@jimklimov jimklimov added the ready / gonna merge The PR is in final cycles leading to merge unless someone logs an objection before we hit the button label Sep 30, 2025
@jimklimov jimklimov merged commit 90c3514 into networkupstools:master Sep 30, 2025
25 of 28 checks passed
@jimklimov jimklimov deleted the issue-3084 branch September 30, 2025 14:53
@jimklimov jimklimov added the service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug label Nov 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug impacts-release-2.8.1 Issues reported against NUT release 2.8.1 (maybe vanilla or with minor packaging tweaks) impacts-release-2.8.2 Issues reported against NUT release 2.8.2 (maybe vanilla or with minor packaging tweaks) impacts-release-2.8.3 Issues reported against NUT release 2.8.3 (maybe vanilla or with minor packaging tweaks) impacts-release-2.8.4 Issues reported against NUT release 2.8.4 (maybe vanilla or with minor packaging tweaks) ready / gonna merge The PR is in final cycles leading to merge unless someone logs an objection before we hit the button service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug Shutdowns and overrides and battery level triggers Issues and PRs about system shutdown, especially if battery charge/runtime remaining is involved upsmon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SHUTDOWNEXIT flag for upsmon doesn't work on Windows

5 participants