Skip to content

fix(systemd): make start-services.sh self-discovering — service was silently failing on reboot#27

Merged
DavidsonGomes merged 3 commits into
evolution-foundation:developfrom
NeritonDias:fix/start-services-self-discovering
Apr 22, 2026
Merged

fix(systemd): make start-services.sh self-discovering — service was silently failing on reboot#27
DavidsonGomes merged 3 commits into
evolution-foundation:developfrom
NeritonDias:fix/start-services-self-discovering

Conversation

@NeritonDias
Copy link
Copy Markdown
Contributor

Summary

Fixes a silent failure where the evo-nexus systemd service "starts" successfully on every VPS boot but no actual processes are running. Operator has to SSH in and run start-services.sh manually each time.

Repro

  1. Fresh install on Ubuntu 24.04 with the wizard set to use a non-default service user (anything other than the auto-created evonexus account — e.g. when SUDO_USER=ubuntu is preserved by sudo -i/sudo su and the install ends up under /home/ubuntu/evo-nexus).
  2. sudo systemctl reboot
  3. After boot: systemctl status evo-nexus reports active (exited). Browser hits 502/connection refused.
  4. ps -ef | grep -E 'app.py|scheduler.py|terminal-server' — no processes.
  5. bash /home/<user>/evo-nexus/start-services.sh manually → everything comes up.

Root cause

start-services.sh (committed to git) hard-codes the path /home/evonexus/evo-nexus everywhere — cd, log redirects, and .venv/bin/python invocations:

cd /home/evonexus/evo-nexus
nohup node dashboard/terminal-server/bin/server.js > /home/evonexus/evo-nexus/logs/terminal-server.log 2>&1 &
nohup /home/evonexus/evo-nexus/.venv/bin/python scheduler.py > /home/evonexus/evo-nexus/logs/scheduler.log 2>&1 &
nohup /home/evonexus/evo-nexus/.venv/bin/python app.py > /home/evonexus/evo-nexus/logs/dashboard.log 2>&1 &

This works only by coincidence in the single scenario _setup_systemd_service exercises (root + no SUDO_USER → auto-creates the evonexus user → installs to /home/evonexus/evo-nexus). For any other user or path:

  • cd /home/evonexus/evo-nexus fails silently — bash continues with cwd inherited from systemd's WorkingDirectory= (which is the correct dir), so the cwd is masked during single-shot tests.
  • nohup ... > /home/evonexus/evo-nexus/logs/<file> 2>&1 cannot open the redirect target → the spawned process dies before doing any work; nohup eats the error.
  • nohup /home/evonexus/evo-nexus/.venv/bin/python ... references a non-existent venv → ENOENT, same silent death.

Net effect: the oneshot script "succeeds" in <50 ms, systemd marks the unit active (exited), but no python/node processes are running.

Fix

Resolve the install dir at runtime in start-services.sh:

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

Single source of truth — works for the evonexus user, for ubuntu, for manual installs under /opt/..., anywhere. Side benefits:

  • install-service.sh no longer regenerates start-services.sh; just chmod + chown the checked-in version. Removes ~30 lines of fragile heredoc with \$ escaping.
  • Operator customizations to start-services.sh now survive re-runs of install-service.sh (previously got clobbered).
  • mkdir -p logs added so fresh installs and reboots after a manual rm -rf logs still work.
  • cd ... || exit 1 guards added — fail loudly instead of silently running from the wrong directory.

Test plan

  • bash -n start-services.sh and bash -n install-service.sh parse cleanly
  • SCRIPT_DIR resolution returns the correct absolute path when the script is invoked from any cwd
  • Operator validation: fresh install → reboot → systemctl status evo-nexus shows active (exited) AND curl localhost:8080 returns the dashboard HTML (not connection refused)
  • Operator validation: ps -ef | grep -E 'app.py|scheduler.py|terminal-server' shows all three after reboot

Breaking changes

None. The new script behaves identically to the regenerated form for the existing evonexus-at-/home/evonexus/evo-nexus install — and starts working for every other layout.

DavidsonGomes and others added 3 commits April 22, 2026 12:48
…comes up after VPS reboot

Symptom (reproduced on Ubuntu 24.04): after a clean install + reboot,
the dashboard does NOT come back online — operators report needing
to manually run start-services.sh from the install dir to bring it
back. systemd actually fires the unit on boot, but the script silently
no-ops and exits 0, so systemd believes the service is "active" while
nothing is running.

Root cause: `start-services.sh` (committed to git) hard-codes the
path `/home/evonexus/evo-nexus` everywhere — `cd`, log redirects,
and the `.venv/bin/python` invocations. This works only by
coincidence in the single scenario the upstream `_setup_systemd_service`
exercises (root + no SUDO_USER → auto-creates `evonexus` user → install
copied to `/home/evonexus/evo-nexus`). For ANY other user/path:

  * `cd /home/evonexus/evo-nexus` fails silently — bash continues with
    cwd inherited from systemd's `WorkingDirectory=` (correct dir),
    masking the bug during single-shot tests.
  * `nohup ... > /home/evonexus/evo-nexus/logs/<file> 2>&1` cannot
    open the redirect target → the spawned process dies before doing
    any work, and nohup eats the error.
  * `nohup /home/evonexus/evo-nexus/.venv/bin/python ...` references
    a non-existent venv → ENOENT, same silent death.

Net effect on reboot: oneshot script "succeeds" in <50 ms, systemd
marks the unit `active (exited)`, but no python/node processes are
running. Operator has to SSH in and re-run things by hand.

Fix: resolve the install dir at runtime via
`SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"`. Single
source of truth — works for the `evonexus` user, for `ubuntu`, for
manual installs under `/opt/...`, anywhere. No regeneration step
required at install time.

Side-effects:
  * `install-service.sh` no longer regenerates start-services.sh;
    just chmod+chown the checked-in version. Removes ~30 lines of
    fragile heredoc.
  * Operator customizations to start-services.sh now survive
    re-runs of install-service.sh (previously got clobbered).
  * `mkdir -p logs` added so fresh installs / reboots after a manual
    `rm -rf logs` still work.
  * `cd ... || exit 1` guards added — fail loudly instead of silently
    running from the wrong directory.
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @NeritonDias, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@NeritonDias NeritonDias changed the base branch from main to develop April 22, 2026 19:58
@DavidsonGomes DavidsonGomes merged commit c3d76c2 into evolution-foundation:develop Apr 22, 2026
1 check passed
DavidsonGomes pushed a commit that referenced this pull request Apr 22, 2026
…ler + start-services + firewall persistence) (#28)

* fix(setup): stop overwriting start-services.sh — preserve the self-discovering version from #27 and re-enable scheduler

#27 made start-services.sh self-discovering (resolves SCRIPT_DIR
at runtime) so the systemd unit comes up cleanly after a VPS reboot
regardless of install path or service user. That part shipped.

The accompanying setup.py change was missed during the merge. As a
result, `setup.py::main()` still rewrites start-services.sh on every
`make setup` invocation — clobbering the in-git self-discovering
file with a hardcoded version that ALSO silently drops the
`scheduler.py` launch line:

    nohup node dashboard/terminal-server/bin/server.js ...
    nohup {install_dir}/.venv/bin/python app.py ...
    # <<< no scheduler line >>>

Cascading effects on a fresh wizard install:

  * `ps -ef | grep scheduler.py` → empty. The scheduler never runs.
    Cron-style routines (morning briefings, integration sync, daily
    digest) never fire until an operator manually launches it.
  * The self-discovering script from #27 disappears from disk —
    setup.py replaces it with the hardcoded variant. So on the next
    rename/relocate of the install, reboots break again.
  * ``logs/scheduler.log`` is never created — silent failure mode
    (no error, no log, just missing process).

Fix: drop the regeneration block. The file in git is now the single
source of truth. setup.py just ensures it's executable
(``chmod 755``) and trusts the canonical content. Verified on a
fresh VPS install (Ubuntu 24.04, SUDO_USER=ubuntu, install at
``/home/ubuntu/evo-nexus``):

  * Before: `ps` shows 2 processes (terminal-server + app.py)
  * After:  `ps` shows 3 processes (terminal-server + scheduler + app.py)
  * Reboot: all 3 come back up automatically

* fix(scheduler): mkdir parent of PID file so scheduler doesn't crash on fresh clone

`acquire_lock()` opens `ADWs/logs/scheduler.pid` with `O_CREAT|O_EXCL`,
but that flag combination only creates the FILE — not the parent
directory. `ADWs/logs/` is not tracked in git (no `.gitkeep`) and
`setup.py::create_folders()` only creates the user-facing workspace
dirs from `config["folders"]`, so on a fresh clone the directory
simply does not exist.

Result on every fresh wizard install:

    Traceback (most recent call last):
      File "scheduler.py", line 189, in <module>
        main()
      File "scheduler.py", line 157, in main
        if not acquire_lock():
      File "scheduler.py", line 31, in acquire_lock
        fd = os.open(str(PID_FILE), os.O_CREAT | os.O_EXCL | ...)
    FileNotFoundError: [Errno 2] No such file or directory:
      '/home/<user>/evo-nexus/ADWs/logs/scheduler.pid'

The scheduler exits in <50 ms — no log file beyond the traceback,
no routine ever executes (briefings, integration sync, daily
digests are all dead). systemd doesn't notice because the unit is
oneshot+nohup and the script kept going.

Fix: `PID_FILE.parent.mkdir(parents=True, exist_ok=True)` before the
open. Idempotent, safe on every restart.

* fix(setup): persist firewall rules + actually report errors instead of swallowing them

Reproduced on Oracle Cloud (Ubuntu 24.04 cloud image): wizard prints
"Firewall ports opened (80, 443)" but the dashboard is unreachable
from outside, and after a reboot the iptables rules vanish entirely.

Three bugs in the original one-liner:

    os.system("ufw allow 80/tcp 2>/dev/null; ufw allow 443/tcp 2>/dev/null; ...")
    os.system("iptables -I INPUT -p tcp --dport 80 -j ACCEPT 2>/dev/null; ...")
    print("Firewall ports opened")  # always prints, regardless

  1. ``2>/dev/null`` swallows every error. On OCI/Ubuntu cloud images
     ``ufw`` isn't installed — the ufw lines all fail silently. The
     iptables fallback often runs, but if it errors (permission,
     nf_tables backend rejection, missing CAP_NET_ADMIN) you'd never
     know.
  2. Nothing calls ``netfilter-persistent save`` (or saves to
     ``/etc/iptables/rules.v4``). Even when iptables -I succeeds,
     the next reboot reloads the persistent ruleset which doesn't
     include 80/443 → dashboard offline until the operator manually
     re-runs setup.
  3. Re-running the wizard adds duplicate ACCEPT rules each time
     (no -C check before -I).

Refactor:

  * New helper ``_open_firewall_ports(ports)`` that prefers ufw when
    present (it persists itself), falls back to iptables with -C
    idempotency check, and PERSISTS via netfilter-persistent —
    auto-installing iptables-persistent on Debian/Ubuntu if missing.
    Falls back further to ``iptables-save > /etc/iptables/rules.v4``.
  * Surfaces actual errors instead of silencing. Reports which
    backend was used and which persistence path succeeded.
  * Best-effort cloud-provider detection (OCI, AWS, GCP, Azure,
    DigitalOcean, Hetzner) via /sys/class/dmi/id/* — prints a hint
    that host-level firewall changes alone may not be enough; the
    operator likely also needs to open the port in the cloud
    Security List/Group/NSG. (No host-level command can fix the
    cloud network firewall — but a clear hint saves hours of
    debugging "523 Origin Unreachable" from Cloudflare.)

Translation keys: 7 new, mirrored across en-US / pt-BR / es. Bundles
remain at exact key parity (160 each).

Verified locally:
  * Oracle Cloud Ubuntu 24.04: rules go in via iptables, persist via
    netfilter-persistent, survive reboot. Hint about OCI Security
    List shown.
  * Ubuntu desktop with ufw: rules go in via ufw, persist
    automatically, no extra hint shown.
  * Re-running wizard: idempotent (no duplicate INPUT rules).
jbmendonca pushed a commit to jbmendonca/evo-nexus that referenced this pull request May 15, 2026
…comes up after VPS reboot (evolution-foundation#27)

Symptom (reproduced on Ubuntu 24.04): after a clean install + reboot,
the dashboard does NOT come back online — operators report needing
to manually run start-services.sh from the install dir to bring it
back. systemd actually fires the unit on boot, but the script silently
no-ops and exits 0, so systemd believes the service is "active" while
nothing is running.

Root cause: `start-services.sh` (committed to git) hard-codes the
path `/home/evonexus/evo-nexus` everywhere — `cd`, log redirects,
and the `.venv/bin/python` invocations. This works only by
coincidence in the single scenario the upstream `_setup_systemd_service`
exercises (root + no SUDO_USER → auto-creates `evonexus` user → install
copied to `/home/evonexus/evo-nexus`). For ANY other user/path:

  * `cd /home/evonexus/evo-nexus` fails silently — bash continues with
    cwd inherited from systemd's `WorkingDirectory=` (correct dir),
    masking the bug during single-shot tests.
  * `nohup ... > /home/evonexus/evo-nexus/logs/<file> 2>&1` cannot
    open the redirect target → the spawned process dies before doing
    any work, and nohup eats the error.
  * `nohup /home/evonexus/evo-nexus/.venv/bin/python ...` references
    a non-existent venv → ENOENT, same silent death.

Net effect on reboot: oneshot script "succeeds" in <50 ms, systemd
marks the unit `active (exited)`, but no python/node processes are
running. Operator has to SSH in and re-run things by hand.

Fix: resolve the install dir at runtime via
`SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"`. Single
source of truth — works for the `evonexus` user, for `ubuntu`, for
manual installs under `/opt/...`, anywhere. No regeneration step
required at install time.

Side-effects:
  * `install-service.sh` no longer regenerates start-services.sh;
    just chmod+chown the checked-in version. Removes ~30 lines of
    fragile heredoc.
  * Operator customizations to start-services.sh now survive
    re-runs of install-service.sh (previously got clobbered).
  * `mkdir -p logs` added so fresh installs / reboots after a manual
    `rm -rf logs` still work.
  * `cd ... || exit 1` guards added — fail loudly instead of silently
    running from the wrong directory.

Co-authored-by: Davidson Gomes <davidsongviolao@gmail.com>
jbmendonca pushed a commit to jbmendonca/evo-nexus that referenced this pull request May 15, 2026
…ler + start-services + firewall persistence) (evolution-foundation#28)

* fix(setup): stop overwriting start-services.sh — preserve the self-discovering version from evolution-foundation#27 and re-enable scheduler

evolution-foundation#27 made start-services.sh self-discovering (resolves SCRIPT_DIR
at runtime) so the systemd unit comes up cleanly after a VPS reboot
regardless of install path or service user. That part shipped.

The accompanying setup.py change was missed during the merge. As a
result, `setup.py::main()` still rewrites start-services.sh on every
`make setup` invocation — clobbering the in-git self-discovering
file with a hardcoded version that ALSO silently drops the
`scheduler.py` launch line:

    nohup node dashboard/terminal-server/bin/server.js ...
    nohup {install_dir}/.venv/bin/python app.py ...
    # <<< no scheduler line >>>

Cascading effects on a fresh wizard install:

  * `ps -ef | grep scheduler.py` → empty. The scheduler never runs.
    Cron-style routines (morning briefings, integration sync, daily
    digest) never fire until an operator manually launches it.
  * The self-discovering script from evolution-foundation#27 disappears from disk —
    setup.py replaces it with the hardcoded variant. So on the next
    rename/relocate of the install, reboots break again.
  * ``logs/scheduler.log`` is never created — silent failure mode
    (no error, no log, just missing process).

Fix: drop the regeneration block. The file in git is now the single
source of truth. setup.py just ensures it's executable
(``chmod 755``) and trusts the canonical content. Verified on a
fresh VPS install (Ubuntu 24.04, SUDO_USER=ubuntu, install at
``/home/ubuntu/evo-nexus``):

  * Before: `ps` shows 2 processes (terminal-server + app.py)
  * After:  `ps` shows 3 processes (terminal-server + scheduler + app.py)
  * Reboot: all 3 come back up automatically

* fix(scheduler): mkdir parent of PID file so scheduler doesn't crash on fresh clone

`acquire_lock()` opens `ADWs/logs/scheduler.pid` with `O_CREAT|O_EXCL`,
but that flag combination only creates the FILE — not the parent
directory. `ADWs/logs/` is not tracked in git (no `.gitkeep`) and
`setup.py::create_folders()` only creates the user-facing workspace
dirs from `config["folders"]`, so on a fresh clone the directory
simply does not exist.

Result on every fresh wizard install:

    Traceback (most recent call last):
      File "scheduler.py", line 189, in <module>
        main()
      File "scheduler.py", line 157, in main
        if not acquire_lock():
      File "scheduler.py", line 31, in acquire_lock
        fd = os.open(str(PID_FILE), os.O_CREAT | os.O_EXCL | ...)
    FileNotFoundError: [Errno 2] No such file or directory:
      '/home/<user>/evo-nexus/ADWs/logs/scheduler.pid'

The scheduler exits in <50 ms — no log file beyond the traceback,
no routine ever executes (briefings, integration sync, daily
digests are all dead). systemd doesn't notice because the unit is
oneshot+nohup and the script kept going.

Fix: `PID_FILE.parent.mkdir(parents=True, exist_ok=True)` before the
open. Idempotent, safe on every restart.

* fix(setup): persist firewall rules + actually report errors instead of swallowing them

Reproduced on Oracle Cloud (Ubuntu 24.04 cloud image): wizard prints
"Firewall ports opened (80, 443)" but the dashboard is unreachable
from outside, and after a reboot the iptables rules vanish entirely.

Three bugs in the original one-liner:

    os.system("ufw allow 80/tcp 2>/dev/null; ufw allow 443/tcp 2>/dev/null; ...")
    os.system("iptables -I INPUT -p tcp --dport 80 -j ACCEPT 2>/dev/null; ...")
    print("Firewall ports opened")  # always prints, regardless

  1. ``2>/dev/null`` swallows every error. On OCI/Ubuntu cloud images
     ``ufw`` isn't installed — the ufw lines all fail silently. The
     iptables fallback often runs, but if it errors (permission,
     nf_tables backend rejection, missing CAP_NET_ADMIN) you'd never
     know.
  2. Nothing calls ``netfilter-persistent save`` (or saves to
     ``/etc/iptables/rules.v4``). Even when iptables -I succeeds,
     the next reboot reloads the persistent ruleset which doesn't
     include 80/443 → dashboard offline until the operator manually
     re-runs setup.
  3. Re-running the wizard adds duplicate ACCEPT rules each time
     (no -C check before -I).

Refactor:

  * New helper ``_open_firewall_ports(ports)`` that prefers ufw when
    present (it persists itself), falls back to iptables with -C
    idempotency check, and PERSISTS via netfilter-persistent —
    auto-installing iptables-persistent on Debian/Ubuntu if missing.
    Falls back further to ``iptables-save > /etc/iptables/rules.v4``.
  * Surfaces actual errors instead of silencing. Reports which
    backend was used and which persistence path succeeded.
  * Best-effort cloud-provider detection (OCI, AWS, GCP, Azure,
    DigitalOcean, Hetzner) via /sys/class/dmi/id/* — prints a hint
    that host-level firewall changes alone may not be enough; the
    operator likely also needs to open the port in the cloud
    Security List/Group/NSG. (No host-level command can fix the
    cloud network firewall — but a clear hint saves hours of
    debugging "523 Origin Unreachable" from Cloudflare.)

Translation keys: 7 new, mirrored across en-US / pt-BR / es. Bundles
remain at exact key parity (160 each).

Verified locally:
  * Oracle Cloud Ubuntu 24.04: rules go in via iptables, persist via
    netfilter-persistent, survive reboot. Hint about OCI Security
    List shown.
  * Ubuntu desktop with ufw: rules go in via ufw, persist
    automatically, no extra hint shown.
  * Re-running wizard: idempotent (no duplicate INPUT rules).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants