fix(systemd): make start-services.sh self-discovering — service was silently failing on reboot by NeritonDias · Pull Request #27 · evolution-foundation/evo-nexus

NeritonDias · 2026-04-22T19:57:52Z

Summary

Fixes a silent failure where the evo-nexus systemd service "starts" successfully on every VPS boot but no actual processes are running. Operator has to SSH in and run start-services.sh manually each time.

Repro

Fresh install on Ubuntu 24.04 with the wizard set to use a non-default service user (anything other than the auto-created evonexus account — e.g. when SUDO_USER=ubuntu is preserved by sudo -i/sudo su and the install ends up under /home/ubuntu/evo-nexus).
sudo systemctl reboot
After boot: systemctl status evo-nexus reports active (exited). Browser hits 502/connection refused.
ps -ef | grep -E 'app.py|scheduler.py|terminal-server' — no processes.
bash /home/<user>/evo-nexus/start-services.sh manually → everything comes up.

Root cause

start-services.sh (committed to git) hard-codes the path /home/evonexus/evo-nexus everywhere — cd, log redirects, and .venv/bin/python invocations:

cd /home/evonexus/evo-nexus
nohup node dashboard/terminal-server/bin/server.js > /home/evonexus/evo-nexus/logs/terminal-server.log 2>&1 &
nohup /home/evonexus/evo-nexus/.venv/bin/python scheduler.py > /home/evonexus/evo-nexus/logs/scheduler.log 2>&1 &
nohup /home/evonexus/evo-nexus/.venv/bin/python app.py > /home/evonexus/evo-nexus/logs/dashboard.log 2>&1 &

This works only by coincidence in the single scenario _setup_systemd_service exercises (root + no SUDO_USER → auto-creates the evonexus user → installs to /home/evonexus/evo-nexus). For any other user or path:

cd /home/evonexus/evo-nexus fails silently — bash continues with cwd inherited from systemd's WorkingDirectory= (which is the correct dir), so the cwd is masked during single-shot tests.
nohup ... > /home/evonexus/evo-nexus/logs/<file> 2>&1 cannot open the redirect target → the spawned process dies before doing any work; nohup eats the error.
nohup /home/evonexus/evo-nexus/.venv/bin/python ... references a non-existent venv → ENOENT, same silent death.

Net effect: the oneshot script "succeeds" in <50 ms, systemd marks the unit active (exited), but no python/node processes are running.

Fix

Resolve the install dir at runtime in start-services.sh:

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

Single source of truth — works for the evonexus user, for ubuntu, for manual installs under /opt/..., anywhere. Side benefits:

install-service.sh no longer regenerates start-services.sh; just chmod + chown the checked-in version. Removes ~30 lines of fragile heredoc with \$ escaping.
Operator customizations to start-services.sh now survive re-runs of install-service.sh (previously got clobbered).
mkdir -p logs added so fresh installs and reboots after a manual rm -rf logs still work.
cd ... || exit 1 guards added — fail loudly instead of silently running from the wrong directory.

Test plan

bash -n start-services.sh and bash -n install-service.sh parse cleanly
SCRIPT_DIR resolution returns the correct absolute path when the script is invoked from any cwd
Operator validation: fresh install → reboot → systemctl status evo-nexus shows active (exited) AND curl localhost:8080 returns the dashboard HTML (not connection refused)
Operator validation: ps -ef | grep -E 'app.py|scheduler.py|terminal-server' shows all three after reboot

Breaking changes

None. The new script behaves identically to the regenerated form for the existing evonexus-at-/home/evonexus/evo-nexus install — and starts working for every other layout.

…comes up after VPS reboot Symptom (reproduced on Ubuntu 24.04): after a clean install + reboot, the dashboard does NOT come back online — operators report needing to manually run start-services.sh from the install dir to bring it back. systemd actually fires the unit on boot, but the script silently no-ops and exits 0, so systemd believes the service is "active" while nothing is running. Root cause: `start-services.sh` (committed to git) hard-codes the path `/home/evonexus/evo-nexus` everywhere — `cd`, log redirects, and the `.venv/bin/python` invocations. This works only by coincidence in the single scenario the upstream `_setup_systemd_service` exercises (root + no SUDO_USER → auto-creates `evonexus` user → install copied to `/home/evonexus/evo-nexus`). For ANY other user/path: * `cd /home/evonexus/evo-nexus` fails silently — bash continues with cwd inherited from systemd's `WorkingDirectory=` (correct dir), masking the bug during single-shot tests. * `nohup ... > /home/evonexus/evo-nexus/logs/<file> 2>&1` cannot open the redirect target → the spawned process dies before doing any work, and nohup eats the error. * `nohup /home/evonexus/evo-nexus/.venv/bin/python ...` references a non-existent venv → ENOENT, same silent death. Net effect on reboot: oneshot script "succeeds" in <50 ms, systemd marks the unit `active (exited)`, but no python/node processes are running. Operator has to SSH in and re-run things by hand. Fix: resolve the install dir at runtime via `SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"`. Single source of truth — works for the `evonexus` user, for `ubuntu`, for manual installs under `/opt/...`, anywhere. No regeneration step required at install time. Side-effects: * `install-service.sh` no longer regenerates start-services.sh; just chmod+chown the checked-in version. Removes ~30 lines of fragile heredoc. * Operator customizations to start-services.sh now survive re-runs of install-service.sh (previously got clobbered). * `mkdir -p logs` added so fresh installs / reboots after a manual `rm -rf logs` still work. * `cd ... || exit 1` guards added — fail loudly instead of silently running from the wrong directory.

sourcery-ai

Sorry @NeritonDias, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

…ler + start-services + firewall persistence) (#28) * fix(setup): stop overwriting start-services.sh — preserve the self-discovering version from #27 and re-enable scheduler #27 made start-services.sh self-discovering (resolves SCRIPT_DIR at runtime) so the systemd unit comes up cleanly after a VPS reboot regardless of install path or service user. That part shipped. The accompanying setup.py change was missed during the merge. As a result, `setup.py::main()` still rewrites start-services.sh on every `make setup` invocation — clobbering the in-git self-discovering file with a hardcoded version that ALSO silently drops the `scheduler.py` launch line: nohup node dashboard/terminal-server/bin/server.js ... nohup {install_dir}/.venv/bin/python app.py ... # <<< no scheduler line >>> Cascading effects on a fresh wizard install: * `ps -ef | grep scheduler.py` → empty. The scheduler never runs. Cron-style routines (morning briefings, integration sync, daily digest) never fire until an operator manually launches it. * The self-discovering script from #27 disappears from disk — setup.py replaces it with the hardcoded variant. So on the next rename/relocate of the install, reboots break again. * ``logs/scheduler.log`` is never created — silent failure mode (no error, no log, just missing process). Fix: drop the regeneration block. The file in git is now the single source of truth. setup.py just ensures it's executable (``chmod 755``) and trusts the canonical content. Verified on a fresh VPS install (Ubuntu 24.04, SUDO_USER=ubuntu, install at ``/home/ubuntu/evo-nexus``): * Before: `ps` shows 2 processes (terminal-server + app.py) * After: `ps` shows 3 processes (terminal-server + scheduler + app.py) * Reboot: all 3 come back up automatically * fix(scheduler): mkdir parent of PID file so scheduler doesn't crash on fresh clone `acquire_lock()` opens `ADWs/logs/scheduler.pid` with `O_CREAT|O_EXCL`, but that flag combination only creates the FILE — not the parent directory. `ADWs/logs/` is not tracked in git (no `.gitkeep`) and `setup.py::create_folders()` only creates the user-facing workspace dirs from `config["folders"]`, so on a fresh clone the directory simply does not exist. Result on every fresh wizard install: Traceback (most recent call last): File "scheduler.py", line 189, in <module> main() File "scheduler.py", line 157, in main if not acquire_lock(): File "scheduler.py", line 31, in acquire_lock fd = os.open(str(PID_FILE), os.O_CREAT | os.O_EXCL | ...) FileNotFoundError: [Errno 2] No such file or directory: '/home/<user>/evo-nexus/ADWs/logs/scheduler.pid' The scheduler exits in <50 ms — no log file beyond the traceback, no routine ever executes (briefings, integration sync, daily digests are all dead). systemd doesn't notice because the unit is oneshot+nohup and the script kept going. Fix: `PID_FILE.parent.mkdir(parents=True, exist_ok=True)` before the open. Idempotent, safe on every restart. * fix(setup): persist firewall rules + actually report errors instead of swallowing them Reproduced on Oracle Cloud (Ubuntu 24.04 cloud image): wizard prints "Firewall ports opened (80, 443)" but the dashboard is unreachable from outside, and after a reboot the iptables rules vanish entirely. Three bugs in the original one-liner: os.system("ufw allow 80/tcp 2>/dev/null; ufw allow 443/tcp 2>/dev/null; ...") os.system("iptables -I INPUT -p tcp --dport 80 -j ACCEPT 2>/dev/null; ...") print("Firewall ports opened") # always prints, regardless 1. ``2>/dev/null`` swallows every error. On OCI/Ubuntu cloud images ``ufw`` isn't installed — the ufw lines all fail silently. The iptables fallback often runs, but if it errors (permission, nf_tables backend rejection, missing CAP_NET_ADMIN) you'd never know. 2. Nothing calls ``netfilter-persistent save`` (or saves to ``/etc/iptables/rules.v4``). Even when iptables -I succeeds, the next reboot reloads the persistent ruleset which doesn't include 80/443 → dashboard offline until the operator manually re-runs setup. 3. Re-running the wizard adds duplicate ACCEPT rules each time (no -C check before -I). Refactor: * New helper ``_open_firewall_ports(ports)`` that prefers ufw when present (it persists itself), falls back to iptables with -C idempotency check, and PERSISTS via netfilter-persistent — auto-installing iptables-persistent on Debian/Ubuntu if missing. Falls back further to ``iptables-save > /etc/iptables/rules.v4``. * Surfaces actual errors instead of silencing. Reports which backend was used and which persistence path succeeded. * Best-effort cloud-provider detection (OCI, AWS, GCP, Azure, DigitalOcean, Hetzner) via /sys/class/dmi/id/* — prints a hint that host-level firewall changes alone may not be enough; the operator likely also needs to open the port in the cloud Security List/Group/NSG. (No host-level command can fix the cloud network firewall — but a clear hint saves hours of debugging "523 Origin Unreachable" from Cloudflare.) Translation keys: 7 new, mirrored across en-US / pt-BR / es. Bundles remain at exact key parity (160 each). Verified locally: * Oracle Cloud Ubuntu 24.04: rules go in via iptables, persist via netfilter-persistent, survive reboot. Hint about OCI Security List shown. * Ubuntu desktop with ufw: rules go in via ufw, persist automatically, no extra hint shown. * Re-running wizard: idempotent (no duplicate INPUT rules).

…comes up after VPS reboot (evolution-foundation#27) Symptom (reproduced on Ubuntu 24.04): after a clean install + reboot, the dashboard does NOT come back online — operators report needing to manually run start-services.sh from the install dir to bring it back. systemd actually fires the unit on boot, but the script silently no-ops and exits 0, so systemd believes the service is "active" while nothing is running. Root cause: `start-services.sh` (committed to git) hard-codes the path `/home/evonexus/evo-nexus` everywhere — `cd`, log redirects, and the `.venv/bin/python` invocations. This works only by coincidence in the single scenario the upstream `_setup_systemd_service` exercises (root + no SUDO_USER → auto-creates `evonexus` user → install copied to `/home/evonexus/evo-nexus`). For ANY other user/path: * `cd /home/evonexus/evo-nexus` fails silently — bash continues with cwd inherited from systemd's `WorkingDirectory=` (correct dir), masking the bug during single-shot tests. * `nohup ... > /home/evonexus/evo-nexus/logs/<file> 2>&1` cannot open the redirect target → the spawned process dies before doing any work, and nohup eats the error. * `nohup /home/evonexus/evo-nexus/.venv/bin/python ...` references a non-existent venv → ENOENT, same silent death. Net effect on reboot: oneshot script "succeeds" in <50 ms, systemd marks the unit `active (exited)`, but no python/node processes are running. Operator has to SSH in and re-run things by hand. Fix: resolve the install dir at runtime via `SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"`. Single source of truth — works for the `evonexus` user, for `ubuntu`, for manual installs under `/opt/...`, anywhere. No regeneration step required at install time. Side-effects: * `install-service.sh` no longer regenerates start-services.sh; just chmod+chown the checked-in version. Removes ~30 lines of fragile heredoc. * Operator customizations to start-services.sh now survive re-runs of install-service.sh (previously got clobbered). * `mkdir -p logs` added so fresh installs / reboots after a manual `rm -rf logs` still work. * `cd ... || exit 1` guards added — fail loudly instead of silently running from the wrong directory. Co-authored-by: Davidson Gomes <davidsongviolao@gmail.com>

…ler + start-services + firewall persistence) (evolution-foundation#28) * fix(setup): stop overwriting start-services.sh — preserve the self-discovering version from evolution-foundation#27 and re-enable scheduler evolution-foundation#27 made start-services.sh self-discovering (resolves SCRIPT_DIR at runtime) so the systemd unit comes up cleanly after a VPS reboot regardless of install path or service user. That part shipped. The accompanying setup.py change was missed during the merge. As a result, `setup.py::main()` still rewrites start-services.sh on every `make setup` invocation — clobbering the in-git self-discovering file with a hardcoded version that ALSO silently drops the `scheduler.py` launch line: nohup node dashboard/terminal-server/bin/server.js ... nohup {install_dir}/.venv/bin/python app.py ... # <<< no scheduler line >>> Cascading effects on a fresh wizard install: * `ps -ef | grep scheduler.py` → empty. The scheduler never runs. Cron-style routines (morning briefings, integration sync, daily digest) never fire until an operator manually launches it. * The self-discovering script from evolution-foundation#27 disappears from disk — setup.py replaces it with the hardcoded variant. So on the next rename/relocate of the install, reboots break again. * ``logs/scheduler.log`` is never created — silent failure mode (no error, no log, just missing process). Fix: drop the regeneration block. The file in git is now the single source of truth. setup.py just ensures it's executable (``chmod 755``) and trusts the canonical content. Verified on a fresh VPS install (Ubuntu 24.04, SUDO_USER=ubuntu, install at ``/home/ubuntu/evo-nexus``): * Before: `ps` shows 2 processes (terminal-server + app.py) * After: `ps` shows 3 processes (terminal-server + scheduler + app.py) * Reboot: all 3 come back up automatically * fix(scheduler): mkdir parent of PID file so scheduler doesn't crash on fresh clone `acquire_lock()` opens `ADWs/logs/scheduler.pid` with `O_CREAT|O_EXCL`, but that flag combination only creates the FILE — not the parent directory. `ADWs/logs/` is not tracked in git (no `.gitkeep`) and `setup.py::create_folders()` only creates the user-facing workspace dirs from `config["folders"]`, so on a fresh clone the directory simply does not exist. Result on every fresh wizard install: Traceback (most recent call last): File "scheduler.py", line 189, in <module> main() File "scheduler.py", line 157, in main if not acquire_lock(): File "scheduler.py", line 31, in acquire_lock fd = os.open(str(PID_FILE), os.O_CREAT | os.O_EXCL | ...) FileNotFoundError: [Errno 2] No such file or directory: '/home/<user>/evo-nexus/ADWs/logs/scheduler.pid' The scheduler exits in <50 ms — no log file beyond the traceback, no routine ever executes (briefings, integration sync, daily digests are all dead). systemd doesn't notice because the unit is oneshot+nohup and the script kept going. Fix: `PID_FILE.parent.mkdir(parents=True, exist_ok=True)` before the open. Idempotent, safe on every restart. * fix(setup): persist firewall rules + actually report errors instead of swallowing them Reproduced on Oracle Cloud (Ubuntu 24.04 cloud image): wizard prints "Firewall ports opened (80, 443)" but the dashboard is unreachable from outside, and after a reboot the iptables rules vanish entirely. Three bugs in the original one-liner: os.system("ufw allow 80/tcp 2>/dev/null; ufw allow 443/tcp 2>/dev/null; ...") os.system("iptables -I INPUT -p tcp --dport 80 -j ACCEPT 2>/dev/null; ...") print("Firewall ports opened") # always prints, regardless 1. ``2>/dev/null`` swallows every error. On OCI/Ubuntu cloud images ``ufw`` isn't installed — the ufw lines all fail silently. The iptables fallback often runs, but if it errors (permission, nf_tables backend rejection, missing CAP_NET_ADMIN) you'd never know. 2. Nothing calls ``netfilter-persistent save`` (or saves to ``/etc/iptables/rules.v4``). Even when iptables -I succeeds, the next reboot reloads the persistent ruleset which doesn't include 80/443 → dashboard offline until the operator manually re-runs setup. 3. Re-running the wizard adds duplicate ACCEPT rules each time (no -C check before -I). Refactor: * New helper ``_open_firewall_ports(ports)`` that prefers ufw when present (it persists itself), falls back to iptables with -C idempotency check, and PERSISTS via netfilter-persistent — auto-installing iptables-persistent on Debian/Ubuntu if missing. Falls back further to ``iptables-save > /etc/iptables/rules.v4``. * Surfaces actual errors instead of silencing. Reports which backend was used and which persistence path succeeded. * Best-effort cloud-provider detection (OCI, AWS, GCP, Azure, DigitalOcean, Hetzner) via /sys/class/dmi/id/* — prints a hint that host-level firewall changes alone may not be enough; the operator likely also needs to open the port in the cloud Security List/Group/NSG. (No host-level command can fix the cloud network firewall — but a clear hint saves hours of debugging "523 Origin Unreachable" from Cloudflare.) Translation keys: 7 new, mirrored across en-US / pt-BR / es. Bundles remain at exact key parity (160 each). Verified locally: * Oracle Cloud Ubuntu 24.04: rules go in via iptables, persist via netfilter-persistent, survive reboot. Hint about OCI Security List shown. * Ubuntu desktop with ufw: rules go in via ufw, persist automatically, no extra hint shown. * Re-running wizard: idempotent (no duplicate INPUT rules).

DavidsonGomes and others added 3 commits April 22, 2026 12:48

release: merge develop into main for v0.26.0

9024d46

release: merge develop into main for v0.27.0

566a534

sourcery-ai Bot reviewed Apr 22, 2026

View reviewed changes

NeritonDias changed the base branch from main to develop April 22, 2026 19:58

DavidsonGomes merged commit c3d76c2 into evolution-foundation:develop Apr 22, 2026
1 check passed

This was referenced Apr 22, 2026

fix(setup): make a fresh VPS install survive its first reboot (scheduler + start-services + firewall persistence) #28

Merged

fix(setup): persist firewall rules across reboot + report errors instead of swallowing #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(systemd): make start-services.sh self-discovering — service was silently failing on reboot#27

fix(systemd): make start-services.sh self-discovering — service was silently failing on reboot#27
DavidsonGomes merged 3 commits into
evolution-foundation:developfrom
NeritonDias:fix/start-services-self-discovering

NeritonDias commented Apr 22, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NeritonDias commented Apr 22, 2026

Summary

Repro

Root cause

Fix

Test plan

Breaking changes

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants