fix(systemd): make start-services.sh self-discovering — service was silently failing on reboot#27
Merged
DavidsonGomes merged 3 commits intoApr 22, 2026
Conversation
…comes up after VPS reboot
Symptom (reproduced on Ubuntu 24.04): after a clean install + reboot,
the dashboard does NOT come back online — operators report needing
to manually run start-services.sh from the install dir to bring it
back. systemd actually fires the unit on boot, but the script silently
no-ops and exits 0, so systemd believes the service is "active" while
nothing is running.
Root cause: `start-services.sh` (committed to git) hard-codes the
path `/home/evonexus/evo-nexus` everywhere — `cd`, log redirects,
and the `.venv/bin/python` invocations. This works only by
coincidence in the single scenario the upstream `_setup_systemd_service`
exercises (root + no SUDO_USER → auto-creates `evonexus` user → install
copied to `/home/evonexus/evo-nexus`). For ANY other user/path:
* `cd /home/evonexus/evo-nexus` fails silently — bash continues with
cwd inherited from systemd's `WorkingDirectory=` (correct dir),
masking the bug during single-shot tests.
* `nohup ... > /home/evonexus/evo-nexus/logs/<file> 2>&1` cannot
open the redirect target → the spawned process dies before doing
any work, and nohup eats the error.
* `nohup /home/evonexus/evo-nexus/.venv/bin/python ...` references
a non-existent venv → ENOENT, same silent death.
Net effect on reboot: oneshot script "succeeds" in <50 ms, systemd
marks the unit `active (exited)`, but no python/node processes are
running. Operator has to SSH in and re-run things by hand.
Fix: resolve the install dir at runtime via
`SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"`. Single
source of truth — works for the `evonexus` user, for `ubuntu`, for
manual installs under `/opt/...`, anywhere. No regeneration step
required at install time.
Side-effects:
* `install-service.sh` no longer regenerates start-services.sh;
just chmod+chown the checked-in version. Removes ~30 lines of
fragile heredoc.
* Operator customizations to start-services.sh now survive
re-runs of install-service.sh (previously got clobbered).
* `mkdir -p logs` added so fresh installs / reboots after a manual
`rm -rf logs` still work.
* `cd ... || exit 1` guards added — fail loudly instead of silently
running from the wrong directory.
There was a problem hiding this comment.
Sorry @NeritonDias, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
DavidsonGomes
pushed a commit
that referenced
this pull request
Apr 22, 2026
…ler + start-services + firewall persistence) (#28) * fix(setup): stop overwriting start-services.sh — preserve the self-discovering version from #27 and re-enable scheduler #27 made start-services.sh self-discovering (resolves SCRIPT_DIR at runtime) so the systemd unit comes up cleanly after a VPS reboot regardless of install path or service user. That part shipped. The accompanying setup.py change was missed during the merge. As a result, `setup.py::main()` still rewrites start-services.sh on every `make setup` invocation — clobbering the in-git self-discovering file with a hardcoded version that ALSO silently drops the `scheduler.py` launch line: nohup node dashboard/terminal-server/bin/server.js ... nohup {install_dir}/.venv/bin/python app.py ... # <<< no scheduler line >>> Cascading effects on a fresh wizard install: * `ps -ef | grep scheduler.py` → empty. The scheduler never runs. Cron-style routines (morning briefings, integration sync, daily digest) never fire until an operator manually launches it. * The self-discovering script from #27 disappears from disk — setup.py replaces it with the hardcoded variant. So on the next rename/relocate of the install, reboots break again. * ``logs/scheduler.log`` is never created — silent failure mode (no error, no log, just missing process). Fix: drop the regeneration block. The file in git is now the single source of truth. setup.py just ensures it's executable (``chmod 755``) and trusts the canonical content. Verified on a fresh VPS install (Ubuntu 24.04, SUDO_USER=ubuntu, install at ``/home/ubuntu/evo-nexus``): * Before: `ps` shows 2 processes (terminal-server + app.py) * After: `ps` shows 3 processes (terminal-server + scheduler + app.py) * Reboot: all 3 come back up automatically * fix(scheduler): mkdir parent of PID file so scheduler doesn't crash on fresh clone `acquire_lock()` opens `ADWs/logs/scheduler.pid` with `O_CREAT|O_EXCL`, but that flag combination only creates the FILE — not the parent directory. `ADWs/logs/` is not tracked in git (no `.gitkeep`) and `setup.py::create_folders()` only creates the user-facing workspace dirs from `config["folders"]`, so on a fresh clone the directory simply does not exist. Result on every fresh wizard install: Traceback (most recent call last): File "scheduler.py", line 189, in <module> main() File "scheduler.py", line 157, in main if not acquire_lock(): File "scheduler.py", line 31, in acquire_lock fd = os.open(str(PID_FILE), os.O_CREAT | os.O_EXCL | ...) FileNotFoundError: [Errno 2] No such file or directory: '/home/<user>/evo-nexus/ADWs/logs/scheduler.pid' The scheduler exits in <50 ms — no log file beyond the traceback, no routine ever executes (briefings, integration sync, daily digests are all dead). systemd doesn't notice because the unit is oneshot+nohup and the script kept going. Fix: `PID_FILE.parent.mkdir(parents=True, exist_ok=True)` before the open. Idempotent, safe on every restart. * fix(setup): persist firewall rules + actually report errors instead of swallowing them Reproduced on Oracle Cloud (Ubuntu 24.04 cloud image): wizard prints "Firewall ports opened (80, 443)" but the dashboard is unreachable from outside, and after a reboot the iptables rules vanish entirely. Three bugs in the original one-liner: os.system("ufw allow 80/tcp 2>/dev/null; ufw allow 443/tcp 2>/dev/null; ...") os.system("iptables -I INPUT -p tcp --dport 80 -j ACCEPT 2>/dev/null; ...") print("Firewall ports opened") # always prints, regardless 1. ``2>/dev/null`` swallows every error. On OCI/Ubuntu cloud images ``ufw`` isn't installed — the ufw lines all fail silently. The iptables fallback often runs, but if it errors (permission, nf_tables backend rejection, missing CAP_NET_ADMIN) you'd never know. 2. Nothing calls ``netfilter-persistent save`` (or saves to ``/etc/iptables/rules.v4``). Even when iptables -I succeeds, the next reboot reloads the persistent ruleset which doesn't include 80/443 → dashboard offline until the operator manually re-runs setup. 3. Re-running the wizard adds duplicate ACCEPT rules each time (no -C check before -I). Refactor: * New helper ``_open_firewall_ports(ports)`` that prefers ufw when present (it persists itself), falls back to iptables with -C idempotency check, and PERSISTS via netfilter-persistent — auto-installing iptables-persistent on Debian/Ubuntu if missing. Falls back further to ``iptables-save > /etc/iptables/rules.v4``. * Surfaces actual errors instead of silencing. Reports which backend was used and which persistence path succeeded. * Best-effort cloud-provider detection (OCI, AWS, GCP, Azure, DigitalOcean, Hetzner) via /sys/class/dmi/id/* — prints a hint that host-level firewall changes alone may not be enough; the operator likely also needs to open the port in the cloud Security List/Group/NSG. (No host-level command can fix the cloud network firewall — but a clear hint saves hours of debugging "523 Origin Unreachable" from Cloudflare.) Translation keys: 7 new, mirrored across en-US / pt-BR / es. Bundles remain at exact key parity (160 each). Verified locally: * Oracle Cloud Ubuntu 24.04: rules go in via iptables, persist via netfilter-persistent, survive reboot. Hint about OCI Security List shown. * Ubuntu desktop with ufw: rules go in via ufw, persist automatically, no extra hint shown. * Re-running wizard: idempotent (no duplicate INPUT rules).
jbmendonca
pushed a commit
to jbmendonca/evo-nexus
that referenced
this pull request
May 15, 2026
…comes up after VPS reboot (evolution-foundation#27) Symptom (reproduced on Ubuntu 24.04): after a clean install + reboot, the dashboard does NOT come back online — operators report needing to manually run start-services.sh from the install dir to bring it back. systemd actually fires the unit on boot, but the script silently no-ops and exits 0, so systemd believes the service is "active" while nothing is running. Root cause: `start-services.sh` (committed to git) hard-codes the path `/home/evonexus/evo-nexus` everywhere — `cd`, log redirects, and the `.venv/bin/python` invocations. This works only by coincidence in the single scenario the upstream `_setup_systemd_service` exercises (root + no SUDO_USER → auto-creates `evonexus` user → install copied to `/home/evonexus/evo-nexus`). For ANY other user/path: * `cd /home/evonexus/evo-nexus` fails silently — bash continues with cwd inherited from systemd's `WorkingDirectory=` (correct dir), masking the bug during single-shot tests. * `nohup ... > /home/evonexus/evo-nexus/logs/<file> 2>&1` cannot open the redirect target → the spawned process dies before doing any work, and nohup eats the error. * `nohup /home/evonexus/evo-nexus/.venv/bin/python ...` references a non-existent venv → ENOENT, same silent death. Net effect on reboot: oneshot script "succeeds" in <50 ms, systemd marks the unit `active (exited)`, but no python/node processes are running. Operator has to SSH in and re-run things by hand. Fix: resolve the install dir at runtime via `SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"`. Single source of truth — works for the `evonexus` user, for `ubuntu`, for manual installs under `/opt/...`, anywhere. No regeneration step required at install time. Side-effects: * `install-service.sh` no longer regenerates start-services.sh; just chmod+chown the checked-in version. Removes ~30 lines of fragile heredoc. * Operator customizations to start-services.sh now survive re-runs of install-service.sh (previously got clobbered). * `mkdir -p logs` added so fresh installs / reboots after a manual `rm -rf logs` still work. * `cd ... || exit 1` guards added — fail loudly instead of silently running from the wrong directory. Co-authored-by: Davidson Gomes <davidsongviolao@gmail.com>
jbmendonca
pushed a commit
to jbmendonca/evo-nexus
that referenced
this pull request
May 15, 2026
…ler + start-services + firewall persistence) (evolution-foundation#28) * fix(setup): stop overwriting start-services.sh — preserve the self-discovering version from evolution-foundation#27 and re-enable scheduler evolution-foundation#27 made start-services.sh self-discovering (resolves SCRIPT_DIR at runtime) so the systemd unit comes up cleanly after a VPS reboot regardless of install path or service user. That part shipped. The accompanying setup.py change was missed during the merge. As a result, `setup.py::main()` still rewrites start-services.sh on every `make setup` invocation — clobbering the in-git self-discovering file with a hardcoded version that ALSO silently drops the `scheduler.py` launch line: nohup node dashboard/terminal-server/bin/server.js ... nohup {install_dir}/.venv/bin/python app.py ... # <<< no scheduler line >>> Cascading effects on a fresh wizard install: * `ps -ef | grep scheduler.py` → empty. The scheduler never runs. Cron-style routines (morning briefings, integration sync, daily digest) never fire until an operator manually launches it. * The self-discovering script from evolution-foundation#27 disappears from disk — setup.py replaces it with the hardcoded variant. So on the next rename/relocate of the install, reboots break again. * ``logs/scheduler.log`` is never created — silent failure mode (no error, no log, just missing process). Fix: drop the regeneration block. The file in git is now the single source of truth. setup.py just ensures it's executable (``chmod 755``) and trusts the canonical content. Verified on a fresh VPS install (Ubuntu 24.04, SUDO_USER=ubuntu, install at ``/home/ubuntu/evo-nexus``): * Before: `ps` shows 2 processes (terminal-server + app.py) * After: `ps` shows 3 processes (terminal-server + scheduler + app.py) * Reboot: all 3 come back up automatically * fix(scheduler): mkdir parent of PID file so scheduler doesn't crash on fresh clone `acquire_lock()` opens `ADWs/logs/scheduler.pid` with `O_CREAT|O_EXCL`, but that flag combination only creates the FILE — not the parent directory. `ADWs/logs/` is not tracked in git (no `.gitkeep`) and `setup.py::create_folders()` only creates the user-facing workspace dirs from `config["folders"]`, so on a fresh clone the directory simply does not exist. Result on every fresh wizard install: Traceback (most recent call last): File "scheduler.py", line 189, in <module> main() File "scheduler.py", line 157, in main if not acquire_lock(): File "scheduler.py", line 31, in acquire_lock fd = os.open(str(PID_FILE), os.O_CREAT | os.O_EXCL | ...) FileNotFoundError: [Errno 2] No such file or directory: '/home/<user>/evo-nexus/ADWs/logs/scheduler.pid' The scheduler exits in <50 ms — no log file beyond the traceback, no routine ever executes (briefings, integration sync, daily digests are all dead). systemd doesn't notice because the unit is oneshot+nohup and the script kept going. Fix: `PID_FILE.parent.mkdir(parents=True, exist_ok=True)` before the open. Idempotent, safe on every restart. * fix(setup): persist firewall rules + actually report errors instead of swallowing them Reproduced on Oracle Cloud (Ubuntu 24.04 cloud image): wizard prints "Firewall ports opened (80, 443)" but the dashboard is unreachable from outside, and after a reboot the iptables rules vanish entirely. Three bugs in the original one-liner: os.system("ufw allow 80/tcp 2>/dev/null; ufw allow 443/tcp 2>/dev/null; ...") os.system("iptables -I INPUT -p tcp --dport 80 -j ACCEPT 2>/dev/null; ...") print("Firewall ports opened") # always prints, regardless 1. ``2>/dev/null`` swallows every error. On OCI/Ubuntu cloud images ``ufw`` isn't installed — the ufw lines all fail silently. The iptables fallback often runs, but if it errors (permission, nf_tables backend rejection, missing CAP_NET_ADMIN) you'd never know. 2. Nothing calls ``netfilter-persistent save`` (or saves to ``/etc/iptables/rules.v4``). Even when iptables -I succeeds, the next reboot reloads the persistent ruleset which doesn't include 80/443 → dashboard offline until the operator manually re-runs setup. 3. Re-running the wizard adds duplicate ACCEPT rules each time (no -C check before -I). Refactor: * New helper ``_open_firewall_ports(ports)`` that prefers ufw when present (it persists itself), falls back to iptables with -C idempotency check, and PERSISTS via netfilter-persistent — auto-installing iptables-persistent on Debian/Ubuntu if missing. Falls back further to ``iptables-save > /etc/iptables/rules.v4``. * Surfaces actual errors instead of silencing. Reports which backend was used and which persistence path succeeded. * Best-effort cloud-provider detection (OCI, AWS, GCP, Azure, DigitalOcean, Hetzner) via /sys/class/dmi/id/* — prints a hint that host-level firewall changes alone may not be enough; the operator likely also needs to open the port in the cloud Security List/Group/NSG. (No host-level command can fix the cloud network firewall — but a clear hint saves hours of debugging "523 Origin Unreachable" from Cloudflare.) Translation keys: 7 new, mirrored across en-US / pt-BR / es. Bundles remain at exact key parity (160 each). Verified locally: * Oracle Cloud Ubuntu 24.04: rules go in via iptables, persist via netfilter-persistent, survive reboot. Hint about OCI Security List shown. * Ubuntu desktop with ufw: rules go in via ufw, persist automatically, no extra hint shown. * Re-running wizard: idempotent (no duplicate INPUT rules).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a silent failure where the
evo-nexussystemd service "starts" successfully on every VPS boot but no actual processes are running. Operator has to SSH in and runstart-services.shmanually each time.Repro
evonexusaccount — e.g. whenSUDO_USER=ubuntuis preserved bysudo -i/sudo suand the install ends up under/home/ubuntu/evo-nexus).sudo systemctl rebootsystemctl status evo-nexusreportsactive (exited). Browser hits 502/connection refused.ps -ef | grep -E 'app.py|scheduler.py|terminal-server'— no processes.bash /home/<user>/evo-nexus/start-services.shmanually → everything comes up.Root cause
start-services.sh(committed to git) hard-codes the path/home/evonexus/evo-nexuseverywhere —cd, log redirects, and.venv/bin/pythoninvocations:This works only by coincidence in the single scenario
_setup_systemd_serviceexercises (root + noSUDO_USER→ auto-creates theevonexususer → installs to/home/evonexus/evo-nexus). For any other user or path:cd /home/evonexus/evo-nexusfails silently — bash continues with cwd inherited from systemd'sWorkingDirectory=(which is the correct dir), so the cwd is masked during single-shot tests.nohup ... > /home/evonexus/evo-nexus/logs/<file> 2>&1cannot open the redirect target → the spawned process dies before doing any work; nohup eats the error.nohup /home/evonexus/evo-nexus/.venv/bin/python ...references a non-existent venv → ENOENT, same silent death.Net effect: the oneshot script "succeeds" in <50 ms, systemd marks the unit
active (exited), but no python/node processes are running.Fix
Resolve the install dir at runtime in
start-services.sh:SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"Single source of truth — works for the
evonexususer, forubuntu, for manual installs under/opt/..., anywhere. Side benefits:install-service.shno longer regeneratesstart-services.sh; justchmod+chownthe checked-in version. Removes ~30 lines of fragile heredoc with\$escaping.start-services.shnow survive re-runs ofinstall-service.sh(previously got clobbered).mkdir -p logsadded so fresh installs and reboots after a manualrm -rf logsstill work.cd ... || exit 1guards added — fail loudly instead of silently running from the wrong directory.Test plan
bash -n start-services.shandbash -n install-service.shparse cleanlySCRIPT_DIRresolution returns the correct absolute path when the script is invoked from any cwdsystemctl status evo-nexusshowsactive (exited)ANDcurl localhost:8080returns the dashboard HTML (not connection refused)ps -ef | grep -E 'app.py|scheduler.py|terminal-server'shows all three after rebootBreaking changes
None. The new script behaves identically to the regenerated form for the existing
evonexus-at-/home/evonexus/evo-nexusinstall — and starts working for every other layout.