fix(setup): make a fresh VPS install survive its first reboot (scheduler + start-services + firewall persistence)#28
Merged
DavidsonGomes merged 3 commits intoApr 22, 2026
Conversation
…scovering version from evolution-foundation#27 and re-enable scheduler evolution-foundation#27 made start-services.sh self-discovering (resolves SCRIPT_DIR at runtime) so the systemd unit comes up cleanly after a VPS reboot regardless of install path or service user. That part shipped. The accompanying setup.py change was missed during the merge. As a result, `setup.py::main()` still rewrites start-services.sh on every `make setup` invocation — clobbering the in-git self-discovering file with a hardcoded version that ALSO silently drops the `scheduler.py` launch line: nohup node dashboard/terminal-server/bin/server.js ... nohup {install_dir}/.venv/bin/python app.py ... # <<< no scheduler line >>> Cascading effects on a fresh wizard install: * `ps -ef | grep scheduler.py` → empty. The scheduler never runs. Cron-style routines (morning briefings, integration sync, daily digest) never fire until an operator manually launches it. * The self-discovering script from evolution-foundation#27 disappears from disk — setup.py replaces it with the hardcoded variant. So on the next rename/relocate of the install, reboots break again. * ``logs/scheduler.log`` is never created — silent failure mode (no error, no log, just missing process). Fix: drop the regeneration block. The file in git is now the single source of truth. setup.py just ensures it's executable (``chmod 755``) and trusts the canonical content. Verified on a fresh VPS install (Ubuntu 24.04, SUDO_USER=ubuntu, install at ``/home/ubuntu/evo-nexus``): * Before: `ps` shows 2 processes (terminal-server + app.py) * After: `ps` shows 3 processes (terminal-server + scheduler + app.py) * Reboot: all 3 come back up automatically
Reviewer's guide (collapsed on small PRs)Reviewer's GuideThis PR stops setup.py from regenerating start-services.sh with a hard-coded script body and instead treats the git-tracked, self-discovering start-services.sh as the single source of truth, only ensuring it is executable so that scheduler.py is launched and operator customizations are preserved across make setup runs. Sequence diagram for service startup via make_setup and start_services_shsequenceDiagram
actor Operator
participant Make as make
participant SetupPy as setup_py_main
participant StartScript as start_services_sh
participant TerminalServer as terminal_server
participant AppPy as app_py
participant SchedulerPy as scheduler_py
Operator->>Make: run make setup
Make->>SetupPy: call main
alt Before_fix
SetupPy->>StartScript: overwrite with hardcoded_body
SetupPy->>StartScript: chmod 755
Operator->>StartScript: execute
StartScript->>TerminalServer: start
StartScript->>AppPy: start
note over SchedulerPy,StartScript: scheduler_py never launched
else After_fix
SetupPy->>StartScript: locate git_tracked_script
SetupPy->>StartScript: if exists then chmod 755
Operator->>StartScript: execute
StartScript->>TerminalServer: start
StartScript->>AppPy: start
StartScript->>SchedulerPy: start
end
Flow diagram for setup_py handling of start_services_sh before and after fixflowchart TD
A["make setup"] --> B["setup.py main"]
subgraph Before_fix
B --> C_before["Create startup_script path"]
C_before --> D_before["Write hardcoded heredoc to start_services.sh"]
D_before --> E_before["chmod 755 start_services.sh"]
E_before --> F_before["start_services.sh lacks scheduler launch and self discovery"]
end
subgraph After_fix
B --> C_after["Create startup_script path"]
C_after --> D_after{"start_services.sh exists?"}
D_after -->|yes| E_after["chmod 755 start_services.sh"]
D_after -->|no| G_after["do nothing (keep absence)"]
E_after --> H_after["Use git tracked self discovering script with scheduler launch"]
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- If
start-services.shis missing (e.g., in non-git or partially copied installs),make setupwill now silently skip creating it; consider either erroring or logging a clear message in theelsebranch so operators aren’t left with a non-startable install.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- If `start-services.sh` is missing (e.g., in non-git or partially copied installs), `make setup` will now silently skip creating it; consider either erroring or logging a clear message in the `else` branch so operators aren’t left with a non-startable install.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
…n fresh clone
`acquire_lock()` opens `ADWs/logs/scheduler.pid` with `O_CREAT|O_EXCL`,
but that flag combination only creates the FILE — not the parent
directory. `ADWs/logs/` is not tracked in git (no `.gitkeep`) and
`setup.py::create_folders()` only creates the user-facing workspace
dirs from `config["folders"]`, so on a fresh clone the directory
simply does not exist.
Result on every fresh wizard install:
Traceback (most recent call last):
File "scheduler.py", line 189, in <module>
main()
File "scheduler.py", line 157, in main
if not acquire_lock():
File "scheduler.py", line 31, in acquire_lock
fd = os.open(str(PID_FILE), os.O_CREAT | os.O_EXCL | ...)
FileNotFoundError: [Errno 2] No such file or directory:
'/home/<user>/evo-nexus/ADWs/logs/scheduler.pid'
The scheduler exits in <50 ms — no log file beyond the traceback,
no routine ever executes (briefings, integration sync, daily
digests are all dead). systemd doesn't notice because the unit is
oneshot+nohup and the script kept going.
Fix: `PID_FILE.parent.mkdir(parents=True, exist_ok=True)` before the
open. Idempotent, safe on every restart.
…f swallowing them
Reproduced on Oracle Cloud (Ubuntu 24.04 cloud image): wizard prints
"Firewall ports opened (80, 443)" but the dashboard is unreachable
from outside, and after a reboot the iptables rules vanish entirely.
Three bugs in the original one-liner:
os.system("ufw allow 80/tcp 2>/dev/null; ufw allow 443/tcp 2>/dev/null; ...")
os.system("iptables -I INPUT -p tcp --dport 80 -j ACCEPT 2>/dev/null; ...")
print("Firewall ports opened") # always prints, regardless
1. ``2>/dev/null`` swallows every error. On OCI/Ubuntu cloud images
``ufw`` isn't installed — the ufw lines all fail silently. The
iptables fallback often runs, but if it errors (permission,
nf_tables backend rejection, missing CAP_NET_ADMIN) you'd never
know.
2. Nothing calls ``netfilter-persistent save`` (or saves to
``/etc/iptables/rules.v4``). Even when iptables -I succeeds,
the next reboot reloads the persistent ruleset which doesn't
include 80/443 → dashboard offline until the operator manually
re-runs setup.
3. Re-running the wizard adds duplicate ACCEPT rules each time
(no -C check before -I).
Refactor:
* New helper ``_open_firewall_ports(ports)`` that prefers ufw when
present (it persists itself), falls back to iptables with -C
idempotency check, and PERSISTS via netfilter-persistent —
auto-installing iptables-persistent on Debian/Ubuntu if missing.
Falls back further to ``iptables-save > /etc/iptables/rules.v4``.
* Surfaces actual errors instead of silencing. Reports which
backend was used and which persistence path succeeded.
* Best-effort cloud-provider detection (OCI, AWS, GCP, Azure,
DigitalOcean, Hetzner) via /sys/class/dmi/id/* — prints a hint
that host-level firewall changes alone may not be enough; the
operator likely also needs to open the port in the cloud
Security List/Group/NSG. (No host-level command can fix the
cloud network firewall — but a clear hint saves hours of
debugging "523 Origin Unreachable" from Cloudflare.)
Translation keys: 7 new, mirrored across en-US / pt-BR / es. Bundles
remain at exact key parity (160 each).
Verified locally:
* Oracle Cloud Ubuntu 24.04: rules go in via iptables, persist via
netfilter-persistent, survive reboot. Hint about OCI Security
List shown.
* Ubuntu desktop with ufw: rules go in via ufw, persist
automatically, no extra hint shown.
* Re-running wizard: idempotent (no duplicate INPUT rules).
6 tasks
There was a problem hiding this comment.
Sorry @NeritonDias, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
jbmendonca
pushed a commit
to jbmendonca/evo-nexus
that referenced
this pull request
May 15, 2026
…ler + start-services + firewall persistence) (evolution-foundation#28) * fix(setup): stop overwriting start-services.sh — preserve the self-discovering version from evolution-foundation#27 and re-enable scheduler evolution-foundation#27 made start-services.sh self-discovering (resolves SCRIPT_DIR at runtime) so the systemd unit comes up cleanly after a VPS reboot regardless of install path or service user. That part shipped. The accompanying setup.py change was missed during the merge. As a result, `setup.py::main()` still rewrites start-services.sh on every `make setup` invocation — clobbering the in-git self-discovering file with a hardcoded version that ALSO silently drops the `scheduler.py` launch line: nohup node dashboard/terminal-server/bin/server.js ... nohup {install_dir}/.venv/bin/python app.py ... # <<< no scheduler line >>> Cascading effects on a fresh wizard install: * `ps -ef | grep scheduler.py` → empty. The scheduler never runs. Cron-style routines (morning briefings, integration sync, daily digest) never fire until an operator manually launches it. * The self-discovering script from evolution-foundation#27 disappears from disk — setup.py replaces it with the hardcoded variant. So on the next rename/relocate of the install, reboots break again. * ``logs/scheduler.log`` is never created — silent failure mode (no error, no log, just missing process). Fix: drop the regeneration block. The file in git is now the single source of truth. setup.py just ensures it's executable (``chmod 755``) and trusts the canonical content. Verified on a fresh VPS install (Ubuntu 24.04, SUDO_USER=ubuntu, install at ``/home/ubuntu/evo-nexus``): * Before: `ps` shows 2 processes (terminal-server + app.py) * After: `ps` shows 3 processes (terminal-server + scheduler + app.py) * Reboot: all 3 come back up automatically * fix(scheduler): mkdir parent of PID file so scheduler doesn't crash on fresh clone `acquire_lock()` opens `ADWs/logs/scheduler.pid` with `O_CREAT|O_EXCL`, but that flag combination only creates the FILE — not the parent directory. `ADWs/logs/` is not tracked in git (no `.gitkeep`) and `setup.py::create_folders()` only creates the user-facing workspace dirs from `config["folders"]`, so on a fresh clone the directory simply does not exist. Result on every fresh wizard install: Traceback (most recent call last): File "scheduler.py", line 189, in <module> main() File "scheduler.py", line 157, in main if not acquire_lock(): File "scheduler.py", line 31, in acquire_lock fd = os.open(str(PID_FILE), os.O_CREAT | os.O_EXCL | ...) FileNotFoundError: [Errno 2] No such file or directory: '/home/<user>/evo-nexus/ADWs/logs/scheduler.pid' The scheduler exits in <50 ms — no log file beyond the traceback, no routine ever executes (briefings, integration sync, daily digests are all dead). systemd doesn't notice because the unit is oneshot+nohup and the script kept going. Fix: `PID_FILE.parent.mkdir(parents=True, exist_ok=True)` before the open. Idempotent, safe on every restart. * fix(setup): persist firewall rules + actually report errors instead of swallowing them Reproduced on Oracle Cloud (Ubuntu 24.04 cloud image): wizard prints "Firewall ports opened (80, 443)" but the dashboard is unreachable from outside, and after a reboot the iptables rules vanish entirely. Three bugs in the original one-liner: os.system("ufw allow 80/tcp 2>/dev/null; ufw allow 443/tcp 2>/dev/null; ...") os.system("iptables -I INPUT -p tcp --dport 80 -j ACCEPT 2>/dev/null; ...") print("Firewall ports opened") # always prints, regardless 1. ``2>/dev/null`` swallows every error. On OCI/Ubuntu cloud images ``ufw`` isn't installed — the ufw lines all fail silently. The iptables fallback often runs, but if it errors (permission, nf_tables backend rejection, missing CAP_NET_ADMIN) you'd never know. 2. Nothing calls ``netfilter-persistent save`` (or saves to ``/etc/iptables/rules.v4``). Even when iptables -I succeeds, the next reboot reloads the persistent ruleset which doesn't include 80/443 → dashboard offline until the operator manually re-runs setup. 3. Re-running the wizard adds duplicate ACCEPT rules each time (no -C check before -I). Refactor: * New helper ``_open_firewall_ports(ports)`` that prefers ufw when present (it persists itself), falls back to iptables with -C idempotency check, and PERSISTS via netfilter-persistent — auto-installing iptables-persistent on Debian/Ubuntu if missing. Falls back further to ``iptables-save > /etc/iptables/rules.v4``. * Surfaces actual errors instead of silencing. Reports which backend was used and which persistence path succeeded. * Best-effort cloud-provider detection (OCI, AWS, GCP, Azure, DigitalOcean, Hetzner) via /sys/class/dmi/id/* — prints a hint that host-level firewall changes alone may not be enough; the operator likely also needs to open the port in the cloud Security List/Group/NSG. (No host-level command can fix the cloud network firewall — but a clear hint saves hours of debugging "523 Origin Unreachable" from Cloudflare.) Translation keys: 7 new, mirrored across en-US / pt-BR / es. Bundles remain at exact key parity (160 each). Verified locally: * Oracle Cloud Ubuntu 24.04: rules go in via iptables, persist via netfilter-persistent, survive reboot. Hint about OCI Security List shown. * Ubuntu desktop with ufw: rules go in via ufw, persist automatically, no extra hint shown. * Re-running wizard: idempotent (no duplicate INPUT rules).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three fixes that together make a fresh OCI/VPS install actually survive its first reboot end-to-end. Validated on Oracle Cloud Ubuntu 24.04 (clone →
make setup→reboot→ 3 processes up + Cloudflare returns 200).Originally split as #28 + #29 — unified into one PR because:
Fix 1 — stop overwriting
start-services.sh(preserves #27's self-discovery + restores scheduler)#27 made
start-services.shself-discovering so the systemd unit comes up cleanly after a VPS reboot regardless of install path or service user. That part shipped. The accompanyingsetup.pychange was missed during the squash merge — sosetup.py::main()still rewritesstart-services.shon everymake setupinvocation, clobbering the in-git self-discovering file with a hardcoded version that ALSO silently drops thescheduler.pylaunch line:Cascading effects on a fresh wizard install:
ps -ef | grep scheduler.py→ empty. The scheduler never runs. Cron-style routines (morning briefings, integration sync, daily digests) never fire until an operator manually launches it.logs/scheduler.logis never created — silent failure mode.Fix: drop the regeneration block. The file in git is now the single source of truth.
setup.pyjust ensures it's executable.Fix 2 —
scheduler.pymkdir parent of PID fileAfter Fix 1 finally lets the scheduler line in
start-services.shactually run, the scheduler still crashes on a fresh clone:acquire_lock()opensADWs/logs/scheduler.pidwithO_CREAT|O_EXCL, but that flag combination only creates the FILE — not the parent directory.ADWs/logs/is not tracked in git (no.gitkeep) andsetup.py::create_folders()only creates the user-facing workspace dirs fromconfig["folders"], so on a fresh clone the directory simply does not exist.Fix:
PID_FILE.parent.mkdir(parents=True, exist_ok=True)before the open. Idempotent.Fix 3 — persist firewall rules + report errors instead of swallowing them
Reproduced on OCI Ubuntu 24.04: wizard prints "✓ Portas do firewall abertas (80, 443)" but the dashboard is unreachable from outside, and after a reboot the iptables rules vanish entirely. Cloudflare returns 523 "Origin Unreachable" the whole time.
Three bugs in the original one-liner:
2>/dev/nullswallows every error. OCI/Ubuntu cloud images don't shipufw— every ufw line silently fails. The iptables fallback often runs, but errors (permission, nf_tables backend rejection) are invisible.netfilter-persistent save(or writes to/etc/iptables/rules.v4). Even wheniptables -Isucceeds, the next reboot reloads the persistent ruleset which doesn't include 80/443 → dashboard offline.-Ccheck before-I).Refactor: new helper
_open_firewall_ports(ports)that prefersufwwhen present (it persists itself), falls back toiptableswith-Cidempotency check before-I, persists vianetfilter-persistent save(auto-installingiptables-persistentnon-interactively on Debian/Ubuntu if missing; last-resort writes/etc/iptables/rules.v4directly). Surfaces actual errors instead of silencing.Plus best-effort cloud-provider detection (OCI, AWS, GCP, Azure, DigitalOcean, Hetzner) via
/sys/class/dmi/id/*— prints a hint that host-level firewall changes alone may not be enough; the operator likely also needs to open the port in the cloud Security List/Group/NSG. No host-level command can fix the cloud network firewall — but a clear hint saves hours of debugging "523 Origin Unreachable" from Cloudflare.7 new translation keys, mirrored across en-US / pt-BR / es. Bundles remain at exact key parity (160 each).
End-to-end validation on a fresh Oracle Cloud Ubuntu 24.04 install
All three fixes needed together: without #1 the scheduler line is missing, without #2 the scheduler line crashes, without #3 the dashboard is firewalled off externally and dies on the first reboot.
Test plan
python -c "import ast; ast.parse(open('setup.py',encoding='utf-8').read())"— clean parseBreaking changes
None.
start-services.shalready exists in git (as of fix(systemd): make start-services.sh self-discovering — service was silently failing on reboot #27); behavior is strictly a superset.start-services.shnow survivemake setupre-runs (previously got clobbered).iptables-persistentis auto-installed only when iptables is the active backend ANDapt-getis available.