Skip to content

fix(setup): make a fresh VPS install survive its first reboot (scheduler + start-services + firewall persistence)#28

Merged
DavidsonGomes merged 3 commits into
evolution-foundation:developfrom
NeritonDias:fix/setup-stop-overwriting-start-services
Apr 22, 2026
Merged

fix(setup): make a fresh VPS install survive its first reboot (scheduler + start-services + firewall persistence)#28
DavidsonGomes merged 3 commits into
evolution-foundation:developfrom
NeritonDias:fix/setup-stop-overwriting-start-services

Conversation

@NeritonDias
Copy link
Copy Markdown
Contributor

@NeritonDias NeritonDias commented Apr 22, 2026

Summary

Three fixes that together make a fresh OCI/VPS install actually survive its first reboot end-to-end. Validated on Oracle Cloud Ubuntu 24.04 (clone → make setupreboot → 3 processes up + Cloudflare returns 200).

Originally split as #28 + #29 — unified into one PR because:

Fix 1 — stop overwriting start-services.sh (preserves #27's self-discovery + restores scheduler)

#27 made start-services.sh self-discovering so the systemd unit comes up cleanly after a VPS reboot regardless of install path or service user. That part shipped. The accompanying setup.py change was missed during the squash merge — so setup.py::main() still rewrites start-services.sh on every make setup invocation, clobbering the in-git self-discovering file with a hardcoded version that ALSO silently drops the scheduler.py launch line:

startup_script.write_text(f"""#!/bin/bash
...
nohup node dashboard/terminal-server/bin/server.js ...
nohup {install_dir}/.venv/bin/python app.py ...
# <<< no scheduler line >>>
""")

Cascading effects on a fresh wizard install:

Fix: drop the regeneration block. The file in git is now the single source of truth. setup.py just ensures it's executable.

Fix 2 — scheduler.py mkdir parent of PID file

After Fix 1 finally lets the scheduler line in start-services.sh actually run, the scheduler still crashes on a fresh clone:

Traceback (most recent call last):
  File "scheduler.py", line 31, in acquire_lock
    fd = os.open(str(PID_FILE), os.O_CREAT | os.O_EXCL | os.O_WRONLY, 0o644)
FileNotFoundError: [Errno 2] No such file or directory:
  '/home/<user>/evo-nexus/ADWs/logs/scheduler.pid'

acquire_lock() opens ADWs/logs/scheduler.pid with O_CREAT|O_EXCL, but that flag combination only creates the FILE — not the parent directory. ADWs/logs/ is not tracked in git (no .gitkeep) and setup.py::create_folders() only creates the user-facing workspace dirs from config["folders"], so on a fresh clone the directory simply does not exist.

Fix: PID_FILE.parent.mkdir(parents=True, exist_ok=True) before the open. Idempotent.

Fix 3 — persist firewall rules + report errors instead of swallowing them

Reproduced on OCI Ubuntu 24.04: wizard prints "✓ Portas do firewall abertas (80, 443)" but the dashboard is unreachable from outside, and after a reboot the iptables rules vanish entirely. Cloudflare returns 523 "Origin Unreachable" the whole time.

Three bugs in the original one-liner:

os.system("ufw allow 80/tcp 2>/dev/null; ufw allow 443/tcp 2>/dev/null; ...")
os.system("iptables -I INPUT -p tcp --dport 80 -j ACCEPT 2>/dev/null; ...")
print("✓ Firewall ports opened (80, 443)")  # always prints, regardless
  1. 2>/dev/null swallows every error. OCI/Ubuntu cloud images don't ship ufw — every ufw line silently fails. The iptables fallback often runs, but errors (permission, nf_tables backend rejection) are invisible.
  2. Nothing calls netfilter-persistent save (or writes to /etc/iptables/rules.v4). Even when iptables -I succeeds, the next reboot reloads the persistent ruleset which doesn't include 80/443 → dashboard offline.
  3. Re-running the wizard adds duplicate ACCEPT rules each time (no -C check before -I).

Refactor: new helper _open_firewall_ports(ports) that prefers ufw when present (it persists itself), falls back to iptables with -C idempotency check before -I, persists via netfilter-persistent save (auto-installing iptables-persistent non-interactively on Debian/Ubuntu if missing; last-resort writes /etc/iptables/rules.v4 directly). Surfaces actual errors instead of silencing.

Plus best-effort cloud-provider detection (OCI, AWS, GCP, Azure, DigitalOcean, Hetzner) via /sys/class/dmi/id/* — prints a hint that host-level firewall changes alone may not be enough; the operator likely also needs to open the port in the cloud Security List/Group/NSG. No host-level command can fix the cloud network firewall — but a clear hint saves hours of debugging "523 Origin Unreachable" from Cloudflare.

7 new translation keys, mirrored across en-US / pt-BR / es. Bundles remain at exact key parity (160 each).

End-to-end validation on a fresh Oracle Cloud Ubuntu 24.04 install

$ sudo git clone --branch <this-branch> https://... /root/evonexus
$ cd /root/evonexus && make setup    # answers: language=pt-BR, mode=domain
  ✓ Portas do firewall abertas (80, 443)
  ✓ Regras persistidas via netfilter-persistent (vão sobreviver ao reboot)
    Firewall do provedor: se 80/443 ainda aparecerem bloqueados de fora, abra também na Security List/Group do seu provedor (Oracle Cloud (OCI)).

$ ps -ef | grep -E 'app.py|scheduler.py|terminal-server' | grep -v grep
ubuntu  6533  ... node dashboard/terminal-server/bin/server.js
ubuntu  6534  ... .venv/bin/python scheduler.py
ubuntu  6535  ... .venv/bin/python app.py

$ sudo iptables-save | grep -E 'dport (80|443)'
-A INPUT -p tcp -m tcp --dport 443 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 80 -j ACCEPT

$ sudo reboot
# … SSH back in …

$ ps -ef | grep -E 'app.py|scheduler.py|terminal-server' | grep -v grep
ubuntu   869  ... node dashboard/terminal-server/bin/server.js
ubuntu   870  ... .venv/bin/python scheduler.py
ubuntu   871  ... .venv/bin/python app.py

$ sudo iptables-save | grep -E 'dport (80|443)'
-A INPUT -p tcp -m tcp --dport 443 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 80 -j ACCEPT

$ curl -sI https://nexus.need.app.br/ | head -1
HTTP/2 200

All three fixes needed together: without #1 the scheduler line is missing, without #2 the scheduler line crashes, without #3 the dashboard is firewalled off externally and dies on the first reboot.

Test plan

  • python -c "import ast; ast.parse(open('setup.py',encoding='utf-8').read())" — clean parse
  • Translation parity — 160 keys per bundle, set-diff empty
  • OCI Ubuntu 24.04 fresh install + reboot → 3 processes up + iptables persisted + HTTPS 200 from Cloudflare
  • Re-running wizard idempotent — no duplicate iptables INPUT rules
  • Ubuntu desktop with ufw — rules go in via ufw, persist automatically, no extra cloud hint shown

Breaking changes

None.

  • The canonical start-services.sh already exists in git (as of fix(systemd): make start-services.sh self-discovering — service was silently failing on reboot #27); behavior is strictly a superset.
  • Operator customizations to start-services.sh now survive make setup re-runs (previously got clobbered).
  • iptables-persistent is auto-installed only when iptables is the active backend AND apt-get is available.
  • Cloud-provider hint is informational; if no provider is detected (bare metal, unknown hypervisor), nothing extra is printed.

…scovering version from evolution-foundation#27 and re-enable scheduler

evolution-foundation#27 made start-services.sh self-discovering (resolves SCRIPT_DIR
at runtime) so the systemd unit comes up cleanly after a VPS reboot
regardless of install path or service user. That part shipped.

The accompanying setup.py change was missed during the merge. As a
result, `setup.py::main()` still rewrites start-services.sh on every
`make setup` invocation — clobbering the in-git self-discovering
file with a hardcoded version that ALSO silently drops the
`scheduler.py` launch line:

    nohup node dashboard/terminal-server/bin/server.js ...
    nohup {install_dir}/.venv/bin/python app.py ...
    # <<< no scheduler line >>>

Cascading effects on a fresh wizard install:

  * `ps -ef | grep scheduler.py` → empty. The scheduler never runs.
    Cron-style routines (morning briefings, integration sync, daily
    digest) never fire until an operator manually launches it.
  * The self-discovering script from evolution-foundation#27 disappears from disk —
    setup.py replaces it with the hardcoded variant. So on the next
    rename/relocate of the install, reboots break again.
  * ``logs/scheduler.log`` is never created — silent failure mode
    (no error, no log, just missing process).

Fix: drop the regeneration block. The file in git is now the single
source of truth. setup.py just ensures it's executable
(``chmod 755``) and trusts the canonical content. Verified on a
fresh VPS install (Ubuntu 24.04, SUDO_USER=ubuntu, install at
``/home/ubuntu/evo-nexus``):

  * Before: `ps` shows 2 processes (terminal-server + app.py)
  * After:  `ps` shows 3 processes (terminal-server + scheduler + app.py)
  * Reboot: all 3 come back up automatically
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Apr 22, 2026

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

This PR stops setup.py from regenerating start-services.sh with a hard-coded script body and instead treats the git-tracked, self-discovering start-services.sh as the single source of truth, only ensuring it is executable so that scheduler.py is launched and operator customizations are preserved across make setup runs.

Sequence diagram for service startup via make_setup and start_services_sh

sequenceDiagram
    actor Operator
    participant Make as make
    participant SetupPy as setup_py_main
    participant StartScript as start_services_sh
    participant TerminalServer as terminal_server
    participant AppPy as app_py
    participant SchedulerPy as scheduler_py

    Operator->>Make: run make setup
    Make->>SetupPy: call main

    alt Before_fix
        SetupPy->>StartScript: overwrite with hardcoded_body
        SetupPy->>StartScript: chmod 755
        Operator->>StartScript: execute
        StartScript->>TerminalServer: start
        StartScript->>AppPy: start
        note over SchedulerPy,StartScript: scheduler_py never launched
    else After_fix
        SetupPy->>StartScript: locate git_tracked_script
        SetupPy->>StartScript: if exists then chmod 755
        Operator->>StartScript: execute
        StartScript->>TerminalServer: start
        StartScript->>AppPy: start
        StartScript->>SchedulerPy: start
    end
Loading

Flow diagram for setup_py handling of start_services_sh before and after fix

flowchart TD
    A["make setup"] --> B["setup.py main"]

    subgraph Before_fix
        B --> C_before["Create startup_script path"]
        C_before --> D_before["Write hardcoded heredoc to start_services.sh"]
        D_before --> E_before["chmod 755 start_services.sh"]
        E_before --> F_before["start_services.sh lacks scheduler launch and self discovery"]
    end

    subgraph After_fix
        B --> C_after["Create startup_script path"]
        C_after --> D_after{"start_services.sh exists?"}
        D_after -->|yes| E_after["chmod 755 start_services.sh"]
        D_after -->|no| G_after["do nothing (keep absence)"]
        E_after --> H_after["Use git tracked self discovering script with scheduler launch"]
    end
Loading

File-Level Changes

Change Details Files
Stop regenerating start-services.sh in setup.py and rely on the canonical git-tracked script, only fixing its permissions when present.
  • Remove the heredoc-style write_text call that recreated start-services.sh with hardcoded paths and without the scheduler launch.
  • Add explanatory comments documenting that start-services.sh is now sourced from git, is self-discovering, and already includes scheduler startup.
  • Change setup.py to locate start-services.sh under the install directory and chmod it to 0o755 only if it already exists, avoiding overwriting local modifications.
setup.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • If start-services.sh is missing (e.g., in non-git or partially copied installs), make setup will now silently skip creating it; consider either erroring or logging a clear message in the else branch so operators aren’t left with a non-startable install.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- If `start-services.sh` is missing (e.g., in non-git or partially copied installs), `make setup` will now silently skip creating it; consider either erroring or logging a clear message in the `else` branch so operators aren’t left with a non-startable install.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@NeritonDias NeritonDias marked this pull request as draft April 22, 2026 20:29
…n fresh clone

`acquire_lock()` opens `ADWs/logs/scheduler.pid` with `O_CREAT|O_EXCL`,
but that flag combination only creates the FILE — not the parent
directory. `ADWs/logs/` is not tracked in git (no `.gitkeep`) and
`setup.py::create_folders()` only creates the user-facing workspace
dirs from `config["folders"]`, so on a fresh clone the directory
simply does not exist.

Result on every fresh wizard install:

    Traceback (most recent call last):
      File "scheduler.py", line 189, in <module>
        main()
      File "scheduler.py", line 157, in main
        if not acquire_lock():
      File "scheduler.py", line 31, in acquire_lock
        fd = os.open(str(PID_FILE), os.O_CREAT | os.O_EXCL | ...)
    FileNotFoundError: [Errno 2] No such file or directory:
      '/home/<user>/evo-nexus/ADWs/logs/scheduler.pid'

The scheduler exits in <50 ms — no log file beyond the traceback,
no routine ever executes (briefings, integration sync, daily
digests are all dead). systemd doesn't notice because the unit is
oneshot+nohup and the script kept going.

Fix: `PID_FILE.parent.mkdir(parents=True, exist_ok=True)` before the
open. Idempotent, safe on every restart.
…f swallowing them

Reproduced on Oracle Cloud (Ubuntu 24.04 cloud image): wizard prints
"Firewall ports opened (80, 443)" but the dashboard is unreachable
from outside, and after a reboot the iptables rules vanish entirely.

Three bugs in the original one-liner:

    os.system("ufw allow 80/tcp 2>/dev/null; ufw allow 443/tcp 2>/dev/null; ...")
    os.system("iptables -I INPUT -p tcp --dport 80 -j ACCEPT 2>/dev/null; ...")
    print("Firewall ports opened")  # always prints, regardless

  1. ``2>/dev/null`` swallows every error. On OCI/Ubuntu cloud images
     ``ufw`` isn't installed — the ufw lines all fail silently. The
     iptables fallback often runs, but if it errors (permission,
     nf_tables backend rejection, missing CAP_NET_ADMIN) you'd never
     know.
  2. Nothing calls ``netfilter-persistent save`` (or saves to
     ``/etc/iptables/rules.v4``). Even when iptables -I succeeds,
     the next reboot reloads the persistent ruleset which doesn't
     include 80/443 → dashboard offline until the operator manually
     re-runs setup.
  3. Re-running the wizard adds duplicate ACCEPT rules each time
     (no -C check before -I).

Refactor:

  * New helper ``_open_firewall_ports(ports)`` that prefers ufw when
    present (it persists itself), falls back to iptables with -C
    idempotency check, and PERSISTS via netfilter-persistent —
    auto-installing iptables-persistent on Debian/Ubuntu if missing.
    Falls back further to ``iptables-save > /etc/iptables/rules.v4``.
  * Surfaces actual errors instead of silencing. Reports which
    backend was used and which persistence path succeeded.
  * Best-effort cloud-provider detection (OCI, AWS, GCP, Azure,
    DigitalOcean, Hetzner) via /sys/class/dmi/id/* — prints a hint
    that host-level firewall changes alone may not be enough; the
    operator likely also needs to open the port in the cloud
    Security List/Group/NSG. (No host-level command can fix the
    cloud network firewall — but a clear hint saves hours of
    debugging "523 Origin Unreachable" from Cloudflare.)

Translation keys: 7 new, mirrored across en-US / pt-BR / es. Bundles
remain at exact key parity (160 each).

Verified locally:
  * Oracle Cloud Ubuntu 24.04: rules go in via iptables, persist via
    netfilter-persistent, survive reboot. Hint about OCI Security
    List shown.
  * Ubuntu desktop with ufw: rules go in via ufw, persist
    automatically, no extra hint shown.
  * Re-running wizard: idempotent (no duplicate INPUT rules).
@NeritonDias NeritonDias changed the title fix(setup): stop overwriting start-services.sh — preserve #27 self-discovery + restore scheduler fix(setup): make a fresh VPS install survive its first reboot (scheduler + start-services + firewall persistence) Apr 22, 2026
@NeritonDias NeritonDias marked this pull request as ready for review April 22, 2026 20:44
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @NeritonDias, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@DavidsonGomes DavidsonGomes merged commit e7add49 into evolution-foundation:develop Apr 22, 2026
1 check passed
@NeritonDias NeritonDias deleted the fix/setup-stop-overwriting-start-services branch April 24, 2026 06:09
jbmendonca pushed a commit to jbmendonca/evo-nexus that referenced this pull request May 15, 2026
…ler + start-services + firewall persistence) (evolution-foundation#28)

* fix(setup): stop overwriting start-services.sh — preserve the self-discovering version from evolution-foundation#27 and re-enable scheduler

evolution-foundation#27 made start-services.sh self-discovering (resolves SCRIPT_DIR
at runtime) so the systemd unit comes up cleanly after a VPS reboot
regardless of install path or service user. That part shipped.

The accompanying setup.py change was missed during the merge. As a
result, `setup.py::main()` still rewrites start-services.sh on every
`make setup` invocation — clobbering the in-git self-discovering
file with a hardcoded version that ALSO silently drops the
`scheduler.py` launch line:

    nohup node dashboard/terminal-server/bin/server.js ...
    nohup {install_dir}/.venv/bin/python app.py ...
    # <<< no scheduler line >>>

Cascading effects on a fresh wizard install:

  * `ps -ef | grep scheduler.py` → empty. The scheduler never runs.
    Cron-style routines (morning briefings, integration sync, daily
    digest) never fire until an operator manually launches it.
  * The self-discovering script from evolution-foundation#27 disappears from disk —
    setup.py replaces it with the hardcoded variant. So on the next
    rename/relocate of the install, reboots break again.
  * ``logs/scheduler.log`` is never created — silent failure mode
    (no error, no log, just missing process).

Fix: drop the regeneration block. The file in git is now the single
source of truth. setup.py just ensures it's executable
(``chmod 755``) and trusts the canonical content. Verified on a
fresh VPS install (Ubuntu 24.04, SUDO_USER=ubuntu, install at
``/home/ubuntu/evo-nexus``):

  * Before: `ps` shows 2 processes (terminal-server + app.py)
  * After:  `ps` shows 3 processes (terminal-server + scheduler + app.py)
  * Reboot: all 3 come back up automatically

* fix(scheduler): mkdir parent of PID file so scheduler doesn't crash on fresh clone

`acquire_lock()` opens `ADWs/logs/scheduler.pid` with `O_CREAT|O_EXCL`,
but that flag combination only creates the FILE — not the parent
directory. `ADWs/logs/` is not tracked in git (no `.gitkeep`) and
`setup.py::create_folders()` only creates the user-facing workspace
dirs from `config["folders"]`, so on a fresh clone the directory
simply does not exist.

Result on every fresh wizard install:

    Traceback (most recent call last):
      File "scheduler.py", line 189, in <module>
        main()
      File "scheduler.py", line 157, in main
        if not acquire_lock():
      File "scheduler.py", line 31, in acquire_lock
        fd = os.open(str(PID_FILE), os.O_CREAT | os.O_EXCL | ...)
    FileNotFoundError: [Errno 2] No such file or directory:
      '/home/<user>/evo-nexus/ADWs/logs/scheduler.pid'

The scheduler exits in <50 ms — no log file beyond the traceback,
no routine ever executes (briefings, integration sync, daily
digests are all dead). systemd doesn't notice because the unit is
oneshot+nohup and the script kept going.

Fix: `PID_FILE.parent.mkdir(parents=True, exist_ok=True)` before the
open. Idempotent, safe on every restart.

* fix(setup): persist firewall rules + actually report errors instead of swallowing them

Reproduced on Oracle Cloud (Ubuntu 24.04 cloud image): wizard prints
"Firewall ports opened (80, 443)" but the dashboard is unreachable
from outside, and after a reboot the iptables rules vanish entirely.

Three bugs in the original one-liner:

    os.system("ufw allow 80/tcp 2>/dev/null; ufw allow 443/tcp 2>/dev/null; ...")
    os.system("iptables -I INPUT -p tcp --dport 80 -j ACCEPT 2>/dev/null; ...")
    print("Firewall ports opened")  # always prints, regardless

  1. ``2>/dev/null`` swallows every error. On OCI/Ubuntu cloud images
     ``ufw`` isn't installed — the ufw lines all fail silently. The
     iptables fallback often runs, but if it errors (permission,
     nf_tables backend rejection, missing CAP_NET_ADMIN) you'd never
     know.
  2. Nothing calls ``netfilter-persistent save`` (or saves to
     ``/etc/iptables/rules.v4``). Even when iptables -I succeeds,
     the next reboot reloads the persistent ruleset which doesn't
     include 80/443 → dashboard offline until the operator manually
     re-runs setup.
  3. Re-running the wizard adds duplicate ACCEPT rules each time
     (no -C check before -I).

Refactor:

  * New helper ``_open_firewall_ports(ports)`` that prefers ufw when
    present (it persists itself), falls back to iptables with -C
    idempotency check, and PERSISTS via netfilter-persistent —
    auto-installing iptables-persistent on Debian/Ubuntu if missing.
    Falls back further to ``iptables-save > /etc/iptables/rules.v4``.
  * Surfaces actual errors instead of silencing. Reports which
    backend was used and which persistence path succeeded.
  * Best-effort cloud-provider detection (OCI, AWS, GCP, Azure,
    DigitalOcean, Hetzner) via /sys/class/dmi/id/* — prints a hint
    that host-level firewall changes alone may not be enough; the
    operator likely also needs to open the port in the cloud
    Security List/Group/NSG. (No host-level command can fix the
    cloud network firewall — but a clear hint saves hours of
    debugging "523 Origin Unreachable" from Cloudflare.)

Translation keys: 7 new, mirrored across en-US / pt-BR / es. Bundles
remain at exact key parity (160 each).

Verified locally:
  * Oracle Cloud Ubuntu 24.04: rules go in via iptables, persist via
    netfilter-persistent, survive reboot. Hint about OCI Security
    List shown.
  * Ubuntu desktop with ufw: rules go in via ufw, persist
    automatically, no extra hint shown.
  * Re-running wizard: idempotent (no duplicate INPUT rules).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants