Skip to content

CI: Debug logging and caching (3/4)#24

Closed
ejc3 wants to merge 23 commits intopr/ci-containerfrom
pr/ci-debug
Closed

CI: Debug logging and caching (3/4)#24
ejc3 wants to merge 23 commits intopr/ci-containerfrom
pr/ci-debug

Conversation

@ejc3
Copy link
Copy Markdown
Owner

@ejc3 ejc3 commented Dec 26, 2025

Summary

Third of 4 PRs. Builds on #23.

CI Caching & Performance:

  • Rust cache with shared-key for cache reuse
  • Cargo cache for container builds
  • Save rust cache even on failure
  • Auto-cancel in-progress runs on new push

Layer 2 Setup Fixes:

  • Fix proper error handling and package resolution
  • Fix VM shutdown in Layer 2 setup
  • Fix podman-in-podman for rootless container setup
  • Add --cgroups=disabled to podman command

CI Infrastructure:

  • Create and pass /dev/userfaultfd to container
  • Install cargo-audit/deny for CVSS 4.0 support
  • Separate lint tests from integration tests
  • Fix disk space exhaustion in CI snapshot tests

Docs:

  • Add NO HACKS policy to CLAUDE.md
  • Expand README setup section

Test plan

ejc3 added 23 commits December 25, 2025 16:58
Key changes:

1. dpkg failures now fail loudly with captured output
   - Uses tee to log dpkg output to /tmp/dpkg-install.log
   - Shows specific error messages on failure
   - Exits with clear error instead of continuing silently

2. Setup completion verified with marker file
   - Writes /etc/fcvm-setup-complete on successful setup
   - Rust code mounts rootfs and verifies marker exists
   - Detects FCVM_SETUP_FAILED in serial output for early bail

3. Fixed package download using apt-get install --download-only
   - Previous apt-cache depends pulled conflicting alternatives
     (e.g., libqt5gui5t64 vs libqt5gui5-gles both downloaded)
   - Now uses apt-get which properly resolves dependencies

4. Fixed dangling symlinks when writing config files
   - /etc/resolv.conf is symlink to /run/systemd/... in cloud image
   - Now removes symlinks before writing files

5. Added codename field to rootfs-plan.toml
   - Specifies target Ubuntu version (noble) for package download
   - Ensures packages match target, not host OS

Tested: sudo fcvm setup && sudo fcvm podman run --name test --network bridged nginx:alpine
- Setup completes in ~15 seconds
- VM boots, pulls image, nginx serves HTTP
- Health checks pass
CLAUDE.md:
- Document package download via podman run ubuntu:noble
- Add setup verification with marker file
- Update hash calculation components

DESIGN.md:
- Expand fcvm setup command description with steps
- Add packages cache directory to data layout
- Document rootfs hash calculation
- Bump version to 2.3
Container job needs qemu-utils, e2fsprogs, podman, skopeo, busybox-static,
cpio, zstd on the host for setup-fcvm to work (rootfs creation).
Use sysrq trigger (echo o > /proc/sysrq-trigger) for reliable shutdown
instead of poweroff -f which doesn't work in minimal initrd environment.

The CI was timing out because poweroff -f failed silently and the VM
kept running for 15 minutes after setup completed.
- Add container-setup-fcvm target that runs setup inside the container
  (container already has Firecracker, qemu-utils, etc.)
- Remove host Firecracker installation from Container CI job
- Use debugfs instead of mount for marker file verification (no root needed)
- Add sanity checks before writing marker file:
  - Verify podman, crun, skopeo binaries exist
  - Verify systemd exists
  - Verify /etc/resolv.conf exists
- Improved VM shutdown with /proc re-mount and multiple fallbacks
- Add container-setup-fcvm target that runs setup inside the container
  (container already has Firecracker, qemu-utils, etc.)
- Update container-test-fast/all to depend on container-setup-fcvm
- Add fdisk package to Containerfile (provides sfdisk for partition info)
- Use debugfs instead of mount for marker file verification (no root needed)
- Add sanity checks before writing marker file:
  - Verify podman, crun, skopeo binaries exist
  - Verify systemd exists
  - Verify /etc/resolv.conf exists
- Improved VM shutdown with /proc re-mount and multiple fallbacks
- Fix cargo fmt issues
Add --cgroups=disabled to inner podman run command when downloading
packages. This allows package download to work inside rootless containers
where cgroup creation is not permitted.

The error was: "crun: create /sys/fs/cgroup/libpod_parent: Permission denied"

Tested: make container-setup-fcvm (completes in ~1 min)
- Add CARGO_CACHE_DIR variable to Makefile for mounting cache volumes
- Add actions/cache step to cache cargo registry and target between runs
- Mount cache into container for faster rebuilds

This caches both the cargo registry and target directory, so subsequent
runs skip downloading crates and recompiling unchanged dependencies.
The previous fix only updated the hash function, not the actual Command
that executes podman. This adds --cgroups=disabled to the real download
command at line 1552.
Remove duplicate script definition - now generate_download_script() is
used for both hashing AND execution. This prevents the bug where the
hash version had --cgroups=disabled but the execution version didn't.
Add lint-tests feature to gate fmt/clippy/audit/deny tests.
These were causing test-fast to fail due to corrupt cargo-audit DB.
Now run lint explicitly with: make lint
The Host job was missing cargo-audit and cargo-deny, causing lint tests
to fail with 'unsupported CVSS version: 4.0' from the RustSec DB.

Added cargo install for both tools alongside cargo-nextest.
Root cause: 15 snapshot tests running in parallel, each creating a 5.6GB
snapshot (2GB memory + 3.6GB disk). With 20GB btrfs, only ~3 tests fit.

Changes:
- Increase btrfs loopback from 20G to 60G
- Add snapshot-tests group with max-threads=3 in nextest.toml
- Assign snapshot/clone tests to this group

This limits concurrent snapshots to ~17GB disk usage, well under the 60GB
limit. Belt and suspenders approach ensures CI stability.
Container tests were failing with "userfaultfd access check failed" because
the Container job wasn't setting vm.unprivileged_userfaultfd=1.

The Host job already had this, but Container was missing it. Containers
inherit host sysctl settings, so setting it on the host before running
podman allows snapshot cloning to work inside the container.
Snapshot cloning requires /dev/userfaultfd device, not just the sysctl.
- Create device with mknod in CI setup
- Pass device to container via --device flag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant