diff --git a/README.md b/README.md index cddb812..6be30b6 100644 --- a/README.md +++ b/README.md @@ -110,14 +110,19 @@ sandlock run --net-allow github.com:22,443 --net-allow :8080 \ # Wildcard port — `host:*` permits every port to the host sandlock run --net-allow github.com:* -r /usr -r /lib -r /etc -- ssh user@github.com -# Unrestricted outbound — `:*` opens any host and any port. UDP socket -# creation is still gated by --allow-udp; pair the two for full egress. -sandlock run --net-allow :* --allow-udp -r /usr -r /lib -r /etc -- ./client +# Unrestricted outbound — `:*` opens any host and any TCP port. For full +# egress add a UDP wildcard via the `udp://*:*` scheme. +sandlock run --net-allow :* --net-allow udp://*:* \ + -r /usr -r /lib -r /etc -- ./client -# UDP — opt in to UDP and allowlist the destination (e.g. DNS) -sandlock run --allow-udp --net-allow 1.1.1.1:53 --net-allow :443 \ +# UDP — scheme prefix gates the protocol and scopes the destination +# (e.g. DNS to 1.1.1.1, plus TCP HTTPS to anywhere) +sandlock run --net-allow udp://1.1.1.1:53 --net-allow :443 \ -r /usr -r /lib -r /etc -- ./client +# Ping — kernel ping socket (SOCK_DGRAM) gated by net.ipv4.ping_group_range +sandlock run --net-allow icmp://github.com -r /usr -r /lib -r /etc -- ping github.com + # HTTP-level ACL (method + host + path rules via transparent proxy) # HTTP rules with concrete hosts auto-extend --net-allow with host:80,443 sandlock run \ @@ -516,81 +521,88 @@ Landlock + seccomp confinement. `CLONE_ID=0..N-1` is set automatically. ### Network Model -Outbound traffic is gated by a single endpoint allowlist. Each -`--net-allow` rule names a `(host, ports)` pair, multiple rules are -OR'd, and a destination is permitted iff `(IP, port)` matches at least -one rule. The same allowlist applies to TCP `connect()` and to UDP -`sendto` / `sendmsg` destinations — the latter only relevant when -`--allow-udp` is set, since UDP socket creation is denied by default. +Outbound traffic is gated by a single endpoint allowlist that names +**protocol × destination**. Each `--net-allow` rule is one of: ``` --net-allow repeatable; no rules = deny all outbound - = host:port[,port,...] (IP-restricted) - | :port | *:port (any IP, listed port) - | host:* (host, any port) - | :* | *:* (any IP, any port) + bare form host:port[,port,...] / :port / *:port / host:* / :* / *:* (TCP) + tcp:// same suffix grammar — explicit TCP + udp:// same suffix grammar — UDP (`udp://*:*` opens any UDP) + icmp:// host or `*`, no port — kernel ping socket (SOCK_DGRAM) ``` +Multiple rules are OR'd. A destination is permitted iff some rule +matches the **same protocol** as the socket plus the destination IP +and port (port is N/A for ICMP). + +**Protocol gating** falls out of rule presence per scheme: + + * No UDP rule → UDP socket creation is denied at the seccomp layer. + * No ICMP rule → kernel ping socket creation (SOCK_DGRAM + IPPROTO_ICMP) + is denied at the seccomp layer. + * Raw ICMP (SOCK_RAW + IPPROTO_ICMP) is **never exposed** — packet + crafting is out of scope. Workloads that need ping should rely on + the host's `net.ipv4.ping_group_range` and use the dgram path + above (`--net-allow icmp://...`). + * TCP is always permitted at the syscall level; destinations are + governed by Landlock and/or the on-behalf path. + **Defaults.** With no `--net-allow` and no HTTP ACL flags, Landlock -denies every TCP `connect()`, UDP and raw socket creation are denied -at the seccomp layer, and there is no on-behalf path active. For -unrestricted egress, opt in explicitly with `--net-allow :*` (still -UDP-gated by `--allow-udp`). +denies every TCP `connect()`, UDP / ICMP / raw socket creation are +denied at the seccomp layer, and there is no on-behalf path active. +For unrestricted TCP egress, opt in explicitly with +`--net-allow :*`; for any UDP, add `--net-allow udp://*:*`. **Resolution.** Concrete hostnames are resolved once at sandbox start -and pinned in a synthetic `/etc/hosts`. The synthetic file replaces -the real one only when `--net-allow` includes at least one concrete -host; pure `:port` rules leave the real `/etc/hosts` and DNS visible. +and pinned in a synthetic `/etc/hosts` (across all protocols). The +synthetic file replaces the real one only when at least one rule has +a concrete host; pure `:port` / `udp://*:*` / `icmp://*` rules leave +the real `/etc/hosts` and DNS visible. **Wildcards.** Hostnames are matched literally — `--net-allow *.example.com:443` is **not** supported, list each domain you need. -The `*` token is allowed in two positions: as the host (alias for -empty: `*:port` ≡ `:port`) and as the port to mean "any port" -(`host:*`, `:*`, `*:*`). Mixing `*` with concrete ports -(`host:80,*`) is rejected — use either the wildcard or an explicit -list. When any rule uses the all-ports wildcard, Landlock no longer -filters TCP connect at the kernel level (it cannot express "every -port" without enumerating 65535 rules); the on-behalf path becomes -the sole enforcer, and for `:*` it short-circuits to allow-all. +The `*` token is allowed as the host (alias for empty: `*:port` ≡ +`:port`) and as the port for TCP/UDP rules (`host:*`, `:*`, `*:*`, +`udp://*:*`). Mixing `*` with concrete ports (`host:80,*`) is +rejected. When any TCP rule uses the all-ports wildcard, Landlock no +longer filters TCP connect at the kernel level (it cannot express +"every port" without enumerating 65535 rules); the on-behalf path +becomes the sole enforcer, and for `:*` it short-circuits to +allow-all. **Implementation.** Two enforcement paths: - * **Direct path** — pure `:port` policies (no concrete host) and no - HTTP ACL. Landlock enforces the TCP port allowlist at the kernel - level; no per-syscall overhead. UDP is not covered by Landlock and - therefore always uses the on-behalf path when allowed. - * **On-behalf path** — any concrete host, any HTTP ACL rule, or - `--allow-udp`. Seccomp traps `connect()`, `sendto()`, and - `sendmsg()`; the supervisor checks the `(ip, port)` against the - resolved allowlist and performs the syscall. The HTTP/HTTPS proxy - redirect (when configured) happens here too. + * **Direct path** — pure `:port` TCP policies (no concrete host) + and no HTTP ACL. Landlock enforces the TCP port allowlist at the + kernel level; no per-syscall overhead. UDP and ICMP are not + covered by Landlock and always use the on-behalf path when allowed. + * **On-behalf path** — any concrete host, any HTTP ACL rule, or any + UDP / ICMP rule. Seccomp traps `connect()`, `sendto()`, `sendmsg()`, + and `sendmmsg()`; the supervisor dups the child fd, queries + `getsockopt(SOL_SOCKET, SO_PROTOCOL)` to learn whether the socket + is TCP / UDP / ICMP, then checks the destination against that + protocol's resolved allowlist before performing the syscall. + The HTTP/HTTPS proxy redirect (when configured) happens here too. **HTTP / HTTPS interception.** `--http-allow` / `--http-deny` route matching ports through a transparent proxy. Each rule with a concrete host auto-extends `--net-allow` with `host:80` (and `host:443` when `--https-ca` is set) so the proxy's intercept ports are reachable; -wildcard hosts auto-add `:80` / `:443` (any IP). HTTPS MITM is opt-in: -pass `--https-ca ` and `--https-key ` for a CA *you generate* -and trust inside the sandbox (typically install the cert into the -workload's `/etc/ssl/certs/`). Without `--https-ca`, port 443 is not -intercepted — `--net-allow host:443` permits raw TLS to the host with -no content inspection. +wildcard hosts auto-add `:80` / `:443` (any IP). All auto-added +entries are TCP. HTTPS MITM is opt-in: pass `--https-ca ` and +`--https-key ` for a CA *you generate* and trust inside the +sandbox (typically install the cert into the workload's +`/etc/ssl/certs/`). Without `--https-ca`, port 443 is not intercepted +— `--net-allow host:443` permits raw TLS to the host with no content +inspection. **Bind.** `--net-bind ` is independent from `--net-allow` and -governs server-side `bind()`. Landlock enforces it; `--port-remap` adds -on-behalf virtualization for binding. - -**UDP, ICMP, unix.** Default-deny, opt in via dedicated flags: +governs server-side `bind()`. Landlock enforces it (TCP only); +`--port-remap` adds on-behalf virtualization for binding. - * `--allow-udp` enables UDP socket creation. Outbound UDP - destinations are then gated by the same `--net-allow` allowlist - used for TCP — the seccomp on-behalf path also covers `sendto` / - `sendmsg`. Example: `--allow-udp --net-allow 1.1.1.1:53` for DNS. - * `--allow-icmp` narrowly permits `socket(AF_INET, SOCK_RAW, - IPPROTO_ICMP)` and the IPv6 equivalent only — enough for `ping`. - Other raw socket types stay denied. - * AF_UNIX sockets are governed by Landlock's - `LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET`. +**AF_UNIX sockets** are governed by Landlock's +`LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET`, independent from `--net-allow`. ### Port Virtualization @@ -651,11 +663,20 @@ Policy( # Syscall filtering (seccomp) block_syscalls=[], # Extra syscalls to block in addition to Sandlock defaults - # Network — see "Network Model" above. Each entry is `host:port[,port,...]`, - # `:port`, `*:port`, `host:*`, or `:*` / `*:*`. Empty list = deny all - # outbound; `:*` = unrestricted. Same allowlist gates UDP destinations - # when allow_udp=True (e.g. `:53` for DNS). - net_allow=["api.example.com:443", "github.com:22,443", ":8080"], + # Network — see "Network Model" above. Each entry is one of: + # bare host:port[,port,...] — TCP (default scheme) + # tcp://host:port — explicit TCP + # udp://host:port — UDP (`udp://*:*` for any UDP) + # icmp://host — kernel ping socket (`icmp://*` = any) + # Empty list = deny all outbound. Protocol gating falls out of rule + # presence: with no UDP/ICMP rule, the corresponding socket creation + # is denied at the seccomp layer. Raw ICMP is not exposed. + net_allow=[ + "api.example.com:443", + "github.com:22,443", + "udp://1.1.1.1:53", # DNS over UDP + "icmp://github.com", # ping (gated by ping_group_range) + ], net_bind=[8080], # TCP bind ports (Landlock; ABI v4+) # HTTP ACL (transparent proxy) @@ -665,10 +686,6 @@ Policy( https_ca="ca.pem", # CA cert for HTTPS MITM (adds port 443) https_key="ca-key.pem", # CA key for HTTPS MITM - # Socket restrictions (raw sockets and UDP denied by default) - allow_udp=False, # CLI: --allow-udp; outbound UDP still gated by net_allow - allow_icmp=False, # CLI: --allow-icmp; permits ICMP raw only (AF_INET/AF_INET6 + SOCK_RAW + IPPROTO_ICMP[V6]) - # Resources max_memory="512M", # Memory limit max_processes=64, # Peak concurrent process limit diff --git a/crates/sandlock-cli/src/main.rs b/crates/sandlock-cli/src/main.rs index 7d4501b..953caae 100644 --- a/crates/sandlock-cli/src/main.rs +++ b/crates/sandlock-cli/src/main.rs @@ -69,18 +69,6 @@ enum Command { fs_storage: Option, #[arg(long = "max-disk")] max_disk: Option, - /// Allow UDP socket creation. UDP is denied by default; this - /// turns it back on. Outbound UDP destinations are still - /// gated by `--net-allow` (the same endpoint allowlist used - /// for TCP). - #[arg(long = "allow-udp")] - allow_udp: bool, - /// Allow ICMP raw sockets only — `socket(AF_INET, SOCK_RAW, - /// IPPROTO_ICMP)` and the IPv6 equivalent. Other `SOCK_RAW` - /// types stay denied. Useful for `ping` without granting full - /// packet-crafting capability. - #[arg(long = "allow-icmp")] - allow_icmp: bool, /// Allow SysV IPC syscalls (shared memory, message queues, /// semaphores). Denied by default: sandlock does not use IPC /// namespaces, so without this denial two sandboxes on the @@ -186,7 +174,7 @@ async fn main() -> Result<()> { net_allow, net_bind, time_start, random_seed, clean_env, num_cpus, profile: profile_name, status_fd, max_cpu, max_open_files, chroot, uid, workdir, cwd, - fs_isolation, fs_storage, max_disk, allow_udp, allow_icmp, allow_sysv_ipc, + fs_isolation, fs_storage, max_disk, allow_sysv_ipc, http_allow, http_deny, http_ports, https_ca, https_key, port_remap, no_randomize_memory, no_huge_pages, deterministic_dirs, name, no_coredump, env_vars, exec_shell, interactive: _, fs_deny, fs_mount, cpu_cores, gpu_devices, image, dry_run, no_supervisor, cmd } => @@ -195,7 +183,7 @@ async fn main() -> Result<()> { validate_no_supervisor( &max_memory, &max_processes, &max_cpu, &max_open_files, &timeout, &net_allow, &net_bind, - allow_udp, allow_icmp, &http_allow, &http_deny, &http_ports, + &http_allow, &http_deny, &http_ports, &num_cpus, &random_seed, &time_start, no_randomize_memory, no_huge_pages, deterministic_dirs, &name, &chroot, &image, &uid, &workdir, &cwd, &fs_isolation, &fs_storage, @@ -258,9 +246,20 @@ async fn main() -> Result<()> { for p in &base.fs_writable { b = b.fs_write(p); } for p in &base.fs_denied { b = b.fs_deny(p); } for rule in &base.net_allow { - let port_csv: Vec = rule.ports.iter().map(|p| p.to_string()).collect(); - let host_part = rule.host.as_deref().unwrap_or(""); - let spec = format!("{}:{}", host_part, port_csv.join(",")); + let host_part = rule.host.as_deref().unwrap_or("*"); + let spec = match rule.protocol { + sandlock_core::policy::Protocol::Tcp => { + let ports = format_ports(&rule.ports, rule.all_ports); + format!("tcp://{}:{}", host_part, ports) + } + sandlock_core::policy::Protocol::Udp => { + let ports = format_ports(&rule.ports, rule.all_ports); + format!("udp://{}:{}", host_part, ports) + } + sandlock_core::policy::Protocol::Icmp => { + format!("icmp://{}", host_part) + } + }; b = b.net_allow(spec); } for p in &base.net_bind { b = b.net_bind_port(*p); } @@ -281,8 +280,6 @@ async fn main() -> Result<()> { if let Some(seed) = base.random_seed { b = b.random_seed(seed); } if let Some(n) = base.num_cpus { b = b.num_cpus(n); } b = b.block_syscalls(base.block_syscalls.clone()); - b = b.allow_udp(base.allow_udp); - b = b.allow_icmp(base.allow_icmp); b = b.allow_sysv_ipc(base.allow_sysv_ipc); b = b.clean_env(base.clean_env); if let Some(ref w) = base.workdir { b = b.workdir(w); } @@ -330,12 +327,9 @@ async fn main() -> Result<()> { } if let Some(ref path) = fs_storage { builder = builder.fs_storage(path); } if let Some(ref s) = max_disk { builder = builder.max_disk(ByteSize::parse(s)?); } - if allow_udp { builder = builder.allow_udp(true); } - // --allow-icmp narrowly permits ICMP raw sockets; arbitrary - // raw sockets stay denied. The seccomp filter inspects the - // protocol arg of `socket()` so non-ICMP `SOCK_RAW` is - // still rejected. - if allow_icmp { builder = builder.allow_icmp(true); } + // UDP, the kernel ping socket (SOCK_DGRAM + IPPROTO_ICMP), + // and raw ICMP are all gated by `--net-allow` rule presence + // (`udp://...`, `icmp://...`, `icmp-raw://*` respectively). if allow_sysv_ipc { builder = builder.allow_sysv_ipc(true); } for rule in &http_allow { builder = builder.http_allow(rule); } for rule in &http_deny { builder = builder.http_deny(rule); } @@ -605,8 +599,6 @@ fn validate_no_supervisor( timeout: &Option, net_allow: &[String], net_bind: &[u16], - allow_udp: bool, - allow_icmp: bool, http_allow: &[String], http_deny: &[String], http_ports: &[u16], @@ -642,8 +634,6 @@ fn validate_no_supervisor( if timeout.is_some() { bad.push("--timeout"); } if !net_allow.is_empty() { bad.push("--net-allow"); } if !net_bind.is_empty() { bad.push("--net-bind"); } - if allow_udp { bad.push("--allow-udp"); } - if allow_icmp { bad.push("--allow-icmp"); } if !http_allow.is_empty() { bad.push("--http-allow"); } if !http_deny.is_empty() { bad.push("--http-deny"); } if !http_ports.is_empty() { bad.push("--http-port"); } @@ -748,6 +738,16 @@ fn no_supervisor_exec(policy: &Policy, cmd: &[&str]) -> Result<()> { } /// Parse an ISO 8601 timestamp (e.g. "2000-01-01T00:00:00Z") into a SystemTime. +/// Render a port list back into the `--net-allow` port-suffix form: +/// concrete ports become `80,443`; the all-ports wildcard becomes `*`. +fn format_ports(ports: &[u16], all_ports: bool) -> String { + if all_ports { + "*".to_string() + } else { + ports.iter().map(|p| p.to_string()).collect::>().join(",") + } +} + fn parse_time_start(s: &str) -> Result { let ts: jiff::Timestamp = s.parse() .map_err(|e| anyhow!("invalid --time-start '{}': {}", s, e))?; diff --git a/crates/sandlock-core/src/context.rs b/crates/sandlock-core/src/context.rs index 6bf5b3b..3cf396c 100644 --- a/crates/sandlock-core/src/context.rs +++ b/crates/sandlock-core/src/context.rs @@ -15,7 +15,7 @@ use crate::sys::structs::{ SECCOMP_RET_ALLOW, SECCOMP_RET_ERRNO, SIOCETHTOOL, SIOCGIFADDR, SIOCGIFBRDADDR, SIOCGIFCONF, SIOCGIFDSTADDR, SIOCGIFFLAGS, SIOCGIFHWADDR, SIOCGIFINDEX, SIOCGIFNAME, SIOCGIFNETMASK, - SOCK_DGRAM, SOCK_RAW, SOCK_TYPE_MASK, IPPROTO_ICMP, IPPROTO_ICMPV6, TIOCLINUX, TIOCSTI, + SOCK_DGRAM, SOCK_RAW, SOCK_TYPE_MASK, TIOCLINUX, TIOCSTI, PR_SET_DUMPABLE, PR_SET_SECUREBITS, PR_SET_PTRACER, OFFSET_ARGS0_LO, OFFSET_ARGS1_LO, OFFSET_ARGS2_LO, OFFSET_ARGS3_LO, OFFSET_NR, SockFilter, @@ -175,6 +175,7 @@ pub fn syscall_name_to_nr(name: &str) -> Option { "connect" => libc::SYS_connect, "sendto" => libc::SYS_sendto, "sendmsg" => libc::SYS_sendmsg, + "sendmmsg" => libc::SYS_sendmmsg, "ioctl" => libc::SYS_ioctl, "socket" => libc::SYS_socket, "prctl" => libc::SYS_prctl, @@ -286,6 +287,7 @@ pub fn notif_syscalls(policy: &Policy, sandbox_name: Option<&str>) -> Vec { nrs.push(libc::SYS_connect as u32); nrs.push(libc::SYS_sendto as u32); nrs.push(libc::SYS_sendmsg as u32); + nrs.push(libc::SYS_sendmmsg as u32); nrs.push(libc::SYS_bind as u32); } @@ -565,17 +567,23 @@ pub fn arg_filters(policy: &Policy) -> Vec { // --- socket: block SOCK_RAW and/or SOCK_DGRAM on AF_INET/AF_INET6 --- // - // Raw sockets are always denied by default. The narrow `allow_icmp` - // carve-out permits only `socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)` - // and the IPv6 equivalent — handled by a separate `socket()` filter - // further down. When `allow_icmp` is set, SOCK_RAW is excluded from - // the simple blocked_types list so the carve-out can decide. - let raw_narrow = policy.allow_icmp; + // SOCK_RAW is unconditionally denied. Sandlock does not expose + // raw ICMP — packet-crafting capabilities aren't part of the XOA + // threat model, and destination filtering at `sendto` can't be + // honestly enforced for raw sockets (the agent controls the IP + // header). Workloads that need ping should use the kernel ping + // socket (SOCK_DGRAM + IPPROTO_ICMP) via an `icmp://...` rule. + // + // SOCK_DGRAM is denied unless a UDP or ICMP rule exists in + // net_allow. The kernel ping socket uses SOCK_DGRAM with + // IPPROTO_ICMP, so the same type bit gates both — destination + // filtering at sendto (Phase 2) is what separates them per-rule. + use crate::policy::Protocol; + let any_udp_rule = policy.net_allow.iter().any(|r| r.protocol == Protocol::Udp); + let any_icmp_rule = policy.net_allow.iter().any(|r| r.protocol == Protocol::Icmp); let mut blocked_types: Vec = Vec::new(); - if !policy.allow_icmp { - blocked_types.push(SOCK_RAW); - } - if !policy.allow_udp { + blocked_types.push(SOCK_RAW); + if !any_udp_rule && !any_icmp_rule { blocked_types.push(SOCK_DGRAM); } @@ -610,43 +618,10 @@ pub fn arg_filters(policy: &Policy) -> Vec { insns.push(stmt(BPF_RET | BPF_K, ret_errno)); } - // --- socket: ICMP-only carve-out for SOCK_RAW --- - // Active when raw sockets are otherwise denied AND --allow-icmp is set. - // Permits `socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)` and - // `socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6)`; denies every other - // SOCK_RAW. The block has 14 instructions; offsets reference the - // post-block instruction count (skip-to-end). - if raw_narrow { - // INST 0: LD NR - insns.push(stmt(BPF_LD | BPF_W | BPF_ABS, OFFSET_NR)); - // INST 1: JEQ socket → fall through (jt=0); not socket → skip 12 - insns.push(jump(BPF_JMP | BPF_JEQ | BPF_K, nr_socket, 0, 12)); - // INST 2-3: LD type, AND TYPE_MASK - insns.push(stmt(BPF_LD | BPF_W | BPF_ABS, OFFSET_ARGS1_LO)); - insns.push(stmt(BPF_ALU | BPF_AND | BPF_K, SOCK_TYPE_MASK)); - // INST 4: JEQ SOCK_RAW → fall through; not raw → skip 9 (allow) - insns.push(jump(BPF_JMP | BPF_JEQ | BPF_K, SOCK_RAW, 0, 9)); - // INST 5: LD domain - insns.push(stmt(BPF_LD | BPF_W | BPF_ABS, OFFSET_ARGS0_LO)); - // INST 6: JEQ AF_INET → fall to v4 proto check; else skip 3 to v6 check at INST 10 - insns.push(jump(BPF_JMP | BPF_JEQ | BPF_K, AF_INET, 0, 3)); - // INST 7: LD proto (arg2) - insns.push(stmt(BPF_LD | BPF_W | BPF_ABS, OFFSET_ARGS2_LO)); - // INST 8: JEQ IPPROTO_ICMP → skip 5 to end (allow); else fall to RET errno - insns.push(jump(BPF_JMP | BPF_JEQ | BPF_K, IPPROTO_ICMP, 5, 0)); - // INST 9: RET errno (v4 SOCK_RAW with non-ICMP proto) - insns.push(stmt(BPF_RET | BPF_K, ret_errno)); - // INST 10: JEQ AF_INET6 → fall to v6 proto check; else skip 2 to RET errno - // (other AF + SOCK_RAW, e.g. AF_PACKET/AF_NETLINK, must be denied) - insns.push(jump(BPF_JMP | BPF_JEQ | BPF_K, AF_INET6, 0, 2)); - // INST 11: LD proto - insns.push(stmt(BPF_LD | BPF_W | BPF_ABS, OFFSET_ARGS2_LO)); - // INST 12: JEQ IPPROTO_ICMPV6 → skip 1 past RET (allow); else fall to RET errno - insns.push(jump(BPF_JMP | BPF_JEQ | BPF_K, IPPROTO_ICMPV6, 1, 0)); - // INST 13: RET errno (v6 SOCK_RAW with non-ICMPv6 proto) - insns.push(stmt(BPF_RET | BPF_K, ret_errno)); - // (post-block — fall through to wait4 block below) - } + // (raw ICMP carve-out removed — SOCK_RAW is unconditionally denied + // by the blocked_types block above. Sandlock does not expose raw + // sockets; ping uses the SOCK_DGRAM kernel ping socket via an + // `icmp://...` rule, gated by host `ping_group_range`.) // --- wait4: skip notification for WNOHANG/WNOWAIT (non-blocking) --- // wait4(pid, status, options, rusage) — options is arg2 @@ -1223,6 +1198,7 @@ mod tests { assert!(nrs.contains(&(libc::SYS_connect as u32))); assert!(nrs.contains(&(libc::SYS_sendto as u32))); assert!(nrs.contains(&(libc::SYS_sendmsg as u32))); + assert!(nrs.contains(&(libc::SYS_sendmmsg as u32))); } #[test] @@ -1377,7 +1353,7 @@ mod tests { #[test] fn test_arg_filters_raw_sockets() { use crate::sys::structs::{BPF_ALU, BPF_AND, BPF_JEQ, BPF_JMP, BPF_K}; - // Raw sockets are blocked by default; allow_icmp is false. + // Raw sockets are blocked by default — no `icmp-raw://*` rule. let policy = Policy::builder().build().unwrap(); let filters = arg_filters(&policy); // Should have AF_INET check @@ -1397,7 +1373,7 @@ mod tests { #[test] fn test_arg_filters_udp_denied_by_default() { use crate::sys::structs::{BPF_JEQ, BPF_JMP, BPF_K}; - // UDP is denied by default; allow_udp(false) is the default state. + // UDP is denied by default — no `udp://...` rule in net_allow. let policy = Policy::builder().build().unwrap(); let filters = arg_filters(&policy); // Should have JEQ SOCK_DGRAM diff --git a/crates/sandlock-core/src/landlock.rs b/crates/sandlock-core/src/landlock.rs index 3b48b81..f75da7d 100644 --- a/crates/sandlock-core/src/landlock.rs +++ b/crates/sandlock-core/src/landlock.rs @@ -213,7 +213,16 @@ fn confine_inner(policy: &Policy, handle_net: bool) -> Result<(), SandlockError> // the per-rule IP allowlist when the rule is `host:*`. For `:*` // the on-behalf path becomes `NetworkPolicy::Unrestricted` (no // additional check). Bind enforcement is unaffected. - let net_wildcard = policy.net_allow.iter().any(|r| r.all_ports); + // Landlock's net hooks only cover TCP (CONNECT_TCP / BIND_TCP). + // UDP and ICMP rules are enforced elsewhere (BPF gates plus the + // on-behalf path), so they're filtered out here — feeding them to + // Landlock would either be a no-op (for unhandled protocols) or + // wrongly install TCP rules from a UDP wildcard. + use crate::policy::Protocol; + let net_wildcard = policy + .net_allow + .iter() + .any(|r| r.protocol == Protocol::Tcp && r.all_ports); let handled_access_net = if !handle_net { 0 } else if net_wildcard { @@ -318,6 +327,10 @@ fn confine_inner(policy: &Policy, handle_net: bool) -> Result<(), SandlockError> if handle_net && !net_wildcard { let mut connect_ports: std::collections::HashSet = std::collections::HashSet::new(); for rule in &policy.net_allow { + // TCP-only — see net_wildcard comment above. + if rule.protocol != Protocol::Tcp { + continue; + } for &p in &rule.ports { connect_ports.insert(p); } diff --git a/crates/sandlock-core/src/network.rs b/crates/sandlock-core/src/network.rs index 02f6068..9101dec 100644 --- a/crates/sandlock-core/src/network.rs +++ b/crates/sandlock-core/src/network.rs @@ -10,7 +10,7 @@ use std::os::unix::io::{AsRawFd, RawFd}; use std::sync::Arc; use crate::seccomp::ctx::SupervisorCtx; -use crate::seccomp::notif::{read_child_mem, NotifAction}; +use crate::seccomp::notif::{read_child_mem, write_child_mem, NotifAction}; use crate::sys::structs::{SeccompNotif, AF_INET, AF_INET6, ECONNREFUSED}; /// Maximum buffer size for sendto/sendmsg on-behalf operations (64 MiB). @@ -68,6 +68,43 @@ fn parse_port_from_sockaddr(bytes: &[u8]) -> Option { } } +// ============================================================ +// query_socket_protocol — derive the rule Protocol from a fd via getsockopt +// ============================================================ + +/// Query `SO_PROTOCOL` on a dup'd socket fd to learn whether to route +/// the on-behalf check through the TCP, UDP, or ICMP policy. +/// +/// Returns `None` for protocols sandlock does not gate via `net_allow` +/// (raw, SCTP, etc.) — the handler treats those as "no rule applies" +/// which collapses to the default-deny path. +fn query_socket_protocol(fd: RawFd) -> Option { + use crate::policy::Protocol; + let mut proto: libc::c_int = 0; + let mut len: libc::socklen_t = std::mem::size_of::() as libc::socklen_t; + let rc = unsafe { + libc::getsockopt( + fd, + libc::SOL_SOCKET, + libc::SO_PROTOCOL, + &mut proto as *mut _ as *mut libc::c_void, + &mut len, + ) + }; + if rc != 0 { + return None; + } + match proto { + libc::IPPROTO_TCP => Some(Protocol::Tcp), + libc::IPPROTO_UDP => Some(Protocol::Udp), + // IPPROTO_ICMP and IPPROTO_ICMPV6 both route to the ICMP policy + // (the policy doesn't distinguish IP versions; the rule's + // resolved IP set already covers both via DNS). + libc::IPPROTO_ICMP | libc::IPPROTO_ICMPV6 => Some(Protocol::Icmp), + _ => None, + } +} + // ============================================================ // connect_on_behalf — perform connect() on behalf of the child (TOCTOU-safe) // ============================================================ @@ -96,23 +133,39 @@ async fn connect_on_behalf( Err(_) => return NotifAction::Errno(libc::EIO), }; - // 2. Check destination (ip, port) against the endpoint allowlist. - // The on-behalf supervisor performs the connect outside Landlock, - // so this check is the only port enforcement on this path. + // 2. Check destination against the per-protocol endpoint allowlist. + // The dup we'd need anyway for the on-behalf connect doubles as + // our SO_PROTOCOL probe — one pidfd_getfd, one getsockopt. The + // per-protocol policy is keyed on whether the socket is TCP / UDP + // / kernel ping (ICMP). Unknown protocol (raw, SCTP, etc.) fails + // closed: the BPF should have prevented socket creation, so + // reaching here with one is an unexpected case worth refusing. if let Some(ip) = parse_ip_from_sockaddr(&addr_bytes) { let dest_port = parse_port_from_sockaddr(&addr_bytes); + let dup_fd = match crate::seccomp::notif::dup_fd_from_pid(notif.pid, sockfd) { + Ok(fd) => fd, + Err(_) => return NotifAction::Errno(libc::ENOSYS), + }; + let protocol = match query_socket_protocol(dup_fd.as_raw_fd()) { + Some(p) => p, + None => return NotifAction::Errno(ECONNREFUSED), + }; let ns = ctx.network.lock().await; let live_policy = { let pfs = ctx.policy_fn.lock().await; pfs.live_policy.clone() }; - let effective = ns.effective_network_policy(notif.pid, live_policy.as_ref()); + let effective = ns.effective_network_policy(notif.pid, protocol, live_policy.as_ref()); match (effective, dest_port) { (crate::seccomp::notif::NetworkPolicy::Unrestricted, _) => { - // No allowlist active — Landlock direct path enforces ports. - // (Reachable when on-behalf is enabled solely by HTTP ACL.) + // No rules for this protocol's wildcard — Landlock (TCP + // only) or the protocol's wildcard rule covers it; no + // additional check here. } (policy, Some(p)) => { + // For ICMP rules every per-IP entry is `PortAllow::Any`, + // so the port arg from the sockaddr (typically 0 or the + // ICMP id) is functionally ignored — IP is what matters. if !policy.allows(ip, p) { return NotifAction::Errno(ECONNREFUSED); } @@ -188,11 +241,9 @@ async fn connect_on_behalf( (addr_bytes.clone(), addr_len) }; - // 3. Duplicate child's socket into supervisor (use notif.pid for grandchild support) - let dup_fd = match crate::seccomp::notif::dup_fd_from_pid(notif.pid, sockfd) { - Ok(fd) => fd, - Err(_) => return NotifAction::Errno(libc::ENOSYS), - }; + // (The supervisor-side dup is the same fd we already created + // for the SO_PROTOCOL probe above — reuse it rather than + // pidfd_getfd-ing a second time.) // 4. Record original dest IP *before* connect to prevent TOCTOU race: // the proxy may receive the request before we write the mapping if @@ -338,15 +389,25 @@ async fn sendto_on_behalf( Err(_) => return NotifAction::Errno(libc::EIO), }; - // 2. Check (ip, port) against the endpoint allowlist. + // 2. Check (ip, port) against the per-protocol endpoint allowlist. + // One pidfd_getfd serves both the SO_PROTOCOL probe and the + // on-behalf sendto. if let Some(ip) = parse_ip_from_sockaddr(&addr_bytes) { let dest_port = parse_port_from_sockaddr(&addr_bytes); + let dup_fd = match crate::seccomp::notif::dup_fd_from_pid(notif.pid, sockfd) { + Ok(fd) => fd, + Err(_) => return NotifAction::Errno(libc::ENOSYS), + }; + let protocol = match query_socket_protocol(dup_fd.as_raw_fd()) { + Some(p) => p, + None => return NotifAction::Errno(ECONNREFUSED), + }; let ns = ctx.network.lock().await; let live_policy = { let pfs = ctx.policy_fn.lock().await; pfs.live_policy.clone() }; - let effective = ns.effective_network_policy(notif.pid, live_policy.as_ref()); + let effective = ns.effective_network_policy(notif.pid, protocol, live_policy.as_ref()); if !matches!(effective, crate::seccomp::notif::NetworkPolicy::Unrestricted) { match dest_port { Some(p) if !effective.allows(ip, p) => { @@ -364,11 +425,7 @@ async fn sendto_on_behalf( Err(_) => return NotifAction::Errno(libc::EIO), }; - // 4. Duplicate child's socket into supervisor (use notif.pid for grandchild support) - let dup_fd = match crate::seccomp::notif::dup_fd_from_pid(notif.pid, sockfd) { - Ok(fd) => fd, - Err(_) => return NotifAction::Errno(libc::ENOSYS), - }; + // 4. (dup_fd from step 2 is reused for the supervisor sendto.) // 5. Perform sendto in supervisor with validated sockaddr + copied data let ret = unsafe { @@ -415,21 +472,104 @@ async fn sendmsg_on_behalf( let msghdr_ptr = args[1]; let flags = args[2] as i32; - // 1. Read full msghdr struct (56 bytes on x86_64): - // msg_name(8) + msg_namelen(4) + pad(4) + msg_iov(8) + msg_iovlen(8) - // + msg_control(8) + msg_controllen(8) + msg_flags(4) + pad(4) - // - // If we cannot read the msghdr, fail the syscall with EFAULT instead - // of falling through to Continue. Continue would let the kernel - // re-read child memory and (for a racing thread that just remapped - // it back) potentially execute the sendmsg without the IP allowlist - // check this handler exists to enforce. EFAULT matches what the - // kernel itself would return for an unreadable msghdr pointer. + // Pre-scan for Continue cases (connected socket / non-IP family). + // Same TOCTOU-aware semantics as before: EFAULT on unreadable + // msghdr (vs. Continue, which would let the kernel re-read child + // memory and bypass our check). + match prescan_msghdr(notif, notif_fd, msghdr_ptr) { + PrescanResult::ContinueWholeCall => return NotifAction::Continue, + PrescanResult::Errno(e) => return NotifAction::Errno(e), + PrescanResult::OnBehalf => {} + } + + let dup_fd = match crate::seccomp::notif::dup_fd_from_pid(notif.pid, sockfd) { + Ok(fd) => fd, + Err(_) => return NotifAction::Errno(libc::ENOSYS), + }; + let protocol = match query_socket_protocol(dup_fd.as_raw_fd()) { + Some(p) => p, + None => return NotifAction::Errno(ECONNREFUSED), + }; + + match send_msghdr_on_behalf(notif, ctx, notif_fd, &dup_fd, protocol, msghdr_ptr, flags).await { + Ok(n) => NotifAction::ReturnValue(n as i64), + Err(errno) => NotifAction::Errno(errno), + } +} + +// ============================================================ +// prescan_msghdr / send_msghdr_on_behalf — shared per-message work +// ============================================================ + +#[derive(Clone, Copy)] +enum PrescanResult { + /// All fields present, IP-family destination — caller can take the + /// on-behalf path with `send_msghdr_on_behalf`. + OnBehalf, + /// `msg_name == NULL` (connected socket) or non-IP family + /// (AF_UNIX etc.). Caller should return `NotifAction::Continue` so + /// the kernel handles the syscall in the child's namespace — + /// AF_UNIX path resolution is the canonical reason we don't take + /// these messages on behalf. + ContinueWholeCall, + /// Memory read failure. Caller maps to the appropriate errno + /// (EFAULT for unreadable msghdr, EIO for the sockaddr). + Errno(i32), +} + +/// Probe one `struct msghdr` to decide whether the on-behalf path +/// applies. Used by both `sendmsg_on_behalf` (one msghdr) and +/// `sendmmsg_on_behalf` (one per `mmsghdr` entry, before doing any +/// sends — Continue is a whole-syscall decision). +fn prescan_msghdr( + notif: &SeccompNotif, + notif_fd: RawFd, + msghdr_ptr: u64, +) -> PrescanResult { let msghdr_bytes = match read_child_mem(notif_fd, notif.id, notif.pid, msghdr_ptr, 56) { Ok(b) if b.len() >= 56 => b, - _ => return NotifAction::Errno(libc::EFAULT), + _ => return PrescanResult::Errno(libc::EFAULT), }; + let msg_name_ptr = u64::from_ne_bytes(msghdr_bytes[0..8].try_into().unwrap()); + if msg_name_ptr == 0 { + return PrescanResult::ContinueWholeCall; + } + let msg_namelen = u32::from_ne_bytes(msghdr_bytes[8..12].try_into().unwrap()); + let addr_bytes = match read_child_mem(notif_fd, notif.id, notif.pid, msg_name_ptr, msg_namelen as usize) { + Ok(b) => b, + Err(_) => return PrescanResult::Errno(libc::EIO), + }; + if parse_ip_from_sockaddr(&addr_bytes).is_none() { + return PrescanResult::ContinueWholeCall; + } + PrescanResult::OnBehalf +} +/// Validate, materialize, and send one `struct msghdr` on behalf of +/// the child. Caller is responsible for: +/// - dup'ing the child fd (`dup_fd`), +/// - resolving the socket protocol (`protocol`) via +/// `query_socket_protocol` on that dup, +/// - having confirmed via `prescan_msghdr` that `msghdr_ptr` points +/// at an IP-family destination (non-NULL `msg_name`). +/// +/// Returns the byte count returned by `sendmsg`, or an errno suitable +/// for `NotifAction::Errno`. ECONNREFUSED is used both for "destination +/// blocked by policy" and for "couldn't parse a port from the +/// sockaddr"; EIO for sub-buffer read failures (iovec / control). +async fn send_msghdr_on_behalf( + notif: &SeccompNotif, + ctx: &Arc, + notif_fd: RawFd, + dup_fd: &std::os::unix::io::OwnedFd, + protocol: crate::policy::Protocol, + msghdr_ptr: u64, + flags: i32, +) -> Result { + let msghdr_bytes = match read_child_mem(notif_fd, notif.id, notif.pid, msghdr_ptr, 56) { + Ok(b) if b.len() >= 56 => b, + _ => return Err(libc::EFAULT), + }; let msg_name_ptr = u64::from_ne_bytes(msghdr_bytes[0..8].try_into().unwrap()); let msg_namelen = u32::from_ne_bytes(msghdr_bytes[8..12].try_into().unwrap()); let msg_iov_ptr = u64::from_ne_bytes(msghdr_bytes[16..24].try_into().unwrap()); @@ -437,22 +577,16 @@ async fn sendmsg_on_behalf( let msg_control_ptr = u64::from_ne_bytes(msghdr_bytes[32..40].try_into().unwrap()); let msg_controllen = u64::from_ne_bytes(msghdr_bytes[40..48].try_into().unwrap()); - if msg_name_ptr == 0 { - return NotifAction::Continue; // no address — connected socket - } - - // 2. Copy sockaddr from msg_name - let addr_bytes = match read_child_mem( - notif_fd, notif.id, notif.pid, msg_name_ptr, msg_namelen as usize, - ) { + let addr_bytes = match read_child_mem(notif_fd, notif.id, notif.pid, msg_name_ptr, msg_namelen as usize) { Ok(b) => b, - Err(_) => return NotifAction::Errno(libc::EIO), + Err(_) => return Err(libc::EIO), }; - - // 3. Check (ip, port) against the endpoint allowlist. let ip = match parse_ip_from_sockaddr(&addr_bytes) { Some(ip) => ip, - None => return NotifAction::Continue, // Non-IP family — allow through + // Caller pre-checks via prescan_msghdr; reaching this branch + // means the sockaddr changed under us between the prescan and + // here. Fail closed. + None => return Err(libc::EAFNOSUPPORT), }; let dest_port = parse_port_from_sockaddr(&addr_bytes); @@ -461,53 +595,42 @@ async fn sendmsg_on_behalf( let pfs = ctx.policy_fn.lock().await; pfs.live_policy.clone() }; - let effective = ns.effective_network_policy(notif.pid, live_policy.as_ref()); + let effective = ns.effective_network_policy(notif.pid, protocol, live_policy.as_ref()); if !matches!(effective, crate::seccomp::notif::NetworkPolicy::Unrestricted) { match dest_port { - Some(p) if !effective.allows(ip, p) => { - return NotifAction::Errno(ECONNREFUSED); - } - None => return NotifAction::Errno(ECONNREFUSED), + Some(p) if !effective.allows(ip, p) => return Err(ECONNREFUSED), + None => return Err(ECONNREFUSED), Some(_) => {} } } drop(ns); - // 4. Copy iovec entries and their data buffers from child memory - // Safety: cap iovlen to prevent excessive allocation let iovlen = (msg_iovlen as usize).min(1024); - let iov_size = iovlen * 16; // each iovec is 16 bytes (ptr + len) + let iov_size = iovlen * 16; let iov_bytes = match read_child_mem(notif_fd, notif.id, notif.pid, msg_iov_ptr, iov_size) { Ok(b) => b, - Err(_) => return NotifAction::Errno(libc::EIO), + Err(_) => return Err(libc::EIO), }; - let mut data_bufs: Vec> = Vec::with_capacity(iovlen); let mut local_iovs: Vec = Vec::with_capacity(iovlen); - for i in 0..iovlen { let off = i * 16; if off + 16 > iov_bytes.len() { break; } let iov_base = u64::from_ne_bytes(iov_bytes[off..off + 8].try_into().unwrap()); let iov_len = u64::from_ne_bytes(iov_bytes[off + 8..off + 16].try_into().unwrap()) as usize; - if iov_len > MAX_SEND_BUF { - return NotifAction::Errno(libc::EMSGSIZE); + return Err(libc::EMSGSIZE); } - if iov_base == 0 || iov_len == 0 { data_bufs.push(Vec::new()); continue; } - let buf = match read_child_mem(notif_fd, notif.id, notif.pid, iov_base, iov_len) { Ok(b) => b, - Err(_) => return NotifAction::Errno(libc::EIO), + Err(_) => return Err(libc::EIO), }; data_bufs.push(buf); } - - // Build local iovec array pointing to our copied data for buf in &data_bufs { local_iovs.push(libc::iovec { iov_base: buf.as_ptr() as *mut libc::c_void, @@ -515,7 +638,6 @@ async fn sendmsg_on_behalf( }); } - // 5. Copy control message buffer (ancillary data) let control_buf = if msg_control_ptr != 0 && msg_controllen > 0 { let len = (msg_controllen as usize).min(4096); read_child_mem(notif_fd, notif.id, notif.pid, msg_control_ptr, len).ok() @@ -523,13 +645,6 @@ async fn sendmsg_on_behalf( None }; - // 6. Duplicate child's socket into supervisor (use notif.pid for grandchild support) - let dup_fd = match crate::seccomp::notif::dup_fd_from_pid(notif.pid, sockfd) { - Ok(fd) => fd, - Err(_) => return NotifAction::Errno(libc::ENOSYS), - }; - - // 7. Build msghdr and perform sendmsg in supervisor let mut msg: libc::msghdr = unsafe { std::mem::zeroed() }; msg.msg_name = addr_bytes.as_ptr() as *mut libc::c_void; msg.msg_namelen = addr_bytes.len() as u32; @@ -541,13 +656,107 @@ async fn sendmsg_on_behalf( } let ret = unsafe { libc::sendmsg(dup_fd.as_raw_fd(), &msg, flags) }; - - // 8. Return result if ret >= 0 { - NotifAction::ReturnValue(ret as i64) + Ok(ret) } else { - let errno = unsafe { *libc::__errno_location() }; - NotifAction::Errno(errno) + Err(unsafe { *libc::__errno_location() }) + } +} + +// ============================================================ +// sendmmsg_on_behalf — multi-message variant +// ============================================================ + +/// `struct mmsghdr` size on Linux x86_64 / aarch64: 56-byte msghdr + +/// 4-byte msg_len + 4-byte tail padding = 64 bytes. msg_len lives at +/// offset 56. +const MMSGHDR_SIZE: usize = 64; +const MSG_LEN_OFFSET: usize = 56; +/// Cap on the number of messages we'll process per sendmmsg call. +/// Linux's UIO_MAXIOV is 1024; lower here to bound supervisor work +/// per syscall (each entry costs at minimum a few read_child_mem +/// hops + one sendmsg). +const MAX_MMSGHDR_ENTRIES: usize = 256; + +/// Perform `sendmmsg()` on behalf of the child. Pre-scans every entry +/// for Continue cases (NULL `msg_name` or non-IP family) — if any +/// entry would Continue, we Continue the whole syscall to match +/// `sendmsg_on_behalf`'s coarse-grained behavior. Otherwise dup the +/// child fd once, query SO_PROTOCOL once, then loop: +/// validate → send → write `msg_len` back to the child's mmsghdr. +/// +/// On partial failure (entry K denied or send fails), returns +/// `ReturnValue(K)` matching the kernel's "messages successfully +/// transmitted" semantics. Returns the errno only when the very first +/// entry fails — otherwise the child sees a positive count and reads +/// per-entry `msg_len` to learn the per-message status. +async fn sendmmsg_on_behalf( + notif: &SeccompNotif, + ctx: &Arc, + notif_fd: RawFd, +) -> NotifAction { + let args = ¬if.data.args; + let sockfd = args[0] as i32; + let msgvec_ptr = args[1]; + let vlen = (args[2] as u32 as usize).min(MAX_MMSGHDR_ENTRIES); + let flags = args[3] as i32; + + if vlen == 0 { + return NotifAction::ReturnValue(0); + } + + // Pre-scan every entry. If any has a Continue-eligible shape + // (NULL msg_name or non-IP family), Continue the whole sendmmsg. + // Mixed-shape sendmmsg calls (some entries on-behalf, others not) + // aren't supported because Continue is binary at the syscall + // level. + for i in 0..vlen { + let entry_ptr = msgvec_ptr + (i * MMSGHDR_SIZE) as u64; + match prescan_msghdr(notif, notif_fd, entry_ptr) { + PrescanResult::OnBehalf => continue, + PrescanResult::ContinueWholeCall => return NotifAction::Continue, + PrescanResult::Errno(e) => return NotifAction::Errno(e), + } + } + + let dup_fd = match crate::seccomp::notif::dup_fd_from_pid(notif.pid, sockfd) { + Ok(fd) => fd, + Err(_) => return NotifAction::Errno(libc::ENOSYS), + }; + let protocol = match query_socket_protocol(dup_fd.as_raw_fd()) { + Some(p) => p, + None => return NotifAction::Errno(ECONNREFUSED), + }; + + let mut sent: usize = 0; + let mut first_errno: Option = None; + + for i in 0..vlen { + let entry_ptr = msgvec_ptr + (i * MMSGHDR_SIZE) as u64; + match send_msghdr_on_behalf(notif, ctx, notif_fd, &dup_fd, protocol, entry_ptr, flags).await { + Ok(n) => { + let bytes = (n as u32).to_ne_bytes(); + let _ = write_child_mem( + notif_fd, notif.id, notif.pid, + entry_ptr + MSG_LEN_OFFSET as u64, + &bytes, + ); + sent += 1; + } + Err(errno) => { + first_errno = Some(errno); + break; + } + } + } + + if sent > 0 { + NotifAction::ReturnValue(sent as i64) + } else { + // Defensive: vlen > 0 + no successes means at least one attempt + // failed, so first_errno is set. Fall back to ECONNREFUSED + // rather than panicking on the unwrap if invariants ever drift. + NotifAction::Errno(first_errno.unwrap_or(ECONNREFUSED)) } } @@ -591,6 +800,8 @@ pub(crate) async fn handle_net( sendto_on_behalf(notif, ctx, notif_fd).await } else if nr == libc::SYS_sendmsg { sendmsg_on_behalf(notif, ctx, notif_fd).await + } else if nr == libc::SYS_sendmmsg { + sendmmsg_on_behalf(notif, ctx, notif_fd).await } else { NotifAction::Continue } @@ -613,72 +824,116 @@ pub struct ResolvedNetAllow { pub per_ip_all_ports: HashSet, /// Ports permitted to any IP (the `:port` form). pub any_ip_ports: HashSet, - /// Any-host any-port wildcard (`:*` / `*:*`). When true, the - /// sandbox is fully unrestricted on outbound TCP/UDP and the - /// on-behalf path is bypassed (`NetworkPolicy::Unrestricted`). + /// Any-host any-port wildcard (`:*` / `*:*`, or `icmp://*`). When + /// true, the per-protocol policy becomes `Unrestricted` and the + /// on-behalf check is bypassed for that protocol. pub any_ip_all_ports: bool, - /// Synthetic `/etc/hosts` content for any concrete hostnames. - /// `None` when no concrete hostnames are present (real `/etc/hosts` - /// stays visible). +} + +/// Per-protocol resolved allowlists. Each protocol gets its own +/// `ResolvedNetAllow`; the on-behalf path picks the right one based on +/// the dup'd fd's `SO_PROTOCOL`. `etc_hosts` is shared across all +/// protocols (the synthetic file maps every concrete host that appears +/// in any rule). +pub struct ResolvedNetAllowSet { + pub tcp: ResolvedNetAllow, + pub udp: ResolvedNetAllow, + pub icmp: ResolvedNetAllow, + /// Synthetic `/etc/hosts` content combining every concrete host + /// across all protocols. `None` when no concrete hostnames appear. pub etc_hosts: Option, } -/// Resolve `--net-allow` rules into the runtime allowlist. +/// Resolve `--net-allow` rules into per-protocol runtime allowlists. +/// +/// Rules are grouped by `Protocol` and each group is resolved +/// independently. ICMP rules carry no ports, so the resulting ICMP +/// `ResolvedNetAllow` always has empty `any_ip_ports` / per-IP port +/// sets — the on-behalf check routes ICMP through the IP-only path +/// (PortAllow::Any). A `*` host on ICMP becomes `any_ip_all_ports`, +/// which the handler reads as "no destination check." pub async fn resolve_net_allow( rules: &[crate::policy::NetAllow], -) -> io::Result { - let mut per_ip: HashMap> = HashMap::new(); - let mut per_ip_all_ports: HashSet = HashSet::new(); - let mut any_ip_ports: HashSet = HashSet::new(); - let mut any_ip_all_ports = false; +) -> io::Result { + use crate::policy::Protocol; + + // Single shared etc_hosts for all protocols. Every concrete host + // (regardless of protocol) ends up resolvable in the sandbox. let mut etc_hosts = String::from("127.0.0.1 localhost\n::1 localhost\n"); let mut has_concrete_host = false; - for rule in rules { - match &rule.host { - None => { - if rule.all_ports { - any_ip_all_ports = true; - } else { - for &p in &rule.ports { - any_ip_ports.insert(p); - } - } - } - Some(host) => { - has_concrete_host = true; - let addr = format!("{}:0", host); - let resolved = tokio::net::lookup_host(addr.as_str()).await.map_err(|e| { - io::Error::new( - e.kind(), - format!("failed to resolve host '{}': {}", host, e), - ) - })?; - for socket_addr in resolved { - let ip = socket_addr.ip(); - if rule.all_ports { - per_ip_all_ports.insert(ip); - // Keep an entry in per_ip so callers iterating - // resolved hosts still see this IP. The runtime - // policy honors per_ip_all_ports first. - per_ip.entry(ip).or_default(); + let per_proto = |target: Protocol| async move { + let mut per_ip: HashMap> = HashMap::new(); + let mut per_ip_all_ports: HashSet = HashSet::new(); + let mut any_ip_ports: HashSet = HashSet::new(); + let mut any_ip_all_ports = false; + let mut local_etc_hosts = String::new(); + let mut local_has_concrete = false; + + for rule in rules.iter().filter(|r| r.protocol == target) { + match &rule.host { + None => { + if rule.all_ports || target == Protocol::Icmp { + // ICMP rules never carry ports, so a wildcard-host + // ICMP rule (`icmp://*`) means "any destination." + any_ip_all_ports = true; } else { - let entry = per_ip.entry(ip).or_default(); for &p in &rule.ports { - entry.insert(p); + any_ip_ports.insert(p); } } - etc_hosts.push_str(&format!("{} {}\n", ip, host)); + } + Some(host) => { + local_has_concrete = true; + let addr = format!("{}:0", host); + let resolved = tokio::net::lookup_host(addr.as_str()).await.map_err(|e| { + io::Error::new( + e.kind(), + format!("failed to resolve host '{}': {}", host, e), + ) + })?; + for socket_addr in resolved { + let ip = socket_addr.ip(); + if rule.all_ports || target == Protocol::Icmp { + per_ip_all_ports.insert(ip); + per_ip.entry(ip).or_default(); + } else { + let entry = per_ip.entry(ip).or_default(); + for &p in &rule.ports { + entry.insert(p); + } + } + local_etc_hosts.push_str(&format!("{} {}\n", ip, host)); + } } } } + + Ok::<_, io::Error>(( + ResolvedNetAllow { + per_ip, + per_ip_all_ports, + any_ip_ports, + any_ip_all_ports, + }, + local_etc_hosts, + local_has_concrete, + )) + }; + + let (tcp, tcp_eh, tcp_concrete) = per_proto(Protocol::Tcp).await?; + let (udp, udp_eh, udp_concrete) = per_proto(Protocol::Udp).await?; + let (icmp, icmp_eh, icmp_concrete) = per_proto(Protocol::Icmp).await?; + + for chunk in [tcp_eh, udp_eh, icmp_eh] { + etc_hosts.push_str(&chunk); } + has_concrete_host |= tcp_concrete || udp_concrete || icmp_concrete; - Ok(ResolvedNetAllow { - per_ip, - per_ip_all_ports, - any_ip_ports, - any_ip_all_ports, + Ok(ResolvedNetAllowSet { + tcp, + udp, + icmp, etc_hosts: if has_concrete_host { Some(etc_hosts) } else { None }, }) } @@ -695,88 +950,161 @@ mod tests { #[tokio::test] async fn test_resolve_net_allow_empty() { let resolved = resolve_net_allow(&[]).await.unwrap(); - assert!(resolved.per_ip.is_empty()); - assert!(resolved.any_ip_ports.is_empty()); + assert!(resolved.tcp.per_ip.is_empty()); + assert!(resolved.tcp.any_ip_ports.is_empty()); + assert!(resolved.udp.per_ip.is_empty()); + assert!(resolved.icmp.per_ip.is_empty()); assert!(resolved.etc_hosts.is_none()); } #[tokio::test] async fn test_resolve_net_allow_concrete_host() { let rules = vec![NetAllow { + protocol: crate::policy::Protocol::Tcp, host: Some("localhost".to_string()), ports: vec![80, 443], all_ports: false, }]; let resolved = resolve_net_allow(&rules).await.unwrap(); - // localhost should resolve to at least one loopback addr. - assert!(!resolved.per_ip.is_empty()); - for ports in resolved.per_ip.values() { + // localhost should resolve to at least one loopback addr; only + // the TCP set has entries. + assert!(!resolved.tcp.per_ip.is_empty()); + for ports in resolved.tcp.per_ip.values() { assert!(ports.contains(&80)); assert!(ports.contains(&443)); } + assert!(resolved.udp.per_ip.is_empty()); + assert!(resolved.icmp.per_ip.is_empty()); assert!(resolved.etc_hosts.as_deref().unwrap_or("").contains("localhost")); } #[tokio::test] async fn test_resolve_net_allow_any_ip() { - let rules = vec![NetAllow { host: None, ports: vec![8080], all_ports: false }]; + let rules = vec![NetAllow { protocol: crate::policy::Protocol::Tcp, host: None, ports: vec![8080], all_ports: false }]; let resolved = resolve_net_allow(&rules).await.unwrap(); - assert!(resolved.per_ip.is_empty()); - assert!(resolved.any_ip_ports.contains(&8080)); - assert!(!resolved.any_ip_all_ports); + assert!(resolved.tcp.per_ip.is_empty()); + assert!(resolved.tcp.any_ip_ports.contains(&8080)); + assert!(!resolved.tcp.any_ip_all_ports); assert!(resolved.etc_hosts.is_none()); } #[tokio::test] async fn test_resolve_net_allow_any_ip_all_ports() { - // `:*` — fully unrestricted egress. - let rules = vec![NetAllow { host: None, ports: vec![], all_ports: true }]; + // `:*` — fully unrestricted egress, TCP-only. + let rules = vec![NetAllow { protocol: crate::policy::Protocol::Tcp, host: None, ports: vec![], all_ports: true }]; let resolved = resolve_net_allow(&rules).await.unwrap(); - assert!(resolved.any_ip_all_ports); - assert!(resolved.per_ip.is_empty()); - assert!(resolved.per_ip_all_ports.is_empty()); - assert!(resolved.any_ip_ports.is_empty()); + assert!(resolved.tcp.any_ip_all_ports); + assert!(resolved.tcp.per_ip.is_empty()); + assert!(resolved.tcp.per_ip_all_ports.is_empty()); + assert!(resolved.tcp.any_ip_ports.is_empty()); + // UDP/ICMP unaffected by a TCP rule. + assert!(!resolved.udp.any_ip_all_ports); + assert!(!resolved.icmp.any_ip_all_ports); } #[tokio::test] async fn test_resolve_net_allow_concrete_host_all_ports() { - // `localhost:*` — every port to localhost only. + // `localhost:*` — every port to localhost only, TCP. let rules = vec![NetAllow { + protocol: crate::policy::Protocol::Tcp, host: Some("localhost".to_string()), ports: vec![], all_ports: true, }]; let resolved = resolve_net_allow(&rules).await.unwrap(); - assert!(!resolved.any_ip_all_ports); - assert!(!resolved.per_ip_all_ports.is_empty(), + assert!(!resolved.tcp.any_ip_all_ports); + assert!(!resolved.tcp.per_ip_all_ports.is_empty(), "localhost should resolve to at least one IP marked as any-port"); - // per_ip has placeholder entries for the same IPs (so callers - // iterating per_ip still see them). - for ip in resolved.per_ip_all_ports.iter() { - assert!(resolved.per_ip.contains_key(ip)); + for ip in resolved.tcp.per_ip_all_ports.iter() { + assert!(resolved.tcp.per_ip.contains_key(ip)); } - // /etc/hosts is synthesized for concrete hosts. assert!(resolved.etc_hosts.is_some()); } #[tokio::test] async fn test_resolve_net_allow_mixed_wildcard_and_concrete() { // Wildcard rule alongside concrete: wildcard sets the global - // any-host any-port flag; concrete rule still resolves into - // per_ip (the runtime layer chooses Unrestricted, ignoring the - // concrete entries — that's a runtime-policy concern, not a - // resolver concern). + // any-host any-port flag for TCP; concrete rule still resolves + // into per_ip (the runtime layer chooses Unrestricted, ignoring + // the concrete entries). let rules = vec![ - NetAllow { host: None, ports: vec![], all_ports: true }, + NetAllow { protocol: crate::policy::Protocol::Tcp, host: None, ports: vec![], all_ports: true }, NetAllow { + protocol: crate::policy::Protocol::Tcp, host: Some("localhost".to_string()), ports: vec![22], all_ports: false, }, ]; let resolved = resolve_net_allow(&rules).await.unwrap(); - assert!(resolved.any_ip_all_ports); - // Concrete entries still present in per_ip. - assert!(!resolved.per_ip.is_empty()); + assert!(resolved.tcp.any_ip_all_ports); + assert!(!resolved.tcp.per_ip.is_empty()); + } + + // ============================================================ + // Per-protocol resolution — UDP / ICMP slices stay isolated + // ============================================================ + + #[tokio::test] + async fn test_resolve_per_protocol_isolation() { + // A UDP rule should not appear in the TCP set, and vice versa. + // This is the property Phase 2 relies on for protocol routing. + let rules = vec![ + NetAllow { + protocol: crate::policy::Protocol::Tcp, + host: Some("localhost".to_string()), + ports: vec![443], + all_ports: false, + }, + NetAllow { + protocol: crate::policy::Protocol::Udp, + host: None, + ports: vec![53], + all_ports: false, + }, + ]; + let resolved = resolve_net_allow(&rules).await.unwrap(); + assert!(!resolved.tcp.per_ip.is_empty(), "TCP rule should populate tcp set"); + assert!(resolved.udp.any_ip_ports.contains(&53), "UDP rule should populate udp set"); + // Cross-contamination check: TCP per_ip ports must not contain 53; + // UDP must not contain 443. + for ports in resolved.tcp.per_ip.values() { + assert!(!ports.contains(&53), "UDP port leaked into TCP set"); + } + assert!(!resolved.udp.any_ip_ports.contains(&443), "TCP port leaked into UDP set"); + } + + #[tokio::test] + async fn test_resolve_icmp_no_ports() { + // ICMP rules carry no ports; concrete hosts go into per_ip with + // PortAllow::Any-style empty port set, plus per_ip_all_ports. + let rules = vec![NetAllow { + protocol: crate::policy::Protocol::Icmp, + host: Some("localhost".to_string()), + ports: vec![], + all_ports: false, + }]; + let resolved = resolve_net_allow(&rules).await.unwrap(); + assert!(!resolved.icmp.per_ip.is_empty(), "icmp host should populate per_ip"); + assert!(!resolved.icmp.per_ip_all_ports.is_empty(), + "icmp host should mark per_ip_all_ports (no port check)"); + assert!(resolved.icmp.any_ip_ports.is_empty()); + // TCP/UDP unaffected. + assert!(resolved.tcp.per_ip.is_empty()); + assert!(resolved.udp.per_ip.is_empty()); + } + + #[tokio::test] + async fn test_resolve_icmp_wildcard() { + // `icmp://*` — any ICMP destination. + let rules = vec![NetAllow { + protocol: crate::policy::Protocol::Icmp, + host: None, + ports: vec![], + all_ports: false, + }]; + let resolved = resolve_net_allow(&rules).await.unwrap(); + assert!(resolved.icmp.any_ip_all_ports); + assert!(!resolved.tcp.any_ip_all_ports); } } diff --git a/crates/sandlock-core/src/policy.rs b/crates/sandlock-core/src/policy.rs index 0f3ffbb..a2fc45c 100644 --- a/crates/sandlock-core/src/policy.rs +++ b/crates/sandlock-core/src/policy.rs @@ -103,8 +103,6 @@ impl TryFrom<&Policy> for ConfinePolicy { if !policy.block_syscalls.is_empty() { unsupported.push("block_syscalls"); } if !policy.net_allow.is_empty() { unsupported.push("net_allow"); } if !policy.net_bind.is_empty() { unsupported.push("net_bind"); } - if policy.allow_udp { unsupported.push("allow_udp"); } - if policy.allow_icmp { unsupported.push("allow_icmp"); } if policy.allow_sysv_ipc { unsupported.push("allow_sysv_ipc"); } if !policy.http_allow.is_empty() { unsupported.push("http_allow"); } if !policy.http_deny.is_empty() { unsupported.push("http_deny"); } @@ -168,18 +166,57 @@ pub enum BranchAction { Keep, } +/// L4 protocol that a `NetAllow` rule applies to. +/// +/// `Tcp` is the default if a rule has no scheme (the bare `host:port` +/// form). `Udp` and `Icmp` require an explicit scheme. +/// +/// `Icmp` is the kernel's unprivileged ping socket +/// (`SOCK_DGRAM + IPPROTO_ICMP{,V6}`), gated by `ping_group_range` — +/// destinations are filterable per host. Sandlock does not expose raw +/// ICMP (`SOCK_RAW + IPPROTO_ICMP`): destination filtering at `sendto` +/// would lie because raw sockets let the agent craft the IP header, +/// and packet-crafting capabilities aren't part of the XOA threat +/// model. Workloads that genuinely need raw ICMP should run outside +/// sandlock or rely on the host's `ping_group_range` for the dgram +/// path instead. +#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash, Serialize, Deserialize)] +#[serde(rename_all = "lowercase")] +pub enum Protocol { + Tcp, + Udp, + Icmp, +} + +impl Protocol { + fn parse(s: &str) -> Option { + match s { + "tcp" => Some(Protocol::Tcp), + "udp" => Some(Protocol::Udp), + "icmp" => Some(Protocol::Icmp), + _ => None, + } + } +} + /// A network endpoint allow rule. /// -/// Each rule permits TCP `connect()` to one host (or any IP, for the -/// `:port` form) on a specific set of ports. Multiple rules are OR'd: -/// a connection is permitted if any rule matches both the destination -/// IP and the destination port. +/// Each rule permits one protocol's traffic to one host (or any IP, for +/// the `:port` form) on a specific set of ports. Multiple rules are +/// OR'd: traffic is permitted if any rule matches the protocol, the +/// destination IP, and the destination port. +/// +/// ICMP rules carry no port (ICMP has none); their `ports` is empty +/// and `all_ports` is false. #[derive(Clone, Debug, Serialize, Deserialize, PartialEq)] pub struct NetAllow { - /// Hostname; `None` means "any IP" (the `:port` form). + /// L4 protocol this rule applies to. + #[serde(default = "default_protocol_tcp")] + pub protocol: Protocol, + /// Hostname; `None` means "any IP" (the `:port` form, or `icmp://*`). pub host: Option, /// Permitted ports. Must be non-empty unless `all_ports` is true, - /// in which case it must be empty. + /// in which case it must be empty. Always empty for `Protocol::Icmp`. pub ports: Vec, /// "Any port" wildcard from the `*` token in port position. When /// true, `ports` is empty; the rule permits every TCP/UDP port to @@ -188,14 +225,41 @@ pub struct NetAllow { pub all_ports: bool, } +fn default_protocol_tcp() -> Protocol { Protocol::Tcp } + impl NetAllow { - /// Parse a `host:port[,port,...]` / `:port` / `*:port` / - /// `host:*` / `:*` / `*:*` spec. + /// Parse a rule spec. Forms: + /// + /// - `host:port[,port,...]`, `:port`, `*:port`, `host:*`, `:*`, `*:*` + /// — TCP (the default scheme). + /// - `tcp://...` — explicit TCP, same suffix grammar as the bare form. + /// - `udp://...` — UDP, same suffix grammar as the bare form. + /// - `icmp://host` or `icmp://*` — ICMP echo (kernel ping socket). + /// No port field; `icmp://host:80` is rejected. /// /// `*` in port position means "any port" (the all-ports wildcard). /// Mixing `*` with concrete ports (e.g. `host:80,*`) is rejected. pub fn parse(s: &str) -> Result { - let (host_part, port_part) = s.rsplit_once(':').ok_or_else(|| { + // Split off the optional scheme prefix `://`. If absent, + // default to TCP and the rest of the parser is unchanged. + let (protocol, rest) = match s.split_once("://") { + Some((scheme, body)) => { + let proto = Protocol::parse(scheme).ok_or_else(|| { + PolicyError::Invalid(format!( + "--net-allow: unknown scheme `{}://` in `{}` (expected tcp, udp, icmp)", + scheme, s + )) + })?; + (proto, body) + } + None => (Protocol::Tcp, s), + }; + + if protocol == Protocol::Icmp { + return Self::parse_icmp(rest, s); + } + + let (host_part, port_part) = rest.rsplit_once(':').ok_or_else(|| { PolicyError::Invalid(format!( "--net-allow: expected `host:port` or `:port`, got `{}`", s @@ -240,7 +304,34 @@ impl NetAllow { s ))); } - Ok(NetAllow { host, ports, all_ports: saw_wildcard }) + Ok(NetAllow { protocol, host, ports, all_ports: saw_wildcard }) + } + + /// Parse the body of an `icmp://` rule. Accepts a host or `*` — + /// ICMP has no ports, so any `:` separator is rejected. + fn parse_icmp(body: &str, full: &str) -> Result { + if body.contains(':') { + return Err(PolicyError::Invalid(format!( + "--net-allow: icmp rules take no port, got `{}`", + full + ))); + } + if body.is_empty() { + return Err(PolicyError::Invalid(format!( + "--net-allow: icmp rule needs a host or `*`, got `{}`", + full + ))); + } + let host = match body { + "*" => None, + h => Some(h.to_string()), + }; + Ok(NetAllow { + protocol: Protocol::Icmp, + host, + ports: Vec::new(), + all_ports: false, + }) } } @@ -426,32 +517,30 @@ pub struct Policy { pub block_syscalls: Vec, // Network - /// Outbound endpoint allowlist as a list of `(host?, ports)` rules. - /// Applies to TCP `connect()` and to UDP `sendto`/`sendmsg` - /// destinations when `allow_udp` is set. + /// Outbound endpoint allowlist as a list of `(protocol, host?, ports)` + /// rules. Each rule names a protocol (TCP/UDP/ICMP) and either a + /// concrete host or "any IP." TCP and UDP rules carry ports; ICMP + /// rules have none. + /// + /// **Protocol gating falls out of rule presence.** Sandlock denies + /// UDP and ICMP socket creation by default; opting in is "list at + /// least one rule for that protocol" (e.g. `udp://*:*` for any UDP, + /// `icmp://*` for any ICMP echo). TCP is always permitted. /// /// Empty `net_allow` and empty `http_allow`/`http_deny` together /// mean "deny all outbound" (Landlock direct path denies, no /// on-behalf path is enabled). Otherwise, the on-behalf path /// enforces these rules: a destination is permitted iff any rule - /// matches both the destination IP (or has `host: None` = any IP) - /// and the destination port — same check for TCP and UDP. + /// matches the protocol, destination IP (or has `host: None` = any + /// IP), and destination port (N/A for ICMP). /// - /// HTTP rules with concrete hosts auto-add a matching `(host, [80])` - /// (and `(host, [443])` when `--https-ca` is set) entry at build - /// time so the proxy's intercept ports remain reachable. HTTP rules - /// with wildcard hosts auto-add `(None, [80])` instead. + /// HTTP rules with concrete hosts auto-add a matching + /// `(Tcp, host, [80])` (and `(Tcp, host, [443])` when `--https-ca` + /// is set) entry at build time so the proxy's intercept ports + /// remain reachable. HTTP rules with wildcard hosts auto-add + /// `(Tcp, None, [80])` instead. pub net_allow: Vec, pub net_bind: Vec, - /// Permit UDP socket creation (`socket(_, SOCK_DGRAM, _)`). UDP is - /// denied by default; outbound destinations remain gated by the - /// `net_allow` endpoint allowlist when set. - pub allow_udp: bool, - /// Narrow ICMP carve-out: permit `socket(AF_INET, SOCK_RAW, - /// IPPROTO_ICMP)` and the IPv6 equivalent. All other raw socket - /// types remain denied. Useful for `ping` without granting full - /// packet-crafting capability. - pub allow_icmp: bool, /// Permit SysV IPC syscalls: shared memory (`shmget`/`shmat`/ /// `shmdt`/`shmctl`), message queues (`msgget`/`msgsnd`/`msgrcv`/ /// `msgctl`), and semaphores (`semget`/`semop`/`semctl`/ @@ -566,8 +655,6 @@ pub struct PolicyBuilder { /// Raw `--net-allow` specs; parsed in `build()` to surface errors. net_allow: Vec, net_bind: Vec, - allow_udp: bool, - allow_icmp: bool, allow_sysv_ipc: bool, http_allow: Vec, @@ -659,20 +746,6 @@ impl PolicyBuilder { self } - /// Permit UDP socket creation. UDP is denied by default; - /// outbound destinations remain gated by `net_allow` if set. - pub fn allow_udp(mut self, v: bool) -> Self { - self.allow_udp = v; - self - } - - /// Permit `socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)` and the IPv6 - /// equivalent only. Other raw socket types stay denied. - pub fn allow_icmp(mut self, v: bool) -> Self { - self.allow_icmp = v; - self - } - /// Permit SysV IPC syscalls (shm/msg/sem). Denied by default /// because sandlock does not use IPC namespaces — without this /// denial, sandboxes on the same host share a SysV keyspace. @@ -911,6 +984,7 @@ impl PolicyBuilder { if wildcard_seen || (http_allow.is_empty() && http_deny.is_empty()) { // Fallback: explicit --http-port without rules, or wildcard rules. net_allow.push(NetAllow { + protocol: Protocol::Tcp, host: None, ports: http_ports.clone(), all_ports: false, @@ -918,6 +992,7 @@ impl PolicyBuilder { } for h in concrete_hosts { net_allow.push(NetAllow { + protocol: Protocol::Tcp, host: Some(h), ports: http_ports.clone(), all_ports: false, @@ -938,8 +1013,6 @@ impl PolicyBuilder { block_syscalls: self.block_syscalls, net_allow, net_bind: self.net_bind, - allow_udp: self.allow_udp, - allow_icmp: self.allow_icmp, allow_sysv_ipc: self.allow_sysv_ipc, http_allow, http_deny, @@ -1367,4 +1440,80 @@ mod http_rule_tests { assert!(r.all_ports); assert!(r.ports.is_empty()); } + + // --- Protocol scheme prefix tests --- + + #[test] + fn netallow_bare_form_defaults_to_tcp() { + let r = NetAllow::parse("example.com:443").unwrap(); + assert_eq!(r.protocol, Protocol::Tcp); + } + + #[test] + fn netallow_explicit_tcp_scheme() { + let r = NetAllow::parse("tcp://example.com:443").unwrap(); + assert_eq!(r.protocol, Protocol::Tcp); + assert_eq!(r.host.as_deref(), Some("example.com")); + assert_eq!(r.ports, vec![443]); + } + + #[test] + fn netallow_udp_scheme_with_host_port() { + let r = NetAllow::parse("udp://1.1.1.1:53").unwrap(); + assert_eq!(r.protocol, Protocol::Udp); + assert_eq!(r.host.as_deref(), Some("1.1.1.1")); + assert_eq!(r.ports, vec![53]); + } + + #[test] + fn netallow_udp_wildcard_any_anywhere() { + // The "any UDP" gate, equivalent to the old `allow_udp = true`. + let r = NetAllow::parse("udp://*:*").unwrap(); + assert_eq!(r.protocol, Protocol::Udp); + assert_eq!(r.host, None); + assert!(r.all_ports); + } + + #[test] + fn netallow_icmp_scheme_with_host() { + let r = NetAllow::parse("icmp://github.com").unwrap(); + assert_eq!(r.protocol, Protocol::Icmp); + assert_eq!(r.host.as_deref(), Some("github.com")); + assert!(r.ports.is_empty()); + assert!(!r.all_ports); + } + + #[test] + fn netallow_icmp_wildcard() { + // The "any ICMP echo" gate, equivalent to the old + // `allow_icmp = true` for the SOCK_DGRAM path. + let r = NetAllow::parse("icmp://*").unwrap(); + assert_eq!(r.protocol, Protocol::Icmp); + assert_eq!(r.host, None); + } + + #[test] + fn netallow_icmp_rejects_port() { + // ICMP has no port — `:port` is meaningless and refused + // explicitly so users can't write a rule that doesn't do what + // they think. + let err = NetAllow::parse("icmp://github.com:80").unwrap_err(); + assert!(format!("{}", err).contains("icmp rules take no port")); + } + + #[test] + fn netallow_icmp_rejects_empty_body() { + let err = NetAllow::parse("icmp://").unwrap_err(); + assert!(format!("{}", err).contains("needs a host or `*`")); + } + + #[test] + fn netallow_unknown_scheme_rejected() { + // Including `icmp-raw` — sandlock does not expose raw ICMP, so + // the scheme is unknown rather than a special-case error. + for spec in ["sctp://host:1234", "icmp-raw://*"] { + let err = NetAllow::parse(spec).unwrap_err(); + assert!(format!("{}", err).contains("unknown scheme"), "spec: {}", spec); + } + } } diff --git a/crates/sandlock-core/src/profile.rs b/crates/sandlock-core/src/profile.rs index f388257..a9db120 100644 --- a/crates/sandlock-core/src/profile.rs +++ b/crates/sandlock-core/src/profile.rs @@ -96,12 +96,6 @@ pub fn parse_profile(content: &str) -> Result { } // Parse booleans - if let Some(v) = sandbox.get("allow_udp").and_then(|v| v.as_bool()) { - builder = builder.allow_udp(v); - } - if let Some(v) = sandbox.get("allow_icmp").and_then(|v| v.as_bool()) { - builder = builder.allow_icmp(v); - } if let Some(v) = sandbox.get("allow_sysv_ipc").and_then(|v| v.as_bool()) { builder = builder.allow_sysv_ipc(v); } diff --git a/crates/sandlock-core/src/sandbox.rs b/crates/sandlock-core/src/sandbox.rs index 6595116..2ee247f 100644 --- a/crates/sandlock-core/src/sandbox.rs +++ b/crates/sandlock-core/src/sandbox.rs @@ -1090,38 +1090,43 @@ impl Sandbox { // TimeRandomState let time_random_state = TimeRandomState::new(time_offset, random_state); - // NetworkState + // NetworkState — three protocol-keyed policies. Each is + // built from the protocol's slice of net_allow rules; the + // on-behalf handler picks the right one at check time + // based on the dup'd fd's SO_PROTOCOL. A protocol with no + // rules gets `Unrestricted` *only* when there are no rules + // for any protocol — otherwise it's an empty AllowList, + // i.e. deny-all for that protocol. (Empty across the board + // means "no on-behalf path active," matching pre-Phase-1 + // behavior where Landlock is the sole enforcer.) let mut net_state = NetworkState::new(); - net_state.network_policy = if self.policy.net_allow.is_empty() - || resolved_net_allow.any_ip_all_ports - { - // Empty net_allow leaves only Landlock to enforce - // (kernel-level deny-all-connect by default). The - // `:*` wildcard explicitly opens egress, so the - // on-behalf path becomes a no-op. - crate::seccomp::notif::NetworkPolicy::Unrestricted - } else { - use crate::seccomp::notif::PortAllow; - let per_ip = resolved_net_allow - .per_ip - .iter() - .map(|(ip, ports)| { - // `host:*` resolves into per_ip_all_ports; for those - // IPs we use PortAllow::Any regardless of the (empty) - // port set in per_ip. - let allow = if resolved_net_allow.per_ip_all_ports.contains(ip) { - PortAllow::Any - } else { - PortAllow::Specific(ports.clone()) - }; - (*ip, allow) - }) - .collect(); - crate::seccomp::notif::NetworkPolicy::AllowList { - per_ip, - any_ip_ports: resolved_net_allow.any_ip_ports.clone(), + let no_rules = self.policy.net_allow.is_empty(); + let policy_from = |resolved: &network::ResolvedNetAllow| { + if no_rules || resolved.any_ip_all_ports { + crate::seccomp::notif::NetworkPolicy::Unrestricted + } else { + use crate::seccomp::notif::PortAllow; + let per_ip = resolved + .per_ip + .iter() + .map(|(ip, ports)| { + let allow = if resolved.per_ip_all_ports.contains(ip) { + PortAllow::Any + } else { + PortAllow::Specific(ports.clone()) + }; + (*ip, allow) + }) + .collect(); + crate::seccomp::notif::NetworkPolicy::AllowList { + per_ip, + any_ip_ports: resolved.any_ip_ports.clone(), + } } }; + net_state.tcp_policy = policy_from(&resolved_net_allow.tcp); + net_state.udp_policy = policy_from(&resolved_net_allow.udp); + net_state.icmp_policy = policy_from(&resolved_net_allow.icmp); net_state.http_acl_addr = self.http_acl_handle.as_ref().map(|h| h.addr); net_state.http_acl_ports = self.policy.http_ports.iter().copied().collect(); net_state.http_acl_orig_dest = self.http_acl_handle.as_ref().map(|h| h.orig_dest.clone()); @@ -1165,14 +1170,17 @@ impl Sandbox { // from per_ip keys (each represents an IP that some // endpoint rule mentions). The any_ip case has no IPs to // expose to the callback. - let allowed_ips = match &net_state.network_policy { - crate::seccomp::notif::NetworkPolicy::AllowList { per_ip, .. } => { - per_ip.keys().copied().collect() - } - crate::seccomp::notif::NetworkPolicy::Unrestricted => { - std::collections::HashSet::new() + // The dynamic-policy live view exposes the IPs the + // sandbox can talk to; that's the union of TCP+UDP+ICMP + // destination IPs (plus any from policy_fn overrides + // applied later). We collect from all three policies. + let mut allowed_ips: std::collections::HashSet = + std::collections::HashSet::new(); + for p in [&net_state.tcp_policy, &net_state.udp_policy, &net_state.icmp_policy] { + if let crate::seccomp::notif::NetworkPolicy::AllowList { per_ip, .. } = p { + allowed_ips.extend(per_ip.keys().copied()); } - }; + } let live = crate::policy_fn::LivePolicy { allowed_ips, max_memory_bytes: notif_policy.max_memory_bytes, diff --git a/crates/sandlock-core/src/seccomp/dispatch.rs b/crates/sandlock-core/src/seccomp/dispatch.rs index d6b560d..dd6a42e 100644 --- a/crates/sandlock-core/src/seccomp/dispatch.rs +++ b/crates/sandlock-core/src/seccomp/dispatch.rs @@ -313,7 +313,12 @@ pub(crate) fn build_dispatch_table( // Network (conditional on has_net_allowlist || has_http_acl) // ------------------------------------------------------------------ if policy.has_net_allowlist || policy.has_http_acl { - for &nr in &[libc::SYS_connect, libc::SYS_sendto, libc::SYS_sendmsg] { + for &nr in &[ + libc::SYS_connect, + libc::SYS_sendto, + libc::SYS_sendmsg, + libc::SYS_sendmmsg, + ] { let __sup = Arc::clone(ctx); table.register(nr, move |cx: &HandlerCtx| { let notif = cx.notif; diff --git a/crates/sandlock-core/src/seccomp/notif.rs b/crates/sandlock-core/src/seccomp/notif.rs index b031ba4..879b365 100644 --- a/crates/sandlock-core/src/seccomp/notif.rs +++ b/crates/sandlock-core/src/seccomp/notif.rs @@ -575,6 +575,7 @@ fn syscall_name(nr: i64) -> &'static str { n if n == libc::SYS_connect => "connect", n if n == libc::SYS_sendto => "sendto", n if n == libc::SYS_sendmsg => "sendmsg", + n if n == libc::SYS_sendmmsg => "sendmmsg", n if n == libc::SYS_bind => "bind", n if n == libc::SYS_clone => "clone", n if n == libc::SYS_clone3 => "clone3", @@ -605,7 +606,8 @@ fn syscall_category(nr: i64) -> crate::policy_fn::SyscallCategory { || n == libc::SYS_faccessat || n == libc::SYS_getdents64 || Some(n) == arch::SYS_GETDENTS => SyscallCategory::File, n if n == libc::SYS_connect || n == libc::SYS_sendto - || n == libc::SYS_sendmsg || n == libc::SYS_bind + || n == libc::SYS_sendmsg || n == libc::SYS_sendmmsg + || n == libc::SYS_bind || n == libc::SYS_getsockname => SyscallCategory::Network, n if n == libc::SYS_clone || n == libc::SYS_clone3 || Some(n) == arch::SYS_VFORK || Some(n) == arch::SYS_FORK @@ -1251,7 +1253,9 @@ mod tests { #[test] fn test_network_state_new() { let ns = super::super::state::NetworkState::new(); - assert!(matches!(ns.network_policy, NetworkPolicy::Unrestricted)); + assert!(matches!(ns.tcp_policy, NetworkPolicy::Unrestricted)); + assert!(matches!(ns.udp_policy, NetworkPolicy::Unrestricted)); + assert!(matches!(ns.icmp_policy, NetworkPolicy::Unrestricted)); assert!(ns.port_map.bound_ports.is_empty()); } diff --git a/crates/sandlock-core/src/seccomp/state.rs b/crates/sandlock-core/src/seccomp/state.rs index b0a3a97..e3f26f9 100644 --- a/crates/sandlock-core/src/seccomp/state.rs +++ b/crates/sandlock-core/src/seccomp/state.rs @@ -296,10 +296,18 @@ impl CowState { // NetworkState — network policy and port remapping state // ============================================================ -/// Network policy and port-remapping state. +/// Network policy and port-remapping state. Holds one +/// `NetworkPolicy` per L4 protocol — the on-behalf handler picks the +/// matching one based on the dup'd fd's `SO_PROTOCOL`. pub struct NetworkState { - /// Global network policy: endpoint-level allowlist or unrestricted. - pub network_policy: crate::seccomp::notif::NetworkPolicy, + /// Allowlist for TCP destinations (`tcp://...` and bare-form rules). + pub tcp_policy: crate::seccomp::notif::NetworkPolicy, + /// Allowlist for UDP destinations (`udp://...` rules). + pub udp_policy: crate::seccomp::notif::NetworkPolicy, + /// Allowlist for ICMP destinations (`icmp://...` rules). ICMP rules + /// carry no ports, so every entry uses `PortAllow::Any` and the + /// effective check is IP-only. + pub icmp_policy: crate::seccomp::notif::NetworkPolicy, /// Port binding and remapping tracker. pub port_map: crate::port_remap::PortMap, /// Per-PID network overrides from policy_fn (IP-only via the legacy @@ -316,7 +324,9 @@ pub struct NetworkState { impl NetworkState { pub fn new() -> Self { Self { - network_policy: crate::seccomp::notif::NetworkPolicy::Unrestricted, + tcp_policy: crate::seccomp::notif::NetworkPolicy::Unrestricted, + udp_policy: crate::seccomp::notif::NetworkPolicy::Unrestricted, + icmp_policy: crate::seccomp::notif::NetworkPolicy::Unrestricted, port_map: crate::port_remap::PortMap::new(), pid_ip_overrides: std::sync::Arc::new(std::sync::RwLock::new(HashMap::new())), http_acl_addr: None, @@ -325,16 +335,20 @@ impl NetworkState { } } - /// Get the effective network policy for a PID. + /// Get the effective network policy for a PID and protocol. /// - /// Priority: per-PID override > live policy (from PolicyFnState) > global network_policy. + /// Priority: per-PID override > live policy (from PolicyFnState) > + /// the per-protocol allowlist for `protocol`. /// PID/live overrides are IP-only — any port is permitted to listed - /// IPs (legacy `policy_fn` semantics). + /// IPs (legacy `policy_fn` semantics) — and they apply across all + /// protocols, since the legacy API didn't distinguish them. pub fn effective_network_policy( &self, pid: u32, + protocol: crate::policy::Protocol, live_policy: Option<&std::sync::Arc>>, ) -> crate::seccomp::notif::NetworkPolicy { + use crate::policy::Protocol; use crate::seccomp::notif::{NetworkPolicy, PortAllow}; let ip_only_allow = |ips: &HashSet| { let per_ip = ips.iter().map(|&ip| (ip, PortAllow::Any)).collect(); @@ -355,7 +369,11 @@ impl NetworkState { } } } - self.network_policy.clone() + match protocol { + Protocol::Tcp => self.tcp_policy.clone(), + Protocol::Udp => self.udp_policy.clone(), + Protocol::Icmp => self.icmp_policy.clone(), + } } } diff --git a/crates/sandlock-core/src/sys/structs.rs b/crates/sandlock-core/src/sys/structs.rs index 02f8f4c..fd8c10a 100644 --- a/crates/sandlock-core/src/sys/structs.rs +++ b/crates/sandlock-core/src/sys/structs.rs @@ -249,8 +249,6 @@ pub const AF_INET6: u32 = 10; pub const SOCK_RAW: u32 = 3; pub const SOCK_DGRAM: u32 = 2; pub const SOCK_TYPE_MASK: u32 = 0xFF; -pub const IPPROTO_ICMP: u32 = 1; -pub const IPPROTO_ICMPV6: u32 = 58; // ============================================================ // Errno values diff --git a/crates/sandlock-core/tests/integration/test_network.rs b/crates/sandlock-core/tests/integration/test_network.rs index 469b07e..b98e654 100644 --- a/crates/sandlock-core/tests/integration/test_network.rs +++ b/crates/sandlock-core/tests/integration/test_network.rs @@ -13,6 +13,284 @@ fn base_policy() -> sandlock_core::PolicyBuilder { .fs_write("/tmp") } +// ============================================================ +// Phase 2: per-protocol destination scoping +// ============================================================ + +/// `udp://127.0.0.1:53` rule scopes UDP sends to 127.0.0.1:53. A +/// `sendto(1.1.1.1, 53)` on the same UDP socket must be denied because +/// the rule's host filters destinations, not just protocol creation. +#[tokio::test] +async fn test_udp_rule_scopes_destination_by_host() { + let out_allowed = temp_file("udp-allowed"); + let out_blocked = temp_file("udp-blocked"); + + let policy = base_policy() + .net_allow("udp://127.0.0.1:53") + .build() + .unwrap(); + + // Two sendto calls on the same socket: one to the allowed host, one + // to a different host on the same port. The on-behalf handler must + // accept the first and deny the second with ECONNREFUSED (errno 111). + let script = format!(concat!( + "import socket\n", + "s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)\n", + "try:\n", + " s.sendto(b'x', ('127.0.0.1', 53))\n", + " open('{ok}', 'w').write('ALLOWED')\n", + "except OSError as e:\n", + " open('{ok}', 'w').write(f'ERR:{{e.errno}}')\n", + "try:\n", + " s.sendto(b'x', ('1.1.1.1', 53))\n", + " open('{deny}', 'w').write('ALLOWED')\n", + "except OSError as e:\n", + " open('{deny}', 'w').write(f'ERR:{{e.errno}}')\n", + "s.close()\n", + ), ok = out_allowed.display(), deny = out_blocked.display()); + + let result = Sandbox::run_interactive(&policy, Some("test"), &["python3", "-c", &script]) + .await.unwrap(); + assert!(result.success(), "exit={:?}", result.code()); + + let allowed = std::fs::read_to_string(&out_allowed).unwrap_or_default(); + let blocked = std::fs::read_to_string(&out_blocked).unwrap_or_default(); + let _ = std::fs::remove_file(&out_allowed); + let _ = std::fs::remove_file(&out_blocked); + + assert_eq!(allowed, "ALLOWED", "sendto to allowed host should succeed"); + assert_eq!(blocked, "ERR:111", "sendto to disallowed host should ECONNREFUSED"); +} + +/// `udp://*:*` is the "any UDP destination" gate — it should not regress +/// after Phase 2's per-protocol routing. Both sendtos succeed. +#[tokio::test] +async fn test_udp_wildcard_allows_any_destination() { + let out_a = temp_file("udp-wild-a"); + let out_b = temp_file("udp-wild-b"); + + let policy = base_policy().net_allow("udp://*:*").build().unwrap(); + + let script = format!(concat!( + "import socket\n", + "s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)\n", + "try:\n", + " s.sendto(b'x', ('127.0.0.1', 53))\n", + " open('{a}', 'w').write('ALLOWED')\n", + "except OSError as e:\n", + " open('{a}', 'w').write(f'ERR:{{e.errno}}')\n", + "try:\n", + " s.sendto(b'x', ('1.1.1.1', 53))\n", + " open('{b}', 'w').write('ALLOWED')\n", + "except OSError as e:\n", + " open('{b}', 'w').write(f'ERR:{{e.errno}}')\n", + "s.close()\n", + ), a = out_a.display(), b = out_b.display()); + + let result = Sandbox::run_interactive(&policy, Some("test"), &["python3", "-c", &script]) + .await.unwrap(); + assert!(result.success(), "exit={:?}", result.code()); + + let a = std::fs::read_to_string(&out_a).unwrap_or_default(); + let b = std::fs::read_to_string(&out_b).unwrap_or_default(); + let _ = std::fs::remove_file(&out_a); + let _ = std::fs::remove_file(&out_b); + + assert_eq!(a, "ALLOWED"); + assert_eq!(b, "ALLOWED"); +} + +/// A UDP rule must not authorize TCP destinations. Phase 1 closed off +/// UDP socket creation under a TCP-only policy; Phase 2 must also stop +/// UDP rules from leaking into the TCP destination check. Here we have +/// a UDP-only rule for 1.1.1.1:53 and try a TCP connect to that +/// (host, port) — which should still be denied because the TCP policy +/// has no rules. +#[tokio::test] +async fn test_udp_rule_does_not_authorize_tcp() { + let out = temp_file("udp-no-leak-tcp"); + + let policy = base_policy().net_allow("udp://1.1.1.1:53").build().unwrap(); + + let script = format!(concat!( + "import socket\n", + "s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\n", + "s.settimeout(2)\n", + "try:\n", + " s.connect(('1.1.1.1', 53))\n", + " open('{out}', 'w').write('ALLOWED')\n", + "except (OSError, socket.timeout) as e:\n", + " errno = getattr(e, 'errno', 0)\n", + " open('{out}', 'w').write(f'BLOCKED:{{errno}}')\n", + "s.close()\n", + ), out = out.display()); + + let result = Sandbox::run_interactive(&policy, Some("test"), &["python3", "-c", &script]) + .await.unwrap(); + assert!(result.success(), "exit={:?}", result.code()); + let content = std::fs::read_to_string(&out).unwrap_or_default(); + let _ = std::fs::remove_file(&out); + assert!( + content.starts_with("BLOCKED:"), + "TCP connect must not piggyback on a UDP rule, got: {}", content + ); +} + +/// `sendmmsg` is the most common UDP escape hatch — agents that want to +/// bypass per-message destination filtering can batch sends with it. +/// This test calls `libc.sendmmsg` directly via ctypes (Python's +/// `socket` module doesn't expose it) with two messages: the first to +/// an allowed host, the second to a disallowed one. The on-behalf +/// handler must let the first through and stop at the second, returning +/// 1 to match the kernel's "messages successfully transmitted" semantics +/// on partial failure. +#[tokio::test] +async fn test_sendmmsg_partial_failure_on_blocked_destination() { + let out = temp_file("sendmmsg-partial"); + + let policy = base_policy() + .net_allow("udp://127.0.0.1:53") + .build() + .unwrap(); + + let script = format!(concat!( + "import ctypes, socket, struct\n", + "libc = ctypes.CDLL('libc.so.6', use_errno=True)\n", + "libc.sendmmsg.restype = ctypes.c_int\n", + "\n", + "class iovec(ctypes.Structure):\n", + " _fields_ = [('iov_base', ctypes.c_void_p), ('iov_len', ctypes.c_size_t)]\n", + "\n", + "class msghdr(ctypes.Structure):\n", + " _fields_ = [\n", + " ('msg_name', ctypes.c_void_p),\n", + " ('msg_namelen', ctypes.c_uint),\n", + " ('_p1', ctypes.c_uint),\n", + " ('msg_iov', ctypes.c_void_p),\n", + " ('msg_iovlen', ctypes.c_size_t),\n", + " ('msg_control', ctypes.c_void_p),\n", + " ('msg_controllen', ctypes.c_size_t),\n", + " ('msg_flags', ctypes.c_int),\n", + " ('_p2', ctypes.c_uint),\n", + " ]\n", + "\n", + "class mmsghdr(ctypes.Structure):\n", + " _fields_ = [('msg_hdr', msghdr), ('msg_len', ctypes.c_uint), ('_p', ctypes.c_uint)]\n", + "\n", + "def sai(ip, port):\n", + " return struct.pack('=HH4s8x', socket.AF_INET, socket.htons(port), socket.inet_aton(ip))\n", + "\n", + "s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)\n", + "\n", + "addr_ok = ctypes.create_string_buffer(sai('127.0.0.1', 53))\n", + "addr_blk = ctypes.create_string_buffer(sai('1.1.1.1', 53))\n", + "data = ctypes.create_string_buffer(b'x')\n", + "\n", + "iovs = (iovec * 2)()\n", + "iovs[0].iov_base = ctypes.cast(data, ctypes.c_void_p).value\n", + "iovs[0].iov_len = 1\n", + "iovs[1].iov_base = ctypes.cast(data, ctypes.c_void_p).value\n", + "iovs[1].iov_len = 1\n", + "\n", + "vec = (mmsghdr * 2)()\n", + "vec[0].msg_hdr.msg_name = ctypes.cast(addr_ok, ctypes.c_void_p).value\n", + "vec[0].msg_hdr.msg_namelen = 16\n", + "vec[0].msg_hdr.msg_iov = ctypes.cast(ctypes.pointer(iovs[0]), ctypes.c_void_p).value\n", + "vec[0].msg_hdr.msg_iovlen = 1\n", + "vec[1].msg_hdr.msg_name = ctypes.cast(addr_blk, ctypes.c_void_p).value\n", + "vec[1].msg_hdr.msg_namelen = 16\n", + "vec[1].msg_hdr.msg_iov = ctypes.cast(ctypes.pointer(iovs[1]), ctypes.c_void_p).value\n", + "vec[1].msg_hdr.msg_iovlen = 1\n", + "\n", + "ret = libc.sendmmsg(s.fileno(), vec, 2, 0)\n", + "errno = ctypes.get_errno()\n", + "msg0_len = vec[0].msg_len\n", + "open('{out}', 'w').write(f'ret={{ret}} errno={{errno}} msg0_len={{msg0_len}}')\n", + "s.close()\n", + ), out = out.display()); + + let result = Sandbox::run_interactive(&policy, Some("test"), &["python3", "-c", &script]) + .await.unwrap(); + assert!(result.success(), "exit={:?}", result.code()); + + let content = std::fs::read_to_string(&out).unwrap_or_default(); + let _ = std::fs::remove_file(&out); + + // ret=1 — first message sent, second blocked. msg0_len=1 — one byte + // delivered for the first message. errno is whatever the kernel left + // it as (sendmmsg sets errno only on full failure ret<0). + assert!( + content.starts_with("ret=1 ") && content.contains("msg0_len=1"), + "expected partial success ret=1 msg0_len=1, got: {}", content + ); +} + +/// Defense-in-depth check that `sendmmsg` doesn't silently bypass the +/// per-protocol routing. With a UDP-only rule, a TCP socket using +/// `sendmsg`/`sendto` already fails (Phase 2 covered that). We verify +/// the same property holds when the agent uses `sendmmsg` to a UDP +/// destination outside the allowlist with vlen=1: ret should be -1 +/// because no entries succeeded. +#[tokio::test] +async fn test_sendmmsg_single_blocked_returns_econnrefused() { + let out = temp_file("sendmmsg-blocked"); + + let policy = base_policy() + .net_allow("udp://127.0.0.1:53") + .build() + .unwrap(); + + let script = format!(concat!( + "import ctypes, socket, struct\n", + "libc = ctypes.CDLL('libc.so.6', use_errno=True)\n", + "libc.sendmmsg.restype = ctypes.c_int\n", + "\n", + "class iovec(ctypes.Structure):\n", + " _fields_ = [('iov_base', ctypes.c_void_p), ('iov_len', ctypes.c_size_t)]\n", + "\n", + "class msghdr(ctypes.Structure):\n", + " _fields_ = [\n", + " ('msg_name', ctypes.c_void_p), ('msg_namelen', ctypes.c_uint), ('_p1', ctypes.c_uint),\n", + " ('msg_iov', ctypes.c_void_p), ('msg_iovlen', ctypes.c_size_t),\n", + " ('msg_control', ctypes.c_void_p), ('msg_controllen', ctypes.c_size_t),\n", + " ('msg_flags', ctypes.c_int), ('_p2', ctypes.c_uint),\n", + " ]\n", + "\n", + "class mmsghdr(ctypes.Structure):\n", + " _fields_ = [('msg_hdr', msghdr), ('msg_len', ctypes.c_uint), ('_p', ctypes.c_uint)]\n", + "\n", + "s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)\n", + "addr = ctypes.create_string_buffer(\n", + " struct.pack('=HH4s8x', socket.AF_INET, socket.htons(53), socket.inet_aton('1.1.1.1'))\n", + ")\n", + "data = ctypes.create_string_buffer(b'x')\n", + "iov = iovec()\n", + "iov.iov_base = ctypes.cast(data, ctypes.c_void_p).value\n", + "iov.iov_len = 1\n", + "vec = (mmsghdr * 1)()\n", + "vec[0].msg_hdr.msg_name = ctypes.cast(addr, ctypes.c_void_p).value\n", + "vec[0].msg_hdr.msg_namelen = 16\n", + "vec[0].msg_hdr.msg_iov = ctypes.cast(ctypes.pointer(iov), ctypes.c_void_p).value\n", + "vec[0].msg_hdr.msg_iovlen = 1\n", + "ret = libc.sendmmsg(s.fileno(), vec, 1, 0)\n", + "errno = ctypes.get_errno()\n", + "open('{out}', 'w').write(f'ret={{ret}} errno={{errno}}')\n", + "s.close()\n", + ), out = out.display()); + + let result = Sandbox::run_interactive(&policy, Some("test"), &["python3", "-c", &script]) + .await.unwrap(); + assert!(result.success(), "exit={:?}", result.code()); + + let content = std::fs::read_to_string(&out).unwrap_or_default(); + let _ = std::fs::remove_file(&out); + + assert_eq!( + content, "ret=-1 errno=111", + "blocked sendmmsg should return -1 with ECONNREFUSED, got: {}", content + ); +} + /// Test that --net-allow blocks connections to non-allowed hosts. #[tokio::test] async fn test_net_allow_blocks_disallowed_host() { diff --git a/crates/sandlock-core/tests/integration/test_policy.rs b/crates/sandlock-core/tests/integration/test_policy.rs index 265cb5e..9ed8875 100644 --- a/crates/sandlock-core/tests/integration/test_policy.rs +++ b/crates/sandlock-core/tests/integration/test_policy.rs @@ -5,8 +5,10 @@ fn test_default_policy() { let policy = Policy::builder().build().unwrap(); assert_eq!(policy.max_processes, 64); assert!(policy.block_syscalls.is_empty()); - assert!(!policy.allow_udp, "UDP is denied by default"); - assert!(!policy.allow_icmp, "ICMP raw is denied by default"); + // UDP, ICMP, and raw ICMP are denied by default — there are no rules + // for those protocols in `net_allow`, which is what the BPF filter + // gates on now (no separate booleans). + assert!(policy.net_allow.is_empty()); assert!(policy.uid.is_none()); assert!(policy.fs_writable.is_empty()); assert!(policy.fs_readable.is_empty()); @@ -128,15 +130,20 @@ fn test_env_var() { } #[test] -fn test_allow_udp_default_false() { +fn test_udp_default_denied() { + // Opt in via `.net_allow("udp://*:*")` (or a scoped UDP rule). let p = Policy::builder().build().unwrap(); - assert!(!p.allow_udp, "UDP is denied by default; opt in via .allow_udp(true)"); + use sandlock_core::policy::Protocol; + assert!(!p.net_allow.iter().any(|r| r.protocol == Protocol::Udp)); } #[test] -fn test_allow_icmp_default_false() { +fn test_icmp_default_denied() { + // Opt in via `.net_allow("icmp://*")` (kernel ping socket). + // Raw ICMP is unconditionally denied — sandlock does not expose it. let p = Policy::builder().build().unwrap(); - assert!(!p.allow_icmp, "ICMP raw is denied by default; opt in via .allow_icmp(true)"); + use sandlock_core::policy::Protocol; + assert!(!p.net_allow.iter().any(|r| r.protocol == Protocol::Icmp)); } #[test] diff --git a/crates/sandlock-core/tests/integration/test_seccomp_enforce.rs b/crates/sandlock-core/tests/integration/test_seccomp_enforce.rs index 07116e2..2dd68b5 100644 --- a/crates/sandlock-core/tests/integration/test_seccomp_enforce.rs +++ b/crates/sandlock-core/tests/integration/test_seccomp_enforce.rs @@ -143,12 +143,14 @@ async fn test_raw_socket_blocked() { } // ------------------------------------------------------------------ -// 4b. allow_icmp(true) permits AF_INET + SOCK_RAW + IPPROTO_ICMP -// while other raw socket types remain denied. +// 4b. Raw ICMP is unconditionally denied — sandlock does not expose +// SOCK_RAW + IPPROTO_ICMP, even with policy concessions. Workloads +// that need ping should use the SOCK_DGRAM kernel ping socket via +// an `icmp://...` rule (test 4d below). // ------------------------------------------------------------------ #[tokio::test] -async fn test_allow_icmp_permits_icmp_raw() { - let out = temp_out("allow-icmp-permits-icmp"); +async fn test_raw_icmp_always_denied() { + let out = temp_out("raw-icmp-denied"); let script = format!(concat!( "import socket\n", "try:\n", @@ -162,8 +164,10 @@ async fn test_allow_icmp_permits_icmp_raw() { "open('{out}', 'w').write(result)\n", ), out = out.display()); + // Even with an `icmp://*` rule (which permits the dgram path), raw + // ICMP must still be blocked: SOCK_RAW is always in the deny list. let policy = base_policy() - .allow_icmp(true) + .net_allow("icmp://*") .build() .unwrap(); let result = Sandbox::run_interactive(&policy, Some("test"), &["python3", "-c", &script]) @@ -172,30 +176,26 @@ async fn test_allow_icmp_permits_icmp_raw() { let contents = std::fs::read_to_string(&out).unwrap_or_default(); let _ = std::fs::remove_file(&out); - // seccomp must permit it; the kernel may still deny without CAP_NET_RAW - // (errno 1 = EPERM). Accept ALLOWED (root) or BLOCKED/ERROR:1 (non-root - // capability denial). - let trimmed = contents.trim(); - assert!( - trimmed == "ALLOWED" || trimmed == "BLOCKED" || trimmed == "ERROR:1", - "ICMP raw socket should be permitted by seccomp under allow_icmp; got: {}", - trimmed + assert_ne!( + contents.trim(), "ALLOWED", + "raw ICMP must be denied unconditionally; got: {}", + contents.trim() ); assert!(result.success()); } // ------------------------------------------------------------------ -// 4c. allow_icmp(true) still blocks SOCK_RAW with non-ICMP protocol -// (verifies the BPF arg2 protocol check) +// 4d. The kernel ping socket (SOCK_DGRAM + IPPROTO_ICMP) is permitted +// when an `icmp://*` rule is present — the modern unprivileged +// ping path, distinct from raw ICMP. // ------------------------------------------------------------------ #[tokio::test] -async fn test_allow_icmp_still_blocks_other_raw() { - let out = temp_out("allow-icmp-blocks-tcp-raw"); - // AF_INET + SOCK_RAW + IPPROTO_TCP must still be denied by seccomp. +async fn test_icmp_dgram_allowed_with_icmp_rule() { + let out = temp_out("icmp-dgram-allowed"); let script = format!(concat!( "import socket\n", "try:\n", - " s = socket.socket(socket.AF_INET, socket.SOCK_RAW, socket.IPPROTO_TCP)\n", + " s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_ICMP)\n", " s.close()\n", " result = 'ALLOWED'\n", "except PermissionError:\n", @@ -206,7 +206,7 @@ async fn test_allow_icmp_still_blocks_other_raw() { ), out = out.display()); let policy = base_policy() - .allow_icmp(true) + .net_allow("icmp://*") .build() .unwrap(); let result = Sandbox::run_interactive(&policy, Some("test"), &["python3", "-c", &script]) @@ -215,19 +215,22 @@ async fn test_allow_icmp_still_blocks_other_raw() { let contents = std::fs::read_to_string(&out).unwrap_or_default(); let _ = std::fs::remove_file(&out); + // Seccomp must allow the syscall. The kernel may still deny if the + // sandbox GID is outside `net.ipv4.ping_group_range` (errno 1 EPERM + // or EACCES). Accepting ALLOWED / BLOCKED / ERROR:1 / ERROR:13 keeps + // the test green across hosts. let trimmed = contents.trim(); - // Must be denied — either via seccomp (BLOCKED) or the kernel (EPERM). - // Critically must NOT be ALLOWED. - assert_ne!( - trimmed, "ALLOWED", - "non-ICMP raw socket must remain denied under allow_icmp; got: {}", + assert!( + trimmed == "ALLOWED" || trimmed == "BLOCKED" + || trimmed == "ERROR:1" || trimmed == "ERROR:13", + "kernel ping socket should be permitted by seccomp under icmp://*; got: {}", trimmed ); assert!(result.success()); } // ------------------------------------------------------------------ -// 5. UDP allowed when allow_udp(true) +// 5. UDP allowed when a `udp://*:*` rule is present. // ------------------------------------------------------------------ #[tokio::test] async fn test_udp_allowed_when_opted_in() { @@ -246,7 +249,7 @@ async fn test_udp_allowed_when_opted_in() { ), out = out.display()); let policy = base_policy() - .allow_udp(true) + .net_allow("udp://*:*") .build() .unwrap(); let result = Sandbox::run_interactive(&policy, Some("test"), &["python3", "-c", &script]) @@ -258,7 +261,7 @@ async fn test_udp_allowed_when_opted_in() { assert_eq!( contents.trim(), "ALLOWED", - "UDP socket should be allowed with allow_udp(true), got: {}", + "UDP socket should be allowed with udp://*:*, got: {}", contents.trim() ); assert!(result.success()); diff --git a/crates/sandlock-ffi/src/lib.rs b/crates/sandlock-ffi/src/lib.rs index 3cf42a9..10e5f0f 100644 --- a/crates/sandlock-ffi/src/lib.rs +++ b/crates/sandlock-ffi/src/lib.rs @@ -344,33 +344,11 @@ pub unsafe extern "C" fn sandlock_policy_builder_port_remap( Box::into_raw(Box::new(builder.port_remap(v))) } -/// Permit UDP socket creation. UDP is denied by default; outbound -/// destinations remain gated by `net_allow` if any rules are set. -/// -/// # Safety -/// `b` must be a valid builder pointer. -#[no_mangle] -pub unsafe extern "C" fn sandlock_policy_builder_allow_udp( - b: *mut PolicyBuilder, v: bool, -) -> *mut PolicyBuilder { - if b.is_null() { return b; } - let builder = *Box::from_raw(b); - Box::into_raw(Box::new(builder.allow_udp(v))) -} - -/// Permit `socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)` and the IPv6 -/// equivalent only. All other raw socket types remain denied. -/// -/// # Safety -/// `b` must be a valid builder pointer. -#[no_mangle] -pub unsafe extern "C" fn sandlock_policy_builder_allow_icmp( - b: *mut PolicyBuilder, v: bool, -) -> *mut PolicyBuilder { - if b.is_null() { return b; } - let builder = *Box::from_raw(b); - Box::into_raw(Box::new(builder.allow_icmp(v))) -} +// Protocol gating (UDP, kernel ping socket) is expressed via +// `net_allow` rule schemes (`udp://`, `icmp://`) rather than separate +// FFI booleans. There is no `allow_udp` / `allow_icmp` setter. +// Sandlock does not expose raw ICMP — SOCK_RAW is unconditionally +// denied at the seccomp layer. /// # Safety /// `b` must be a valid builder pointer. diff --git a/python/README.md b/python/README.md index 31e09b0..a05157e 100644 --- a/python/README.md +++ b/python/README.md @@ -62,11 +62,9 @@ Unset fields mean "no restriction" unless noted otherwise. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| -| `net_allow` | `list[str]` | `[]` | Outbound endpoint rules (TCP; UDP too when `allow_udp=True`). Each entry is `"host:port[,port,...]"`, `":port"`, or `"*:port"`. Empty = deny all. | +| `net_allow` | `list[str]` | `[]` | Outbound endpoint rules. Bare `host:port` is TCP; protocol prefixes opt others in: `tcp://host:port`, `udp://host:port` (or `udp://*:*` for any UDP), `icmp://host` (or `icmp://*` for any ICMP echo via the kernel ping socket — gated by `net.ipv4.ping_group_range` on the host). Empty = deny all. Raw ICMP is not exposed. | | `net_bind` | `list[int \| str]` | `[]` | TCP ports the sandbox may bind (empty = deny all) | | `port_remap` | `bool` | `False` | Transparent TCP port virtualization | -| `allow_udp` | `bool` | `False` | Permit UDP sockets (outbound destinations still gated by `net_allow`) | -| `allow_icmp` | `bool` | `False` | Permit `socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)` and IPv6 equivalent only — useful for `ping`. Other raw socket types stay denied. | #### HTTP ACL @@ -508,7 +506,7 @@ permissions explicitly: | Capability | Example | Description | |------------|---------|-------------| | `fs_writable` | `["/tmp/agent"]` | Paths the tool can write to | -| `net_allow` | `["api.example.com:443"]` | Outbound endpoints (`host:port`, `:port`, or `*:port`) — TCP, plus UDP when `allow_udp=True` | +| `net_allow` | `["api.example.com:443", "udp://1.1.1.1:53"]` | Outbound endpoints. Bare `host:port` is TCP; `udp://...` / `icmp://...` schemes opt UDP / ICMP echo in. | | `env` | `{"KEY": "val"}` | Environment variables to pass | | `max_memory` | `"256M"` | Memory limit | diff --git a/python/src/sandlock/_profile.py b/python/src/sandlock/_profile.py index 3170196..dad724d 100644 --- a/python/src/sandlock/_profile.py +++ b/python/src/sandlock/_profile.py @@ -32,8 +32,6 @@ # Network "net_allow": list, "net_bind": list, - "allow_udp": bool, - "allow_icmp": bool, # Resources "max_memory": str, "max_processes": int, diff --git a/python/src/sandlock/_sdk.py b/python/src/sandlock/_sdk.py index 97564bf..8b2b139 100644 --- a/python/src/sandlock/_sdk.py +++ b/python/src/sandlock/_sdk.py @@ -91,8 +91,6 @@ def _builder_fn(name, *extra_args): _b_net_allow = _builder_fn("sandlock_policy_builder_net_allow", ctypes.c_char_p) _b_net_bind_port = _builder_fn("sandlock_policy_builder_net_bind_port", ctypes.c_uint16) _b_port_remap = _builder_fn("sandlock_policy_builder_port_remap", ctypes.c_bool) -_b_allow_udp = _builder_fn("sandlock_policy_builder_allow_udp", ctypes.c_bool) -_b_allow_icmp = _builder_fn("sandlock_policy_builder_allow_icmp", ctypes.c_bool) _b_http_allow = _builder_fn("sandlock_policy_builder_http_allow", ctypes.c_char_p) _b_http_deny = _builder_fn("sandlock_policy_builder_http_deny", ctypes.c_char_p) _b_http_port = _builder_fn("sandlock_policy_builder_http_port", ctypes.c_uint16) @@ -746,7 +744,7 @@ def __del__(self): "max_memory", "max_disk", "max_processes", "max_cpu", "num_cpus", "cpu_cores", "gpu_devices", "net_allow", "net_bind", - "port_remap", "allow_udp", "allow_icmp", + "port_remap", "http_allow", "http_deny", "http_ports", "https_ca", "https_key", "uid", "random_seed", "time_start", "clean_env", "env", @@ -828,10 +826,10 @@ def _build_from_policy(policy: PolicyDataclass): arr = (ctypes.c_uint32 * len(policy.cpu_cores))(*policy.cpu_cores) b = _b_cpu_cores(b, arr, len(policy.cpu_cores)) - # net_allow: list of endpoint specs (`host:port[,port,...]`, - # `:port`, `*:port`). Empty = deny all outbound. Applies to TCP - # and to UDP (when allow_udp is set). Validation of each spec - # happens in the native build(). + # net_allow: list of endpoint specs. Bare `host:port` means TCP; + # `tcp://`/`udp://`/`icmp://` schemes opt other protocols in. + # Empty = deny all outbound. Validation of each spec happens in + # the native build(). for spec in (policy.net_allow or []): b = _b_net_allow(b, _encode(str(spec))) for port in parse_ports(policy.net_bind) if policy.net_bind else []: @@ -850,10 +848,6 @@ def _build_from_policy(policy: PolicyDataclass): if policy.port_remap: b = _b_port_remap(b, True) - if policy.allow_udp: - b = _b_allow_udp(b, True) - if policy.allow_icmp: - b = _b_allow_icmp(b, True) if policy.uid is not None: b = _b_uid(b, policy.uid) diff --git a/python/src/sandlock/policy.py b/python/src/sandlock/policy.py index 8edead2..c32480e 100644 --- a/python/src/sandlock/policy.py +++ b/python/src/sandlock/policy.py @@ -137,21 +137,29 @@ class Policy: block_syscalls: Sequence[str] = field(default_factory=list) """Additional syscall names to block on top of Sandlock's default blocklist.""" - # Network — endpoint allowlist (IP × port via seccomp on-behalf path) + # Network — endpoint allowlist (protocol × IP × port via seccomp on-behalf path) net_allow: Sequence[str] = field(default_factory=list) - """Outbound endpoint rules. Applies to TCP and to UDP (when - :attr:`allow_udp` is set). Each entry is a string of the form: - - * ``"host:port"`` — restrict to one host on one port (e.g. ``"api.openai.com:443"``) - * ``"host:port,port,..."`` — multiple ports for one host (e.g. ``"github.com:22,443"``) - * ``":port"`` or ``"*:port"`` — any IP on this port (e.g. ``":53"`` for DNS) - + """Outbound endpoint rules. Each entry is a string. The bare form is + TCP; other protocols use a scheme prefix: + + * ``"host:port"`` — TCP to one host on one port (e.g. ``"api.openai.com:443"``) + * ``"host:port,port,..."`` — TCP, multiple ports (e.g. ``"github.com:22,443"``) + * ``":port"`` / ``"*:port"`` — TCP on any IP (e.g. ``":53"``) + * ``"tcp://host:port"`` — explicit TCP (same suffix grammar as bare form) + * ``"udp://host:port"`` — UDP to a host + * ``"udp://*:*"`` — any UDP (matches the previous ``allow_udp=True`` behavior) + * ``"icmp://host"`` — kernel ping socket (SOCK_DGRAM + IPPROTO_ICMP) to a host + * ``"icmp://*"`` — any ICMP echo destination + + Sandlock does not expose raw ICMP (SOCK_RAW). Workloads that need + ping should rely on the host's ``net.ipv4.ping_group_range`` and + use the dgram path above. + + Protocol gating falls out of rule presence: with no UDP/ICMP rules, + UDP and ICMP socket creation are denied at the seccomp layer. Hostnames are resolved at sandbox-creation time and pinned via a - synthetic ``/etc/hosts``. Empty = deny all outbound (Landlock - rejects TCP on the direct path; no on-behalf path is enabled, so - UDP `sendto`/`sendmsg` are also untrapped — but UDP socket creation - itself is denied unless :attr:`allow_udp` is set). HTTP rules with - concrete hosts auto-add a matching entry on :attr:`http_ports`. + synthetic ``/etc/hosts``. Empty = deny all outbound. HTTP rules with + concrete hosts auto-add a matching TCP entry on :attr:`http_ports`. See README "Network Model" for details.""" no_coredump: bool = False @@ -162,24 +170,8 @@ class Policy: # Network — bind allowlist (Landlock ABI v4+, TCP only) net_bind: Sequence[int | str] = field(default_factory=list) """TCP ports the sandbox may bind. Empty = deny all. Each entry is - a port number or a ``"lo-hi"`` range string. UDP bind is gated by - :attr:`allow_udp` rather than this list — Landlock's port hooks - are TCP-only.""" - - # Socket type restrictions (seccomp-enforced). - # Raw sockets and UDP are denied by default; opt in via the flags below. - allow_udp: bool = False - """Permit UDP sockets (SOCK_DGRAM on AF_INET/AF_INET6). UDP is denied - by default. When ``True``, outbound UDP destinations are still gated - by :attr:`net_allow` — same endpoint allowlist used for TCP. AF_UNIX - datagrams are unaffected. CLI: ``--allow-udp``. Enforced via seccomp BPF.""" - - allow_icmp: bool = False - """Narrow ICMP raw socket carve-out: permit - ``socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)`` and the IPv6 equivalent - only. All other raw socket types remain denied. Useful for ``ping`` - without granting full packet-crafting capability. - CLI: ``--allow-icmp``. Enforced via seccomp BPF.""" + a port number or a ``"lo-hi"`` range string. Landlock's port hooks + are TCP-only — UDP bind is not separately gated.""" # HTTP ACL http_allow: Sequence[str] = field(default_factory=list) diff --git a/python/tests/test_mcp.py b/python/tests/test_mcp.py index 0c05b91..0a1c583 100644 --- a/python/tests/test_mcp.py +++ b/python/tests/test_mcp.py @@ -16,8 +16,6 @@ def test_no_capabilities(self): assert "/tmp/ws" in policy.fs_readable assert policy.net_allow == [] assert policy.net_bind == [] - assert policy.allow_udp is False - assert policy.allow_icmp is False def test_empty_capabilities(self): policy = policy_for_tool(workspace="/tmp/ws", capabilities={})