Skip to content

fix(setsockopt): increase maxOptLen from 8KB to 32KB#12698

Merged
copybara-service[bot] merged 2 commits intomasterfrom
test/cl880918196
Mar 12, 2026
Merged

fix(setsockopt): increase maxOptLen from 8KB to 32KB#12698
copybara-service[bot] merged 2 commits intomasterfrom
test/cl880918196

Conversation

@copybara-service
Copy link

fix(setsockopt): increase maxOptLen from 8KB to 32KB

Summary

The hard-coded maxOptLen of 1024 * 8 (8192 bytes) in pkg/sentry/syscalls/linux/sys_socket.go silently returns EINVAL for any setsockopt call whose optval exceeds 8KB. This breaks real-world workloads that rely on setsockopt(IPT_SO_SET_REPLACE) with large iptables rulesets.

Root cause: iptables-restore uses setsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...) to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceeds maxOptLen, gVisor returns EINVAL before the buffer even reaches the netfilter layer — with no log message — causing iptables-restore to report cryptic errors like "can't initialize iptables table 'nat': Table does not exist".

Data points:

  • Istio 1.24 full ruleset (~30 rules, TCP-only exclusions): ~5-6KB — under 8KB limit — passes
  • Existing istio_blob test fixture in gVisor: 5,688 bytes — under 8KB limit — passes
  • Istio 1.28.4 full ruleset (~58+ rules, TCP+UDP exclusions): ~13KB — exceeds 8KB limit — fails

What Linux does

Linux limits setsockopt optval to INT_MAX via the int optlen parameter type and a single if (optlen < 0) return -EINVAL check in do_sock_setsockopt() (net/socket.c). There is no upper-bound check — INT_MAX is the effective ceiling.

What other runtimes do

Runtime setsockopt optlen limit Mechanism
Linux kernel INT_MAX (2,147,483,647) C int type + optlen < 0 check
runc INT_MAX Delegates to host kernel
Kata Containers INT_MAX Runs real Linux kernel in VM
gVisor (before) 8,192 Arbitrary maxOptLen constant
gVisor (this PR) 32,768 Conservative increase

Changes

  1. pkg/sentry/syscalls/linux/sys_socket.go: Changed maxOptLen from 1024 * 8 (8KB) to 32 * 1024 (32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).

  2. test/syscalls/linux/iptables.cc: Added LargeReplacePayload regression test that constructs a valid nat table replacement payload exceeding 8KB and verifies setsockopt(IPT_SO_SET_REPLACE) succeeds. The test restores the original table afterward to avoid side effects.

Question for maintainers

We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses INT_MAX and the existing maxControlLen in the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.

Should this be raised to math.MaxInt32 to match Linux's INT_MAX behavior exactly? This would eliminate any future risk of hitting this limit with other large setsockopt payloads (e.g. SO_ATTACH_FILTER, complex kube-proxy rulesets, etc.).

Reproducer

On a gVisor-enabled Kubernetes node with --net-raw enabled:

# Istio 1.28+ istio-init generates ~13KB nat table payloads
# This fails silently with EINVAL on gVisor (before this fix)
kubectl run istio-test --image=istio/proxyv2:1.28.0 \
  --restart=Never --rm -it \
  --overrides='{"spec":{"runtimeClassName":"gvisor"}}' \
  -- pilot-agent istio-iptables

Risk assessment

  • Low risk: The maxOptLen constant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate via mmap, and far smaller than the existing maxControlLen of 10MB for msg_control buffers.
  • Note: The existing maxControlLen comment acknowledges this exact pattern: "Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."

Fixes #12685

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12686 from a7i:fix/iptables-restore-ipt-so-set-replace-12685 4b78644

The hard-coded maxOptLen of 8192 bytes silently rejected
setsockopt(IPT_SO_SET_REPLACE) payloads with EINVAL when the
iptables ruleset exceeded 8KB. This broke real-world workloads such
as Istio 1.28+ service mesh, whose istio-init container generates
nat table rulesets of ~13KB.

This raises the limit to 32KB, which provides ~2.5x headroom over
the largest known real-world payload. Linux itself limits this to
INT_MAX (net/socket.c: do_sock_setsockopt); other container runtimes
(runc, Kata Containers) inherit that limit by delegating to the host
kernel. We use a conservative 32KB here to balance compatibility with
resource protection in the sentry.

Fixes #12685

Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
@copybara-service copybara-service bot added the exported Issue was exported automatically label Mar 9, 2026
@parth-opensrc parth-opensrc self-assigned this Mar 11, 2026
@copybara-service copybara-service bot force-pushed the test/cl880918196 branch 3 times, most recently from 55c5d45 to 9c850b0 Compare March 12, 2026 01:13
…place-12685

PiperOrigin-RevId: 882305944
@copybara-service copybara-service bot merged commit 293707e into master Mar 12, 2026
1 check was pending
@copybara-service copybara-service bot deleted the test/cl880918196 branch March 12, 2026 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

exported Issue was exported automatically

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Istio istio-init fails on gVisor: maxOptLen 8KB limit + missing raw table block iptables-restore

3 participants