fix(setsockopt): increase maxOptLen from 8KB to 32KB#12698
Merged
copybara-service[bot] merged 2 commits intomasterfrom Mar 12, 2026
Merged
fix(setsockopt): increase maxOptLen from 8KB to 32KB#12698copybara-service[bot] merged 2 commits intomasterfrom
copybara-service[bot] merged 2 commits intomasterfrom
Conversation
The hard-coded maxOptLen of 8192 bytes silently rejected setsockopt(IPT_SO_SET_REPLACE) payloads with EINVAL when the iptables ruleset exceeded 8KB. This broke real-world workloads such as Istio 1.28+ service mesh, whose istio-init container generates nat table rulesets of ~13KB. This raises the limit to 32KB, which provides ~2.5x headroom over the largest known real-world payload. Linux itself limits this to INT_MAX (net/socket.c: do_sock_setsockopt); other container runtimes (runc, Kata Containers) inherit that limit by delegating to the host kernel. We use a conservative 32KB here to balance compatibility with resource protection in the sentry. Fixes #12685 Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
55c5d45 to
9c850b0
Compare
…place-12685 PiperOrigin-RevId: 882305944
9c850b0 to
293707e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(setsockopt): increase maxOptLen from 8KB to 32KB
Summary
The hard-coded
maxOptLenof1024 * 8(8192 bytes) inpkg/sentry/syscalls/linux/sys_socket.gosilently returnsEINVALfor anysetsockoptcall whoseoptvalexceeds 8KB. This breaks real-world workloads that rely onsetsockopt(IPT_SO_SET_REPLACE)with large iptables rulesets.Root cause:
iptables-restoreusessetsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...)to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceedsmaxOptLen, gVisor returnsEINVALbefore the buffer even reaches the netfilter layer — with no log message — causingiptables-restoreto report cryptic errors like"can't initialize iptables table 'nat': Table does not exist".Data points:
istio_blobtest fixture in gVisor: 5,688 bytes — under 8KB limit — passesWhat Linux does
Linux limits
setsockoptoptvaltoINT_MAXvia theint optlenparameter type and a singleif (optlen < 0) return -EINVALcheck indo_sock_setsockopt()(net/socket.c). There is no upper-bound check —INT_MAXis the effective ceiling.What other runtimes do
setsockoptoptlen limitINT_MAX(2,147,483,647)inttype +optlen < 0checkINT_MAXINT_MAXmaxOptLenconstantChanges
pkg/sentry/syscalls/linux/sys_socket.go: ChangedmaxOptLenfrom1024 * 8(8KB) to32 * 1024(32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).test/syscalls/linux/iptables.cc: AddedLargeReplacePayloadregression test that constructs a valid nat table replacement payload exceeding 8KB and verifiessetsockopt(IPT_SO_SET_REPLACE)succeeds. The test restores the original table afterward to avoid side effects.Question for maintainers
We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses
INT_MAXand the existingmaxControlLenin the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.Should this be raised to
math.MaxInt32to match Linux'sINT_MAXbehavior exactly? This would eliminate any future risk of hitting this limit with other largesetsockoptpayloads (e.g.SO_ATTACH_FILTER, complex kube-proxy rulesets, etc.).Reproducer
On a gVisor-enabled Kubernetes node with
--net-rawenabled:Risk assessment
maxOptLenconstant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate viammap, and far smaller than the existingmaxControlLenof 10MB formsg_controlbuffers.maxControlLencomment acknowledges this exact pattern: "Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."Fixes #12685
FUTURE_COPYBARA_INTEGRATE_REVIEW=#12686 from a7i:fix/iptables-restore-ipt-so-set-replace-12685 4b78644