fix(setsockopt): increase maxOptLen from 8KB to 32KB#12686
Merged
copybara-service[bot] merged 1 commit intogoogle:masterfrom Mar 12, 2026
Merged
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
8a0cd4e to
acf6c58
Compare
The hard-coded maxOptLen of 8192 bytes silently rejected setsockopt(IPT_SO_SET_REPLACE) payloads with EINVAL when the iptables ruleset exceeded 8KB. This broke real-world workloads such as Istio 1.28+ service mesh, whose istio-init container generates nat table rulesets of ~13KB. This raises the limit to 32KB, which provides ~2.5x headroom over the largest known real-world payload. Linux itself limits this to INT_MAX (net/socket.c: do_sock_setsockopt); other container runtimes (runc, Kata Containers) inherit that limit by delegating to the host kernel. We use a conservative 32KB here to balance compatibility with resource protection in the sentry. Fixes google#12685 Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
acf6c58 to
4b78644
Compare
Collaborator
|
Thanks for the investigation and the fix! |
ayushr2
approved these changes
Mar 9, 2026
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 9, 2026
## Summary
The hard-coded `maxOptLen` of `1024 * 8` (8192 bytes) in `pkg/sentry/syscalls/linux/sys_socket.go` silently returns `EINVAL` for any `setsockopt` call whose `optval` exceeds 8KB. This breaks real-world workloads that rely on `setsockopt(IPT_SO_SET_REPLACE)` with large iptables rulesets.
**Root cause:** `iptables-restore` uses `setsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...)` to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceeds `maxOptLen`, gVisor returns `EINVAL` before the buffer even reaches the netfilter layer — with no log message — causing `iptables-restore` to report cryptic errors like `"can't initialize iptables table 'nat': Table does not exist"`.
**Data points:**
- Istio 1.24 full ruleset (~30 rules, TCP-only exclusions): ~5-6KB — under 8KB limit — **passes**
- Existing `istio_blob` test fixture in gVisor: 5,688 bytes — under 8KB limit — **passes**
- Istio 1.28.4 full ruleset (~58+ rules, TCP+UDP exclusions): ~13KB — exceeds 8KB limit — **fails**
## What Linux does
Linux limits `setsockopt` `optval` to `INT_MAX` via the `int optlen` parameter type and a single `if (optlen < 0) return -EINVAL` check in `do_sock_setsockopt()` (`net/socket.c`). There is no upper-bound check — `INT_MAX` is the effective ceiling.
## What other runtimes do
| Runtime | `setsockopt` optlen limit | Mechanism |
|---------|--------------------------|-----------|
| **Linux kernel** | `INT_MAX` (2,147,483,647) | C `int` type + `optlen < 0` check |
| **runc** | `INT_MAX` | Delegates to host kernel |
| **Kata Containers** | `INT_MAX` | Runs real Linux kernel in VM |
| **gVisor (before)** | 8,192 | Arbitrary `maxOptLen` constant |
| **gVisor (this PR)** | 32,768 | Conservative increase |
## Changes
1. **`pkg/sentry/syscalls/linux/sys_socket.go`**: Changed `maxOptLen` from `1024 * 8` (8KB) to `32 * 1024` (32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).
2. **`test/syscalls/linux/iptables.cc`**: Added `LargeReplacePayload` regression test that constructs a valid nat table replacement payload exceeding 8KB and verifies `setsockopt(IPT_SO_SET_REPLACE)` succeeds. The test restores the original table afterward to avoid side effects.
## Question for maintainers
We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses `INT_MAX` and the existing `maxControlLen` in the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.
**Should this be raised to `math.MaxInt32` to match Linux's `INT_MAX` behavior exactly?** This would eliminate any future risk of hitting this limit with other large `setsockopt` payloads (e.g. `SO_ATTACH_FILTER`, complex kube-proxy rulesets, etc.).
## Reproducer
On a gVisor-enabled Kubernetes node with `--net-raw` enabled:
```bash
# Istio 1.28+ istio-init generates ~13KB nat table payloads
# This fails silently with EINVAL on gVisor (before this fix)
kubectl run istio-test --image=istio/proxyv2:1.28.0 \
--restart=Never --rm -it \
--overrides='{"spec":{"runtimeClassName":"gvisor"}}' \
-- pilot-agent istio-iptables
```
## Risk assessment
- **Low risk**: The `maxOptLen` constant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate via `mmap`, and far smaller than the existing `maxControlLen` of 10MB for `msg_control` buffers.
- **Note**: The existing `maxControlLen` comment acknowledges this exact pattern: _"Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."_
Fixes #12685
FUTURE_COPYBARA_INTEGRATE_REVIEW=#12686 from a7i:fix/iptables-restore-ipt-so-set-replace-12685 4b78644
PiperOrigin-RevId: 880918196
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 11, 2026
## Summary
The hard-coded `maxOptLen` of `1024 * 8` (8192 bytes) in `pkg/sentry/syscalls/linux/sys_socket.go` silently returns `EINVAL` for any `setsockopt` call whose `optval` exceeds 8KB. This breaks real-world workloads that rely on `setsockopt(IPT_SO_SET_REPLACE)` with large iptables rulesets.
**Root cause:** `iptables-restore` uses `setsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...)` to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceeds `maxOptLen`, gVisor returns `EINVAL` before the buffer even reaches the netfilter layer — with no log message — causing `iptables-restore` to report cryptic errors like `"can't initialize iptables table 'nat': Table does not exist"`.
**Data points:**
- Istio 1.24 full ruleset (~30 rules, TCP-only exclusions): ~5-6KB — under 8KB limit — **passes**
- Existing `istio_blob` test fixture in gVisor: 5,688 bytes — under 8KB limit — **passes**
- Istio 1.28.4 full ruleset (~58+ rules, TCP+UDP exclusions): ~13KB — exceeds 8KB limit — **fails**
## What Linux does
Linux limits `setsockopt` `optval` to `INT_MAX` via the `int optlen` parameter type and a single `if (optlen < 0) return -EINVAL` check in `do_sock_setsockopt()` (`net/socket.c`). There is no upper-bound check — `INT_MAX` is the effective ceiling.
## What other runtimes do
| Runtime | `setsockopt` optlen limit | Mechanism |
|---------|--------------------------|-----------|
| **Linux kernel** | `INT_MAX` (2,147,483,647) | C `int` type + `optlen < 0` check |
| **runc** | `INT_MAX` | Delegates to host kernel |
| **Kata Containers** | `INT_MAX` | Runs real Linux kernel in VM |
| **gVisor (before)** | 8,192 | Arbitrary `maxOptLen` constant |
| **gVisor (this PR)** | 32,768 | Conservative increase |
## Changes
1. **`pkg/sentry/syscalls/linux/sys_socket.go`**: Changed `maxOptLen` from `1024 * 8` (8KB) to `32 * 1024` (32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).
2. **`test/syscalls/linux/iptables.cc`**: Added `LargeReplacePayload` regression test that constructs a valid nat table replacement payload exceeding 8KB and verifies `setsockopt(IPT_SO_SET_REPLACE)` succeeds. The test restores the original table afterward to avoid side effects.
## Question for maintainers
We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses `INT_MAX` and the existing `maxControlLen` in the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.
**Should this be raised to `math.MaxInt32` to match Linux's `INT_MAX` behavior exactly?** This would eliminate any future risk of hitting this limit with other large `setsockopt` payloads (e.g. `SO_ATTACH_FILTER`, complex kube-proxy rulesets, etc.).
## Reproducer
On a gVisor-enabled Kubernetes node with `--net-raw` enabled:
```bash
# Istio 1.28+ istio-init generates ~13KB nat table payloads
# This fails silently with EINVAL on gVisor (before this fix)
kubectl run istio-test --image=istio/proxyv2:1.28.0 \
--restart=Never --rm -it \
--overrides='{"spec":{"runtimeClassName":"gvisor"}}' \
-- pilot-agent istio-iptables
```
## Risk assessment
- **Low risk**: The `maxOptLen` constant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate via `mmap`, and far smaller than the existing `maxControlLen` of 10MB for `msg_control` buffers.
- **Note**: The existing `maxControlLen` comment acknowledges this exact pattern: _"Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."_
Fixes #12685
FUTURE_COPYBARA_INTEGRATE_REVIEW=#12686 from a7i:fix/iptables-restore-ipt-so-set-replace-12685 4b78644
PiperOrigin-RevId: 880918196
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 12, 2026
## Summary
The hard-coded `maxOptLen` of `1024 * 8` (8192 bytes) in `pkg/sentry/syscalls/linux/sys_socket.go` silently returns `EINVAL` for any `setsockopt` call whose `optval` exceeds 8KB. This breaks real-world workloads that rely on `setsockopt(IPT_SO_SET_REPLACE)` with large iptables rulesets.
**Root cause:** `iptables-restore` uses `setsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...)` to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceeds `maxOptLen`, gVisor returns `EINVAL` before the buffer even reaches the netfilter layer — with no log message — causing `iptables-restore` to report cryptic errors like `"can't initialize iptables table 'nat': Table does not exist"`.
**Data points:**
- Istio 1.24 full ruleset (~30 rules, TCP-only exclusions): ~5-6KB — under 8KB limit — **passes**
- Existing `istio_blob` test fixture in gVisor: 5,688 bytes — under 8KB limit — **passes**
- Istio 1.28.4 full ruleset (~58+ rules, TCP+UDP exclusions): ~13KB — exceeds 8KB limit — **fails**
## What Linux does
Linux limits `setsockopt` `optval` to `INT_MAX` via the `int optlen` parameter type and a single `if (optlen < 0) return -EINVAL` check in `do_sock_setsockopt()` (`net/socket.c`). There is no upper-bound check — `INT_MAX` is the effective ceiling.
## What other runtimes do
| Runtime | `setsockopt` optlen limit | Mechanism |
|---------|--------------------------|-----------|
| **Linux kernel** | `INT_MAX` (2,147,483,647) | C `int` type + `optlen < 0` check |
| **runc** | `INT_MAX` | Delegates to host kernel |
| **Kata Containers** | `INT_MAX` | Runs real Linux kernel in VM |
| **gVisor (before)** | 8,192 | Arbitrary `maxOptLen` constant |
| **gVisor (this PR)** | 32,768 | Conservative increase |
## Changes
1. **`pkg/sentry/syscalls/linux/sys_socket.go`**: Changed `maxOptLen` from `1024 * 8` (8KB) to `32 * 1024` (32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).
2. **`test/syscalls/linux/iptables.cc`**: Added `LargeReplacePayload` regression test that constructs a valid nat table replacement payload exceeding 8KB and verifies `setsockopt(IPT_SO_SET_REPLACE)` succeeds. The test restores the original table afterward to avoid side effects.
## Question for maintainers
We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses `INT_MAX` and the existing `maxControlLen` in the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.
**Should this be raised to `math.MaxInt32` to match Linux's `INT_MAX` behavior exactly?** This would eliminate any future risk of hitting this limit with other large `setsockopt` payloads (e.g. `SO_ATTACH_FILTER`, complex kube-proxy rulesets, etc.).
## Reproducer
On a gVisor-enabled Kubernetes node with `--net-raw` enabled:
```bash
# Istio 1.28+ istio-init generates ~13KB nat table payloads
# This fails silently with EINVAL on gVisor (before this fix)
kubectl run istio-test --image=istio/proxyv2:1.28.0 \
--restart=Never --rm -it \
--overrides='{"spec":{"runtimeClassName":"gvisor"}}' \
-- pilot-agent istio-iptables
```
## Risk assessment
- **Low risk**: The `maxOptLen` constant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate via `mmap`, and far smaller than the existing `maxControlLen` of 10MB for `msg_control` buffers.
- **Note**: The existing `maxControlLen` comment acknowledges this exact pattern: _"Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."_
Fixes #12685
FUTURE_COPYBARA_INTEGRATE_REVIEW=#12686 from a7i:fix/iptables-restore-ipt-so-set-replace-12685 4b78644
PiperOrigin-RevId: 880918196
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 12, 2026
## Summary Add the iptables `raw` table and a no-op `CT` (conntrack zone) target to gVisor's netfilter implementation. This enables Istio's `istio-init` container to apply iptables rules when DNS capture is enabled (`ISTIO_META_DNS_CAPTURE=true`). ## Problem When Istio DNS capture is enabled, `istio-iptables` generates `iptables-restore` input containing both `* nat` and `* raw` table sections. The `raw` table rules use `-j CT --zone N` targets for conntrack zone isolation between Envoy's DNS queries and application DNS queries. gVisor previously only implemented `nat`, `mangle`, and `filter` tables, causing `iptables-restore` to fail with: ``` iptables-restore: unable to initialize table 'raw' ``` This blocks Istio service mesh adoption on gVisor when DNS capture is required. ## Approach **Raw table**: Added as a new `TableID` (`RawID`) with `PREROUTING` and `OUTPUT` hooks, matching the Linux kernel's raw table. Wired into `CheckPrerouting()` and `CheckOutput()` as the **first** table checked (before mangle), matching Linux's netfilter hook priority ordering: - Linux hook order: raw → conntrack → mangle → nat → filter - gVisor hook order (now): raw → mangle → nat (filter is separate) **CT target**: Implemented as a **no-op** that accepts packets without modifying conntrack behavior. The target parses the `xt_ct_target_info` (revision 0) struct from userspace, stores the zone value, but does not apply zone-based conntrack isolation. This is intentional: - gVisor's conntrack implementation does not support zones - The CT target's purpose in Istio is to prevent conntrack table collisions between Envoy (UID 1337) and application DNS traffic - DNS redirection still works correctly via the `nat` table's `REDIRECT` rules to port 15053 - The lack of zone tracking may cause rare conntrack 5-tuple collisions under heavy concurrent DNS load, but this is acceptable for gVisor's sandboxed environment **How Linux and other runtimes handle this**: - **Linux kernel**: Full `raw` table with `CT --zone` support via `nf_conntrack_zones` - **runc / kata**: Delegate to the host Linux kernel, so they get full support for free - **gVisor**: Must implement in userspace netstack — this PR adds the table/target scaffolding with a no-op CT action ## Changes - `pkg/tcpip/stack/iptables.go`: Add `RawID` to `TableID` enum, `EmptyRawTable()`, default table entries for IPv4/IPv6, wire into `CheckPrerouting()` and `CheckOutput()` - `pkg/tcpip/stack/iptables_targets.go`: Add `CTTarget` struct with no-op `Action()` returning `RuleAccept` - `pkg/abi/linux/netfilter.go`: Add `XTCTTargetInfoV0` ABI struct (72 bytes) matching Linux's `xt_ct_target_info` - `pkg/sentry/socket/netfilter/netfilter.go`: Register `raw` table in `nameToID`, `SetEntries`, and `DefaultLinuxTables` - `pkg/sentry/socket/netfilter/ct_target.go`: New file — `ctTarget` wrapper and `ctTargetMaker` with marshal/unmarshal - `pkg/sentry/socket/netfilter/targets.go`: Register `ctTargetMaker` for IPv4 and IPv6 - `pkg/sentry/socket/netfilter/BUILD`: Add `ct_target.go` to srcs - `test/syscalls/linux/iptables.cc`: Add `RawTableInitialState` test (gVisor-only) and `CTTargetGetRevision` test ## Testing - `RawTableInitialState`: Verifies `IPT_SO_GET_INFO` for the "raw" table returns correct `valid_hooks` (PREROUTING + OUTPUT), `num_entries` (3), and entry sizes - `CTTargetGetRevision`: Verifies `IPT_SO_GET_REVISION_TARGET` for "CT" target revision 0 succeeds - **Manual end-to-end test**: Built `runsc` with this change (plus #12686), deployed to an aarch64 node, and verified Istio `istio-init` with `ISTIO_META_DNS_CAPTURE=true` completes successfully — the full `iptables-restore` input including both `* nat` and `* raw` sections is applied without error ## Related - Fixes #12685 - Depends on #12686 (maxOptLen increase) for large Istio rulesets FUTURE_COPYBARA_INTEGRATE_REVIEW=#12688 from a7i:fix/raw-table-ct-target 06aa774 PiperOrigin-RevId: 882273534
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 12, 2026
## Summary
The hard-coded `maxOptLen` of `1024 * 8` (8192 bytes) in `pkg/sentry/syscalls/linux/sys_socket.go` silently returns `EINVAL` for any `setsockopt` call whose `optval` exceeds 8KB. This breaks real-world workloads that rely on `setsockopt(IPT_SO_SET_REPLACE)` with large iptables rulesets.
**Root cause:** `iptables-restore` uses `setsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...)` to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceeds `maxOptLen`, gVisor returns `EINVAL` before the buffer even reaches the netfilter layer — with no log message — causing `iptables-restore` to report cryptic errors like `"can't initialize iptables table 'nat': Table does not exist"`.
**Data points:**
- Istio 1.24 full ruleset (~30 rules, TCP-only exclusions): ~5-6KB — under 8KB limit — **passes**
- Existing `istio_blob` test fixture in gVisor: 5,688 bytes — under 8KB limit — **passes**
- Istio 1.28.4 full ruleset (~58+ rules, TCP+UDP exclusions): ~13KB — exceeds 8KB limit — **fails**
## What Linux does
Linux limits `setsockopt` `optval` to `INT_MAX` via the `int optlen` parameter type and a single `if (optlen < 0) return -EINVAL` check in `do_sock_setsockopt()` (`net/socket.c`). There is no upper-bound check — `INT_MAX` is the effective ceiling.
## What other runtimes do
| Runtime | `setsockopt` optlen limit | Mechanism |
|---------|--------------------------|-----------|
| **Linux kernel** | `INT_MAX` (2,147,483,647) | C `int` type + `optlen < 0` check |
| **runc** | `INT_MAX` | Delegates to host kernel |
| **Kata Containers** | `INT_MAX` | Runs real Linux kernel in VM |
| **gVisor (before)** | 8,192 | Arbitrary `maxOptLen` constant |
| **gVisor (this PR)** | 32,768 | Conservative increase |
## Changes
1. **`pkg/sentry/syscalls/linux/sys_socket.go`**: Changed `maxOptLen` from `1024 * 8` (8KB) to `32 * 1024` (32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).
2. **`test/syscalls/linux/iptables.cc`**: Added `LargeReplacePayload` regression test that constructs a valid nat table replacement payload exceeding 8KB and verifies `setsockopt(IPT_SO_SET_REPLACE)` succeeds. The test restores the original table afterward to avoid side effects.
## Question for maintainers
We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses `INT_MAX` and the existing `maxControlLen` in the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.
**Should this be raised to `math.MaxInt32` to match Linux's `INT_MAX` behavior exactly?** This would eliminate any future risk of hitting this limit with other large `setsockopt` payloads (e.g. `SO_ATTACH_FILTER`, complex kube-proxy rulesets, etc.).
## Reproducer
On a gVisor-enabled Kubernetes node with `--net-raw` enabled:
```bash
# Istio 1.28+ istio-init generates ~13KB nat table payloads
# This fails silently with EINVAL on gVisor (before this fix)
kubectl run istio-test --image=istio/proxyv2:1.28.0 \
--restart=Never --rm -it \
--overrides='{"spec":{"runtimeClassName":"gvisor"}}' \
-- pilot-agent istio-iptables
```
## Risk assessment
- **Low risk**: The `maxOptLen` constant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate via `mmap`, and far smaller than the existing `maxControlLen` of 10MB for `msg_control` buffers.
- **Note**: The existing `maxControlLen` comment acknowledges this exact pattern: _"Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."_
Fixes #12685
FUTURE_COPYBARA_INTEGRATE_REVIEW=#12686 from a7i:fix/iptables-restore-ipt-so-set-replace-12685 4b78644
PiperOrigin-RevId: 880918196
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The hard-coded
maxOptLenof1024 * 8(8192 bytes) inpkg/sentry/syscalls/linux/sys_socket.gosilently returnsEINVALfor anysetsockoptcall whoseoptvalexceeds 8KB. This breaks real-world workloads that rely onsetsockopt(IPT_SO_SET_REPLACE)with large iptables rulesets.Root cause:
iptables-restoreusessetsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...)to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceedsmaxOptLen, gVisor returnsEINVALbefore the buffer even reaches the netfilter layer — with no log message — causingiptables-restoreto report cryptic errors like"can't initialize iptables table 'nat': Table does not exist".Data points:
istio_blobtest fixture in gVisor: 5,688 bytes — under 8KB limit — passesWhat Linux does
Linux limits
setsockoptoptvaltoINT_MAXvia theint optlenparameter type and a singleif (optlen < 0) return -EINVALcheck indo_sock_setsockopt()(net/socket.c). There is no upper-bound check —INT_MAXis the effective ceiling.What other runtimes do
setsockoptoptlen limitINT_MAX(2,147,483,647)inttype +optlen < 0checkINT_MAXINT_MAXmaxOptLenconstantChanges
pkg/sentry/syscalls/linux/sys_socket.go: ChangedmaxOptLenfrom1024 * 8(8KB) to32 * 1024(32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).test/syscalls/linux/iptables.cc: AddedLargeReplacePayloadregression test that constructs a valid nat table replacement payload exceeding 8KB and verifiessetsockopt(IPT_SO_SET_REPLACE)succeeds. The test restores the original table afterward to avoid side effects.Question for maintainers
We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses
INT_MAXand the existingmaxControlLenin the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.Should this be raised to
math.MaxInt32to match Linux'sINT_MAXbehavior exactly? This would eliminate any future risk of hitting this limit with other largesetsockoptpayloads (e.g.SO_ATTACH_FILTER, complex kube-proxy rulesets, etc.).Reproducer
On a gVisor-enabled Kubernetes node with
--net-rawenabled:Risk assessment
maxOptLenconstant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate viammap, and far smaller than the existingmaxControlLenof 10MB formsg_controlbuffers.maxControlLencomment acknowledges this exact pattern: "Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."Fixes #12685