Skip to content

fix(setsockopt): increase maxOptLen from 8KB to 32KB#12686

Merged
copybara-service[bot] merged 1 commit intogoogle:masterfrom
a7i:fix/iptables-restore-ipt-so-set-replace-12685
Mar 12, 2026
Merged

fix(setsockopt): increase maxOptLen from 8KB to 32KB#12686
copybara-service[bot] merged 1 commit intogoogle:masterfrom
a7i:fix/iptables-restore-ipt-so-set-replace-12685

Conversation

@a7i
Copy link
Contributor

@a7i a7i commented Mar 7, 2026

Summary

The hard-coded maxOptLen of 1024 * 8 (8192 bytes) in pkg/sentry/syscalls/linux/sys_socket.go silently returns EINVAL for any setsockopt call whose optval exceeds 8KB. This breaks real-world workloads that rely on setsockopt(IPT_SO_SET_REPLACE) with large iptables rulesets.

Root cause: iptables-restore uses setsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...) to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceeds maxOptLen, gVisor returns EINVAL before the buffer even reaches the netfilter layer — with no log message — causing iptables-restore to report cryptic errors like "can't initialize iptables table 'nat': Table does not exist".

Data points:

  • Istio 1.24 full ruleset (~30 rules, TCP-only exclusions): ~5-6KB — under 8KB limit — passes
  • Existing istio_blob test fixture in gVisor: 5,688 bytes — under 8KB limit — passes
  • Istio 1.28.4 full ruleset (~58+ rules, TCP+UDP exclusions): ~13KB — exceeds 8KB limit — fails

What Linux does

Linux limits setsockopt optval to INT_MAX via the int optlen parameter type and a single if (optlen < 0) return -EINVAL check in do_sock_setsockopt() (net/socket.c). There is no upper-bound check — INT_MAX is the effective ceiling.

What other runtimes do

Runtime setsockopt optlen limit Mechanism
Linux kernel INT_MAX (2,147,483,647) C int type + optlen < 0 check
runc INT_MAX Delegates to host kernel
Kata Containers INT_MAX Runs real Linux kernel in VM
gVisor (before) 8,192 Arbitrary maxOptLen constant
gVisor (this PR) 32,768 Conservative increase

Changes

  1. pkg/sentry/syscalls/linux/sys_socket.go: Changed maxOptLen from 1024 * 8 (8KB) to 32 * 1024 (32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).

  2. test/syscalls/linux/iptables.cc: Added LargeReplacePayload regression test that constructs a valid nat table replacement payload exceeding 8KB and verifies setsockopt(IPT_SO_SET_REPLACE) succeeds. The test restores the original table afterward to avoid side effects.

Question for maintainers

We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses INT_MAX and the existing maxControlLen in the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.

Should this be raised to math.MaxInt32 to match Linux's INT_MAX behavior exactly? This would eliminate any future risk of hitting this limit with other large setsockopt payloads (e.g. SO_ATTACH_FILTER, complex kube-proxy rulesets, etc.).

Reproducer

On a gVisor-enabled Kubernetes node with --net-raw enabled:

# Istio 1.28+ istio-init generates ~13KB nat table payloads
# This fails silently with EINVAL on gVisor (before this fix)
kubectl run istio-test --image=istio/proxyv2:1.28.0 \
  --restart=Never --rm -it \
  --overrides='{"spec":{"runtimeClassName":"gvisor"}}' \
  -- pilot-agent istio-iptables

Risk assessment

  • Low risk: The maxOptLen constant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate via mmap, and far smaller than the existing maxControlLen of 10MB for msg_control buffers.
  • Note: The existing maxControlLen comment acknowledges this exact pattern: "Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."

Fixes #12685

@google-cla
Copy link

google-cla bot commented Mar 7, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

The hard-coded maxOptLen of 8192 bytes silently rejected
setsockopt(IPT_SO_SET_REPLACE) payloads with EINVAL when the
iptables ruleset exceeded 8KB. This broke real-world workloads such
as Istio 1.28+ service mesh, whose istio-init container generates
nat table rulesets of ~13KB.

This raises the limit to 32KB, which provides ~2.5x headroom over
the largest known real-world payload. Linux itself limits this to
INT_MAX (net/socket.c: do_sock_setsockopt); other container runtimes
(runc, Kata Containers) inherit that limit by delegating to the host
kernel. We use a conservative 32KB here to balance compatibility with
resource protection in the sentry.

Fixes google#12685

Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
@a7i a7i force-pushed the fix/iptables-restore-ipt-so-set-replace-12685 branch from acf6c58 to 4b78644 Compare March 8, 2026 01:22
@a7i a7i changed the title fix(setsockopt): increase maxOptLen from 8KB to match Linux INT_MAX fix(setsockopt): increase maxOptLen from 8KB to 32KB Mar 8, 2026
@ayushr2
Copy link
Collaborator

ayushr2 commented Mar 9, 2026

Thanks for the investigation and the fix!

copybara-service bot pushed a commit that referenced this pull request Mar 9, 2026
## Summary

The hard-coded `maxOptLen` of `1024 * 8` (8192 bytes) in `pkg/sentry/syscalls/linux/sys_socket.go` silently returns `EINVAL` for any `setsockopt` call whose `optval` exceeds 8KB. This breaks real-world workloads that rely on `setsockopt(IPT_SO_SET_REPLACE)` with large iptables rulesets.

**Root cause:** `iptables-restore` uses `setsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...)` to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceeds `maxOptLen`, gVisor returns `EINVAL` before the buffer even reaches the netfilter layer — with no log message — causing `iptables-restore` to report cryptic errors like `"can't initialize iptables table 'nat': Table does not exist"`.

**Data points:**
- Istio 1.24 full ruleset (~30 rules, TCP-only exclusions): ~5-6KB — under 8KB limit — **passes**
- Existing `istio_blob` test fixture in gVisor: 5,688 bytes — under 8KB limit — **passes**
- Istio 1.28.4 full ruleset (~58+ rules, TCP+UDP exclusions): ~13KB — exceeds 8KB limit — **fails**

## What Linux does

Linux limits `setsockopt` `optval` to `INT_MAX` via the `int optlen` parameter type and a single `if (optlen < 0) return -EINVAL` check in `do_sock_setsockopt()` (`net/socket.c`). There is no upper-bound check — `INT_MAX` is the effective ceiling.

## What other runtimes do

| Runtime | `setsockopt` optlen limit | Mechanism |
|---------|--------------------------|-----------|
| **Linux kernel** | `INT_MAX` (2,147,483,647) | C `int` type + `optlen < 0` check |
| **runc** | `INT_MAX` | Delegates to host kernel |
| **Kata Containers** | `INT_MAX` | Runs real Linux kernel in VM |
| **gVisor (before)** | 8,192 | Arbitrary `maxOptLen` constant |
| **gVisor (this PR)** | 32,768 | Conservative increase |

## Changes

1. **`pkg/sentry/syscalls/linux/sys_socket.go`**: Changed `maxOptLen` from `1024 * 8` (8KB) to `32 * 1024` (32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).

2. **`test/syscalls/linux/iptables.cc`**: Added `LargeReplacePayload` regression test that constructs a valid nat table replacement payload exceeding 8KB and verifies `setsockopt(IPT_SO_SET_REPLACE)` succeeds. The test restores the original table afterward to avoid side effects.

## Question for maintainers

We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses `INT_MAX` and the existing `maxControlLen` in the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.

**Should this be raised to `math.MaxInt32` to match Linux's `INT_MAX` behavior exactly?** This would eliminate any future risk of hitting this limit with other large `setsockopt` payloads (e.g. `SO_ATTACH_FILTER`, complex kube-proxy rulesets, etc.).

## Reproducer

On a gVisor-enabled Kubernetes node with `--net-raw` enabled:

```bash
# Istio 1.28+ istio-init generates ~13KB nat table payloads
# This fails silently with EINVAL on gVisor (before this fix)
kubectl run istio-test --image=istio/proxyv2:1.28.0 \
  --restart=Never --rm -it \
  --overrides='{"spec":{"runtimeClassName":"gvisor"}}' \
  -- pilot-agent istio-iptables
```

## Risk assessment

- **Low risk**: The `maxOptLen` constant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate via `mmap`, and far smaller than the existing `maxControlLen` of 10MB for `msg_control` buffers.
- **Note**: The existing `maxControlLen` comment acknowledges this exact pattern: _"Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."_

Fixes #12685

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12686 from a7i:fix/iptables-restore-ipt-so-set-replace-12685 4b78644
PiperOrigin-RevId: 880918196
copybara-service bot pushed a commit that referenced this pull request Mar 11, 2026
## Summary

The hard-coded `maxOptLen` of `1024 * 8` (8192 bytes) in `pkg/sentry/syscalls/linux/sys_socket.go` silently returns `EINVAL` for any `setsockopt` call whose `optval` exceeds 8KB. This breaks real-world workloads that rely on `setsockopt(IPT_SO_SET_REPLACE)` with large iptables rulesets.

**Root cause:** `iptables-restore` uses `setsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...)` to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceeds `maxOptLen`, gVisor returns `EINVAL` before the buffer even reaches the netfilter layer — with no log message — causing `iptables-restore` to report cryptic errors like `"can't initialize iptables table 'nat': Table does not exist"`.

**Data points:**
- Istio 1.24 full ruleset (~30 rules, TCP-only exclusions): ~5-6KB — under 8KB limit — **passes**
- Existing `istio_blob` test fixture in gVisor: 5,688 bytes — under 8KB limit — **passes**
- Istio 1.28.4 full ruleset (~58+ rules, TCP+UDP exclusions): ~13KB — exceeds 8KB limit — **fails**

## What Linux does

Linux limits `setsockopt` `optval` to `INT_MAX` via the `int optlen` parameter type and a single `if (optlen < 0) return -EINVAL` check in `do_sock_setsockopt()` (`net/socket.c`). There is no upper-bound check — `INT_MAX` is the effective ceiling.

## What other runtimes do

| Runtime | `setsockopt` optlen limit | Mechanism |
|---------|--------------------------|-----------|
| **Linux kernel** | `INT_MAX` (2,147,483,647) | C `int` type + `optlen < 0` check |
| **runc** | `INT_MAX` | Delegates to host kernel |
| **Kata Containers** | `INT_MAX` | Runs real Linux kernel in VM |
| **gVisor (before)** | 8,192 | Arbitrary `maxOptLen` constant |
| **gVisor (this PR)** | 32,768 | Conservative increase |

## Changes

1. **`pkg/sentry/syscalls/linux/sys_socket.go`**: Changed `maxOptLen` from `1024 * 8` (8KB) to `32 * 1024` (32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).

2. **`test/syscalls/linux/iptables.cc`**: Added `LargeReplacePayload` regression test that constructs a valid nat table replacement payload exceeding 8KB and verifies `setsockopt(IPT_SO_SET_REPLACE)` succeeds. The test restores the original table afterward to avoid side effects.

## Question for maintainers

We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses `INT_MAX` and the existing `maxControlLen` in the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.

**Should this be raised to `math.MaxInt32` to match Linux's `INT_MAX` behavior exactly?** This would eliminate any future risk of hitting this limit with other large `setsockopt` payloads (e.g. `SO_ATTACH_FILTER`, complex kube-proxy rulesets, etc.).

## Reproducer

On a gVisor-enabled Kubernetes node with `--net-raw` enabled:

```bash
# Istio 1.28+ istio-init generates ~13KB nat table payloads
# This fails silently with EINVAL on gVisor (before this fix)
kubectl run istio-test --image=istio/proxyv2:1.28.0 \
  --restart=Never --rm -it \
  --overrides='{"spec":{"runtimeClassName":"gvisor"}}' \
  -- pilot-agent istio-iptables
```

## Risk assessment

- **Low risk**: The `maxOptLen` constant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate via `mmap`, and far smaller than the existing `maxControlLen` of 10MB for `msg_control` buffers.
- **Note**: The existing `maxControlLen` comment acknowledges this exact pattern: _"Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."_

Fixes #12685

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12686 from a7i:fix/iptables-restore-ipt-so-set-replace-12685 4b78644
PiperOrigin-RevId: 880918196
copybara-service bot pushed a commit that referenced this pull request Mar 12, 2026
## Summary

The hard-coded `maxOptLen` of `1024 * 8` (8192 bytes) in `pkg/sentry/syscalls/linux/sys_socket.go` silently returns `EINVAL` for any `setsockopt` call whose `optval` exceeds 8KB. This breaks real-world workloads that rely on `setsockopt(IPT_SO_SET_REPLACE)` with large iptables rulesets.

**Root cause:** `iptables-restore` uses `setsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...)` to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceeds `maxOptLen`, gVisor returns `EINVAL` before the buffer even reaches the netfilter layer — with no log message — causing `iptables-restore` to report cryptic errors like `"can't initialize iptables table 'nat': Table does not exist"`.

**Data points:**
- Istio 1.24 full ruleset (~30 rules, TCP-only exclusions): ~5-6KB — under 8KB limit — **passes**
- Existing `istio_blob` test fixture in gVisor: 5,688 bytes — under 8KB limit — **passes**
- Istio 1.28.4 full ruleset (~58+ rules, TCP+UDP exclusions): ~13KB — exceeds 8KB limit — **fails**

## What Linux does

Linux limits `setsockopt` `optval` to `INT_MAX` via the `int optlen` parameter type and a single `if (optlen < 0) return -EINVAL` check in `do_sock_setsockopt()` (`net/socket.c`). There is no upper-bound check — `INT_MAX` is the effective ceiling.

## What other runtimes do

| Runtime | `setsockopt` optlen limit | Mechanism |
|---------|--------------------------|-----------|
| **Linux kernel** | `INT_MAX` (2,147,483,647) | C `int` type + `optlen < 0` check |
| **runc** | `INT_MAX` | Delegates to host kernel |
| **Kata Containers** | `INT_MAX` | Runs real Linux kernel in VM |
| **gVisor (before)** | 8,192 | Arbitrary `maxOptLen` constant |
| **gVisor (this PR)** | 32,768 | Conservative increase |

## Changes

1. **`pkg/sentry/syscalls/linux/sys_socket.go`**: Changed `maxOptLen` from `1024 * 8` (8KB) to `32 * 1024` (32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).

2. **`test/syscalls/linux/iptables.cc`**: Added `LargeReplacePayload` regression test that constructs a valid nat table replacement payload exceeding 8KB and verifies `setsockopt(IPT_SO_SET_REPLACE)` succeeds. The test restores the original table afterward to avoid side effects.

## Question for maintainers

We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses `INT_MAX` and the existing `maxControlLen` in the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.

**Should this be raised to `math.MaxInt32` to match Linux's `INT_MAX` behavior exactly?** This would eliminate any future risk of hitting this limit with other large `setsockopt` payloads (e.g. `SO_ATTACH_FILTER`, complex kube-proxy rulesets, etc.).

## Reproducer

On a gVisor-enabled Kubernetes node with `--net-raw` enabled:

```bash
# Istio 1.28+ istio-init generates ~13KB nat table payloads
# This fails silently with EINVAL on gVisor (before this fix)
kubectl run istio-test --image=istio/proxyv2:1.28.0 \
  --restart=Never --rm -it \
  --overrides='{"spec":{"runtimeClassName":"gvisor"}}' \
  -- pilot-agent istio-iptables
```

## Risk assessment

- **Low risk**: The `maxOptLen` constant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate via `mmap`, and far smaller than the existing `maxControlLen` of 10MB for `msg_control` buffers.
- **Note**: The existing `maxControlLen` comment acknowledges this exact pattern: _"Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."_

Fixes #12685

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12686 from a7i:fix/iptables-restore-ipt-so-set-replace-12685 4b78644
PiperOrigin-RevId: 880918196
copybara-service bot pushed a commit that referenced this pull request Mar 12, 2026
## Summary

Add the iptables `raw` table and a no-op `CT` (conntrack zone) target to gVisor's netfilter implementation. This enables Istio's `istio-init` container to apply iptables rules when DNS capture is enabled (`ISTIO_META_DNS_CAPTURE=true`).

## Problem

When Istio DNS capture is enabled, `istio-iptables` generates `iptables-restore` input containing both `* nat` and `* raw` table sections. The `raw` table rules use `-j CT --zone N` targets for conntrack zone isolation between Envoy's DNS queries and application DNS queries. gVisor previously only implemented `nat`, `mangle`, and `filter` tables, causing `iptables-restore` to fail with:

```
iptables-restore: unable to initialize table 'raw'
```

This blocks Istio service mesh adoption on gVisor when DNS capture is required.

## Approach

**Raw table**: Added as a new `TableID` (`RawID`) with `PREROUTING` and `OUTPUT` hooks, matching the Linux kernel's raw table. Wired into `CheckPrerouting()` and `CheckOutput()` as the **first** table checked (before mangle), matching Linux's netfilter hook priority ordering:

- Linux hook order: raw → conntrack → mangle → nat → filter
- gVisor hook order (now): raw → mangle → nat (filter is separate)

**CT target**: Implemented as a **no-op** that accepts packets without modifying conntrack behavior. The target parses the `xt_ct_target_info` (revision 0) struct from userspace, stores the zone value, but does not apply zone-based conntrack isolation. This is intentional:

- gVisor's conntrack implementation does not support zones
- The CT target's purpose in Istio is to prevent conntrack table collisions between Envoy (UID 1337) and application DNS traffic
- DNS redirection still works correctly via the `nat` table's `REDIRECT` rules to port 15053
- The lack of zone tracking may cause rare conntrack 5-tuple collisions under heavy concurrent DNS load, but this is acceptable for gVisor's sandboxed environment

**How Linux and other runtimes handle this**:
- **Linux kernel**: Full `raw` table with `CT --zone` support via `nf_conntrack_zones`
- **runc / kata**: Delegate to the host Linux kernel, so they get full support for free
- **gVisor**: Must implement in userspace netstack — this PR adds the table/target scaffolding with a no-op CT action

## Changes

- `pkg/tcpip/stack/iptables.go`: Add `RawID` to `TableID` enum, `EmptyRawTable()`, default table entries for IPv4/IPv6, wire into `CheckPrerouting()` and `CheckOutput()`
- `pkg/tcpip/stack/iptables_targets.go`: Add `CTTarget` struct with no-op `Action()` returning `RuleAccept`
- `pkg/abi/linux/netfilter.go`: Add `XTCTTargetInfoV0` ABI struct (72 bytes) matching Linux's `xt_ct_target_info`
- `pkg/sentry/socket/netfilter/netfilter.go`: Register `raw` table in `nameToID`, `SetEntries`, and `DefaultLinuxTables`
- `pkg/sentry/socket/netfilter/ct_target.go`: New file — `ctTarget` wrapper and `ctTargetMaker` with marshal/unmarshal
- `pkg/sentry/socket/netfilter/targets.go`: Register `ctTargetMaker` for IPv4 and IPv6
- `pkg/sentry/socket/netfilter/BUILD`: Add `ct_target.go` to srcs
- `test/syscalls/linux/iptables.cc`: Add `RawTableInitialState` test (gVisor-only) and `CTTargetGetRevision` test

## Testing

- `RawTableInitialState`: Verifies `IPT_SO_GET_INFO` for the "raw" table returns correct `valid_hooks` (PREROUTING + OUTPUT), `num_entries` (3), and entry sizes
- `CTTargetGetRevision`: Verifies `IPT_SO_GET_REVISION_TARGET` for "CT" target revision 0 succeeds
- **Manual end-to-end test**: Built `runsc` with this change (plus #12686), deployed to an aarch64 node, and verified Istio `istio-init` with `ISTIO_META_DNS_CAPTURE=true` completes successfully — the full `iptables-restore` input including both `* nat` and `* raw` sections is applied without error

## Related

- Fixes #12685
- Depends on #12686 (maxOptLen increase) for large Istio rulesets

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12688 from a7i:fix/raw-table-ct-target 06aa774
PiperOrigin-RevId: 882273534
copybara-service bot pushed a commit that referenced this pull request Mar 12, 2026
## Summary

The hard-coded `maxOptLen` of `1024 * 8` (8192 bytes) in `pkg/sentry/syscalls/linux/sys_socket.go` silently returns `EINVAL` for any `setsockopt` call whose `optval` exceeds 8KB. This breaks real-world workloads that rely on `setsockopt(IPT_SO_SET_REPLACE)` with large iptables rulesets.

**Root cause:** `iptables-restore` uses `setsockopt(SOL_IP, IPT_SO_SET_REPLACE, ...)` to atomically replace an entire iptables table. For service meshes like Istio (v1.28+), the nat table payload commonly exceeds 8KB due to the number of rules generated for port exclusions, owner-match rules, and REDIRECT targets. When the payload exceeds `maxOptLen`, gVisor returns `EINVAL` before the buffer even reaches the netfilter layer — with no log message — causing `iptables-restore` to report cryptic errors like `"can't initialize iptables table 'nat': Table does not exist"`.

**Data points:**
- Istio 1.24 full ruleset (~30 rules, TCP-only exclusions): ~5-6KB — under 8KB limit — **passes**
- Existing `istio_blob` test fixture in gVisor: 5,688 bytes — under 8KB limit — **passes**
- Istio 1.28.4 full ruleset (~58+ rules, TCP+UDP exclusions): ~13KB — exceeds 8KB limit — **fails**

## What Linux does

Linux limits `setsockopt` `optval` to `INT_MAX` via the `int optlen` parameter type and a single `if (optlen < 0) return -EINVAL` check in `do_sock_setsockopt()` (`net/socket.c`). There is no upper-bound check — `INT_MAX` is the effective ceiling.

## What other runtimes do

| Runtime | `setsockopt` optlen limit | Mechanism |
|---------|--------------------------|-----------|
| **Linux kernel** | `INT_MAX` (2,147,483,647) | C `int` type + `optlen < 0` check |
| **runc** | `INT_MAX` | Delegates to host kernel |
| **Kata Containers** | `INT_MAX` | Runs real Linux kernel in VM |
| **gVisor (before)** | 8,192 | Arbitrary `maxOptLen` constant |
| **gVisor (this PR)** | 32,768 | Conservative increase |

## Changes

1. **`pkg/sentry/syscalls/linux/sys_socket.go`**: Changed `maxOptLen` from `1024 * 8` (8KB) to `32 * 1024` (32KB). This provides ~2.5x headroom over the largest known real-world payload (Istio 1.28+ at ~13KB).

2. **`test/syscalls/linux/iptables.cc`**: Added `LargeReplacePayload` regression test that constructs a valid nat table replacement payload exceeding 8KB and verifies `setsockopt(IPT_SO_SET_REPLACE)` succeeds. The test restores the original table afterward to avoid side effects.

## Question for maintainers

We chose a conservative 32KB to balance compatibility with resource protection. However, Linux uses `INT_MAX` and the existing `maxControlLen` in the same file is already 10MB. The sandbox's cgroup memory limits are arguably the right place to guard against resource exhaustion, not a per-syscall buffer cap.

**Should this be raised to `math.MaxInt32` to match Linux's `INT_MAX` behavior exactly?** This would eliminate any future risk of hitting this limit with other large `setsockopt` payloads (e.g. `SO_ATTACH_FILTER`, complex kube-proxy rulesets, etc.).

## Reproducer

On a gVisor-enabled Kubernetes node with `--net-raw` enabled:

```bash
# Istio 1.28+ istio-init generates ~13KB nat table payloads
# This fails silently with EINVAL on gVisor (before this fix)
kubectl run istio-test --image=istio/proxyv2:1.28.0 \
  --restart=Never --rm -it \
  --overrides='{"spec":{"runtimeClassName":"gvisor"}}' \
  -- pilot-agent istio-iptables
```

## Risk assessment

- **Low risk**: The `maxOptLen` constant is a gVisor-internal safety cap, not a Linux compatibility requirement. 32KB is still tiny compared to what a sandboxed process can already allocate via `mmap`, and far smaller than the existing `maxControlLen` of 10MB for `msg_control` buffers.
- **Note**: The existing `maxControlLen` comment acknowledges this exact pattern: _"Note that this limit is smaller than Linux, which allows buffers upto INT_MAX."_

Fixes #12685

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12686 from a7i:fix/iptables-restore-ipt-so-set-replace-12685 4b78644
PiperOrigin-RevId: 880918196
@copybara-service copybara-service bot merged commit 293707e into google:master Mar 12, 2026
3 checks passed
@a7i a7i deleted the fix/iptables-restore-ipt-so-set-replace-12685 branch March 12, 2026 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Istio istio-init fails on gVisor: maxOptLen 8KB limit + missing raw table block iptables-restore

2 participants