Skip to content

Agent may fail to spawn jobs when environment variables approach ARG_MAX limit (fork/exec: argument list too long) #3509

@jasonwbarnett

Description

@jasonwbarnett

Describe the bug

The buildkite-agent fails to spawn job processes when the combined size of environment variables and command arguments exceeds the system's ARG_MAX limit (typically 1MB). This results in fork/exec /usr/bin/buildkite-agent-stable: argument list too long errors during builds.

The issue is particularly pronounced in GitHub merge queue builds due to:

  1. The BUILDKITE_PLUGINS environment variable containing the entire plugin configuration as JSON (can be 50-100KB+ for complex pipelines)
  2. Longer branch names in merge queues (e.g., gh-readonly-queue/main/pr-34570-ab491de6748ebf93bfceb873775bfbbd4fb33129)
  3. More conditional steps triggered by merge queue branch patterns
  4. Environment variable accumulation through nested/triggered pipelines

This is a fundamental architectural limitation - using environment variables to transport large structured data violates Unix ARG_MAX constraints enforced by the execve() system call.

Steps To Reproduce

  1. Create a pipeline with:

    • 30+ watch configurations in monorepo-diff plugin
    • Multiple plugins per step (4-5 plugins with configurations)
    • Long inline commands (20+ lines of bash)
    • Conditional steps that trigger on merge queue branches
  2. Configure the pipeline to run on GitHub merge queue events

    • Branch pattern: gh-readonly-queue/main/pr-*
  3. Trigger a build via GitHub merge queue

    • Branch name will be format: gh-readonly-queue/main/pr-XXXXX-<40-char-hash>
  4. Observe failure when agent attempts to spawn job processes

To measure environment size in a job:

env | wc -c  # Total bytes
echo -n "$BUILDKITE_PLUGINS" | wc -c  # Just plugin config

Expected behavior

The buildkite-agent should be able to spawn job processes regardless of pipeline complexity or configuration size. The agent should:

  • Handle arbitrarily large plugin configurations without hitting OS limits
  • Provide clear error messages when approaching system limits
  • Detect potential ARG_MAX issues before attempting process spawning
  • Use appropriate data transport mechanisms for large structured data

Actual behaviour

The agent attempts to pass all plugin configuration via the BUILDKITE_PLUGINS environment variable. When the total environment size exceeds ARG_MAX (~1MB), the OS rejects the fork/exec call with:

fork/exec /usr/bin/buildkite-agent-stable: argument list too long

The build fails with no prior warning, and the agent doesn't provide guidance on what exceeded limits or how to resolve the issue.

Environment Size Breakdown (typical complex pipeline):

  • Standard Buildkite variables: ~30 KB
  • BUILDKITE_PLUGINS: ~80-100 KB (grows with pipeline complexity)
  • Other plugin environment variables: ~10 KB
  • System environment variables: ~5 KB
  • Custom environment variables: ~5 KB
  • Merge queue branch name overhead: +5 KB (appears in multiple variables)
  • Total: ~135-155 KB per job level

With nested pipelines or triggered builds, environments accumulate and can exceed the 1MB limit.

Stack parameters (please complete the following information):

  • Platform: Linux (Ubuntu), macOS (Darwin)
  • System ARG_MAX: 1048576 bytes (1MB) - verified via getconf ARG_MAX
  • Agent Version: Affects multiple versions (observed in production across v3.x releases)
  • Build Context: GitHub merge queue builds specifically, but can affect any complex pipeline

Additional context

Related Issues

This is part of a broader pattern of ARG_MAX issues in the agent:

Root Cause Analysis

The agent currently serializes entire plugin configurations into the BUILDKITE_PLUGINS environment variable and passes this to every child process. This design choice has fundamental limitations:

Why environment variables are the wrong medium:

  • Designed for small key-value pairs (PATH, HOME, USER)
  • Combined env + args limited by ARG_MAX (kernel constraint)
  • Not designed for structured data or large payloads
  • Accumulate through inheritance in nested pipelines

Why this specifically affects merge queues:

  1. Longer branch names (74 chars vs 25 chars for feature branches)
  2. Branch name appears in multiple variables (BUILDKITE_BRANCH, BUILDKITE_MESSAGE, etc.)
  3. More conditional steps (e.g., if: build.branch =~ /^gh-readonly-queue\//)
  4. Larger overall BUILDKITE_PLUGINS due to additional step configurations

Proposed Solutions (Priority Order)

CRITICAL: Stop using environment variables for large configurations

Implement one of these approaches:

  1. Write configs to temp files - Pass file path via environment variable

    BUILDKITE_PLUGINS_FILE=/tmp/buildkite-plugins-$JOB_ID.json
  2. Store configs in agent memory - Reference by ID

    BUILDKITE_PLUGINS_ID=abc123def456
  3. Fetch via Buildkite API - Jobs request their config when needed

    • Agent already has API access
    • Avoids environment variable limits entirely
    • Natural fit for the agent's architecture

HIGH: Implement ARG_MAX detection and warnings

Before spawning processes, check if approaching the limit:

func checkArgMax(cmd *exec.Cmd) error {
    argMax := getSystemArgMax()
    totalSize := calculateEnvSize(cmd.Env) + calculateArgsSize(cmd.Args)

    if totalSize > (argMax * 0.9) { // 90% threshold
        return fmt.Errorf(
            "Environment + args (%d bytes) approaching ARG_MAX limit (%d bytes). "+
            "Consider simplifying your pipeline configuration or moving inline "+
            "commands to shell scripts. See: https://buildkite.com/docs/...",
            totalSize, argMax,
        )
    }
    return nil
}

MEDIUM: Configuration size validation

Validate at pipeline upload time:

  • Warn if pipeline YAML > 100KB
  • Warn if any step has inline commands > 10KB
  • Suggest using shell scripts instead of inline commands
  • Provide pipeline optimization guidance

LOW: Environment variable pruning

Strip unnecessary data:

  • Don't pass parent job's BUILDKITE_PLUGINS to unrelated child jobs
  • Only pass plugin configs relevant to the current step
  • Remove redundant BUILDKITE_* variables that child jobs don't need

Immediate Workarounds for Users

Until the agent is fixed, users can mitigate by:

  1. Moving inline commands to shell scripts

    # Instead of 50-line inline command:
    command: .buildkite/scripts/my_command.sh
  2. Minimizing plugin usage per step

    • Use agent hooks instead of plugins where possible
    • Consolidate plugin functionality
  3. Simplifying pipeline configuration

    • Reduce number of watch paths in monorepo-diff
    • Use fewer conditional steps
    • Split large pipelines into smaller independent pipelines
  4. Using pipeline templates

    • Define common plugin chains once
    • Reference templates instead of duplicating

Technical References

Impact

This issue affects users with:

  • Complex monorepo configurations
  • Multiple plugins per step
  • GitHub merge queue integration
  • Nested or triggered pipelines
  • Large pipeline configurations

The current design makes it impossible to build arbitrarily complex pipelines, which limits Buildkite's scalability for large organizations with sophisticated CI/CD requirements.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions