-
Notifications
You must be signed in to change notification settings - Fork 329
Description
Describe the bug
The buildkite-agent fails to spawn job processes when the combined size of environment variables and command arguments exceeds the system's ARG_MAX limit (typically 1MB). This results in fork/exec /usr/bin/buildkite-agent-stable: argument list too long errors during builds.
The issue is particularly pronounced in GitHub merge queue builds due to:
- The
BUILDKITE_PLUGINSenvironment variable containing the entire plugin configuration as JSON (can be 50-100KB+ for complex pipelines) - Longer branch names in merge queues (e.g.,
gh-readonly-queue/main/pr-34570-ab491de6748ebf93bfceb873775bfbbd4fb33129) - More conditional steps triggered by merge queue branch patterns
- Environment variable accumulation through nested/triggered pipelines
This is a fundamental architectural limitation - using environment variables to transport large structured data violates Unix ARG_MAX constraints enforced by the execve() system call.
Steps To Reproduce
-
Create a pipeline with:
- 30+ watch configurations in monorepo-diff plugin
- Multiple plugins per step (4-5 plugins with configurations)
- Long inline commands (20+ lines of bash)
- Conditional steps that trigger on merge queue branches
-
Configure the pipeline to run on GitHub merge queue events
- Branch pattern:
gh-readonly-queue/main/pr-*
- Branch pattern:
-
Trigger a build via GitHub merge queue
- Branch name will be format:
gh-readonly-queue/main/pr-XXXXX-<40-char-hash>
- Branch name will be format:
-
Observe failure when agent attempts to spawn job processes
To measure environment size in a job:
env | wc -c # Total bytes
echo -n "$BUILDKITE_PLUGINS" | wc -c # Just plugin configExpected behavior
The buildkite-agent should be able to spawn job processes regardless of pipeline complexity or configuration size. The agent should:
- Handle arbitrarily large plugin configurations without hitting OS limits
- Provide clear error messages when approaching system limits
- Detect potential ARG_MAX issues before attempting process spawning
- Use appropriate data transport mechanisms for large structured data
Actual behaviour
The agent attempts to pass all plugin configuration via the BUILDKITE_PLUGINS environment variable. When the total environment size exceeds ARG_MAX (~1MB), the OS rejects the fork/exec call with:
fork/exec /usr/bin/buildkite-agent-stable: argument list too long
The build fails with no prior warning, and the agent doesn't provide guidance on what exceeded limits or how to resolve the issue.
Environment Size Breakdown (typical complex pipeline):
- Standard Buildkite variables: ~30 KB
BUILDKITE_PLUGINS: ~80-100 KB (grows with pipeline complexity)- Other plugin environment variables: ~10 KB
- System environment variables: ~5 KB
- Custom environment variables: ~5 KB
- Merge queue branch name overhead: +5 KB (appears in multiple variables)
- Total: ~135-155 KB per job level
With nested pipelines or triggered builds, environments accumulate and can exceed the 1MB limit.
Stack parameters (please complete the following information):
- Platform: Linux (Ubuntu), macOS (Darwin)
- System ARG_MAX: 1048576 bytes (1MB) - verified via
getconf ARG_MAX - Agent Version: Affects multiple versions (observed in production across v3.x releases)
- Build Context: GitHub merge queue builds specifically, but can affect any complex pipeline
Additional context
Related Issues
This is part of a broader pattern of ARG_MAX issues in the agent:
- Issue annotate - buildkite-agent: Argument list too long #1484:
buildkite-agent annotatefails with "Argument list too long" when passing 240+ lines - Issue Git checkout failed at: argument list too long #1269: Git checkout fails with "argument list too long" with large commit messages (2000+ lines)
Root Cause Analysis
The agent currently serializes entire plugin configurations into the BUILDKITE_PLUGINS environment variable and passes this to every child process. This design choice has fundamental limitations:
Why environment variables are the wrong medium:
- Designed for small key-value pairs (PATH, HOME, USER)
- Combined env + args limited by ARG_MAX (kernel constraint)
- Not designed for structured data or large payloads
- Accumulate through inheritance in nested pipelines
Why this specifically affects merge queues:
- Longer branch names (74 chars vs 25 chars for feature branches)
- Branch name appears in multiple variables (BUILDKITE_BRANCH, BUILDKITE_MESSAGE, etc.)
- More conditional steps (e.g.,
if: build.branch =~ /^gh-readonly-queue\//) - Larger overall BUILDKITE_PLUGINS due to additional step configurations
Proposed Solutions (Priority Order)
CRITICAL: Stop using environment variables for large configurations
Implement one of these approaches:
-
Write configs to temp files - Pass file path via environment variable
BUILDKITE_PLUGINS_FILE=/tmp/buildkite-plugins-$JOB_ID.json -
Store configs in agent memory - Reference by ID
BUILDKITE_PLUGINS_ID=abc123def456
-
Fetch via Buildkite API - Jobs request their config when needed
- Agent already has API access
- Avoids environment variable limits entirely
- Natural fit for the agent's architecture
HIGH: Implement ARG_MAX detection and warnings
Before spawning processes, check if approaching the limit:
func checkArgMax(cmd *exec.Cmd) error {
argMax := getSystemArgMax()
totalSize := calculateEnvSize(cmd.Env) + calculateArgsSize(cmd.Args)
if totalSize > (argMax * 0.9) { // 90% threshold
return fmt.Errorf(
"Environment + args (%d bytes) approaching ARG_MAX limit (%d bytes). "+
"Consider simplifying your pipeline configuration or moving inline "+
"commands to shell scripts. See: https://buildkite.com/docs/...",
totalSize, argMax,
)
}
return nil
}MEDIUM: Configuration size validation
Validate at pipeline upload time:
- Warn if pipeline YAML > 100KB
- Warn if any step has inline commands > 10KB
- Suggest using shell scripts instead of inline commands
- Provide pipeline optimization guidance
LOW: Environment variable pruning
Strip unnecessary data:
- Don't pass parent job's
BUILDKITE_PLUGINSto unrelated child jobs - Only pass plugin configs relevant to the current step
- Remove redundant
BUILDKITE_*variables that child jobs don't need
Immediate Workarounds for Users
Until the agent is fixed, users can mitigate by:
-
Moving inline commands to shell scripts
# Instead of 50-line inline command: command: .buildkite/scripts/my_command.sh
-
Minimizing plugin usage per step
- Use agent hooks instead of plugins where possible
- Consolidate plugin functionality
-
Simplifying pipeline configuration
- Reduce number of watch paths in monorepo-diff
- Use fewer conditional steps
- Split large pipelines into smaller independent pipelines
-
Using pipeline templates
- Define common plugin chains once
- Reference templates instead of duplicating
Technical References
- Unix ARG_MAX limits: https://unix.stackexchange.com/questions/45583/argument-list-too-long
- POSIX execve() specification: https://pubs.opengroup.org/onlinepubs/9699919799/functions/execve.html
- Linux kernel ARG_MAX: https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/binfmts.h
Impact
This issue affects users with:
- Complex monorepo configurations
- Multiple plugins per step
- GitHub merge queue integration
- Nested or triggered pipelines
- Large pipeline configurations
The current design makes it impossible to build arbitrarily complex pipelines, which limits Buildkite's scalability for large organizations with sophisticated CI/CD requirements.