feat: extend frontmatter with A/B experiments section#29534
Conversation
- Add `experiments` section to frontmatter schema (map of name → variant list)
- Add `Experiments map[string][]string` to FrontmatterConfig and WorkflowData
- Transform `${{ experiments.name }}` expressions to `${{ env.GH_AW_EXPERIMENTS_NAME }}`
via new `transformExperimentsExpression()` in expression_extraction.go
- Generate activation-job steps: restore cache → pick variants (pick_experiment.cjs) →
save cache → upload "experiment" artifact
- Add pick_experiment.cjs: balanced round-robin variant selection using actions/cache
state file, GITHUB_ENV export, and Markdown step summary
- Add `ArtifactSetExperiment` (\"experiment\") to CLI artifact sets for logs/audit
- Add ExperimentArtifactName constant
- Full test coverage: Go unit tests and vitest JS tests (16 passing)"
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/6603825b-d8d5-4381-a44c-3d6fd823ab92
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
- Fix expression_extraction.go: use originalContent variable and cleaner branching to avoid reusing transformedContent across two transformations; scope deprecation warning precisely to activation output transformations - Fix ExperimentEnvVarName: use normalizeJobNameForEnvVar instead of strings.ToUpper to align Go and JS normalization (handles hyphens etc.) - Fix buildExperimentSpecJSON: replace manual escapeJSONString with encoding/json.Marshal for correct escaping of all control characters" Agent-Logs-Url: https://github.com/github/gh-aw/sessions/6603825b-d8d5-4381-a44c-3d6fd823ab92 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
|
||
| // Export the selected variant so subsequent steps can read it via $GITHUB_ENV. | ||
| const envVarName = `GH_AW_EXPERIMENTS_${name.toUpperCase().replace(/[^A-Z0-9]/g, "_")}`; | ||
| core.exportVariable(envVarName, selected); |
There was a problem hiding this comment.
-
set experiment state serialized JSON as output of the step and the job (needs.activation.experiments.outputs.caveman), do not set env vars.
-
add an experiment to smoke-copilot prompt using a conditional template to inject "Talk like a caveman"
experiments:
caveman: [yes, no]
In prompt
{{#if experiments.caveman }}
Talk like a caveman
{{#endif}} # use corresr syntax
There was a problem hiding this comment.
Done in commit 2acaa08. Here's what changed:
Step outputs instead of env vars (pick_experiment.cjs):
- Removed
core.exportVariableentirely - Now sets
core.setOutput(name, selected)per experiment (e.g.caveman=yes) - Sets
core.setOutput('experiments', JSON.stringify(assignments))for the full JSON blob - Accessible downstream as
needs.activation.outputs.experiments
Template conditional plumbing — so {{#if experiments.caveman }} works correctly:
ExperimentExpressionMappings()generatesGH_AW_EXPERIMENTS_CAVEMAN → steps.pick-experiment.outputs.cavemanmappings, ensuring the interpolation step sets the env var from the step output (not GITHUB_ENV)- Added Step 2.5 in
interpolate_prompt.cjsthat substitutes__GH_AW_EXPERIMENTS_*__placeholders beforerenderMarkdownTemplate()runs, so the actual value (yes/no) is what gets evaluated — not the placeholder string - Extended
isTruthyto treat"no"as falsy, so{{#if no }}correctly removes the block
smoke-copilot.md — added to frontmatter:
experiments:
caveman: [yes, no]And in the prompt:
{{#if experiments.caveman }}
Talk like a caveman in all your responses and outputs. Use short, broken sentences. Me test. You run.
{{/if}}
… smoke-copilot
- Replace core.exportVariable with core.setOutput in pick_experiment.cjs:
- Per-experiment output: setOutput(name, variant) e.g. 'caveman=yes'
- Combined JSON output: setOutput('experiments', JSON) accessible as
needs.activation.outputs.experiments from downstream jobs
- Wire activation job output: ctx.outputs["experiments"] = "${{ steps.pick-experiment.outputs.experiments }}"
- Add ExperimentExpressionMappings() in compiler_experiments.go: generates
GH_AW_EXPERIMENTS_* → steps.pick-experiment.outputs.* mappings so the
interpolation and substitution steps have the env vars set from step outputs
- Add Step 2.5 in interpolate_prompt.cjs: substitute __GH_AW_EXPERIMENTS_*__
placeholders from env vars BEFORE renderMarkdownTemplate() so {{#if experiments.name}}
conditionals evaluate the actual variant value not the placeholder string
- Extend isTruthy to treat 'no' as falsy (enables [yes, no] variant pattern)
- Update transformExperimentsExpression: map experiments.NAME →
steps.pick-experiment.outputs.NAME (was env.GH_AW_EXPERIMENTS_NAME)
- Add caveman experiment to smoke-copilot.md with {{#if experiments.caveman}}
conditional that injects 'Talk like a caveman' instructions when selected
- Update tests: pick_experiment.test.cjs, is_truthy.test.cjs,
compiler_experiments_test.go, expression_extraction_test.go"
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/3ca87b03-00f9-40d6-b348-bdc026ad85c0
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
Update the logs/audit commands run parser to extract the experiment artifact and display it to the user. |
|
@copilot use the tech-writer agent to write a documentation page about experiments |
|
@copilot from an expert A/B testing research rate this implementation and suggest improvements. Do not make any code changes, write report in report progress. |
🧪 Test Quality Sentinel ReportTest Quality Score: 92/100✅ Excellent
Test Classification DetailsView all 27 test classifications
Flagged Tests — Requires ReviewNo tests were flagged. All 27 tests are classified as behavioral design tests. Language SupportTests analyzed:
Scoring Breakdown
Verdict
�� Understanding Test ClassificationsDesign Tests (High Value) verify what the system does:
Implementation Tests (Low Value) verify how the system does it:
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators. References: §25222221598
|
Auto-generated draft ADR documenting the decision to add an experiments map to workflow frontmatter for balanced variant selection in A/B testing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Commit pushed:
|
🏗️ Design Decision Gate — ADR RequiredThis PR makes significant changes to core business logic (497 new lines in business-logic directories, threshold: 100) but does not have a linked Architecture Decision Record (ADR). AI has analyzed the PR diff and generated a draft ADR to help you get started: 📄 Draft ADR: What to do next
Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision. Why ADRs Matter
ADRs create a searchable, permanent record of why the codebase looks the way it does. Future contributors (and your future self) will thank you. 📋 Michael Nygard ADR Format ReferenceAn ADR must contain these four sections to be considered complete:
All ADRs are stored in
References: §25222221666
|
There was a problem hiding this comment.
Pull request overview
Adds an experiments section to workflow frontmatter to support A/B experiment variant selection at runtime, and wires the selected variants into prompt interpolation/template rendering and job outputs.
Changes:
- Extend workflow data/frontmatter/schema to include
experiments: { name: [variants...] }. - Inject activation-job steps to restore state, pick least-used variants, persist state, and upload an experiment artifact; expose combined assignments via
needs.activation.outputs.experiments. - Add Go + JS logic/tests for expression remapping, experiment selection, prompt interpolation (pre-template substitution), and
"no"falsy handling.
Show a summary per file
| File | Description |
|---|---|
| pkg/workflow/frontmatter_types.go | Adds Experiments to frontmatter config struct. |
| pkg/workflow/compiler_types.go | Stores extracted experiments on WorkflowData. |
| pkg/parser/schemas/main_workflow_schema.json | Adds experiments schema entry. |
| pkg/workflow/compiler_orchestrator_workflow.go | Extracts experiments from frontmatter into workflow data. |
| pkg/workflow/expression_extraction.go | Rewrites experiments.<name> expressions to runtime step outputs; adds env-var naming helper. |
| pkg/workflow/expression_extraction_test.go | Tests experiment expression transformation. |
| pkg/workflow/compiler_experiments.go | Generates activation-job steps for picking/persisting/uploading experiment state + env var mappings. |
| pkg/workflow/compiler_experiments_test.go | Unit tests for extraction, spec JSON building, step generation, and mappings. |
| pkg/workflow/compiler_activation_job.go | Injects experiment steps and exposes activation output experiments. |
| pkg/workflow/compiler_yaml.go | Adds experiment env-var mappings to interpolation/substitution steps. |
| actions/setup/js/pick_experiment.cjs | Implements runtime variant selection + outputs + state persistence + step summary. |
| actions/setup/js/pick_experiment.test.cjs | Vitest coverage for selection algorithm, persistence, and outputs. |
| actions/setup/js/interpolate_prompt.cjs | Adds “Step 2.5” to substitute experiment placeholders before template rendering. |
| actions/setup/js/is_truthy.cjs | Treats "no" as falsy for template evaluation. |
| actions/setup/js/is_truthy.test.cjs | Tests "no" falsy behavior. |
| pkg/constants/job_constants.go | Adds ExperimentArtifactName constant. |
| pkg/cli/logs_artifact_set.go | Adds experiment artifact set support. |
| pkg/cli/logs_artifact_set_test.go | Updates expected artifact set names. |
| .github/workflows/smoke-copilot.md | Adds a sample caveman experiment and conditional prompt block. |
| .github/workflows/smoke-copilot.lock.yml | Updates generated lock workflow to include experiment steps/outputs. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (2)
actions/setup/js/pick_experiment.cjs:176
- The inline comment says downstream jobs access individual experiment outputs via
needs.activation.outputs.experiment_<name>, but the workflow wiring uses step outputs (steps.pick-experiment.outputs.<name>) plus a combinedneeds.activation.outputs.experimentsJSON output. Updating this comment will make it clearer how consumers should actually read the chosen variant(s).
// Expose the selected variant as a step output (individual per experiment).
// Downstream jobs access this via needs.activation.outputs.experiment_<name>.
core.setOutput(name, selected);
core.info(`Experiment "${name}": selected variant "${selected}" (output: ${name}=${selected})`);
pkg/workflow/compiler_experiments.go:140
- The comment claims the JSON output is "safe to embed in a YAML single-quoted scalar (no single quotes in output)", but
json.Marshaldoes not escape', so this guarantee isn't true. Please update the comment and/or change the serialization/escaping so the embedding is actually safe for all user-provided strings.
// buildExperimentSpecJSON builds a compact JSON object from the experiments map.
// Uses encoding/json for proper escaping of all special characters.
// The resulting JSON is safe to embed in a YAML single-quoted scalar (no single quotes in output).
func buildExperimentSpecJSON(experiments map[string][]string, names []string) string {
- Files reviewed: 21/21 changed files
- Comments generated: 7
| // 2. Pick variants – pick_experiment.cjs (reads/writes state.json, sets step outputs) | ||
| // Outputs: one per experiment (e.g. "caveman=yes") + "experiments" JSON blob | ||
| // 3. Experiment step summary – write a Markdown summary to GITHUB_STEP_SUMMARY | ||
| // 4. Save experiment cache – actions/cache/save keyed by workflow ID | ||
| // 5. Upload experiment artifact – actions/upload-artifact named "experiment" |
| * K is the number of variants, satisfying basic A/B statistical balance. | ||
| * | ||
| * Outputs: | ||
| * - Sets GH_AW_EXPERIMENTS_<NAME>=<variant> in GITHUB_ENV for each experiment. |
| // Detect experiments.NAME expressions and remap them to env.GH_AW_EXPERIMENTS_NAME | ||
| // so the substitution step reads the value set by the experiment selection step via GITHUB_ENV. |
| "description": "A/B testing experiments. Each key is an experiment name; the value is an array of two or more variant strings. At runtime the activation job picks a variant using actions/cache to maintain consistent assignment across runs. Use ${{ experiments.<name> }} in the workflow prompt to reference the selected variant. When multiple experiments are declared, assignments are statistically balanced using a counter that round-robins across variants.", | ||
| "type": "object", |
| // ExperimentEnvVarName returns the GITHUB_ENV / GITHUB_OUTPUT variable name used by the | ||
| // pick_experiment step for the given experiment name. | ||
| // The name is uppercased and any character that is not A-Z, 0-9, or underscore is replaced | ||
| // with underscore, matching the JavaScript normalization in pick_experiment.cjs. | ||
| // Example: "feature1" → "GH_AW_EXPERIMENTS_FEATURE1" | ||
| // Example: "my-flag" → "GH_AW_EXPERIMENTS_MY_FLAG" | ||
| func ExperimentEnvVarName(experimentName string) string { | ||
| return "GH_AW_EXPERIMENTS_" + normalizeJobNameForEnvVar(experimentName) |
| " env:\n", | ||
| fmt.Sprintf(" GH_AW_EXPERIMENT_SPEC: '%s'\n", specJSON), | ||
| fmt.Sprintf(" GH_AW_EXPERIMENT_STATE_FILE: %s\n", experimentStateFile), | ||
| fmt.Sprintf(" GH_AW_EXPERIMENT_STATE_DIR: %s\n", experimentsCacheDir), |
| result := make(map[string][]string, len(rawMap)) | ||
| for name, val := range rawMap { | ||
| switch v := val.(type) { | ||
| case []string: | ||
| if len(v) >= 2 { |
|
@copilot review all comments |
- Add pkg/cli/audit_report_experiments.go: parse state.json from the
experiment artifact directory (both logsPath root and experiment/
subdir), derive last-run variant assignments by max-count selection,
expose as ExperimentData{Assignments, CumulativeCounts}
- Add pkg/cli/audit_report_experiments_test.go: 12 unit tests
- Add Experiments *ExperimentData to AuditData; populate in buildAuditData()
- Add renderExperimentData() and render "🧪 A/B Experiments" section in
renderConsole() before the Metrics section
- Add Experiments *ExperimentData to RunData; populate in buildLogsData()
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/f8f310d4-097e-4b80-b86b-ff5b06132bfe
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in commit 532e6de. Here's what was added:
12 unit tests cover path resolution, JSON parsing, multi-experiment extraction, and the variant selection algorithm. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
@copilot the experiments artifact should be restored in the detection job so that it can be used by the detection agent. |
- Fix step-list comment in compiler_experiments.go: remove non-existent "Experiment step summary" step (summary is inside pick_experiment.cjs) - Add experiment name validation: names must match [a-zA-Z_][a-zA-Z0-9_]*; invalid names are logged and skipped (2 new test cases) - Fix YAML single-quote escaping for GH_AW_EXPERIMENT_SPEC: escape ' as '' in YAML single-quoted scalar (new test: TestGenerateExperimentSteps_SingleQuoteEscaping) - Fix expression_extraction.go comment: correct GITHUB_ENV → step output path - Fix ExperimentEnvVarName comment: "replaced with underscore" → "dropped" - Fix pick_experiment.cjs header: setOutput not GITHUB_ENV; fix downstream access expression needs.activation.outputs.<name> - Add propertyNames pattern to experiments JSON schema (enforces identifiers) - Create docs/src/content/docs/guides/experiments.md (how-to guide) - Rebuild and recompile 205 workflow lock files Agent-Logs-Url: https://github.com/github/gh-aw/sessions/6e6bc500-04f5-47c0-a97b-ea46cb2d2517 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in commit bea5f0c. A how-to guide is now at
Expert analysis written in the PR description (no code changes). Summary: 6.5/10. The balanced round-robin counter is sound for allocation, but the implementation has no statistical significance framework or outcome metric collection — meaning you can distribute variants evenly but cannot yet conclude anything from the results. The most impactful gaps are: (1) no per-run assignment log for cross-run correlation, (2) no metric/outcome tracking, (3)
All 7 PR review comments addressed in commit bea5f0c:
Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
- Add buildExperimentArtifactDownloadSteps() in compiler_experiments.go: downloads the experiment artifact to /tmp/gh-aw/experiments/ so the detection agent can read current variant assignments from state.json. Returns nil when no experiments are declared (no-op for non-experiment workflows). - Call the new helper in buildDetectionJob() (threat_detection.go) after the agent output download step, using the activation-downstream prefix. - Add 3 unit tests for buildExperimentArtifactDownloadSteps and one integration test TestDetectionJobDownloadsExperimentArtifact verifying step presence/absence based on experiment declarations. - Recompile 205 workflow lock files (smoke-copilot detection job now includes Download experiment artifact step). Agent-Logs-Url: https://github.com/github/gh-aw/sessions/a8ddb085-cb12-4b28-9fd9-51760ccf429e Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in commit cb6a471. Added Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Summary
Adds an
experimentssection to the workflow frontmatter that enables A/B testing by defining groups of input macro variants. At runtime the activation job picks a statistically balanced variant and exposes it as a step output and activation job output, making it available in the workflow prompt via{{#if experiments.name }}template conditionals. Experiment assignments are also parsed and displayed in thegh aw auditandgh aw logscommands.Usage
How it works
experimentsmap (name → [variants...]) is extracted intoWorkflowData.Experiments. Experiment names must be valid identifiers ([a-zA-Z_][a-zA-Z0-9_]*); invalid names are logged and skipped.${{ experiments.name }}is rewritten tosteps.pick-experiment.outputs.nameduring expression extraction so the placeholder substitution step reads the runtime-selected value from the step output.actions/cache/restorekeyed by workflow ID (restore key allows cross-run state retrieval).pick_experiment.cjs— reads the persisted counter state, selects the least-used variant per experiment (balanced round-robin across all variants), sets individual step outputs (core.setOutput(name, variant)) and a combined JSON output (core.setOutput('experiments', JSON)), and writes a Markdown step summary.actions/cache/save.experiment) for post-run analysis viagh aw audit --artifacts experimentorgh aw logs.needs.activation.outputs.experimentsexposes the full variant assignment JSON blob for downstream jobs.{{#if experiments.name }}blocks in the prompt are resolved correctly:ExperimentExpressionMappings()ensuresGH_AW_EXPERIMENTS_*env vars are set from step outputs in the interpolation step, and a new Step 2.5 ininterpolate_prompt.cjssubstitutes__GH_AW_EXPERIMENTS_*__placeholders before template rendering so the actual variant value (yes/no) is evaluated.isTruthyis extended to treat"no"as falsy, enabling the[yes, no]variant pattern.gh aw auditrenders a🧪 A/B Experimentssection showing variant assignments and cumulative counts when the experiment artifact is present.gh aw logs --jsonincludes experiment data inRunData.experimentartifact to/tmp/gh-aw/experiments/so the detection agent can read the current variant assignments fromstate.json.Statistical design
The variant with the lowest cumulative invocation count is selected next; ties are broken by variant order. Over N runs this distributes each variant approximately N/K times (K = variant count), satisfying basic A/B balance with zero configuration.
Changes
pkg/constants/job_constants.goExperimentArtifactName = "experiment"pkg/workflow/frontmatter_types.goExperiments map[string][]stringtoFrontmatterConfigpkg/workflow/compiler_types.goExperiments map[string][]stringtoWorkflowDatapkg/parser/schemas/main_workflow_schema.jsonexperimentsobject schema withpropertyNamespattern constraintpkg/workflow/expression_extraction.gotransformExperimentsExpression()maps to step output;ExperimentEnvVarName()pkg/workflow/compiler_experiments.goExperimentExpressionMappings()for env var wiring; name validation; YAML single-quote escaping for spec JSON;buildExperimentArtifactDownloadSteps()for detection jobpkg/workflow/compiler_experiments_test.gopkg/workflow/compiler_activation_job.goneeds.activation.outputs.experimentspkg/workflow/compiler_orchestrator_workflow.gopkg/workflow/compiler_yaml.gopkg/workflow/threat_detection.gopkg/workflow/detection_success_test.goactions/setup/js/pick_experiment.cjscore.setOutputper experiment + combined JSONactions/setup/js/pick_experiment.test.cjsactions/setup/js/interpolate_prompt.cjsactions/setup/js/is_truthy.cjs"no"as falsyactions/setup/js/is_truthy.test.cjs"no"falsy behaviour.github/workflows/smoke-copilot.mdcaveman: [yes, no]experiment with conditional prompt blockpkg/cli/logs_artifact_set.goArtifactSetExperiment = "experiment"pkg/cli/logs_artifact_set_test.gopkg/cli/audit_report_experiments.gostate.jsonfrom experiment artifact; derive last-run assignmentspkg/cli/audit_report_experiments_test.gopkg/cli/audit_report.goExperiments *ExperimentDatatoAuditData; populate inbuildAuditData()pkg/cli/audit_report_render.go🧪 A/B Experimentssection inrenderConsole()pkg/cli/logs_report.goExperiments *ExperimentDatatoRunData; populate inbuildLogsData()docs/src/content/docs/guides/experiments.md