Changelog: Bundle upload support and Private-Public Bundling SQS Lambda#3163
Changelog: Bundle upload support and Private-Public Bundling SQS Lambda#3163
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new AWS Lambda "changelog-scrubber" that processes SQS-delivered S3 object events: deletions remove the corresponding public S3 key; creations either passthrough JSON (notably registry-index.json) or download YAML/YML, deserialize bundle vs changelog, apply an allowlist-derived sanitization (including free-text scrubbing) and write sanitized YAML to a public S3 bucket. Adds a Docker-based reusable GitHub Actions workflow to build and upload the Lambda bootstrap binary, extends LinkAllowlistSanitizer with allowlist construction and text-scrubbing, and adds bundle handling to the changelog upload flow. Sequence Diagram(s)sequenceDiagram
participant S3 as S3 (Private)
participant SQS as SQS Queue
participant Lambda as Changelog Scrubber Lambda
participant S3Pub as S3 (Public)
S3->>SQS: Emit object event
SQS->>Lambda: Invoke with SQS event batch
Lambda->>Lambda: Parse batch & S3 events
alt ObjectRemoved
Lambda->>S3Pub: Delete object
S3Pub-->>Lambda: Deleted / NotFound
else ObjectCreated
alt JSON (registry-index.json or .json)
Lambda->>S3: Get object
S3-->>Lambda: Content
Lambda->>S3Pub: PutObject (passthrough)
else YAML/YML
Lambda->>S3: Get object
S3-->>Lambda: YAML content
Lambda->>Lambda: Deserialize (bundle vs changelog)
Lambda->>Lambda: Build allowlist (assembler.yml)
Lambda->>Lambda: Apply allowlist & scrub text
Lambda->>S3Pub: PutObject (application/yaml)
end
end
Lambda->>SQS: Return SQSBatchResponse (per-message failures)
Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/build-changelog-scrubber-lambda.yml:
- Around line 22-38: The workflow attempts to docker cp /app/.artifacts/publish
from the built image but the Dockerfile does not create that path, causing the
copy to fail before stat "${BINARY_PATH}"; update the docker cp step (still
referencing the image name used in docker create: changelog-scrubber:latest) to
copy the actual publish output path that the Dockerfile produces (or adjust the
Dockerfile to emit .artifacts/publish), and ensure the copied file location
matches the BINARY_PATH env variable so stat "${BINARY_PATH}" succeeds (refs:
BINARY_PATH, the docker cp/docke r create lines, and stat "${BINARY_PATH}").
In `@src/infra/docs-lambda-changelog-scrubber/lambda.DockerFile`:
- Around line 30-34: The Dockerfile is computing a runtime identifier into
/tmp/rid using TARGETARCH/TARGETOS but then ignores it by hardcoding "-r
linux-x64" and omitting the publish output folder; update the dotnet publish RUN
that targets src/infra/docs-lambda-changelog-scrubber to read the computed RID
(cat /tmp/rid) instead of linux-x64 and add the explicit output directory flag
so artifacts land in /app/.artifacts/publish (ensure the command uses the
--output or -o option to publish to that path).
In `@src/infra/docs-lambda-changelog-scrubber/Program.cs`:
- Around line 176-180: The branch that currently logs a warning and returns the
original content when LinkAllowlistSanitizer.TryApplyBundle(...) fails should
instead fail closed: throw an exception so the message is retried or goes to DLQ
rather than publishing unsafe content; replace the context.Logger.LogWarning +
return content behavior in the TryApplyBundle failure path (the block handling
LinkAllowlistSanitizer.TryApplyBundle(collector, bundle, allowRepos, owner,
repo, out var sanitized, out var changed)) with a thrown exception that includes
context (e.g., input identifiers) and ensure you make the same change for the
equivalent failure branch later in the file that handles the same sanitizer
call.
- Around line 111-115: The current check in Program.cs that passes through any
object whose key ends with ".json" (the branch that calls
CopyPassThrough(s3Client, sourceBucket, key, context)) is too broad; change it
to only allow a small explicit allowlist of known-safe filenames (e.g.,
"registry-index.json") before calling CopyPassThrough. Locate the code that
inspects the key variable and replace the EndsWith(".json", ...) condition with
a check that compares the file name portion (use Path.GetFileName(key) or
equivalent) against a HashSet or array of allowedJsonNames (case-insensitive)
and only call CopyPassThrough when the file name is in that allowlist; otherwise
continue with normal scrubbing logic.
In `@src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs`:
- Around line 159-192: Apply the reviewer request to avoid emitting reversible
private sentinels by filtering/replacing entries returned from
ApplyToReferenceList before assigning to Prs/Issues: after calling
ApplyToReferenceList (used for PRs and Issues) run a helper (e.g.,
DropPrivateSentinels) that removes any entries that start with the sentinel
prefix (or alternatively replaces them with a non-reversible marker) and update
anyRewritten accordingly, then assign the filtered list to sanitized.Prs and
sanitized.Issues; keep ScrubText usage for Description/Impact/Action unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 71d66081-7fae-43cf-a4fc-5b1c30199b54
📒 Files selected for processing (12)
.github/workflows/build-changelog-scrubber-lambda.yml.github/workflows/release.ymlsrc/infra/docs-lambda-changelog-scrubber/Program.cssrc/infra/docs-lambda-changelog-scrubber/README.mdsrc/infra/docs-lambda-changelog-scrubber/SerializerContext.cssrc/infra/docs-lambda-changelog-scrubber/aws-lambda-tools-defaults.jsonsrc/infra/docs-lambda-changelog-scrubber/docs-lambda-changelog-scrubber.csprojsrc/infra/docs-lambda-changelog-scrubber/lambda.DockerFilesrc/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cssrc/services/Elastic.Changelog/Uploading/ChangelogUploadService.cstests/Elastic.Changelog.Tests/Changelogs/LinkAllowlistSanitizerTests.cstests/Elastic.Changelog.Tests/Uploading/ChangelogUploadServiceTests.cs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs`:
- Line 19: The GeneratedRegex attribute on LinkAllowlistSanitizer (the attribute
decorating the GitHub URL regex) is currently case-sensitive and allows
uppercase variants like HTTPS://github.com to bypass scrubbing; update the
attribute to use RegexOptions.IgnoreCase (matching ProfileFilterResolver's
approach) so the pattern for matching github.com pull/issues URLs is
case-insensitive and will correctly scrub private references.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 989d6c36-4602-4eab-8ef5-21891a1921d6
📒 Files selected for processing (3)
src/infra/docs-lambda-changelog-scrubber/Program.cssrc/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cstests/Elastic.Changelog.Tests/Changelogs/LinkAllowlistSanitizerTests.cs
🚧 Files skipped from review as they are similar to previous changes (2)
- src/infra/docs-lambda-changelog-scrubber/Program.cs
- tests/Elastic.Changelog.Tests/Changelogs/LinkAllowlistSanitizerTests.cs
Mpdreamz
left a comment
There was a problem hiding this comment.
Request changes
Thanks for wiring up the private → public pipeline and the deserialize → mutate → serialize scrubbing path. With how we plan to use these buckets, I think we need a few adjustments before this is safe to ship.
Bucket roles (why scrubbing must be strict)
- Private S3 backs indexing into Elasticsearch (internal/trusted context).
- Public S3 is what docs-builder will use to render release notes on anyone’s machine (untrusted context).
So the public objects must not carry internal repo names, PR numbers, or URLs that slipped through a new field or a renamed property. The bar for the public artifact is “no disclose,” not only “no clickable link.”
Sentinel-style redaction vs real removal
Using # PRIVATE: <original> on structured refs can still leave the full original reference in plain text in YAML that lands in the public bucket. That conflicts with the goal above: if the private copy remains authoritative in private S3, the public copy should remove disallowed references (or replace them with a non-identifying placeholder, e.g. a fixed string with no owner/repo/URL), not preserve them after a prefix.
Please align bundle scrubbing with that requirement: no residual GitHub URLs or owner/repo#N patterns in public bundle YAML unless they are explicitly allowlisted.
Post-serialize validation
Deserialize → mutate → serialize is the right core loop, but it won’t catch:
- New YAML fields that contain links in a future schema revision.
- Renamed fields where scrubbing wasn’t extended.
Please add a post-serialize validation pass on the final string destined for the public bucket (or equivalently, a second parse + walk), e.g.:
- Assert the serialized text contains no GitHub PR/issue URLs and no short-form
owner/repo#references outside an explicit allowlist (reuse or share the same patterns asScrubText/ link detection). - Fail the Lambda batch item (or fail CI tests) if validation fails so we don’t quietly publish a leaky object.
That gives us defense in depth when the model changes.
Summary
Request:
- Public bundle/changelog YAML should remove or non-identifying-redact disallowed references so internal details aren’t recoverable from the file.
- Post-serialize validation on emitted content to guard against new/renamed fields leaking links.
Happy to revisit once those are addressed.
|
Changes for residual private references:
Changes for post-serialize validation
The Lambda now calls this on every output (both changed and unchanged paths) before writing to the public bucket. If validation fails, the exception bubbles up and the SQS message goes to DLQ -- no unsafe content gets published. Added some test scenarios to cover scrubbing:
|
Mpdreamz
left a comment
There was a problem hiding this comment.
Request changes
What we want from the two buckets
- Private S3 is the trusted store: full YAML, real PR/issue links, whatever indexing and internal workflows need. It does not need redaction hints.
- Public S3 is what docs-builder will use to render release notes on any machine. That copy should contain only allowlisted GitHub references. Anything else should be absent, not replaced with a marker that something used to be there.
So the public pipeline should not depend on an intermediate # PRIVATE: representation at all.
Why drop the sentinel round-trip for public output
Right now the flow can read as: rewrite disallowed refs to # PRIVATE: …, then strip sentinels, serialize, validate (plus validation that still looks for residual # PRIVATE:).
That works, but it adds moving parts and two different concepts of “clean”:
- Mental model — Operators and future maintainers should only need one rule for public: disallowed links are removed. They should not have to reason about a string format that is explicitly not allowed in the final artifact anyway.
- Complexity — If nothing outside private S3 needs “a hint something was redacted,” the sanitizer’s public path can omit disallowed list entries (or never add them) directly, instead of tag then strip. Same outcome, fewer steps, fewer places for ordering bugs (e.g. strip forgotten on a code path).
- Validation —
ValidateNoPrivateReferencesshould primarily enforce “no non-allowlisted link-shaped content” in the serialized string. A separate “no# PRIVATE:” check is reasonable as a temporary guard while sentinel output still exists somewhere; once public code never emits that token, that check becomes redundant. Prefer one contract: public YAML must not contain private links or redaction sentinels, achieved by never writing either.
Requested changes
- Public / Lambda scrubbing path — For bundle (and any shared entry logic used only for public), do not use
# PRIVATE:as an intermediary. Resolve each PR/issue reference against the allowlist; drop disallowed entries (or equivalent) before serialization. Keep deserialize → mutate (remove) → serialize → validate string without a sentinel phase for this path. - Refactor / split if needed — If other callers still need sentinel behavior for private-side tooling, isolate that behind an API that is not used when writing to the public bucket, or branch clearly so “public emission” never constructs
# PRIVATE:. - Strip / validate — Remove
StripBundleSentinels-style workarounds once they are unnecessary, and tightenValidateNoPrivateReferencesto match the final contract (link scans + any residual rules you still need during migration).
Summary
Private artifacts can keep private links. Public artifacts should be link-scrubbed by removal, not by a two-phase sentinel encoding. Please adjust this PR so the public scrubbing path reflects that simpler model and the rationale above.
Happy to re-review once that’s in place.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs (1)
19-23:⚠️ Potential issue | 🔴 CriticalBroaden GitHub URL detection before publishing.
Line 19 only matches lowercase GitHub PR/issue URLs, so values like
HTTPS://github.com/private/repo/pull/1orhttps://github.com/private/repo/blob/main/file.mdbypass both scrubbing andValidateNoPrivateReferences. Make this matcher case-insensitive and match repo-scoped GitHub URLs, not just/pull|issues/. The case-sensitivity part was already flagged in a prior review.Suggested fix
- [GeneratedRegex(@"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)/(?:pull|issues)/\d+", RegexOptions.None)] + [GeneratedRegex( + @"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)(?:/[^\s""'<>)]*)?", + RegexOptions.IgnoreCase | RegexOptions.CultureInvariant)] private static partial Regex GitHubUrlRegex();Read-only verification:
#!/bin/bash # Verifies the current matcher misses uppercase GitHub URLs and repo/blob URLs. rg -n -C2 'GeneratedRegex\(@"https\?://github\\\.com' src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs python - <<'PY' import re current = re.compile(r"https?://github\.com/(?P<owner>[A-Za-z0-9_.-]+)/(?P<repo>[A-Za-z0-9_.-]+)/(?:pull|issues)/\d+") samples = [ "HTTPS://github.com/private/repo/pull/123", "https://github.com/private/repo", "https://github.com/private/repo/blob/main/file.md", "https://github.com/private/repo/issues/123", ] for sample in samples: print(f"{sample}: {'MATCH' if current.search(sample) else 'MISS'}") PYExpected output: the first three samples show
MISSwith the current matcher.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs` around lines 19 - 23, The GitHubUrlRegex is too strict and case-sensitive so URLs like HTTPS://github.com/... or repo-scoped paths (blob/tree/etc.) bypass scrubbing; update the GeneratedRegex for GitHubUrlRegex to be case-insensitive (RegexOptions.IgnoreCase) and broaden its pattern to match any repo-scoped path not just /pull|issues (e.g. change @"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)/(?:pull|issues)/\d+" to a pattern that allows an optional slash and any trailing path such as @"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)(?:$|/.*)" and add RegexOptions.IgnoreCase); also add RegexOptions.IgnoreCase to ShortFormRefRegex so short owner/repo#123 forms are matched case-insensitively, ensuring ValidateNoPrivateReferences and LinkAllowlistSanitizer now catch uppercase and blob/tree URLs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs`:
- Around line 19-23: The GitHubUrlRegex is too strict and case-sensitive so URLs
like HTTPS://github.com/... or repo-scoped paths (blob/tree/etc.) bypass
scrubbing; update the GeneratedRegex for GitHubUrlRegex to be case-insensitive
(RegexOptions.IgnoreCase) and broaden its pattern to match any repo-scoped path
not just /pull|issues (e.g. change
@"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)/(?:pull|issues)/\d+"
to a pattern that allows an optional slash and any trailing path such as
@"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)(?:$|/.*)"
and add RegexOptions.IgnoreCase); also add RegexOptions.IgnoreCase to
ShortFormRefRegex so short owner/repo#123 forms are matched case-insensitively,
ensuring ValidateNoPrivateReferences and LinkAllowlistSanitizer now catch
uppercase and blob/tree URLs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 38ebf793-39b1-438d-baaf-7b10432f7fd2
📒 Files selected for processing (3)
src/infra/docs-lambda-changelog-scrubber/Program.cssrc/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cstests/Elastic.Changelog.Tests/Changelogs/LinkAllowlistSanitizerTests.cs
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/Elastic.Changelog.Tests/Changelogs/LinkAllowlistSanitizerTests.cs
- src/infra/docs-lambda-changelog-scrubber/Program.cs
Public / Lambda scrubbing path:The public path now uses
For bundles, Refactor / split if needed The split is now clean at the method level: Private-side: Public-side: The two paths share only the parsing utility ( Remove
|
Mpdreamz
left a comment
There was a problem hiding this comment.
LGTM
The private → public scrub + post-serialize validation is in good shape for the intended defense-in-depth story. Shapes up.
Follow-up (not blocking)
Align scrubbing/validation with every PR/issue format we document or accept — e.g. catalog what docs-builder (and changelog/bundle schema) allow in prs / issues and in prose fields (bare numbers, owner/repo#N, full URLs, alternate hosts). ValidateNoPrivateReferences and ScrubText only match a narrow set of GitHub PR/issue URL and short-form patterns today. If the product is effectively freeform strings in those fields, we should tighten the contract (schema/docs) and/or expand scrub + validation in lockstep so we do not give the impression of “all links removed” when some shapes slip through.
Please track a follow-up to: (1) list supported reference formats, (2) assert scrub/validation each cover the same set, and (3) if inputs stay freeform, add validation or heuristics appropriate to that.
Gaps in current pattern coverage (defense-in-depth / known blind spots)
Below are examples the current GitHubUrlRegex + ShortFormRefRegex (and thus post-validate) do not flag as disallowed by owner/repo, even though they may still point at private work. TryApply paths that use ChangelogTextUtilities.TryGetGitHubRepo can accept additional shapes; validate/scrub are narrower.
| Example | Notes |
|---|---|
https://www.github.com/elastic/secret/pull/42 |
www.github.com — not matched (pattern expects github.com immediately after //) |
https://github.com/elastic/secret |
Repo home only, no pull/issues path |
https://github.com/elastic/secret/wiki/Page |
Wiki / non–PR–issue path |
https://github.com/elastic/secret/compare/8.0...8.1 |
Compare (no /pull/, /issues/) |
https://github.com/elastic/secret/blob/main/README.md |
Blob/tree paths |
https://github.com/elastic/secret/commit/abc123def |
Commit link |
git@github.com:elastic/secret.git |
SSH remotes in text |
https://gh.internal.example.com/... |
Other GitHub hostnames (Enterprise) |
(Allowlisted https://github.com/org/repo/pull|issues/n and allowlisted owner/repo#n do line up and pass as expected.)
This table is to inform the follow-up: either narrow accepted input, or broadened detection where we need parity.
Thanks for the work on this — happy to help refine the follow-up if useful.
This pull request introduces the new "changelog-scrubber" Lambda function, which sanitizes changelog and bundle files in S3 by removing private repository references before making them public. The changes include the Lambda's implementation, build and deployment automation, and supporting project files. The release workflow is updated to build, package, and deploy this Lambda alongside the existing link index updater Lambda.
Key changes:
Changelog Scrubber Lambda Implementation
changelog-scrubberLambda function inProgram.csto process SQS events, scrub changelog and bundle YAML files usingLinkAllowlistSanitizer, and copy sanitized versions to a public S3 bucket. Handles S3 object creation/removal events and supports pass-through for specific JSON files.SerializerContext.csfor optimized JSON serialization of Lambda event types.docs-lambda-changelog-scrubber.csprojwith dependencies, embedded resource configuration, and AOT publishing settings.lambda.DockerFile) for building the Lambda binary on Amazon Linux 2023 with .NET 10 AOT.aws-lambda-tools-defaults.json.README.mdexplaining build, event handling, and scrubbing logic.CI/CD and Release Workflow Updates
build-changelog-scrubber-lambda.yml) to build and archive the Lambda binary.release.ymlto build, package, and deploy the changelog-scrubber Lambda, including artifact handling and release asset upload. Also refactored jobs for clarity and robustness. [1] [2] [3] [4] [5] [6]Link Allowlist Scrubbing Enhancements
LinkAllowlistSanitizerto support scrubbing of individual changelog entries and free-text fields, and added regexes for GitHub URL and short-form reference detection. [1] [2]These changes collectively add an automated, secure way to sanitize and publish changelog and bundle files, ensuring no private repository references are leaked to the public.