Skip to content

Changelog: Bundle upload support and Private-Public Bundling SQS Lambda#3163

Merged
cotti merged 11 commits intomainfrom
feature/private_bundling
Apr 22, 2026
Merged

Changelog: Bundle upload support and Private-Public Bundling SQS Lambda#3163
cotti merged 11 commits intomainfrom
feature/private_bundling

Conversation

@cotti
Copy link
Copy Markdown
Contributor

@cotti cotti commented Apr 21, 2026

This pull request introduces the new "changelog-scrubber" Lambda function, which sanitizes changelog and bundle files in S3 by removing private repository references before making them public. The changes include the Lambda's implementation, build and deployment automation, and supporting project files. The release workflow is updated to build, package, and deploy this Lambda alongside the existing link index updater Lambda.

Key changes:

Changelog Scrubber Lambda Implementation

  • Added the changelog-scrubber Lambda function in Program.cs to process SQS events, scrub changelog and bundle YAML files using LinkAllowlistSanitizer, and copy sanitized versions to a public S3 bucket. Handles S3 object creation/removal events and supports pass-through for specific JSON files.
  • Created SerializerContext.cs for optimized JSON serialization of Lambda event types.
  • Added project file docs-lambda-changelog-scrubber.csproj with dependencies, embedded resource configuration, and AOT publishing settings.
  • Included a Dockerfile (lambda.DockerFile) for building the Lambda binary on Amazon Linux 2023 with .NET 10 AOT.
  • Added AWS Lambda deployment defaults in aws-lambda-tools-defaults.json.
  • Provided a detailed README.md explaining build, event handling, and scrubbing logic.

CI/CD and Release Workflow Updates

  • Added a new GitHub Actions workflow (build-changelog-scrubber-lambda.yml) to build and archive the Lambda binary.
  • Updated release.yml to build, package, and deploy the changelog-scrubber Lambda, including artifact handling and release asset upload. Also refactored jobs for clarity and robustness. [1] [2] [3] [4] [5] [6]

Link Allowlist Scrubbing Enhancements

  • Improved LinkAllowlistSanitizer to support scrubbing of individual changelog entries and free-text fields, and added regexes for GitHub URL and short-form reference detection. [1] [2]

These changes collectively add an automated, secure way to sanitize and publish changelog and bundle files, ensuring no private repository references are leaked to the public.

@cotti cotti self-assigned this Apr 21, 2026
@cotti cotti added the feature label Apr 21, 2026
@cotti cotti requested a review from a team as a code owner April 21, 2026 13:51
@cotti cotti requested a review from reakaleek April 21, 2026 13:51
@coderabbitai coderabbitai Bot added ci and removed feature labels Apr 21, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new AWS Lambda "changelog-scrubber" that processes SQS-delivered S3 object events: deletions remove the corresponding public S3 key; creations either passthrough JSON (notably registry-index.json) or download YAML/YML, deserialize bundle vs changelog, apply an allowlist-derived sanitization (including free-text scrubbing) and write sanitized YAML to a public S3 bucket. Adds a Docker-based reusable GitHub Actions workflow to build and upload the Lambda bootstrap binary, extends LinkAllowlistSanitizer with allowlist construction and text-scrubbing, and adds bundle handling to the changelog upload flow.

Sequence Diagram(s)

sequenceDiagram
    participant S3 as S3 (Private)
    participant SQS as SQS Queue
    participant Lambda as Changelog Scrubber Lambda
    participant S3Pub as S3 (Public)

    S3->>SQS: Emit object event
    SQS->>Lambda: Invoke with SQS event batch
    Lambda->>Lambda: Parse batch & S3 events

    alt ObjectRemoved
        Lambda->>S3Pub: Delete object
        S3Pub-->>Lambda: Deleted / NotFound
    else ObjectCreated
        alt JSON (registry-index.json or .json)
            Lambda->>S3: Get object
            S3-->>Lambda: Content
            Lambda->>S3Pub: PutObject (passthrough)
        else YAML/YML
            Lambda->>S3: Get object
            S3-->>Lambda: YAML content
            Lambda->>Lambda: Deserialize (bundle vs changelog)
            Lambda->>Lambda: Build allowlist (assembler.yml)
            Lambda->>Lambda: Apply allowlist & scrub text
            Lambda->>S3Pub: PutObject (application/yaml)
        end
    end

    Lambda->>SQS: Return SQSBatchResponse (per-message failures)
Loading

Possibly related PRs

Suggested labels

ci, enhancement

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 24.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main additions: bundle upload support and a new SQS Lambda for sanitizing changelog/bundle files before public release.
Description check ✅ Passed The description comprehensively covers the changeset: changelog-scrubber Lambda implementation, CI/CD workflow updates, and link allowlist sanitization enhancements, all of which are reflected in the modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch feature/private_bundling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/build-changelog-scrubber-lambda.yml:
- Around line 22-38: The workflow attempts to docker cp /app/.artifacts/publish
from the built image but the Dockerfile does not create that path, causing the
copy to fail before stat "${BINARY_PATH}"; update the docker cp step (still
referencing the image name used in docker create: changelog-scrubber:latest) to
copy the actual publish output path that the Dockerfile produces (or adjust the
Dockerfile to emit .artifacts/publish), and ensure the copied file location
matches the BINARY_PATH env variable so stat "${BINARY_PATH}" succeeds (refs:
BINARY_PATH, the docker cp/docke r create lines, and stat "${BINARY_PATH}").

In `@src/infra/docs-lambda-changelog-scrubber/lambda.DockerFile`:
- Around line 30-34: The Dockerfile is computing a runtime identifier into
/tmp/rid using TARGETARCH/TARGETOS but then ignores it by hardcoding "-r
linux-x64" and omitting the publish output folder; update the dotnet publish RUN
that targets src/infra/docs-lambda-changelog-scrubber to read the computed RID
(cat /tmp/rid) instead of linux-x64 and add the explicit output directory flag
so artifacts land in /app/.artifacts/publish (ensure the command uses the
--output or -o option to publish to that path).

In `@src/infra/docs-lambda-changelog-scrubber/Program.cs`:
- Around line 176-180: The branch that currently logs a warning and returns the
original content when LinkAllowlistSanitizer.TryApplyBundle(...) fails should
instead fail closed: throw an exception so the message is retried or goes to DLQ
rather than publishing unsafe content; replace the context.Logger.LogWarning +
return content behavior in the TryApplyBundle failure path (the block handling
LinkAllowlistSanitizer.TryApplyBundle(collector, bundle, allowRepos, owner,
repo, out var sanitized, out var changed)) with a thrown exception that includes
context (e.g., input identifiers) and ensure you make the same change for the
equivalent failure branch later in the file that handles the same sanitizer
call.
- Around line 111-115: The current check in Program.cs that passes through any
object whose key ends with ".json" (the branch that calls
CopyPassThrough(s3Client, sourceBucket, key, context)) is too broad; change it
to only allow a small explicit allowlist of known-safe filenames (e.g.,
"registry-index.json") before calling CopyPassThrough. Locate the code that
inspects the key variable and replace the EndsWith(".json", ...) condition with
a check that compares the file name portion (use Path.GetFileName(key) or
equivalent) against a HashSet or array of allowedJsonNames (case-insensitive)
and only call CopyPassThrough when the file name is in that allowlist; otherwise
continue with normal scrubbing logic.

In `@src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs`:
- Around line 159-192: Apply the reviewer request to avoid emitting reversible
private sentinels by filtering/replacing entries returned from
ApplyToReferenceList before assigning to Prs/Issues: after calling
ApplyToReferenceList (used for PRs and Issues) run a helper (e.g.,
DropPrivateSentinels) that removes any entries that start with the sentinel
prefix (or alternatively replaces them with a non-reversible marker) and update
anyRewritten accordingly, then assign the filtered list to sanitized.Prs and
sanitized.Issues; keep ScrubText usage for Description/Impact/Action unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 71d66081-7fae-43cf-a4fc-5b1c30199b54

📥 Commits

Reviewing files that changed from the base of the PR and between 0c07334 and aacd6e3.

📒 Files selected for processing (12)
  • .github/workflows/build-changelog-scrubber-lambda.yml
  • .github/workflows/release.yml
  • src/infra/docs-lambda-changelog-scrubber/Program.cs
  • src/infra/docs-lambda-changelog-scrubber/README.md
  • src/infra/docs-lambda-changelog-scrubber/SerializerContext.cs
  • src/infra/docs-lambda-changelog-scrubber/aws-lambda-tools-defaults.json
  • src/infra/docs-lambda-changelog-scrubber/docs-lambda-changelog-scrubber.csproj
  • src/infra/docs-lambda-changelog-scrubber/lambda.DockerFile
  • src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs
  • src/services/Elastic.Changelog/Uploading/ChangelogUploadService.cs
  • tests/Elastic.Changelog.Tests/Changelogs/LinkAllowlistSanitizerTests.cs
  • tests/Elastic.Changelog.Tests/Uploading/ChangelogUploadServiceTests.cs

Comment thread .github/workflows/build-changelog-scrubber-lambda.yml
Comment thread src/infra/docs-lambda-changelog-scrubber/lambda.DockerFile
Comment thread src/infra/docs-lambda-changelog-scrubber/Program.cs Outdated
Comment thread src/infra/docs-lambda-changelog-scrubber/Program.cs Outdated
Comment thread src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs Outdated
@coderabbitai coderabbitai Bot added documentation Improvements or additions to documentation feature and removed ci labels Apr 21, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs`:
- Line 19: The GeneratedRegex attribute on LinkAllowlistSanitizer (the attribute
decorating the GitHub URL regex) is currently case-sensitive and allows
uppercase variants like HTTPS://github.com to bypass scrubbing; update the
attribute to use RegexOptions.IgnoreCase (matching ProfileFilterResolver's
approach) so the pattern for matching github.com pull/issues URLs is
case-insensitive and will correctly scrub private references.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 989d6c36-4602-4eab-8ef5-21891a1921d6

📥 Commits

Reviewing files that changed from the base of the PR and between aacd6e3 and 70af618.

📒 Files selected for processing (3)
  • src/infra/docs-lambda-changelog-scrubber/Program.cs
  • src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs
  • tests/Elastic.Changelog.Tests/Changelogs/LinkAllowlistSanitizerTests.cs
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/infra/docs-lambda-changelog-scrubber/Program.cs
  • tests/Elastic.Changelog.Tests/Changelogs/LinkAllowlistSanitizerTests.cs

Comment thread src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs
Copy link
Copy Markdown
Member

@Mpdreamz Mpdreamz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request changes

Thanks for wiring up the private → public pipeline and the deserialize → mutate → serialize scrubbing path. With how we plan to use these buckets, I think we need a few adjustments before this is safe to ship.

Bucket roles (why scrubbing must be strict)

  • Private S3 backs indexing into Elasticsearch (internal/trusted context).
  • Public S3 is what docs-builder will use to render release notes on anyone’s machine (untrusted context).

So the public objects must not carry internal repo names, PR numbers, or URLs that slipped through a new field or a renamed property. The bar for the public artifact is “no disclose,” not only “no clickable link.”

Sentinel-style redaction vs real removal

Using # PRIVATE: <original> on structured refs can still leave the full original reference in plain text in YAML that lands in the public bucket. That conflicts with the goal above: if the private copy remains authoritative in private S3, the public copy should remove disallowed references (or replace them with a non-identifying placeholder, e.g. a fixed string with no owner/repo/URL), not preserve them after a prefix.

Please align bundle scrubbing with that requirement: no residual GitHub URLs or owner/repo#N patterns in public bundle YAML unless they are explicitly allowlisted.

Post-serialize validation

Deserialize → mutate → serialize is the right core loop, but it won’t catch:

  • New YAML fields that contain links in a future schema revision.
  • Renamed fields where scrubbing wasn’t extended.

Please add a post-serialize validation pass on the final string destined for the public bucket (or equivalently, a second parse + walk), e.g.:

  • Assert the serialized text contains no GitHub PR/issue URLs and no short-form owner/repo# references outside an explicit allowlist (reuse or share the same patterns as ScrubText / link detection).
  • Fail the Lambda batch item (or fail CI tests) if validation fails so we don’t quietly publish a leaky object.

That gives us defense in depth when the model changes.

Summary

Request:

  1. Public bundle/changelog YAML should remove or non-identifying-redact disallowed references so internal details aren’t recoverable from the file.
  2. Post-serialize validation on emitted content to guard against new/renamed fields leaking links.

Happy to revisit once those are addressed.

@coderabbitai coderabbitai Bot added ci and removed documentation Improvements or additions to documentation feature labels Apr 22, 2026
@cotti
Copy link
Copy Markdown
Contributor Author

cotti commented Apr 22, 2026

Changes for residual private references:

  • Added StripBundleSentinels -- strips # PRIVATE: entries from all prs/issues across a bundle's entries. The Lambda now calls this after TryApplyBundle before serializing bundle output.

Changes for post-serialize validation

  • Added ValidateNoPrivateReferences(string yaml, IReadOnlyList<string> allowRepos), which scans the final serialized YAML string using the same GitHubUrlRegex and ShortFormRefRegex patterns, plus checks for residual # PRIVATE: sentinels. Throws InvalidOperationException if any non-allowlisted references are found.

The Lambda now calls this on every output (both changed and unchanged paths) before writing to the public bucket. If validation fails, the exception bubbles up and the SQS message goes to DLQ -- no unsafe content gets published.

Added some test scenarios to cover scrubbing:

StripBundleSentinels:

  • Removes sentinels while keeping allowed refs in a mixed list
  • All-sentinel lists become empty
  • Null prs/issues are preserved as null
  • Multiple entries are all stripped

ValidateNoPrivateReferences:

  • Clean YAML with only allowed refs passes
  • Private GitHub URL triggers exception
  • Private short-form owner/repo#N triggers exception
  • Residual # PRIVATE: sentinel triggers exception
  • All-allowed content passes
  • Empty YAML passes
  • Mixed allowed + private throws (citing the private one)

TryApplyChangelogEntry edge cases:

  • Mixed allowed + private PRs: keeps allowed, drops private, verifies count
  • Private refs across ALL fields (prs, issues, description, impact, action): verifies complete scrubbing

@cotti cotti requested a review from Mpdreamz April 22, 2026 14:36
@coderabbitai coderabbitai Bot removed the ci label Apr 22, 2026
@coderabbitai coderabbitai Bot added documentation Improvements or additions to documentation feature labels Apr 22, 2026
Copy link
Copy Markdown
Member

@Mpdreamz Mpdreamz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request changes

What we want from the two buckets

  • Private S3 is the trusted store: full YAML, real PR/issue links, whatever indexing and internal workflows need. It does not need redaction hints.
  • Public S3 is what docs-builder will use to render release notes on any machine. That copy should contain only allowlisted GitHub references. Anything else should be absent, not replaced with a marker that something used to be there.

So the public pipeline should not depend on an intermediate # PRIVATE: representation at all.

Why drop the sentinel round-trip for public output

Right now the flow can read as: rewrite disallowed refs to # PRIVATE: …, then strip sentinels, serialize, validate (plus validation that still looks for residual # PRIVATE:).

That works, but it adds moving parts and two different concepts of “clean”:

  1. Mental model — Operators and future maintainers should only need one rule for public: disallowed links are removed. They should not have to reason about a string format that is explicitly not allowed in the final artifact anyway.
  2. Complexity — If nothing outside private S3 needs “a hint something was redacted,” the sanitizer’s public path can omit disallowed list entries (or never add them) directly, instead of tag then strip. Same outcome, fewer steps, fewer places for ordering bugs (e.g. strip forgotten on a code path).
  3. ValidationValidateNoPrivateReferences should primarily enforce “no non-allowlisted link-shaped content” in the serialized string. A separate “no # PRIVATE: check is reasonable as a temporary guard while sentinel output still exists somewhere; once public code never emits that token, that check becomes redundant. Prefer one contract: public YAML must not contain private links or redaction sentinels, achieved by never writing either.

Requested changes

  1. Public / Lambda scrubbing path — For bundle (and any shared entry logic used only for public), do not use # PRIVATE: as an intermediary. Resolve each PR/issue reference against the allowlist; drop disallowed entries (or equivalent) before serialization. Keep deserialize → mutate (remove) → serialize → validate string without a sentinel phase for this path.
  2. Refactor / split if needed — If other callers still need sentinel behavior for private-side tooling, isolate that behind an API that is not used when writing to the public bucket, or branch clearly so “public emission” never constructs # PRIVATE:.
  3. Strip / validate — Remove StripBundleSentinels-style workarounds once they are unnecessary, and tighten ValidateNoPrivateReferences to match the final contract (link scans + any residual rules you still need during migration).

Summary

Private artifacts can keep private links. Public artifacts should be link-scrubbed by removal, not by a two-phase sentinel encoding. Please adjust this PR so the public scrubbing path reflects that simpler model and the rationale above.

Happy to re-review once that’s in place.

@coderabbitai coderabbitai Bot added ci enhancement and removed documentation Improvements or additions to documentation feature labels Apr 22, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs (1)

19-23: ⚠️ Potential issue | 🔴 Critical

Broaden GitHub URL detection before publishing.

Line 19 only matches lowercase GitHub PR/issue URLs, so values like HTTPS://github.com/private/repo/pull/1 or https://github.com/private/repo/blob/main/file.md bypass both scrubbing and ValidateNoPrivateReferences. Make this matcher case-insensitive and match repo-scoped GitHub URLs, not just /pull|issues/. The case-sensitivity part was already flagged in a prior review.

Suggested fix
-	[GeneratedRegex(@"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)/(?:pull|issues)/\d+", RegexOptions.None)]
+	[GeneratedRegex(
+		@"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)(?:/[^\s""'<>)]*)?",
+		RegexOptions.IgnoreCase | RegexOptions.CultureInvariant)]
 	private static partial Regex GitHubUrlRegex();

Read-only verification:

#!/bin/bash
# Verifies the current matcher misses uppercase GitHub URLs and repo/blob URLs.

rg -n -C2 'GeneratedRegex\(@"https\?://github\\\.com' src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs

python - <<'PY'
import re

current = re.compile(r"https?://github\.com/(?P<owner>[A-Za-z0-9_.-]+)/(?P<repo>[A-Za-z0-9_.-]+)/(?:pull|issues)/\d+")

samples = [
    "HTTPS://github.com/private/repo/pull/123",
    "https://github.com/private/repo",
    "https://github.com/private/repo/blob/main/file.md",
    "https://github.com/private/repo/issues/123",
]

for sample in samples:
    print(f"{sample}: {'MATCH' if current.search(sample) else 'MISS'}")
PY

Expected output: the first three samples show MISS with the current matcher.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs` around
lines 19 - 23, The GitHubUrlRegex is too strict and case-sensitive so URLs like
HTTPS://github.com/... or repo-scoped paths (blob/tree/etc.) bypass scrubbing;
update the GeneratedRegex for GitHubUrlRegex to be case-insensitive
(RegexOptions.IgnoreCase) and broaden its pattern to match any repo-scoped path
not just /pull|issues (e.g. change
@"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)/(?:pull|issues)/\d+"
to a pattern that allows an optional slash and any trailing path such as
@"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)(?:$|/.*)"
and add RegexOptions.IgnoreCase); also add RegexOptions.IgnoreCase to
ShortFormRefRegex so short owner/repo#123 forms are matched case-insensitively,
ensuring ValidateNoPrivateReferences and LinkAllowlistSanitizer now catch
uppercase and blob/tree URLs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs`:
- Around line 19-23: The GitHubUrlRegex is too strict and case-sensitive so URLs
like HTTPS://github.com/... or repo-scoped paths (blob/tree/etc.) bypass
scrubbing; update the GeneratedRegex for GitHubUrlRegex to be case-insensitive
(RegexOptions.IgnoreCase) and broaden its pattern to match any repo-scoped path
not just /pull|issues (e.g. change
@"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)/(?:pull|issues)/\d+"
to a pattern that allows an optional slash and any trailing path such as
@"https?://github\.com/(?<owner>[A-Za-z0-9_.-]+)/(?<repo>[A-Za-z0-9_.-]+)(?:$|/.*)"
and add RegexOptions.IgnoreCase); also add RegexOptions.IgnoreCase to
ShortFormRefRegex so short owner/repo#123 forms are matched case-insensitively,
ensuring ValidateNoPrivateReferences and LinkAllowlistSanitizer now catch
uppercase and blob/tree URLs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 38ebf793-39b1-438d-baaf-7b10432f7fd2

📥 Commits

Reviewing files that changed from the base of the PR and between 63a3311 and aaea34a.

📒 Files selected for processing (3)
  • src/infra/docs-lambda-changelog-scrubber/Program.cs
  • src/services/Elastic.Changelog/Bundling/LinkAllowlistSanitizer.cs
  • tests/Elastic.Changelog.Tests/Changelogs/LinkAllowlistSanitizerTests.cs
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/Elastic.Changelog.Tests/Changelogs/LinkAllowlistSanitizerTests.cs
  • src/infra/docs-lambda-changelog-scrubber/Program.cs

@cotti
Copy link
Copy Markdown
Contributor Author

cotti commented Apr 22, 2026

Public / Lambda scrubbing path:

The public path now uses FilterReferenceList, which resolves each reference against the allowlist and either keeps it or drops it. The flow is:

deserialize -> FilterReferenceList drops disallowed refs + ScrubText removes private links from text fields -> serialize -> ValidateNoPrivateReferences on the final string.

For bundles, ScrubBundleForPublic iterates entries and applies TryApplyChangelogEntry (which uses FilterReferenceList) to each, plus scrubs the bundle-level description. The # PRIVATE: constant is not referenced on any public code path.

Refactor / split if needed

The split is now clean at the method level:

Private-side: TryApplyBundle -> ApplyToReferenceList -> ProcessPlainReference (produces sentinels). Used only by the docs-builder changelog bundle CLI.

Public-side: ScrubBundleForPublic / TryApplyChangelogEntry -> FilterReferenceList (drops disallowed refs, never creates sentinels). Used by the Lambda.

The two paths share only the parsing utility (TryGetGitHubRepo) and the allowlist (BuildAllowSet).

Remove StripBundleSentinels-style workarounds once unnecessary, and tighten ValidateNoPrivateReferences

StripBundleSentinels was removed, alongside DropSentinels.

ValidateNoPrivateReferences enforces one contract: the serialized YAML must not contain non-allowlisted GitHub URLs, owner/repo#N short forms, or residual # PRIVATE: strings. The sentinel check remains as a transitional safety net (it will never fire since the public path never produces sentinels, but it catches programming errors if someone accidentally routes private-side output to the public path). The primary enforcement is the link-pattern scans.

@cotti cotti requested a review from Mpdreamz April 22, 2026 19:57
Copy link
Copy Markdown
Member

@Mpdreamz Mpdreamz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The private → public scrub + post-serialize validation is in good shape for the intended defense-in-depth story. Shapes up.

Follow-up (not blocking)

Align scrubbing/validation with every PR/issue format we document or accept — e.g. catalog what docs-builder (and changelog/bundle schema) allow in prs / issues and in prose fields (bare numbers, owner/repo#N, full URLs, alternate hosts). ValidateNoPrivateReferences and ScrubText only match a narrow set of GitHub PR/issue URL and short-form patterns today. If the product is effectively freeform strings in those fields, we should tighten the contract (schema/docs) and/or expand scrub + validation in lockstep so we do not give the impression of “all links removed” when some shapes slip through.

Please track a follow-up to: (1) list supported reference formats, (2) assert scrub/validation each cover the same set, and (3) if inputs stay freeform, add validation or heuristics appropriate to that.

Gaps in current pattern coverage (defense-in-depth / known blind spots)

Below are examples the current GitHubUrlRegex + ShortFormRefRegex (and thus post-validate) do not flag as disallowed by owner/repo, even though they may still point at private work. TryApply paths that use ChangelogTextUtilities.TryGetGitHubRepo can accept additional shapes; validate/scrub are narrower.

Example Notes
https://www.github.com/elastic/secret/pull/42 www.github.com — not matched (pattern expects github.com immediately after //)
https://github.com/elastic/secret Repo home only, no pull/issues path
https://github.com/elastic/secret/wiki/Page Wiki / non–PR–issue path
https://github.com/elastic/secret/compare/8.0...8.1 Compare (no /pull/, /issues/)
https://github.com/elastic/secret/blob/main/README.md Blob/tree paths
https://github.com/elastic/secret/commit/abc123def Commit link
git@github.com:elastic/secret.git SSH remotes in text
https://gh.internal.example.com/... Other GitHub hostnames (Enterprise)

(Allowlisted https://github.com/org/repo/pull|issues/n and allowlisted owner/repo#n do line up and pass as expected.)

This table is to inform the follow-up: either narrow accepted input, or broadened detection where we need parity.

Thanks for the work on this — happy to help refine the follow-up if useful.

@cotti cotti merged commit e4ff8f3 into main Apr 22, 2026
28 of 29 checks passed
@cotti cotti deleted the feature/private_bundling branch April 22, 2026 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants