Skip to content

update: improve s3 orphan handling#3594

Merged
isTravis merged 1 commit intomainfrom
tr/s3clean2
Apr 20, 2026
Merged

update: improve s3 orphan handling#3594
isTravis merged 1 commit intomainfrom
tr/s3clean2

Conversation

@isTravis
Copy link
Copy Markdown
Member

Improves the s3Cleanup tool so that subsequent runs don't re-process objects that were already tagged as orphans, and ensures non-asset prefixes (CloudFront logs, admin manifests, fonts) are never considered orphan candidates.

Changes

tools/s3Cleanup.ts

  • S3-based orphan manifest tracking: After a successful --tag run, the script uploads a manifest of all successfully-tagged keys to s3://assets.pubpub.org/_orphanAdmin/tagged-<timestamp>.txt. On subsequent runs, these manifests are downloaded and their keys are skipped during the S3 listing phase. This replaces the previous local-file approach and survives redeploys.
  • Skip all _ prefixed folders: Any top-level key starting with _ is now ignored (covers _testing/, _cflogs/, _orphanAdmin/, _fonts/, and any future underscore-prefixed paths).
  • Output renamed to newOrphans.txt: Phase 2 now writes only newly-discovered orphans to tmp/newOrphans.txt, making it clear these are incremental finds not already tracked.
  • New imports: GetObjectCommand, PutObjectCommand added for manifest download/upload.

scripts/upload-fonts-to-s3.sh

  • Fonts now upload to _fonts/ prefix (already covered by the _ skip rule in s3Cleanup).

server/Html.tsx & workers/tasks/export/html.tsx

  • Updated font CSS URL to point at new _fonts/<hash>/fonts.css path.

How it works

  1. Phase 1: Scan DB for all referenced S3 keys (unchanged)
  2. Load previously-tagged keys from _orphanAdmin/*.txt manifests in S3
  3. Phase 2: List bucket, skip _* and fonts/ prefixes, skip previously-tagged keys → write only new orphans to tmp/newOrphans.txt
  4. Phase 3 (--tag): Tag new orphans, then upload a manifest of successfully-tagged keys to _orphanAdmin/

Safety

  • Non-destructive: tagging is reversible with --untag
  • Objects < 1 year old are never considered
  • Previously-tagged manifests are stored durably in S3 (not local disk)
  • First run on a fresh environment works fine (empty _orphanAdmin/ prefix returns no manifests)

@isTravis isTravis merged commit 7a96b55 into main Apr 20, 2026
1 check passed
@isTravis isTravis deleted the tr/s3clean2 branch April 20, 2026 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant