Skip to content

chore: s3 clean#3572

Open
isTravis wants to merge 12 commits intomainfrom
tr/s3-clean
Open

chore: s3 clean#3572
isTravis wants to merge 12 commits intomainfrom
tr/s3-clean

Conversation

@isTravis
Copy link
Copy Markdown
Member

@isTravis isTravis commented Apr 9, 2026

Adds tools/s3Cleanup.ts - a script that identifies and removes unreferenced objects from the assets.pubpub.org S3 bucket.

Problem

Every uploaded asset is retained forever, even after being replaced in-app (all uploads get unique filenames). There's been no mechanism to identify or remove orphaned files.

How it works

Phase 1 — Scan the database for every reference to an assets.pubpub.org key. This covers:

  • All image/avatar TEXT columns across Communities, Pubs, Pages, Collections, Users, Attributions, ExternalPublications, Exports, and the PubHeaderTheme facet
  • JSONB structures: Pubs.downloads, layout block banner images, layout HTML blocks, layout text blocks (DocJson), layout submission-banner bodies
  • All DocJson trees (recursive ProseMirror walk): Docs, ThreadComments, Releases, Reviews, Submissions, SubmissionWorkflows (5 columns), DraftCheckpoints
  • Regex text scans for: LandingPageFeature payloads, ActivityItem payloads, CustomScripts, Pub htmlTitle/htmlDescription, Community heroText, WorkerTask input/output
  • All URL formats: direct, Fastly IO query params, resize-v3 (base64), resize v1 (Thumbor), s3-external-1.amazonaws.com

Phase 2 — List the S3 bucket and compare each object's key against the referenced set. Writes orphans to orphans.txt.

Phase 3 — Delete orphans in batches of 1,000 (only with --execute).

Safety measures

  • Age threshold: objects newer than 1 year are never considered orphans, preventing race conditions where a file was just uploaded but the DB read missed it (configurable via --min-age-days=N)
  • Dry-run by default: without --execute, only writes a manifest — no deletions
  • Resumable: saves an S3 listing checkpoint so interrupted runs can --resume
  • _testing/ prefix keys are always skipped

Usage

pnpm run tools-prod s3Cleanup                      # dry-run
pnpm run tools-prod s3Cleanup --execute            # delete orphans
pnpm run tools-prod s3Cleanup --resume             # resume interrupted listing
pnpm run tools-prod s3Cleanup --min-age-days=180   # custom age threshold
pnpm run tools-prod s3Cleanup --skip-s3-list --execute  # reuse existing orphans.txt

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a production tooling script to identify and remove (or quarantine) orphaned objects in the assets.pubpub.org S3 bucket by scanning the Postgres DB for referenced asset keys, comparing against a full bucket listing, and then acting on the orphan set.

Changes:

  • Add tools/s3Cleanup.ts implementing DB reference collection, S3 listing with resume support, and delete/quarantine/unquarantine flows.
  • Register the new s3Cleanup command in the tools command router.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
tools/s3Cleanup.ts New CLI tool for scanning DB references, listing S3 keys, and deleting/quarantining unreferenced objects.
tools/index.js Adds s3Cleanup to the available tools commands.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants