Skip to content

feat(db): optimize queries to reduce data transfer#33

Merged
cobacious merged 2 commits intomainfrom
optimize-data-transfer
Jan 2, 2026
Merged

feat(db): optimize queries to reduce data transfer#33
cobacious merged 2 commits intomainfrom
optimize-data-transfer

Conversation

@cobacious
Copy link
Copy Markdown
Owner

Optimized database queries to fetch only necessary fields, significantly reducing egress data transfer to stay within free tier limits.

Changes:

  • getUnembeddedArticles: only fetch id, title, content, snippet, url
  • getUnclusteredArticles: only fetch id, embedding, createdAt
  • Updated tests to match optimized query structures

Impact: ~90-95% reduction in data transfer per query, which should help avoid exceeding the 5GB/month egress limit on Neon/Supabase free tiers.

🤖 Generated with Claude Code

Optimized database queries to fetch only necessary fields, significantly
reducing egress data transfer to stay within free tier limits.

Changes:
- getUnembeddedArticles: only fetch id, title, content, snippet, url
- getUnclusteredArticles: only fetch id, embedding, createdAt
- Updated tests to match optimized query structures

Impact: ~90-95% reduction in data transfer per query, which should help
avoid exceeding the 5GB/month egress limit on Neon/Supabase free tiers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Dec 16, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
neus Ready Ready Preview, Comment Dec 19, 2025 4:45pm

…bles

Make all pipeline processing limits configurable via environment variables
instead of hardcoded constants. This allows flexible configuration for
different use cases (local development, production, bulk processing).

Changes:
- embedArticles.ts: MAX_EMBEDDINGS env var (0 = unlimited)
- getArticlesMissingContent.ts: MAX_CONTENT_EXTRACTION env var (0 = unlimited)
- summarizeClusters.ts: Fix TOKEN_LIMIT to properly handle 0 as unlimited
- .env.example: Document all pipeline limits with cost estimates
- run-pipeline.yml: Set limits to 0 (unlimited) for GitHub Actions

For hobby use, unlimited processing costs ~$0.02 per run (~$0.10/month
if run weekly). The configurable limits allow adding safety caps if needed
for production use.

Rationale: The previous hardcoded limits (200 embeddings, 100 content
extractions) were causing incomplete processing when there was a backlog,
requiring multiple runs to process all articles. For a hobby project with
infrequent runs, it's simpler to process everything in one go.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@cobacious cobacious merged commit a96d938 into main Jan 2, 2026
3 checks passed
@cobacious cobacious deleted the optimize-data-transfer branch January 2, 2026 10:12
cobacious added a commit that referenced this pull request Jan 2, 2026
* feat(db): optimize queries to reduce data transfer

Optimized database queries to fetch only necessary fields, significantly
reducing egress data transfer to stay within free tier limits.

Changes:
- getUnembeddedArticles: only fetch id, title, content, snippet, url
- getUnclusteredArticles: only fetch id, embedding, createdAt
- Updated tests to match optimized query structures

Impact: ~90-95% reduction in data transfer per query, which should help
avoid exceeding the 5GB/month egress limit on Neon/Supabase free tiers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(engine): make pipeline limits configurable via environment variables

Make all pipeline processing limits configurable via environment variables
instead of hardcoded constants. This allows flexible configuration for
different use cases (local development, production, bulk processing).

Changes:
- embedArticles.ts: MAX_EMBEDDINGS env var (0 = unlimited)
- getArticlesMissingContent.ts: MAX_CONTENT_EXTRACTION env var (0 = unlimited)
- summarizeClusters.ts: Fix TOKEN_LIMIT to properly handle 0 as unlimited
- .env.example: Document all pipeline limits with cost estimates
- run-pipeline.yml: Set limits to 0 (unlimited) for GitHub Actions

For hobby use, unlimited processing costs ~$0.02 per run (~$0.10/month
if run weekly). The configurable limits allow adding safety caps if needed
for production use.

Rationale: The previous hardcoded limits (200 embeddings, 100 content
extractions) were causing incomplete processing when there was a backlog,
requiring multiple runs to process all articles. For a hobby project with
infrequent runs, it's simpler to process everything in one go.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Jacob Walton <jacob.walton@dnata.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant