feat: asset library pipeline for entity icons#26
Conversation
- extract-entities.py: add --normalize-only flag for LLM deduplication - fetch-icons.py: CoinGecko integration with rate limiting and pre-scan - generate-asset-checklist.py: coverage reporting with fuzzy matching - assets/: entity inventory (1143 entities) and 200+ icons 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Add fuzzy name matching (0.85 threshold) to merge typo variants (e.g., jintern/jinintern, ai16z/ai16) - Add --dedupe flag for post-processing existing inventory - Add --since flag for incremental extraction (CI/CD efficiency) - Add resolve_type_conflicts() with batch LLM arbitration - Add token cost tracking in metadata - Constrain entity types to: token, platform, project, user - Add type normalization (person→user, company→project) - Remove visual_hints from extraction (not actionable) Results: 2,320 → 1,928 entities (-17% deduplication) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ENTITIES filter - Rename entity-inventory.json → manifest.json for clarity - Update all script references to use manifest.json - Add SKIP_ENTITIES filter to extract-entities.py to skip generic terms (crypto, token, nft, blockchain, agent, etc.) - Remove fetch-icons.py (already merged into generate-icons.py) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename all icons from {name}.ext to {name}-1.ext
- Remove legacy file handling code from generate-icons.py
- Simplify get_next_icon_filename() to only handle numbered files
- Delete obsolete/duplicate icon files
This allows multiple icons per entity (for artist reference/moodboard)
with a consistent naming convention: {base}-{n}.{ext}
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Merge entity types from 4 to 3 (platform → project) - Replace resolve_type_conflicts() with classify_entities() - Single LLM call classifies type + status (keep/skip/review) - generate-icons.py filters by status=keep - Clean up metadata fields in --dedupe output 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Code reviewFound 1 issue:
knowledge/scripts/posters/generate-asset-checklist.py Lines 71 to 80 in 7ef9af9 The code at line 72 calls 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
- Updated load_entities() to handle flat list schema with type field - Updated get_all_assets() for flat icon directory structure - Updated generate_checklist() for new entity types (token, project, user) Fixes schema mismatch that caused AttributeError when running script. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix IGP-003: Pass entity_type as keyword arg, not positional - Fix IGP-004: Update --batch help to show valid types (token, project, user) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add retry_with_backoff() helper (3 retries, 2s base backoff) - Apply to generate_image() for OpenRouter API - Apply to fetch_coingecko_icon() for CoinGecko API - Remove hardcoded HTTP-Referer for fork-friendliness Aligns with CLAUDE.md guidance on retry logic for API calls. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tiered confidence matching: exact → alias → suffix-stripped → word boundary - Implement MIN_SUBSTRING_LENGTH (4 chars) to prevent short name false positives - Add word boundary matching to prevent "Go" matching "Google" - Include domain verification as bonus using Simple Icons source field - Load full Simple Icons metadata (slug, title, source, aliases) - Update README with reference image pipeline documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ledge into feat/asset-library-pipeline
- Add Entity Icons section to main README.md - Delete README_ICON_GENERATION.md (content merged) - Add --stats flag to validate-icons.py for coverage reporting - Delete generate-asset-checklist.py (merged into validate-icons.py) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Allows saving coverage stats to a markdown file: python validate-icons.py --stats -o coverage.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Tokens: full checklist with [x]/[ ] checkboxes - Projects/Users: summary with top 20 have/missing items 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Tokens: full checklist (manageable size) - Projects/Users: just have/missing counts (too large to list) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add "Symbol/icon marks only (not wordmarks or text-based logos)" - Add "Seamless cells with clean edges (no visible gridlines or borders)" - Also sync manifest before showing --stats 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add is_invalid_entity_name() to filter numbers, IDs, timestamps, prices - Add --filter-only flag for fast code-based filtering (no LLM) - Filtered 55 junk entities from manifest (3180 -> 3125) - Token coverage improved from 19% to 96% 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add explicit "OMIT ENTIRELY" section to extraction prompt: - Dates, times, timestamps - Numbers, amounts, prices - IDs, hashes, addresses - Durations, sizes/measurements - Generic phrases This prevents junk extraction at the source, not just post-processing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add SKIP_FIELDS to filter out metadata before feeding to LLM: - source, url, date, briefing_date, number, status, author - item_type, sentiment, extracted_at, schema_version This prevents dates like "2025-12-20" and URLs from being parsed as entity-containing content. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace SKIP_FIELDS (blocklist) with CONTENT_FIELDS (allowlist): - claim, title, description, significance, summary - content, text, body, message, details, notes Allowlist is more robust - new metadata fields won't leak through. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Entity extraction: - Reframe extraction prompt with iconifiability test - Add --reclassify mode for batch LLM classification (200/batch) - Simplify code filtering to minimal checks (let LLM decide) - Reclassified manifest: 1917 keep, 1188 skip, 20 review Icon generation: - Add selfhst/icons library support via SELFHST_ICONS_PATH env var - Reference pipeline: Simple Icons → selfhst/icons → GitHub → Favicon 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reference pipeline now: 1. Simple Icons (3300+ tech brands) 2. selfhst/icons (2300+ self-hosted apps) 3. gilbarbara/logos (2000+ SVG logos) <- NEW 4. GitHub avatars 5. Google Favicon Configure via GILBARBARA_LOGOS_PATH env var. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Pipeline for extracting entities from daily content and sourcing visual assets (icons/logos).
extract-entities.py: Added--normalize-onlyflag for LLM deduplication without re-extractionfetch-icons.py: CoinGecko integration with rate limiting (3s/req) and pre-scan for efficiencygenerate-asset-checklist.py: Coverage reporting with fuzzy containment matchingCurrent Coverage
Test plan
python scripts/posters/fetch-icons.py --tokens- pre-scans existing, skips API callspython scripts/posters/generate-asset-checklist.py- generates coverage reportpython scripts/etl/extract-entities.py --normalize-only -i <inventory>- normalizes without re-extraction🤖 Generated with Claude Code