Skip to content

feat: asset library pipeline for entity icons#26

Merged
madjin merged 31 commits intomainfrom
feat/asset-library-pipeline
Dec 30, 2025
Merged

feat: asset library pipeline for entity icons#26
madjin merged 31 commits intomainfrom
feat/asset-library-pipeline

Conversation

@madjin
Copy link
Copy Markdown
Contributor

@madjin madjin commented Dec 21, 2025

Summary

Pipeline for extracting entities from daily content and sourcing visual assets (icons/logos).

  • extract-entities.py: Added --normalize-only flag for LLM deduplication without re-extraction
  • fetch-icons.py: CoinGecko integration with rate limiting (3s/req) and pre-scan for efficiency
  • generate-asset-checklist.py: Coverage reporting with fuzzy containment matching
  • 200+ icons across tokens/plugins categories

Current Coverage

Category Coverage
Tokens 20% (19/96)
Platforms 17% (33/189)
Tech 11% (18/157)
Projects 14% (34/244)
Plugins 30% (53/175)

Test plan

  • python scripts/posters/fetch-icons.py --tokens - pre-scans existing, skips API calls
  • python scripts/posters/generate-asset-checklist.py - generates coverage report
  • python scripts/etl/extract-entities.py --normalize-only -i <inventory> - normalizes without re-extraction

🤖 Generated with Claude Code

- extract-entities.py: add --normalize-only flag for LLM deduplication
- fetch-icons.py: CoinGecko integration with rate limiting and pre-scan
- generate-asset-checklist.py: coverage reporting with fuzzy matching
- assets/: entity inventory (1143 entities) and 200+ icons

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 21, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/asset-library-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

madjin and others added 11 commits December 24, 2025 14:33
- Add fuzzy name matching (0.85 threshold) to merge typo variants
  (e.g., jintern/jinintern, ai16z/ai16)
- Add --dedupe flag for post-processing existing inventory
- Add --since flag for incremental extraction (CI/CD efficiency)
- Add resolve_type_conflicts() with batch LLM arbitration
- Add token cost tracking in metadata
- Constrain entity types to: token, platform, project, user
- Add type normalization (person→user, company→project)
- Remove visual_hints from extraction (not actionable)

Results: 2,320 → 1,928 entities (-17% deduplication)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ENTITIES filter

- Rename entity-inventory.json → manifest.json for clarity
- Update all script references to use manifest.json
- Add SKIP_ENTITIES filter to extract-entities.py to skip generic terms
  (crypto, token, nft, blockchain, agent, etc.)
- Remove fetch-icons.py (already merged into generate-icons.py)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename all icons from {name}.ext to {name}-1.ext
- Remove legacy file handling code from generate-icons.py
- Simplify get_next_icon_filename() to only handle numbered files
- Delete obsolete/duplicate icon files

This allows multiple icons per entity (for artist reference/moodboard)
with a consistent naming convention: {base}-{n}.{ext}

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Merge entity types from 4 to 3 (platform → project)
- Replace resolve_type_conflicts() with classify_entities()
- Single LLM call classifies type + status (keep/skip/review)
- generate-icons.py filters by status=keep
- Clean up metadata fields in --dedupe output

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@madjin
Copy link
Copy Markdown
Contributor Author

madjin commented Dec 29, 2025

Code review

Found 1 issue:

  1. Schema mismatch causes immediate runtime failure - load_entities() expects the old dict-based schema {"entities": {"category": [items]}} but manifest.json now uses a flat list schema {"entities": [{"name": "...", "type": "..."}]}. This causes AttributeError: 'list' object has no attribute 'items' when the script runs.

result = {}
for category, items in data.get("entities", {}).items():
# Dedupe by lowercase, keep original for display
seen = {}
for item in items:
key = item.lower()
if key not in seen:
seen[key] = item
result[category] = seen
return result

The code at line 72 calls .items() on data.get("entities", {}), but after the refactor from entity-inventory.json to manifest.json, entities is now a list, not a dict.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

madjin and others added 15 commits December 29, 2025 00:35
- Updated load_entities() to handle flat list schema with type field
- Updated get_all_assets() for flat icon directory structure
- Updated generate_checklist() for new entity types (token, project, user)

Fixes schema mismatch that caused AttributeError when running script.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix IGP-003: Pass entity_type as keyword arg, not positional
- Fix IGP-004: Update --batch help to show valid types (token, project, user)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add retry_with_backoff() helper (3 retries, 2s base backoff)
- Apply to generate_image() for OpenRouter API
- Apply to fetch_coingecko_icon() for CoinGecko API
- Remove hardcoded HTTP-Referer for fork-friendliness

Aligns with CLAUDE.md guidance on retry logic for API calls.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tiered confidence matching: exact → alias → suffix-stripped → word boundary
- Implement MIN_SUBSTRING_LENGTH (4 chars) to prevent short name false positives
- Add word boundary matching to prevent "Go" matching "Google"
- Include domain verification as bonus using Simple Icons source field
- Load full Simple Icons metadata (slug, title, source, aliases)
- Update README with reference image pipeline documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add Entity Icons section to main README.md
- Delete README_ICON_GENERATION.md (content merged)
- Add --stats flag to validate-icons.py for coverage reporting
- Delete generate-asset-checklist.py (merged into validate-icons.py)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Allows saving coverage stats to a markdown file:
  python validate-icons.py --stats -o coverage.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Tokens: full checklist with [x]/[ ] checkboxes
- Projects/Users: summary with top 20 have/missing items

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Tokens: full checklist (manageable size)
- Projects/Users: just have/missing counts (too large to list)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add "Symbol/icon marks only (not wordmarks or text-based logos)"
- Add "Seamless cells with clean edges (no visible gridlines or borders)"
- Also sync manifest before showing --stats

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add is_invalid_entity_name() to filter numbers, IDs, timestamps, prices
- Add --filter-only flag for fast code-based filtering (no LLM)
- Filtered 55 junk entities from manifest (3180 -> 3125)
- Token coverage improved from 19% to 96%

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add explicit "OMIT ENTIRELY" section to extraction prompt:
- Dates, times, timestamps
- Numbers, amounts, prices
- IDs, hashes, addresses
- Durations, sizes/measurements
- Generic phrases

This prevents junk extraction at the source, not just post-processing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add SKIP_FIELDS to filter out metadata before feeding to LLM:
- source, url, date, briefing_date, number, status, author
- item_type, sentiment, extracted_at, schema_version

This prevents dates like "2025-12-20" and URLs from being parsed
as entity-containing content.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
madjin and others added 4 commits December 29, 2025 19:13
Replace SKIP_FIELDS (blocklist) with CONTENT_FIELDS (allowlist):
- claim, title, description, significance, summary
- content, text, body, message, details, notes

Allowlist is more robust - new metadata fields won't leak through.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Entity extraction:
- Reframe extraction prompt with iconifiability test
- Add --reclassify mode for batch LLM classification (200/batch)
- Simplify code filtering to minimal checks (let LLM decide)
- Reclassified manifest: 1917 keep, 1188 skip, 20 review

Icon generation:
- Add selfhst/icons library support via SELFHST_ICONS_PATH env var
- Reference pipeline: Simple Icons → selfhst/icons → GitHub → Favicon

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reference pipeline now:
1. Simple Icons (3300+ tech brands)
2. selfhst/icons (2300+ self-hosted apps)
3. gilbarbara/logos (2000+ SVG logos) <- NEW
4. GitHub avatars
5. Google Favicon

Configure via GILBARBARA_LOGOS_PATH env var.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@madjin madjin merged commit 31b0d0a into main Dec 30, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant