Skip to content

feat: CLI migration + progressive disclosure redesign for ultimate-scraper#33

Merged
vystrcild merged 13 commits into
apify:mainfrom
lukas-bekr:feat/ultimate-scraper-cli-migration-and-workflow-upgrade
Apr 21, 2026
Merged

feat: CLI migration + progressive disclosure redesign for ultimate-scraper#33
vystrcild merged 13 commits into
apify:mainfrom
lukas-bekr:feat/ultimate-scraper-cli-migration-and-workflow-upgrade

Conversation

@lukas-bekr
Copy link
Copy Markdown
Contributor

@lukas-bekr lukas-bekr commented Mar 30, 2026

Summary

Major upgrade to the apify-ultimate-scraper skill: migrates from REST API scripts to Apify CLI, restructures the information architecture using progressive disclosure, and enriches all workflow guides with 58 research-backed data pipeline patterns.

Phase 1: CLI migration

  • Replaced 3 Node.js scripts (search_actors.js, run_actor.js, fetch_actor_details.js) with CLI commands (apify actors call --json, actors search, actors info, datasets get-items)
  • --json output as stable API contract - immune to upcoming CLI UI changes (Markdown default, colors)
  • OAuth-first authentication (apify login) with env var fallback. Fixed security contradiction in actorization skill (was using apify login -t exposing tokens in shell history, aligned with PR fix: migrate security fixes to actorization skill #31)

Phase 2: Progressive disclosure restructure

  • Replaced monolithic 400-line Actor index with hub-and-spoke architecture
  • SKILL.md (~109 lines) routes to lean actor-index (206 lines) + 14 workflow guides + gotchas (108 lines)
  • Simple task ("scrape Nike's Instagram") loads ~300 lines. Complex pipeline loads ~500. Neither loads the other 13 guides.

Phase 3: Research-driven workflow enrichment

  • 4-workstream research: Notion internal use cases + AI research (Perplexity/Gemini/ChatGPT) + n8n template library scraping (85+ templates, 26 use Apify) + social media scraping
  • 58 distinct workflow patterns mapped to Apify Actors, ranked by cross-source frequency
  • Every workflow guide now has 4-6 pipelines with explicit Actor chaining, data piping (results[].website -> startUrls), PPE cost estimates, and gotchas

Phase 4: New content

  • 4 new workflow categories: e-commerce price monitoring, contact enrichment, knowledge base/RAG, company research (covers 5,000+ Store Actors with previously zero workflow coverage)
  • Enriched gotchas with anti-bot guidance (Cloudflare, SPA, fingerprinting), platform rate limits, cost estimation protocols

By the numbers

  • 17 files, 1,597 lines (was 13 files, 782 lines)
  • Token budget for simple tasks: ~300 lines (unchanged, progressive disclosure)
  • 14 workflow guides with 4-6 pipelines each (was 10 with 1-4 each)
  • Design principles: Anthropic's "Lessons from Building Skills" - skip the obvious, gotchas are highest-signal, hub-and-spoke progressive disclosure, don't railroad

Scope

  • apify-ultimate-scraper skill only (full rewrite)
  • apify-actorization auth fix (aligned with PR fix: migrate security fixes to actorization skill #31)
  • apify-actor-development minor auth alignment (OAuth-first)
  • commands/create-actor.md auth alignment
  • Did NOT touch developer skill content (actor-development, actorization workflows) - Patrik's territory

lukas-bekr and others added 11 commits March 30, 2026 14:29
- Standardize auth to OAuth-first across all skills
- Fix security contradiction in actorization (remove -t flag)
- Delete legacy Node.js scripts (replaced by CLI commands)
- Bump version to 2.0.0
- Add design spec and implementation plan

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove error handling table (moving non-obvious errors to gotchas.md),
add 4 new routing rows for e-commerce, contact enrichment, knowledge base/RAG,
and company research, and replace error section with a brief troubleshooting pointer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…low guides

Added 7 new pipelines across 3 files from combined-patterns research:
- brand-monitoring: Twitter/X real-time mention routing (P16), Reddit brand monitoring (P17), multi-platform social listening with sentiment (P18)
- review-analysis: competitor review intelligence (P21), Google Play app review monitoring (P22), multi-platform hospitality aggregation (P20)
- content-and-seo: SERP content brief generation (P23), sitemap content audit (P24), keyword rank tracking with alerts (P26), deep research agent (P54)

All pipelines include explicit pipe field paths, PPE cost estimates where applicable, and non-obvious gotchas only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…with research patterns

Added 3 new pipelines to lead-generation.md (Sales Navigator bulk, SERP discovery, Apollo icebreakers, Reddit lead mining), 3 to competitive-intel.md (website change detection, SERP position monitoring, feature benchmarking), and 3 to influencer-vetting.md (TikTok creator vetting, YouTube channel audit, cross-platform hashtag discovery). All entries include explicit field paths, cost estimates for PPE Actors, and per-pipeline gotchas.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…flow guides

Add 2 pipelines to each guide from research patterns: Instagram competitor
analysis + LinkedIn company page analytics (social); Reddit trend mining +
YouTube outlier discovery (trend); sales signal outreach + Upwork monitoring
(jobs); lead scoring/routing + construction discovery (real estate).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…esearch)

Adds workflow reference guides for the 4 new categories identified in combined-patterns.md research: e-commerce price monitoring (patterns 45-49), contact enrichment (50-52), knowledge base and RAG pipelines (53-55), and company research (56-58). Each guide follows the existing format with When/Pipeline/Output fields/Cost estimate/Gotcha sections.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vystrcild
Copy link
Copy Markdown
Collaborator

Few issues which needs to be fixed:
1 - Delete your docs/superpowers folder
2 - Unnecessary auth check before every run

The skill instructs to run apify info as an authentication check before every Actor call. This is wasteful — if auth is missing, apify actors call will fail with a clear error. The check should be removed from the workflow and auth should only be handled reactively on failure.

aka - you don't want to run this everytime, when I'm already logged in.

3 - Missing stderr redirect causes JSON parsing failures

The skill says to always pass --json to CLI commands, but doesn't mention that apify actors call --json writes progress messages to stderr. When the output is piped to a JSON parser, stderr and stdout get mixed, producing invalid JSON. This caused JSONDecodeError during our test run.

Fix: All CLI command examples that are meant to be parsed programmatically should include 2>/dev/null. For example:

apify actors call "ACTOR_ID" -i 'JSON_INPUT' --json 2>/dev/null

Alternatively, add a global note to the existing rule on line 10:
Rule: Always pass --json and 2>/dev/null to CLI commands. JSON output is stable across CLI versions. Never parse human-readable output.

This applies to all commands where JSON output is consumed: apify actors call, apify actors info, apify runs info, apify datasets get-items, etc.

4 - Pricing is not working. I already try to do that in previous version of these skills and never get exact and right costs. E.g. I get 4x lower costs that was reality. That's just confusing for the users.


Just few notes:

  • I still didn't try many cases - those were just first obvious issues
  • We need to be clear that CLI team will test these skills with every update of apify cli
  • I still believe that we should have two versions of this skills - one for CLI and one for API. Reason: You can't install cli everywhere - some VM, CI runner, sandbox where you don't have rights etc. OAuth needs browser, so login will not work in headless VM or container without GUI - although this should be fixed by fallback to .env file (btw I see that you're reading APIFY_TOKEN from env var in shell and not .env file - so that should be added too).

lukas-bekr and others added 2 commits April 14, 2026 15:27
1. Delete docs/superpowers/ (specs/plans don't belong in repo)
2. Remove pre-run auth check (apify info) - handle auth reactively on failure
   instead of checking before every run. Added .env file sourcing as auth option.
3. Add 2>/dev/null to all CLI command examples in SKILL.md to prevent stderr
   mixing with JSON output (causes JSONDecodeError in parsers)
4. Strip all dollar-amount cost estimates from workflow guides (were 4x
   inaccurate in testing). Keep pricing model awareness (FREE/PPE/FLAT)
   in gotchas.md but without specific amounts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore all cost estimates removed in the previous commit. Add a
mandatory disclaimer to gotchas.md cost estimation protocol: agents
must always present estimates as rough guidance with a warning that
actual costs can vary significantly. This addresses the accuracy
concern while keeping the estimates useful as rough signals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lukas-bekr
Copy link
Copy Markdown
Contributor Author

Hey @vystrcild, thanks for the thorough review. All points addressed in 2 commits:

Fixed:

  1. docs/superpowers/ deleted
  2. Removed apify info pre-check - auth is now reactive (only on failure). Added .env file sourcing as auth option alongside env var and OAuth
  3. Added 2>/dev/null to all CLI command examples in SKILL.md. Updated the global rule to: "Always pass --json and 2>/dev/null to CLI commands"
  4. Cost estimates restored but with a mandatory disclaimer - agents must always caveat estimates as rough guidance that can vary significantly

On your strategic notes:

  • CLI team testing: agreed, let's discuss with Patrik how to integrate skill testing into the CLI release process
  • CLI vs API versions: parking for now, the env var + .env file fallback covers most headless scenarios

@Jakub-Vacek
Copy link
Copy Markdown

I like the approach, OAuth in CLI is pretty cool unlock.

Maybe one suggestion would be to use Markdown links for referencing other files (like when SKILL.md references some file from /references) - this way you can use extension to check validity of the link - you are not pointing to non existing files when modifying skills/agents/commands.

And I have one high level question (not really blocking this PR): Would it make sense in some cases use MCP instead of CLI?

I think that biggest difference between CLI & MCP is that CLI can do some of the calls "for free"/without auth - which is super useful in discovery phase. I am wondering how can we reuse this skill in platforms where it is not possible to install (non technical platforms) - I would say this leads to version of this skill which uses MCP instead of tools. This version would make sense for platforms aimed at less technical audience.

Generally there are at least 3 approaches to serve Apify:

  1. Native connector (Strands, LangChain, Claude Desktop, N8N) => great UX, needs to be developed = high investment/low portability
  2. MCP + skills (and other MD AI files) => improving UX (MCP auth), improving portability (skills standard, likely incoming plugin standards). Low investment/high portability
  3. CLI + skills (and other MD AI files) => UX depends on the CLI (and how will platform handle it), improving portability (skills standard, likely incoming plugin standards). Low investment/high portability

Ideally we should have all of these, but maybe I just don' have enough of your context :)

@vystrcild vystrcild merged commit 2227b17 into apify:main Apr 21, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants