Skip to content

feat: crewai deploy validate + defer native LLM client construction#5412

Open
greysonlalonde wants to merge 15 commits intomainfrom
feat/cli-predeploy-validation
Open

feat: crewai deploy validate + defer native LLM client construction#5412
greysonlalonde wants to merge 15 commits intomainfrom
feat/cli-predeploy-validation

Conversation

@greysonlalonde
Copy link
Copy Markdown
Contributor

@greysonlalonde greysonlalonde commented Apr 10, 2026

Summary

Two related changes that together prevent the most common category of deploy-time failures:

  • fix (851df79) — all native LLM providers now build their SDK clients lazily. Instantiating LLM(model="openai/gpt-4o-mini") at module scope (e.g. as a class default on a @crew method) previously crashed during crew-metadata extraction with ImportError: Error importing native provider: 1 validation error... whenever the API key wasn't in os.environ at the moment of import. _init_clients now wraps eager construction in try/except ValueError and _get_sync_client / _get_async_client helpers build on first use, re-raising the descriptive X_API_KEY is required error only at real call time.
  • feat (9f61deb) — new crewai deploy validate command that runs 9 local checks against the project before it contacts the deploy API. crewai deploy create and crewai deploy push run the same checks automatically and abort on errors; --skip-validate opts out.

Checks

  1. pyproject.toml with [project].name
  2. lockfile (uv.lock / poetry.lock) present and not stale
  3. src/<package>/ resolves (catches empty package names and .egg-info misdetection)
  4. crew.py, config/agents.yaml, config/tasks.yaml for standard crews
  5. main.py for flows
  6. hatchling wheel target resolves
  7. crew/flow module imports cleanly in a uv run subprocess with structured classification of failures (missing provider extras, missing API keys at import, stale crewai pins, pydantic errors)
  8. env vars referenced in source vs .env (warning only)
  9. crewai lockfile pin vs a known-bad cutoff (warning only)

Each finding has a stable code + structured title/detail/hint, so tests pin behavior 1:1 against the failure patterns this work was driven from.

Test plan

  • 358/358 provider + deploy tests pass (pytest tests/llms/{openai,anthropic,azure,google,bedrock} tests/cli/deploy)
  • Verified LLM(model="openai/gpt-4o-mini") succeeds with no API key set (previously raised ImportError: Error importing native provider: 1 validation error)
  • Same for Anthropic, Gemini, Azure
  • 33 new tests in tests/cli/deploy/test_validate.py cover each check 1:1
  • crewai deploy validate smoke-tested against a broken project (missing pyproject.toml) — reports missing_pyproject and exits non-zero
  • ruff check clean, ruff format clean on touched files

Note

Medium Risk
Adds a new validation gate that can block deployments and refactors native LLM providers to defer SDK client creation; both affect critical deploy/runtime paths and could surface new edge cases around credential discovery and client lifecycle.

Overview
Adds pre-deploy project validation via new crewai deploy validate, and runs the same checks automatically before crewai deploy create/push (with --skip-validate to bypass), aborting before contacting the platform when blocking issues are found.

Introduces a new DeployValidator that checks common deployment failure modes (project metadata/layout, lockfiles, hatchling wheel targeting, importability via uv run, env var usage, and stale crewai pins) and prints structured error/warning reports with stable codes.

Refactors native LLM providers (OpenAI, Anthropic, Azure, Gemini, Bedrock) to lazily build sync/async SDK clients using _get_sync_client()/_get_async_client(), so missing credentials don’t crash module import; tests are updated/added to cover lazy credential resolution and no-op cleanup behavior.

Reviewed by Cursor Bugbot for commit 4c3ff7d. Bugbot is set up for automated code reviews on this repo. Configure here.

All native LLM providers built their SDK clients inside
`@model_validator(mode="after")`, which required the API key at
`LLM(...)` construction time. Instantiating an LLM at module scope
(e.g. `chat_llm=LLM(model="openai/gpt-4o-mini")` on a `@crew` method)
crashed during downstream crew-metadata extraction with a confusing
`ImportError: Error importing native provider: 1 validation error...`
before the process env vars were ever consulted.

Wrap eager client construction in a try/except in each provider and
add `_get_sync_client` / `_get_async_client` methods that build on
first use. OpenAI call sites are routed through the lazy getters so
calls made without eager construction still work. The descriptive
"X_API_KEY is required" errors are re-raised from the lazy path at
first real call.

Update two Azure tests that asserted the old eager-error contract to
assert the new lazy-error contract.
Adds a new `crewai deploy validate` command that checks a project
locally against the most common categories of deploy-time failures,
so users don't burn attempts on fixable project-structure problems.
`crewai deploy create` and `crewai deploy push` now run the same
checks automatically and abort on errors; `--skip-validate` opts out.

Checks (errors block, warnings print only):
  1. pyproject.toml present with `[project].name`
  2. lockfile (uv.lock or poetry.lock) present and not stale
  3. src/<package>/ resolves, rejecting empty names and .egg-info dirs
  4. crew.py, config/agents.yaml, config/tasks.yaml for standard crews
  5. main.py for flow projects
  6. hatchling wheel target resolves
  7. crew/flow module imports cleanly in a `uv run` subprocess, with
     classification of common failures (missing provider extras,
     missing API keys at import, stale crewai pins, pydantic errors)
  8. env vars referenced in source vs .env (warning only)
  9. crewai lockfile pin vs a known-bad cutoff (warning only)

Each finding has a stable code and a structured title/detail/hint so
downstream tooling and tests can pin behavior. 33 tests cover the
checks 1:1 against the failure patterns observed in practice.
Copy link
Copy Markdown
Contributor

@iris-clawd iris-clawd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent PR — two complementary changes that address the most common deploy failure category.

Lazy client construction (all 5 native providers):

  • Pattern is consistent: _init_clients wraps eager construction in try/except ValueError, _build_sync_client/_build_async_client factored out, _get_sync_client/_get_async_client build on first use
  • All call sites updated from self._clientself._get_sync_client() etc.
  • Azure tests updated to assert new lazy-error contract (construction succeeds, _get_sync_client() raises)
  • This unblocks the key pattern: LLM(model="openai/gpt-4o-mini") as a class field default at import time

Deploy validator (842-line validate.py + 400-line test suite):

  • 9 checks covering the failure patterns from production: pyproject, lockfile, package dir, crew/flow entrypoints, hatchling config, module imports, env vars, version pins
  • Import check via uv run subprocess with JSON payload — smart approach to test in the real dependency environment
  • _classify_import_error handles all the observed patterns: missing extras, missing API keys, no @crewbase, no Flow subclass, stale pins, pydantic errors
  • Error/warning severity distinction is correct: missing structure = error, missing env vars = warning (platform sets them server-side)
  • --skip-validate escape hatch on create/push
  • Tests are thorough — 33 tests covering each code 1:1, including regression cases from production failures

Minor notes (non-blocking):

  • The Bedrock lazy path catches Exception rather than ValueError (matches its broader auth failure modes — fine)
  • aclose() in Azure calls _get_async_client() which could trigger lazy construction just to close — unlikely in practice but worth noting

LGTM 🚀 💬 223

The lazy-init refactor rewrote `aclose` to access the async client via
`_get_async_client()`, which forces lazy construction. When an
`AzureCompletion` is instantiated without credentials (the whole point
of deferred init), that call raises `ValueError: "Azure API key is
required"` during cleanup — including via `async with` / `__aexit__`.

Access the cached `_async_client` attribute directly so cleanup on an
uninitialized LLM is a harmless no-op. Add a regression test that
enters and exits an `async with` block against a credentials-less
`AzureCompletion`.
Two sites that were mechanically rewritten by the lazy-getter
regex shouldn't actually go through the lazy getter:

- `BedrockCompletion._ensure_async_client` manages its own client
  lifecycle through `aiobotocore` inside an exit stack. Its trailing
  `return self._get_async_client()` was a redundant indirection
  through a stub method that doesn't even attempt to build a client.
  Return the cached attribute directly.

- `GeminiCompletion._get_client_params` is a lightweight config
  accessor used at `to_config_dict()` time. Calling `_get_sync_client()`
  here forced client construction (and would raise `ValueError` when
  credentials aren't set) just to check the `vertexai` attribute. Read
  `self._client` directly and null-guard before the `hasattr` check.
`deploy_push` gained a `--skip-validate` flag that forwards to
`DeployCommand.deploy()` as `skip_validate=False` by default.
Update the two CLI tests that pin the exact call args.
`LLM(model="gpt-4o")` no longer raises at construction when
`OPENAI_API_KEY` is missing — the descriptive error now surfaces when
the client is actually built. Update the test to assert that contract:
`create_llm` succeeds, and `llm._get_sync_client()` raises.
Azure's `_normalize_azure_fields` captures env vars at construction
time. When `LLM(model="azure/...")` is instantiated before deployment
env vars are set, `self.api_key` / `self.endpoint` freeze as `None`
and the lazy client builder then always raises — defeating the point
of deferred init for Azure.

Re-read `AZURE_API_KEY` / `AZURE_ENDPOINT` (and friends) inside
`_make_client_kwargs` when the fields are still unset, matching
OpenAI's `_get_client_params` pattern. Runs the endpoint validator
on any env-provided value so the same normalization applies.

Add a regression test that constructs the LLM with no env vars set,
then patches them in afterwards and asserts `_get_sync_client()`
successfully builds a client and writes the resolved values back
onto the LLM instance.
`_prepare_completion_params` uses `is_azure_openai_endpoint` to decide
whether to include the `model` parameter in requests — Azure OpenAI
endpoints embed the deployment name in the URL and reject a `model`
field. When the endpoint was resolved lazily from env vars, the flag
stayed at its pre-resolve `False` value, causing every lazily-inited
Azure OpenAI request to include `model` and fail.

Factor the classification into `_is_azure_openai_endpoint` and call
it from both `_normalize_azure_fields` and `_make_client_kwargs`.
Extend the lazy-build regression test to assert the flag flips to
`True` once the endpoint is resolved.
`_normalize_gemini_fields` captures `GOOGLE_API_KEY` / `GEMINI_API_KEY`
/ `GOOGLE_CLOUD_PROJECT` at construction time, so an LLM constructed
before deployment env vars are set would freeze `self.api_key = None`
and the lazy `_get_sync_client` build would then always auth-fail.

Re-read the env vars inside `_get_sync_client` when `self._client` is
None and the corresponding field is still unset, matching the pattern
used for the other native providers.

Add a regression test that constructs `GeminiCompletion` with no env
vars set, patches them in afterwards, and asserts the lazy build
succeeds and writes the resolved key back onto the LLM.
`_ahandle_completion` and `_ahandle_streaming_completion` were calling
`_get_sync_client()` directly. The other native providers (OpenAI,
Anthropic, Azure) consistently route async code through
`_get_async_client()`; matching that abstraction here keeps the
contract consistent and lets a future async-specific override work
without re-touching call sites.
CodeQL flagged the `"test.openai.azure.com" in llm.endpoint` substring
check as incomplete URL sanitization — the substring could match in
an arbitrary position. Parse the URL and assert against
`urlparse(...).hostname` instead, which is the precise check we want.
`_build_async_client` called `_get_client_params()`, which under
`interceptor` constructs a sync `httpx.Client` and stores it under
`http_client`. The async builder then immediately overwrote that key
with an `httpx.AsyncClient`, leaving the sync client allocated and
unclosed.

Add an `include_http_client` flag to `_get_client_params` (defaults
True for the sync path); the async builder passes False so no sync
client is constructed and only the async one is attached.
The bare `except Exception` would silently swallow `TypeError`,
`AttributeError`, and other real bugs in `_build_sync_client` with
only a debug log. The intent is to defer on missing AWS credentials,
which boto3 surfaces as `BotoCoreError` / `ClientError` (and `ValueError`
for some validation paths). Catch only those; let everything else
propagate so genuine failures stay loud.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4c3ff7d. Configure here.

# the extras-missing message contains the same phrase.
m = re.search(
r"(?P<pkg>[A-Za-z0-9_ -]+?)\s+native provider not available.*?`([^`]+)`",
err_msg,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex never matches real provider error messages

Medium Severity

The missing_provider_extra regex expects backtick-delimited install commands (`([^`]+)`), but the actual ImportError messages raised by all provider modules (anthropic, azure, bedrock, gemini) use plain quotes without backticks — e.g., '...to install: uv add "crewai[anthropic]"'. This means the regex will never match real errors. The test only passes because it artificially injects backticks into the error message. In production, these errors fall through to the less-specific llm_provider_init_failed classification, losing the targeted missing_provider_extra code and the extracted install command hint.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4c3ff7d. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants