feat: crewai deploy validate + defer native LLM client construction by greysonlalonde · Pull Request #5412 · crewAIInc/crewAI

greysonlalonde · 2026-04-10T21:51:00Z

Summary

Two related changes that together prevent the most common category of deploy-time failures:

fix (851df79) — all native LLM providers now build their SDK clients lazily. Instantiating LLM(model="openai/gpt-4o-mini") at module scope (e.g. as a class default on a @crew method) previously crashed during crew-metadata extraction with ImportError: Error importing native provider: 1 validation error... whenever the API key wasn't in os.environ at the moment of import. _init_clients now wraps eager construction in try/except ValueError and _get_sync_client / _get_async_client helpers build on first use, re-raising the descriptive X_API_KEY is required error only at real call time.
feat (9f61deb) — new crewai deploy validate command that runs 9 local checks against the project before it contacts the deploy API. crewai deploy create and crewai deploy push run the same checks automatically and abort on errors; --skip-validate opts out.

Checks

pyproject.toml with [project].name
lockfile (uv.lock / poetry.lock) present and not stale
src/<package>/ resolves (catches empty package names and .egg-info misdetection)
crew.py, config/agents.yaml, config/tasks.yaml for standard crews
main.py for flows
hatchling wheel target resolves
crew/flow module imports cleanly in a uv run subprocess with structured classification of failures (missing provider extras, missing API keys at import, stale crewai pins, pydantic errors)
env vars referenced in source vs .env (warning only)
crewai lockfile pin vs a known-bad cutoff (warning only)

Each finding has a stable code + structured title/detail/hint, so tests pin behavior 1:1 against the failure patterns this work was driven from.

Test plan

358/358 provider + deploy tests pass (pytest tests/llms/{openai,anthropic,azure,google,bedrock} tests/cli/deploy)
Verified LLM(model="openai/gpt-4o-mini") succeeds with no API key set (previously raised ImportError: Error importing native provider: 1 validation error)
Same for Anthropic, Gemini, Azure
33 new tests in tests/cli/deploy/test_validate.py cover each check 1:1
crewai deploy validate smoke-tested against a broken project (missing pyproject.toml) — reports missing_pyproject and exits non-zero
ruff check clean, ruff format clean on touched files

Note

Medium Risk
Adds a new validation gate that can block deployments and refactors native LLM providers to defer SDK client creation; both affect critical deploy/runtime paths and could surface new edge cases around credential discovery and client lifecycle.

Overview
Adds pre-deploy project validation via new crewai deploy validate, and runs the same checks automatically before crewai deploy create/push (with --skip-validate to bypass), aborting before contacting the platform when blocking issues are found.

Introduces a new DeployValidator that checks common deployment failure modes (project metadata/layout, lockfiles, hatchling wheel targeting, importability via uv run, env var usage, and stale crewai pins) and prints structured error/warning reports with stable codes.

Refactors native LLM providers (OpenAI, Anthropic, Azure, Gemini, Bedrock) to lazily build sync/async SDK clients using _get_sync_client()/_get_async_client(), so missing credentials don’t crash module import; tests are updated/added to cover lazy credential resolution and no-op cleanup behavior.

^{Reviewed by Cursor Bugbot for commit 4c3ff7d. Bugbot is set up for automated code reviews on this repo. Configure here.}

All native LLM providers built their SDK clients inside `@model_validator(mode="after")`, which required the API key at `LLM(...)` construction time. Instantiating an LLM at module scope (e.g. `chat_llm=LLM(model="openai/gpt-4o-mini")` on a `@crew` method) crashed during downstream crew-metadata extraction with a confusing `ImportError: Error importing native provider: 1 validation error...` before the process env vars were ever consulted. Wrap eager client construction in a try/except in each provider and add `_get_sync_client` / `_get_async_client` methods that build on first use. OpenAI call sites are routed through the lazy getters so calls made without eager construction still work. The descriptive "X_API_KEY is required" errors are re-raised from the lazy path at first real call. Update two Azure tests that asserted the old eager-error contract to assert the new lazy-error contract.

Adds a new `crewai deploy validate` command that checks a project locally against the most common categories of deploy-time failures, so users don't burn attempts on fixable project-structure problems. `crewai deploy create` and `crewai deploy push` now run the same checks automatically and abort on errors; `--skip-validate` opts out. Checks (errors block, warnings print only): 1. pyproject.toml present with `[project].name` 2. lockfile (uv.lock or poetry.lock) present and not stale 3. src/<package>/ resolves, rejecting empty names and .egg-info dirs 4. crew.py, config/agents.yaml, config/tasks.yaml for standard crews 5. main.py for flow projects 6. hatchling wheel target resolves 7. crew/flow module imports cleanly in a `uv run` subprocess, with classification of common failures (missing provider extras, missing API keys at import, stale crewai pins, pydantic errors) 8. env vars referenced in source vs .env (warning only) 9. crewai lockfile pin vs a known-bad cutoff (warning only) Each finding has a stable code and a structured title/detail/hint so downstream tooling and tests can pin behavior. 33 tests cover the checks 1:1 against the failure patterns observed in practice.

iris-clawd

Excellent PR — two complementary changes that address the most common deploy failure category.

Lazy client construction (all 5 native providers):

Pattern is consistent: _init_clients wraps eager construction in try/except ValueError, _build_sync_client/_build_async_client factored out, _get_sync_client/_get_async_client build on first use
All call sites updated from self._client → self._get_sync_client() etc.
Azure tests updated to assert new lazy-error contract (construction succeeds, _get_sync_client() raises)
This unblocks the key pattern: LLM(model="openai/gpt-4o-mini") as a class field default at import time

Deploy validator (842-line validate.py + 400-line test suite):

9 checks covering the failure patterns from production: pyproject, lockfile, package dir, crew/flow entrypoints, hatchling config, module imports, env vars, version pins
Import check via uv run subprocess with JSON payload — smart approach to test in the real dependency environment
_classify_import_error handles all the observed patterns: missing extras, missing API keys, no @crewbase, no Flow subclass, stale pins, pydantic errors
Error/warning severity distinction is correct: missing structure = error, missing env vars = warning (platform sets them server-side)
--skip-validate escape hatch on create/push
Tests are thorough — 33 tests covering each code 1:1, including regression cases from production failures

Minor notes (non-blocking):

The Bedrock lazy path catches Exception rather than ValueError (matches its broader auth failure modes — fine)
aclose() in Azure calls _get_async_client() which could trigger lazy construction just to close — unlikely in practice but worth noting

LGTM 🚀 💬 223

lib/crewai/src/crewai/llms/providers/azure/completion.py

lib/crewai/src/crewai/llms/providers/gemini/completion.py

lib/crewai/src/crewai/llms/providers/anthropic/completion.py

lib/crewai/src/crewai/llms/providers/openai/completion.py

lib/crewai/src/crewai/llms/providers/azure/completion.py

lib/crewai/src/crewai/cli/deploy/validate.py

lib/crewai/src/crewai/llms/providers/bedrock/completion.py

lib/crewai/src/crewai/llms/providers/gemini/completion.py

The lazy-init refactor rewrote `aclose` to access the async client via `_get_async_client()`, which forces lazy construction. When an `AzureCompletion` is instantiated without credentials (the whole point of deferred init), that call raises `ValueError: "Azure API key is required"` during cleanup — including via `async with` / `__aexit__`. Access the cached `_async_client` attribute directly so cleanup on an uninitialized LLM is a harmless no-op. Add a regression test that enters and exits an `async with` block against a credentials-less `AzureCompletion`.

Two sites that were mechanically rewritten by the lazy-getter regex shouldn't actually go through the lazy getter: - `BedrockCompletion._ensure_async_client` manages its own client lifecycle through `aiobotocore` inside an exit stack. Its trailing `return self._get_async_client()` was a redundant indirection through a stub method that doesn't even attempt to build a client. Return the cached attribute directly. - `GeminiCompletion._get_client_params` is a lightweight config accessor used at `to_config_dict()` time. Calling `_get_sync_client()` here forced client construction (and would raise `ValueError` when credentials aren't set) just to check the `vertexai` attribute. Read `self._client` directly and null-guard before the `hasattr` check.

`deploy_push` gained a `--skip-validate` flag that forwards to `DeployCommand.deploy()` as `skip_validate=False` by default. Update the two CLI tests that pin the exact call args.

lib/crewai/src/crewai/llms/providers/azure/completion.py

`LLM(model="gpt-4o")` no longer raises at construction when `OPENAI_API_KEY` is missing — the descriptive error now surfaces when the client is actually built. Update the test to assert that contract: `create_llm` succeeds, and `llm._get_sync_client()` raises.

lib/crewai/src/crewai/llms/providers/gemini/completion.py

Azure's `_normalize_azure_fields` captures env vars at construction time. When `LLM(model="azure/...")` is instantiated before deployment env vars are set, `self.api_key` / `self.endpoint` freeze as `None` and the lazy client builder then always raises — defeating the point of deferred init for Azure. Re-read `AZURE_API_KEY` / `AZURE_ENDPOINT` (and friends) inside `_make_client_kwargs` when the fields are still unset, matching OpenAI's `_get_client_params` pattern. Runs the endpoint validator on any env-provided value so the same normalization applies. Add a regression test that constructs the LLM with no env vars set, then patches them in afterwards and asserts `_get_sync_client()` successfully builds a client and writes the resolved values back onto the LLM instance.

lib/crewai/tests/llms/azure/test_azure.py

lib/crewai/src/crewai/llms/providers/azure/completion.py

`_prepare_completion_params` uses `is_azure_openai_endpoint` to decide whether to include the `model` parameter in requests — Azure OpenAI endpoints embed the deployment name in the URL and reject a `model` field. When the endpoint was resolved lazily from env vars, the flag stayed at its pre-resolve `False` value, causing every lazily-inited Azure OpenAI request to include `model` and fail. Factor the classification into `_is_azure_openai_endpoint` and call it from both `_normalize_azure_fields` and `_make_client_kwargs`. Extend the lazy-build regression test to assert the flag flips to `True` once the endpoint is resolved.

lib/crewai/tests/llms/azure/test_azure.py

lib/crewai/src/crewai/llms/providers/azure/completion.py

lib/crewai/src/crewai/llms/providers/gemini/completion.py

`_normalize_gemini_fields` captures `GOOGLE_API_KEY` / `GEMINI_API_KEY` / `GOOGLE_CLOUD_PROJECT` at construction time, so an LLM constructed before deployment env vars are set would freeze `self.api_key = None` and the lazy `_get_sync_client` build would then always auth-fail. Re-read the env vars inside `_get_sync_client` when `self._client` is None and the corresponding field is still unset, matching the pattern used for the other native providers. Add a regression test that constructs `GeminiCompletion` with no env vars set, patches them in afterwards, and asserts the lazy build succeeds and writes the resolved key back onto the LLM.

`_ahandle_completion` and `_ahandle_streaming_completion` were calling `_get_sync_client()` directly. The other native providers (OpenAI, Anthropic, Azure) consistently route async code through `_get_async_client()`; matching that abstraction here keeps the contract consistent and lets a future async-specific override work without re-touching call sites.

CodeQL flagged the `"test.openai.azure.com" in llm.endpoint` substring check as incomplete URL sanitization — the substring could match in an arbitrary position. Parse the URL and assert against `urlparse(...).hostname` instead, which is the precise check we want.

lib/crewai/src/crewai/llms/providers/anthropic/completion.py

`_build_async_client` called `_get_client_params()`, which under `interceptor` constructs a sync `httpx.Client` and stores it under `http_client`. The async builder then immediately overwrote that key with an `httpx.AsyncClient`, leaving the sync client allocated and unclosed. Add an `include_http_client` flag to `_get_client_params` (defaults True for the sync path); the async builder passes False so no sync client is constructed and only the async one is attached.

lib/crewai/src/crewai/llms/providers/bedrock/completion.py

The bare `except Exception` would silently swallow `TypeError`, `AttributeError`, and other real bugs in `_build_sync_client` with only a debug log. The intent is to defer on missing AWS credentials, which boto3 surfaces as `BotoCoreError` / `ClientError` (and `ValueError` for some validation paths). Catch only those; let everything else propagate so genuine failures stay loud.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 4c3ff7d. Configure here.}

cursor · 2026-04-11T21:51:12Z

lib/crewai/src/crewai/cli/deploy/validate.py

+        # the extras-missing message contains the same phrase.
+        m = re.search(
+            r"(?P<pkg>[A-Za-z0-9_ -]+?)\s+native provider not available.*?`([^`]+)`",
+            err_msg,


Regex never matches real provider error messages

Medium Severity

The missing_provider_extra regex expects backtick-delimited install commands (`([^`]+)`), but the actual ImportError messages raised by all provider modules (anthropic, azure, bedrock, gemini) use plain quotes without backticks — e.g., '...to install: uv add "crewai[anthropic]"'. This means the regex will never match real errors. The test only passes because it artificially injects backticks into the error message. In production, these errors fall through to the less-specific llm_provider_init_failed classification, losing the targeted missing_provider_extra code and the extracted install command hint.

Additional Locations (1)

lib/crewai/tests/cli/deploy/test_validate.py#L288-L295

^{Reviewed by Cursor Bugbot for commit 4c3ff7d. Configure here.}

greysonlalonde added 2 commits April 11, 2026 05:48

github-actions bot added the size/XL label Apr 10, 2026

Merge branch 'main' into feat/cli-predeploy-validation

bbc7392

iris-clawd approved these changes Apr 10, 2026

View reviewed changes

cursor bot reviewed Apr 10, 2026

View reviewed changes

lib/crewai/src/crewai/llms/providers/azure/completion.py Outdated Show resolved Hide resolved

github-code-quality bot found potential problems Apr 10, 2026

View reviewed changes

Merge branch 'main' into feat/cli-predeploy-validation

96031fa

cursor bot reviewed Apr 10, 2026

View reviewed changes

lib/crewai/src/crewai/llms/providers/bedrock/completion.py Show resolved Hide resolved

lib/crewai/src/crewai/llms/providers/gemini/completion.py Outdated Show resolved Hide resolved

greysonlalonde added 3 commits April 12, 2026 04:31

test: update deploy_push CLI tests for new skip_validate kwarg

6f87078

`deploy_push` gained a `--skip-validate` flag that forwards to `DeployCommand.deploy()` as `skip_validate=False` by default. Update the two CLI tests that pin the exact call args.