feat: crewai deploy validate + defer native LLM client construction#5412
feat: crewai deploy validate + defer native LLM client construction#5412greysonlalonde wants to merge 15 commits intomainfrom
Conversation
All native LLM providers built their SDK clients inside `@model_validator(mode="after")`, which required the API key at `LLM(...)` construction time. Instantiating an LLM at module scope (e.g. `chat_llm=LLM(model="openai/gpt-4o-mini")` on a `@crew` method) crashed during downstream crew-metadata extraction with a confusing `ImportError: Error importing native provider: 1 validation error...` before the process env vars were ever consulted. Wrap eager client construction in a try/except in each provider and add `_get_sync_client` / `_get_async_client` methods that build on first use. OpenAI call sites are routed through the lazy getters so calls made without eager construction still work. The descriptive "X_API_KEY is required" errors are re-raised from the lazy path at first real call. Update two Azure tests that asserted the old eager-error contract to assert the new lazy-error contract.
Adds a new `crewai deploy validate` command that checks a project
locally against the most common categories of deploy-time failures,
so users don't burn attempts on fixable project-structure problems.
`crewai deploy create` and `crewai deploy push` now run the same
checks automatically and abort on errors; `--skip-validate` opts out.
Checks (errors block, warnings print only):
1. pyproject.toml present with `[project].name`
2. lockfile (uv.lock or poetry.lock) present and not stale
3. src/<package>/ resolves, rejecting empty names and .egg-info dirs
4. crew.py, config/agents.yaml, config/tasks.yaml for standard crews
5. main.py for flow projects
6. hatchling wheel target resolves
7. crew/flow module imports cleanly in a `uv run` subprocess, with
classification of common failures (missing provider extras,
missing API keys at import, stale crewai pins, pydantic errors)
8. env vars referenced in source vs .env (warning only)
9. crewai lockfile pin vs a known-bad cutoff (warning only)
Each finding has a stable code and a structured title/detail/hint so
downstream tooling and tests can pin behavior. 33 tests cover the
checks 1:1 against the failure patterns observed in practice.
iris-clawd
left a comment
There was a problem hiding this comment.
Excellent PR — two complementary changes that address the most common deploy failure category.
Lazy client construction (all 5 native providers):
- Pattern is consistent:
_init_clientswraps eager construction in try/except ValueError,_build_sync_client/_build_async_clientfactored out,_get_sync_client/_get_async_clientbuild on first use - All call sites updated from
self._client→self._get_sync_client()etc. - Azure tests updated to assert new lazy-error contract (construction succeeds,
_get_sync_client()raises) - This unblocks the key pattern:
LLM(model="openai/gpt-4o-mini")as a class field default at import time
Deploy validator (842-line validate.py + 400-line test suite):
- 9 checks covering the failure patterns from production: pyproject, lockfile, package dir, crew/flow entrypoints, hatchling config, module imports, env vars, version pins
- Import check via
uv runsubprocess with JSON payload — smart approach to test in the real dependency environment _classify_import_errorhandles all the observed patterns: missing extras, missing API keys, no @crewbase, no Flow subclass, stale pins, pydantic errors- Error/warning severity distinction is correct: missing structure = error, missing env vars = warning (platform sets them server-side)
--skip-validateescape hatch on create/push- Tests are thorough — 33 tests covering each code 1:1, including regression cases from production failures
Minor notes (non-blocking):
- The Bedrock lazy path catches
Exceptionrather thanValueError(matches its broader auth failure modes — fine) aclose()in Azure calls_get_async_client()which could trigger lazy construction just to close — unlikely in practice but worth noting
LGTM 🚀 💬 223
lib/crewai/src/crewai/llms/providers/anthropic/completion.py
Dismissed
Show dismissed
Hide dismissed
The lazy-init refactor rewrote `aclose` to access the async client via `_get_async_client()`, which forces lazy construction. When an `AzureCompletion` is instantiated without credentials (the whole point of deferred init), that call raises `ValueError: "Azure API key is required"` during cleanup — including via `async with` / `__aexit__`. Access the cached `_async_client` attribute directly so cleanup on an uninitialized LLM is a harmless no-op. Add a regression test that enters and exits an `async with` block against a credentials-less `AzureCompletion`.
Two sites that were mechanically rewritten by the lazy-getter regex shouldn't actually go through the lazy getter: - `BedrockCompletion._ensure_async_client` manages its own client lifecycle through `aiobotocore` inside an exit stack. Its trailing `return self._get_async_client()` was a redundant indirection through a stub method that doesn't even attempt to build a client. Return the cached attribute directly. - `GeminiCompletion._get_client_params` is a lightweight config accessor used at `to_config_dict()` time. Calling `_get_sync_client()` here forced client construction (and would raise `ValueError` when credentials aren't set) just to check the `vertexai` attribute. Read `self._client` directly and null-guard before the `hasattr` check.
`deploy_push` gained a `--skip-validate` flag that forwards to `DeployCommand.deploy()` as `skip_validate=False` by default. Update the two CLI tests that pin the exact call args.
`LLM(model="gpt-4o")` no longer raises at construction when `OPENAI_API_KEY` is missing — the descriptive error now surfaces when the client is actually built. Update the test to assert that contract: `create_llm` succeeds, and `llm._get_sync_client()` raises.
Azure's `_normalize_azure_fields` captures env vars at construction time. When `LLM(model="azure/...")` is instantiated before deployment env vars are set, `self.api_key` / `self.endpoint` freeze as `None` and the lazy client builder then always raises — defeating the point of deferred init for Azure. Re-read `AZURE_API_KEY` / `AZURE_ENDPOINT` (and friends) inside `_make_client_kwargs` when the fields are still unset, matching OpenAI's `_get_client_params` pattern. Runs the endpoint validator on any env-provided value so the same normalization applies. Add a regression test that constructs the LLM with no env vars set, then patches them in afterwards and asserts `_get_sync_client()` successfully builds a client and writes the resolved values back onto the LLM instance.
`_prepare_completion_params` uses `is_azure_openai_endpoint` to decide whether to include the `model` parameter in requests — Azure OpenAI endpoints embed the deployment name in the URL and reject a `model` field. When the endpoint was resolved lazily from env vars, the flag stayed at its pre-resolve `False` value, causing every lazily-inited Azure OpenAI request to include `model` and fail. Factor the classification into `_is_azure_openai_endpoint` and call it from both `_normalize_azure_fields` and `_make_client_kwargs`. Extend the lazy-build regression test to assert the flag flips to `True` once the endpoint is resolved.
`_normalize_gemini_fields` captures `GOOGLE_API_KEY` / `GEMINI_API_KEY` / `GOOGLE_CLOUD_PROJECT` at construction time, so an LLM constructed before deployment env vars are set would freeze `self.api_key = None` and the lazy `_get_sync_client` build would then always auth-fail. Re-read the env vars inside `_get_sync_client` when `self._client` is None and the corresponding field is still unset, matching the pattern used for the other native providers. Add a regression test that constructs `GeminiCompletion` with no env vars set, patches them in afterwards, and asserts the lazy build succeeds and writes the resolved key back onto the LLM.
`_ahandle_completion` and `_ahandle_streaming_completion` were calling `_get_sync_client()` directly. The other native providers (OpenAI, Anthropic, Azure) consistently route async code through `_get_async_client()`; matching that abstraction here keeps the contract consistent and lets a future async-specific override work without re-touching call sites.
CodeQL flagged the `"test.openai.azure.com" in llm.endpoint` substring check as incomplete URL sanitization — the substring could match in an arbitrary position. Parse the URL and assert against `urlparse(...).hostname` instead, which is the precise check we want.
`_build_async_client` called `_get_client_params()`, which under `interceptor` constructs a sync `httpx.Client` and stores it under `http_client`. The async builder then immediately overwrote that key with an `httpx.AsyncClient`, leaving the sync client allocated and unclosed. Add an `include_http_client` flag to `_get_client_params` (defaults True for the sync path); the async builder passes False so no sync client is constructed and only the async one is attached.
The bare `except Exception` would silently swallow `TypeError`, `AttributeError`, and other real bugs in `_build_sync_client` with only a debug log. The intent is to defer on missing AWS credentials, which boto3 surfaces as `BotoCoreError` / `ClientError` (and `ValueError` for some validation paths). Catch only those; let everything else propagate so genuine failures stay loud.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4c3ff7d. Configure here.
| # the extras-missing message contains the same phrase. | ||
| m = re.search( | ||
| r"(?P<pkg>[A-Za-z0-9_ -]+?)\s+native provider not available.*?`([^`]+)`", | ||
| err_msg, |
There was a problem hiding this comment.
Regex never matches real provider error messages
Medium Severity
The missing_provider_extra regex expects backtick-delimited install commands (`([^`]+)`), but the actual ImportError messages raised by all provider modules (anthropic, azure, bedrock, gemini) use plain quotes without backticks — e.g., '...to install: uv add "crewai[anthropic]"'. This means the regex will never match real errors. The test only passes because it artificially injects backticks into the error message. In production, these errors fall through to the less-specific llm_provider_init_failed classification, losing the targeted missing_provider_extra code and the extracted install command hint.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 4c3ff7d. Configure here.


Summary
Two related changes that together prevent the most common category of deploy-time failures:
LLM(model="openai/gpt-4o-mini")at module scope (e.g. as a class default on a@crewmethod) previously crashed during crew-metadata extraction withImportError: Error importing native provider: 1 validation error...whenever the API key wasn't inos.environat the moment of import._init_clientsnow wraps eager construction intry/except ValueErrorand_get_sync_client/_get_async_clienthelpers build on first use, re-raising the descriptiveX_API_KEY is requirederror only at real call time.crewai deploy validatecommand that runs 9 local checks against the project before it contacts the deploy API.crewai deploy createandcrewai deploy pushrun the same checks automatically and abort on errors;--skip-validateopts out.Checks
pyproject.tomlwith[project].nameuv.lock/poetry.lock) present and not stalesrc/<package>/resolves (catches empty package names and.egg-infomisdetection)crew.py,config/agents.yaml,config/tasks.yamlfor standard crewsmain.pyfor flowsuv runsubprocess with structured classification of failures (missing provider extras, missing API keys at import, stale crewai pins, pydantic errors).env(warning only)Each finding has a stable code + structured title/detail/hint, so tests pin behavior 1:1 against the failure patterns this work was driven from.
Test plan
pytest tests/llms/{openai,anthropic,azure,google,bedrock} tests/cli/deploy)LLM(model="openai/gpt-4o-mini")succeeds with no API key set (previously raisedImportError: Error importing native provider: 1 validation error)tests/cli/deploy/test_validate.pycover each check 1:1crewai deploy validatesmoke-tested against a broken project (missingpyproject.toml) — reportsmissing_pyprojectand exits non-zeroruff checkclean,ruff formatclean on touched filesNote
Medium Risk
Adds a new validation gate that can block deployments and refactors native LLM providers to defer SDK client creation; both affect critical deploy/runtime paths and could surface new edge cases around credential discovery and client lifecycle.
Overview
Adds pre-deploy project validation via new
crewai deploy validate, and runs the same checks automatically beforecrewai deploy create/push(with--skip-validateto bypass), aborting before contacting the platform when blocking issues are found.Introduces a new
DeployValidatorthat checks common deployment failure modes (project metadata/layout, lockfiles, hatchling wheel targeting, importability viauv run, env var usage, and stalecrewaipins) and prints structured error/warning reports with stable codes.Refactors native LLM providers (OpenAI, Anthropic, Azure, Gemini, Bedrock) to lazily build sync/async SDK clients using
_get_sync_client()/_get_async_client(), so missing credentials don’t crash module import; tests are updated/added to cover lazy credential resolution and no-op cleanup behavior.Reviewed by Cursor Bugbot for commit 4c3ff7d. Bugbot is set up for automated code reviews on this repo. Configure here.