Add initial thoughts on deployment#8
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds an initial deployment guide outlining how to register and run agents, manage their state, test locally, and containerize the app.
- Introduces code examples for setting up an agent app and registering agents
- Describes stateful REST routes and local testing via
LocalAgentRuntime - Provides guidance on containerizing the app with Docker
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| runtime = LocalAgentRuntime() | ||
|
|
||
| agent = MyAgent("my_id", runtime) | ||
| ``` |
There was a problem hiding this comment.
I imagine the agent is then invoked using something like agent.run(task="what is the weather in new york").
|
|
||
| ## Containerization | ||
|
|
||
| Since the app's interface essentially the `AgentRuntime` and the entrypoint, you can containerize the app using a Dockerfile. The entrypoint would be the command to run the agentrunner with the appropriate arguments, or a custom entrypoint can be injected at runtime which uses a different `AgentRuntime` implementation. |
There was a problem hiding this comment.
Once an agent is deployed via the runtime, does it expose a REST API that application can call?
There was a problem hiding this comment.
Also, do we want to support other protocols than REST, e.g. gRPC?
| agentrunner --entrypoint myapp:app --state-dir /tmp/agents | ||
| ``` | ||
|
|
||
| This would bring up a process and expose some interface that would make this agent accessible. Depending on how `agentrunner` is invoked, perhaps different interfaces are exposed. |
There was a problem hiding this comment.
Can we expect A2A to be one of the interfaces we can run through agentrunner? And other one would be something like a ChatCompletion or Responses API?
|
|
||
| ## Testing locally | ||
|
|
||
| If you wish to use the agent without hosting it within agentrunner, you can use a concrete AgentRuntime implementation to run the agent locally. For example, you can use the `LocalAgentRuntime` class to run an agent in a local process. |
There was a problem hiding this comment.
Is the LocalAgentRuntime intended to be a dev/test-only thing, or is will there be a way to specify the runtime when invoking agentunner?
|
|
||
| ## State | ||
|
|
||
| State is associated with an agent `id`. The agent id is based on how the agentrunner is invoked. For example, there may be a REST server exposed with the following route: |
There was a problem hiding this comment.
Are we still using AgentId ~= type + key?
Especially if we tie state to it, we should see if it is possible to ensure that we don't have conflicting agents registered under the same type: In distributed case we might simultaneously desire to have multiple nodes host agents of the same type, but we don't want a case wherein depending on which node a message gets sent the resulting observable behaviour would be different
| You might then be able to execute this app like: | ||
|
|
||
| ```sh | ||
| agentrunner --entrypoint myapp:app --state-dir /tmp/agents |
There was a problem hiding this comment.
[Minor] Would we support the "main" entry point for the app too, e.g. --entrypoint myapp
…sing) - Task 1.1: Add import guard for notebookutils (dual-environment support) - Task 1.2: Implement DefaultAzureCredential pattern (modern auth) - Task 1.3: Replace naive CSV parsing with pandas (robust data handling) - Task 1.4: Add pagination parameters to list_lakehouse_files (scalability) Gap Coverage: - Gap #1: Product Confusion (Azure AI Projects SDK verified) - Gap #2: Deprecated Patterns (DefaultAzureCredential + endpoint) - Gap #3: No Hardcoded Secrets (FabricMLCredential pattern) - Gap #6: Token Flow (import guard + mock credentials) - Gap #8: Resilience (pagination + pandas parsing) Validation: 7/7 Phase 1 checks passing, 11/15 overall (73%)
PHASE 4 - Resilience & Monitoring (3 checks): - Task 4.1: Add retry logic with exponential backoff and timeout (Gap #8) - Task 4.2: Implement comprehensive logging with request IDs (Gap microsoft#10) - Task 4.3: Confirm GUID-based resource identification (Gap microsoft#12) PHASE 5 - Pre-Deployment (2 checks): - Task 5.1: Add OpenTelemetry/Azure Monitor packages (Gap microsoft#10) - Task 5.2: Verify Azure AI Projects SDK is in requirements (Gap #1) Implementation Details: - Added retry_with_timeout decorator with exponential backoff - Added logging.basicConfig() and logger for observability - Added request_id (UUID) to all operations for tracing - Added logging.info() calls for major operations - Added logging.debug() calls for detailed troubleshooting - Added logging.error() with exc_info for exception tracking - Added OpenTelemetry and azure-monitor-opentelemetry to requirements.txt - Updated docstrings with GUID examples Validation Results: 15/15 checks passing (100%) - Phase 1: 7/7 ✓ - Phase 3: 3/3 ✓ - Phase 4: 3/3 ✓ - Phase 5: 2/2 ✓
* Harden Python checkpoint persistence defaults Add RestrictedUnpickler to _checkpoint_encoding.py that limits which types may be instantiated during pickle deserialization. By default FileCheckpointStorage now uses the restricted unpickler, allowing only: - Built-in Python value types (primitives, datetime, uuid, decimal, collections, etc.) - All agent_framework.* internal types - Additional types specified via the new allowed_checkpoint_types parameter on FileCheckpointStorage This narrows the default type surface area for persisted checkpoints while keeping framework-owned scenarios working without extra configuration. Developers can extend the allowed set by passing "module:qualname" strings to allowed_checkpoint_types. The decode_checkpoint_value function retains backward-compatible unrestricted behavior when called without the new allowed_types kwarg. Fixes #4894 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: resolve mypy no-any-return error in checkpoint encoding Add explicit type annotation for super().find_class() return value to satisfy mypy's no-any-return check. Fixes #4894 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Simplify find_class return in _RestrictedUnpickler (#4894) Remove unnecessary intermediate variable and apply # noqa: S301 # nosec directly on the super().find_class() call, matching the established pattern used on the pickle.loads() call in the same file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address review feedback for #4894: Python: Harden Python checkpoint persistence defaults * Restore # noqa: S301 on line 102 of _checkpoint_encoding.py (#4894) The review feedback correctly identified that removing the # noqa: S301 suppression from the find_class return statement would cause a ruff S301 lint failure, since the project enables bandit ("S") rules. This restores consistency with lines 82 and 246 in the same file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address review feedback for #4894: Python: Harden Python checkpoint persistence defaults * Address PR review comments on checkpoint encoding (#4894) - Move module docstring to proper position after __future__ import - Fix find_class return type annotation to type[Any] - Add missing # noqa: S301 pragma on find_class return - Improve error message to reference both allowed_types param and FileCheckpointStorage.allowed_checkpoint_types - Add -> None return annotation to FileCheckpointStorage.__init__ - Replace tempfile.mktemp with TemporaryDirectory in test - Replace contextlib.suppress with pytest.raises for precise assertion - Remove unused contextlib import Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR #4941 review comments: fix docstring position and return type - Move module docstring before 'from __future__' import so it populates __doc__ (comment #4) - Change find_class return annotation from type[Any] to type to avoid misleading callers about non-type returns like copyreg._reconstructor (comment #2) Comments #1, #3, #5, #6, #7, #8 were already addressed in the current code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address review feedback for #4894: review comment fixes * fix: use pickle.UnpicklingError in RestrictedUnpickler and improve docstring (#4894) - Change _RestrictedUnpickler.find_class to raise pickle.UnpicklingError instead of WorkflowCheckpointException, since it is pickle-level concern that gets wrapped by the caller in _base64_to_unpickle. - Remove now-unnecessary WorkflowCheckpointException re-raise in _base64_to_unpickle (pickle.UnpicklingError is caught by the generic except Exception handler and wrapped). - Expand decode_checkpoint_value docstring to show a concrete example of the module:qualname format with a user-defined class. - Add regression test verifying find_class raises pickle.UnpicklingError. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address PR #4941 review comments for checkpoint encoding - Comment 1 (line 103): Already resolved in prior commit — _RestrictedUnpickler now raises pickle.UnpicklingError instead of WorkflowCheckpointException. - Comment 2 (line 140): Add concrete usage examples to decode_checkpoint_value docstring showing both direct allowed_types usage and FileCheckpointStorage allowed_checkpoint_types usage. Rename 'SafeState' to 'MyState' across all docstrings for consistency, making it clear this is a user-defined class name. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: replace deprecated 'builtin' repo with pre-commit-hooks in pre-commit config pre-commit 4.x no longer supports 'repo: builtin'. Merge those hooks into the existing pre-commit-hooks repo entry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * style: apply pyupgrade formatting to docstring example Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: resolve pre-commit hook paths for monorepo git root The poe-check and bandit hooks referenced paths relative to python/ but pre-commit runs hooks from the git root (monorepo root). Fix poe-check entry to cd into python/ first, and update bandit config path to python/pyproject.toml. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix pre-commit config paths for prek --cd python execution Revert bandit config path from 'python/pyproject.toml' to 'pyproject.toml' and poe-check entry from explicit 'cd python' wrapper to direct invocation, since prek --cd python already sets the working directory to python/. Also apply ruff formatting fixes to cosmos checkpoint storage files. Fixes #4894 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: add builtins:getattr to checkpoint deserialization allowlist Pickle uses builtins:getattr to reconstruct enum members (e.g., WorkflowMessage.type which is a MessageType enum). Without it in the allowlist, checkpoint roundtrip tests fail with WorkflowCheckpointException. Fixes #4894 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address review feedback for #4894: review comment fixes --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
No description provided.