Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
12 changes: 12 additions & 0 deletions src/aks-agent/HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,18 @@ To release a new version, please select a new version number (usually plus 1 to
Pending
+++++++

1.0.0b12
++++++++
* [BREAKING CHANGE]:
* aks-agent is now containerized and deployed per Kubernetes cluster along with a managed aks-mcp instance
* aks-agent is deployed on the AKS cluster as Helm charts during `az aks agent-init`
* aks agent commands now require --resource-group and --name parameters to specify the target AKS cluster
* Add `az aks agent-cleanup` to cleanup the AKS agent from the cluster
* [SECURITY]:
* Kubernetes RBAC: Uses cluster roles to securely access Kubernetes resources with least-privilege principles
* Azure Workload Identity: Supports Azure workload identity for secure, keyless access to Azure resources
* LLM credentials are stored securely in Kubernetes secrets with encryption at rest

1.0.0b11
++++++++
* Fix(agent-init): replace max_tokens with max_completion_tokens for connection check of Azure OpenAI service.
Expand Down
166 changes: 37 additions & 129 deletions src/aks-agent/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,28 +7,34 @@ Introduction

The AKS Agent extension provides the "az aks agent" command, an AI-powered assistant that helps analyze and troubleshoot Azure Kubernetes Service (AKS) clusters using Large Language Models (LLMs). The agent combines cluster context, configurable toolsets, and LLMs to answer natural-language questions about your cluster (for example, "Why are my pods not starting?") and can investigate issues in both interactive and non-interactive (batch) modes.

New in this version: **az aks agent-init** command for easy LLM model configuration!
New in this version: **az aks agent-init** command for containerized agent deployment!

You can now use `az aks agent-init` to interactively add and configure LLM models before asking questions. This command guides you through the setup process, allowing you to add multiple models as needed. When asking questions with `az aks agent`, you can:
The `az aks agent-init` command deploys the AKS agent as a Helm chart directly in your AKS cluster with enterprise-grade security:

- Use `--config-file` to specify your own model configuration file
- Use `--model` to select a previously configured model
- If neither is provided, the last configured LLM will be used by default
- **Kubernetes RBAC**: Uses cluster roles to securely access Kubernetes resources with least-privilege principles
- **Workload Identity**: Leverages Azure workload identity for secure, keyless access to Azure resources
- **Interactive LLM Configuration**: Guides you through setting up LLM models with encrypted storage in Kubernetes secrets

This makes it much easier to manage and switch between multiple models for your AKS troubleshooting workflows.
When asking questions with `az aks agent`:

- The agent automatically uses the last configured model
- Use `--model` to select a specific model when you have multiple models configured

This architecture provides better security, scalability, and manageability for production AKS troubleshooting workflows.

Key capabilities
----------------


- **Containerized Deployment**: Agent runs as a Helm chart in your AKS cluster with `az aks agent-init`.
- **Secure Access**: Uses Kubernetes RBAC for cluster resources and Azure workload identity for Azure resources.
- **LLM Configuration**: Interactively configure LLM models with credentials stored securely in Kubernetes secrets.
- Support for multiple LLM providers (Azure OpenAI, OpenAI, Anthropic, Gemini, etc.).
- Automatically uses the last configured model by default.
- Optionally use --model to select a specific model when you have multiple models configured.
- Interactive and non-interactive modes (use --no-interactive for batch runs).
- Support for multiple LLM providers (Azure OpenAI, OpenAI, etc.) via interactive configuration.
- **Easy model setup with `az aks agent-init`**: interactively add and configure LLM models, run multiple times to add more models.
- Configurable via a JSON/YAML config file provided with --config-file, or select a model with --model.
- If no config or model is specified, the last configured LLM is used automatically.
- Control echo and tool output visibility with --no-echo-request and --show-tool-output.
- Refresh the available toolsets with --refresh-toolsets.
- Stay in traditional toolset mode by default, or opt in to aks-mcp integration with ``--aks-mcp`` when you need the enhanced capabilities.

Prerequisites
-------------
Expand All @@ -37,98 +43,6 @@ For more details about supported model providers and required
variables, see: https://docs.litellm.ai/docs/providers


LLM Configuration Explained
---------------------------

The AKS Agent uses YAML configuration files to define LLM connections. Each configuration contains a provider specification and the required environment variables for that provider.

Configuration Structure
^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: yaml

llms:
- provider: azure
MODEL_NAME: gpt-4.1
AZURE_API_KEY: *******
AZURE_API_BASE: https://{azure-openai-service}.openai.azure.com/
AZURE_API_VERSION: 2025-04-01-preview

Field Explanations
^^^^^^^^^^^^^^^^^^

**provider**
The LiteLLM provider route that determines which LLM service to use. This follows the LiteLLM provider specification from https://docs.litellm.ai/docs/providers.

Common values:

* ``azure`` - Azure OpenAI Service
* ``openai`` - OpenAI API and OpenAI-compatible APIs (e.g., local models, other services)
* ``anthropic`` - Anthropic Claude
* ``gemini`` - Google's Gemini
* ``openai_compatible`` - OpenAI-compatible APIs (e.g., local models, other services)

**MODEL_NAME**
The specific model or deployment name to use. This varies by provider:

* For Azure OpenAI: Your deployment name (e.g., ``gpt-4.1``, ``gpt-35-turbo``)
* For OpenAI: Model name (e.g., ``gpt-4``, ``gpt-3.5-turbo``)
* For other providers: Check the specific model names in LiteLLM documentation

**Environment Variables by Provider**

The remaining fields are environment variables required by each provider. These correspond to the authentication and configuration requirements of each LLM service:

**Azure OpenAI (provider: azure)**
* ``AZURE_API_KEY`` - Your Azure OpenAI API key
* ``AZURE_API_BASE`` - Your Azure OpenAI endpoint URL (e.g., https://your-resource.openai.azure.com/)
* ``AZURE_API_VERSION`` - API version (e.g., 2024-02-01, 2025-04-01-preview)

**OpenAI (provider: openai)**
* ``OPENAI_API_KEY`` - Your OpenAI API key (starts with sk-)

**Gemini (provider: gemini)**
* ``GOOGLE_API_KEY`` - Your Google Cloud API key
* ``GOOGLE_API_ENDPOINT`` - Base URL for the Gemini API endpoint

**Anthropic (provider: anthropic)**
* ``ANTHROPIC_API_KEY`` - Your Anthropic API key

**OpenAI Compatible (provider: openai_compatible)**
* ``OPENAI_API_BASE`` - Base URL for the API endpoint
* ``OPENAI_API_KEY`` - API key (if required by the service)

Multiple Model Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can configure multiple models in a single file:

.. code-block:: yaml

llms:
- provider: azure
MODEL_NAME: gpt-4
AZURE_API_KEY: your-azure-key
AZURE_API_BASE: https://your-azure-endpoint.openai.azure.com/
AZURE_API_VERSION: 2024-02-01
- provider: openai
MODEL_NAME: gpt-4
OPENAI_API_KEY: your-openai-key
- provider: anthropic
MODEL_NAME: claude-3-sonnet-20240229
ANTHROPIC_API_KEY: your-anthropic-key

When using ``--model``, specify the provider and model as ``provider/model_name`` (e.g., ``azure/gpt-4``, ``openai/gpt-4``).

Security Note
^^^^^^^^^^^^^

API keys and credentials in configuration files should be kept secure. Consider using:

* Restricted file permissions (``chmod 600 config.yaml``)
* Environment variable substitution where supported
* Separate configuration files for different environments (dev/prod)

Quick start and examples
=========================

Expand All @@ -139,14 +53,21 @@ Install the extension

az extension add --name aks-agent

Configure LLM models interactively
----------------------------------
Initialize and configure the AKS agent
---------------------------------------

.. code-block:: bash

az aks agent-init
az aks agent-init --resource-group MyResourceGroup --name MyManagedCluster

This command will configure the LLM configuration and:

This command will guide you through adding a new LLM model. You can run it multiple times to add more models or update existing models. All configured models are saved locally and can be selected when asking questions.
1. Guide you through LLM model configuration with credentials stored securely in Kubernetes secrets
2. Deploy the AKS agent Helm chart in your cluster
3. Configure Kubernetes RBAC for secure cluster resource access
4. Optionally configure Azure workload identity for Azure resource access

You can run it multiple times to update configurations or add more models.

Run the agent (Azure OpenAI example) :
-----------------------------------
Expand All @@ -163,12 +84,6 @@ Run the agent (Azure OpenAI example) :

az aks agent "Why are my pods not starting?" --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment

**3. Use a custom config file:**

.. code-block:: bash

az aks agent "Why are my pods not starting?" --config-file /path/to/your/model_config.yaml


Run the agent (OpenAI example)
------------------------------
Expand All @@ -185,34 +100,27 @@ Run the agent (OpenAI example)

az aks agent "Why are my pods not starting?" --name MyManagedCluster --resource-group MyResourceGroup --model gpt-4o

**3. Use a custom config file:**

.. code-block:: bash

az aks agent "Why are my pods not starting?" --config-file /path/to/your/model_config.yaml

Run in non-interactive batch mode
---------------------------------

.. code-block:: bash

az aks agent "Diagnose networking issues" --no-interactive --max-steps 15 --model azure/my-gpt4.1-deployment
az aks agent "Diagnose networking issues" --no-interactive --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment

Opt in to MCP mode
------------------
Clean up the AKS agent
-----------------------

Traditional toolsets remain the default. Enable the aks-mcp integration when you want the enhanced toolsets by passing ``--aks-mcp``. You can return to traditional mode on a subsequent run with ``--no-aks-mcp``.
To uninstall the AKS agent and clean up all Kubernetes resources:

.. code-block:: bash

az aks agent --aks-mcp "Check node health with MCP" --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment
az aks agent-cleanup --resource-group MyResourceGroup --name MyManagedCluster

Using a configuration file
--------------------------
This command will:

Pass a config file with --config-file to predefine model, credentials, and toolsets. See
the example config and more detailed examples in the help definition at
`src/aks-agent/azext_aks_agent/_help.py`.
1. Uninstall the AKS agent Helm chart from your cluster
2. Remove all associated Kubernetes resources (deployments, pods, secrets, RBAC configurations)
3. Clean up the LLM configuration secrets

More help
---------
Expand Down
34 changes: 11 additions & 23 deletions src/aks-agent/azext_aks_agent/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,27 +3,26 @@
# Licensed under the MIT License. See License.txt in the project root for license information.
# --------------------------------------------------------------------------------------------


import os
from azext_aks_agent._client_factory import CUSTOM_MGMT_AKS

# pylint: disable=unused-import
import azext_aks_agent._help
from azext_aks_agent._consts import (
CONST_AGENT_CONFIG_PATH_DIR_ENV_KEY,
CONST_AGENT_NAME,
CONST_AGENT_NAME_ENV_KEY,
CONST_DISABLE_PROMETHEUS_TOOLSET_ENV_KEY,
CONST_PRIVACY_NOTICE_BANNER,
CONST_PRIVACY_NOTICE_BANNER_ENV_KEY,
)
from azure.cli.core import AzCommandsLoader
from azure.cli.core.api import get_config_dir
from azure.cli.core.profiles import register_resource_type


def register_aks_agent_resource_type():
register_resource_type(
"latest",
CUSTOM_MGMT_AKS,
None,
)


class ContainerServiceCommandsLoader(AzCommandsLoader):

def __init__(self, cli_ctx=None):
from azure.cli.core.commands import CliCommandType
register_aks_agent_resource_type()

aks_agent_custom = CliCommandType(operations_tmpl='azext_aks_agent.custom#{}')
super().__init__(
Expand All @@ -44,14 +43,3 @@ def load_arguments(self, command):


COMMAND_LOADER_CLS = ContainerServiceCommandsLoader


# NOTE(mainred): holmesgpt leverages the environment variables to customize its behavior.
def customize_holmesgpt():
os.environ[CONST_DISABLE_PROMETHEUS_TOOLSET_ENV_KEY] = "true"
os.environ[CONST_AGENT_CONFIG_PATH_DIR_ENV_KEY] = get_config_dir()
os.environ[CONST_AGENT_NAME_ENV_KEY] = CONST_AGENT_NAME
os.environ[CONST_PRIVACY_NOTICE_BANNER_ENV_KEY] = CONST_PRIVACY_NOTICE_BANNER


customize_holmesgpt()
23 changes: 23 additions & 0 deletions src/aks-agent/azext_aks_agent/_client_factory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# --------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License. See License.txt in the project root for license information.
# --------------------------------------------------------------------------------------------

from azure.cli.core.commands.client_factory import get_mgmt_service_client
from azure.cli.core.profiles import CustomResourceType

CUSTOM_MGMT_AKS = CustomResourceType('azext_aks_agent.vendored_sdks.azure_mgmt_containerservice.2025_10_01',
'ContainerServiceClient')

# Note: cf_xxx, as the client_factory option value of a command group at command declaration, it should ignore
# parameters other than cli_ctx; get_xxx_client is used as the client of other services in the command implementation,
# and usually accepts subscription_id as a parameter to reconfigure the subscription when sending the request


# container service clients
def get_container_service_client(cli_ctx, subscription_id=None):
return get_mgmt_service_client(cli_ctx, CUSTOM_MGMT_AKS, subscription_id=subscription_id)


def cf_managed_clusters(cli_ctx, *_):
return get_container_service_client(cli_ctx).managed_clusters
20 changes: 17 additions & 3 deletions src/aks-agent/azext_aks_agent/_consts.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,20 @@
CONST_MCP_GITHUB_REPO = "Azure/aks-mcp"
CONST_MCP_BINARY_DIR = "bin"

# Color constants for terminal output
HELP_COLOR = "cyan" # same as AI_COLOR for now
ERROR_COLOR = "red"
# Kubernetes WebSocket exec protocol constants
RESIZE_CHANNEL = 4 # WebSocket channel for terminal resize messages
# WebSocket heartbeat configuration (matching kubectl client-go)
# Based on kubernetes/client-go/tools/remotecommand/websocket.go#L59-L65
# pingPeriod = 5 * time.Second
# pingReadDeadline = (pingPeriod * 12) + (1 * time.Second)
# The read deadline is calculated to allow up to 12 missed pings plus 1 second buffer
# This provides tolerance for network delays while detecting actual connection failures
HEARTBEAT_INTERVAL = 5.0 # pingPeriod: 5 seconds between pings
HEARTBEAT_TIMEOUT = (HEARTBEAT_INTERVAL * 12) + 1 # pingReadDeadline: 61 seconds total timeout

AGENT_NAMESPACE = "kube-system"
AGENT_LABEL_SELECTOR = "app.kubernetes.io/name=aks-agent"
AKS_MCP_LABEL_SELECTOR = "app.kubernetes.io/name=aks-mcp"

# Helm Configuration
HELM_VERSION = "3.16.0"
Loading
Loading