Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Project

> This repo has been populated by an initial template to help get you started. Please
> make sure to update the content to build a great experience for community-building.

As the maintainer of this project, please make a few updates:

- Improving this README.MD file to provide a great experience
- Updating SUPPORT.MD with content about this project's support experience
- Understanding the security reporting process in SECURITY.MD
- Remove this section from the README

## Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

## Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
144 changes: 120 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,129 @@
# Project
---
title: Get started with AI Foundry Local
titleSuffix: AI Foundry Local
description: Learn how to install, configure, and run your first AI model with AI Foundry Local
manager: scottpolly
keywords: Azure AI services, cognitive, AI models, local inference
ms.service: azure-ai-foundry
ms.topic: quickstart
ms.date: 02/20/2025
ms.reviewer: samkemp
ms.author: samkemp
author: samuel100
ms.custom: build-2025
#customer intent: As a developer, I want to get started with AI Foundry Local so that I can run AI models locally.
---

> This repo has been populated by an initial template to help get you started. Please
> make sure to update the content to build a great experience for community-building.
# Get started with AI Foundry Local

As the maintainer of this project, please make a few updates:
This article shows you how to get started with AI Foundry Local to run AI models on your device. Follow these steps to install the tool, discover available models, and run your first local AI model.

- Improving this README.MD file to provide a great experience
- Updating SUPPORT.MD with content about this project's support experience
- Understanding the security reporting process in SECURITY.MD
- Remove this section from the README
## Prerequisites

## Contributing
- A PC with sufficient specifications to run AI models locally
- Windows 10 or later
- Greater than 8GB RAM
- Greater than 3GB of free disk space for model caching (quantized Phi 3.2 models are ~3GB)
- Suggested hardware for optimal performance:
- Windows 11
- NVIDIA GPU (2000 series or newer) OR AMD GPU (6000 series or newer) OR Qualcomm Snapdragon X Elite, with 8GB or more of VRAM
- Greater than 16GB RAM
- Greater than 15GB of free disk space for model caching (the largest models are ~15GB)
- Administrator access to install software

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
## Quickstart in 2-steps

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
Follow these steps to get started with AI Foundry Local:

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
1. **Install Foundry Local**

## Trademarks
1. Download AI Foundry Local for your platform (Windows, MacOS, Linux - x64/ARM) from the repository's releases page.
2. Install the package by following the on-screen prompts.

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
**IMPORTANT: For MacOS/Linux users:** Run both components in separate terminals:

- Neutron Server (`Inference.Service.Agent`) - Use `chmod +x Inference.Service.Agent` to make executable
- Foundry Client (`foundry`) - Use `chmod +x foundry` to make executable, and add to your PATH

3. After installation, access the tool via command line with `foundry`.

2. **Run your first model**
1. Open a command prompt or terminal window.
2. Run the DeepSeek-R1 model on the CPU using the following command:
```bash
foundry model run deepseek-r1-1.5b-cpu
```

**💡 TIP:** The `foundry model run <model>` command will automatically download the model if it is not already cached on your local machine, and then start an interactive chat session with the model. You're encouraged to try out different models by replacing `deepseek-r1-1.5b-cpu` with the name of any other model available in the catalog, located with the `foundry model list` command.

## Explore Foundry Local CLI commands

The foundry CLI is structured into several categories:

- **Model**: Commands related to managing and running models
- **Service**: Commands for managing the AI Foundry Local service
- **Cache**: Commands for managing the local cache where models are stored

To see all available commands, use the help option:

```bash
foundry --help
```

**💡 TIP:** For a complete reference of all available CLI commands and their usage, see the [Foundry Local CLI Reference](./reference/reference-cli.md)

## Security and privacy considerations

AI Foundry Local is designed with privacy and security as core principles:

- **Local processing**: All data processed by AI Foundry Local remains on your device and is never sent to Microsoft or any external services.
- **No telemetry**: AI Foundry Local does not collect usage data or model inputs.
- **Air-gapped environments**: AI Foundry Local can be used in disconnected environments after initial model download.

### Security best practices

- Use AI Foundry Local in environments that align with your organization's security policies.
- For handling sensitive data, ensure your device meets your organization's security requirements.
- Consider disk encryption for devices where cached models might contain sensitive fine-tuning data.

### Licensing considerations

Models available through AI Foundry Local are subject to their original licenses:

- Open-source models maintain their original licenses (e.g., Apache 2.0, MIT).
- Commercial models may have specific usage restrictions or require separate licensing.
- Always review the licensing information for each model before deploying in production.

## Production deployment scope

AI Foundry Local is designed primarily for:

- Individual developer workstations
- Single-node deployment
- Local application development and testing

**⚠️ IMPORTANT:** AI Foundry Local is not currently intended for distributed, containerized, or multi-machine production deployment. For production-scale deployment needs, consider Azure AI Foundry for enterprise-grade availability and scale.

## Troubleshooting

### Common issues and solutions

| Issue | Possible Cause | Solution |
| ----------------------- | --------------------------------------- | ----------------------------------------------------------------------------------------- |
| Slow inference | CPU-only model on large parameter count | Use GPU-optimized model variants when available |
| Model download failures | Network connectivity issues | Check your internet connection, try `foundry cache list` to verify cache state |
| Service won't start | Port conflicts or permission issues | Try `foundry service restart` or post an issue providing logs with `foundry zip-logsrock` |

### Diagnosing performance issues

If you're experiencing slow inference:

1. Check that you're using GPU acceleration if available
2. Monitor memory usage during inference to detect bottlenecks
3. Consider a more quantized model variant (e.g., INT8 instead of FP16)
4. Experiment with batch sizes for non-interactive workloads

## Next steps

- [Learn how to integrate AI Foundry Local with your applications](./how-to/integrate-with-inference-sdks.md)
- [Explore the AI Foundry Local documentation](./index.yml)
128 changes: 128 additions & 0 deletions concepts/foundry-local-architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
title: Foundry Local Architecture
titleSuffix: AI Foundry Local
description: This article articulates the Foundry Local architecture
manager: scottpolly
ms.service: azure-ai-foundry
ms.custom: build-2025
ms.topic: concept-article
ms.date: 02/12/2025
ms.author: samkemp
author: samuel100
---

# Foundry Local Architecture

Foundry Local is designed to enable efficient, secure, and scalable AI model inference directly on local devices. This article explains the key components of the Foundry Local architecture and how they interact to deliver AI capabilities.

The benefits of Foundry Local include:

- **Low Latency**: By running models locally, Foundry Local minimizes the time it takes to process requests and return results.
- **Data Privacy**: Sensitive data can be processed locally without sending it to the cloud, ensuring compliance with data protection regulations.
- **Flexibility**: Foundry Local supports a wide range of hardware configurations, allowing users to choose the best setup for their needs.
- **Scalability**: Foundry Local can be deployed on various devices, from personal computers to powerful servers, making it suitable for different use cases.
- **Cost-Effectiveness**: Running models locally can reduce costs associated with cloud computing, especially for high-volume applications.
- **Offline Capabilities**: Foundry Local can operate without an internet connection, making it ideal for remote or disconnected environments.
- **Integration with Existing Workflows**: Foundry Local can be easily integrated into existing development and deployment workflows, allowing for a smooth transition to local inference.

## Key Components

The key components of the Foundry Local architecture are articulated in the following diagram:

![Foundry Local Architecture Diagram](../media/architecture/foundry-local-arch.png)

### Foundry Local Service

The Foundry Local Service is an OpenAI compatible REST server that provides a standardized interface for interacting with the inference engine and model management. Developers can use this API to send requests, run models, and retrieve results programmatically.

- **Endpoint**: `http://localhost:5272/v1`
- **Use Cases**:
- Integrating Foundry Local with custom applications.
- Running models via HTTP requests.

### ONNX Runtime

The ONNX runtime is a core component responsible for running AI models. It uses optimized ONNX models to perform inference efficiently on local hardware, such as CPUs, GPUs, or NPUs.

**Features**:

- Supports multiple hardware providers (for example: NVIDIA, AMD, Intel) and devices (for example: NPUs, CPUs, GPUs).
- Provides a unified interface for running models on different hardware platforms.
- Best-in-class performance.
- Supports quantized models for faster inference.

### Model Management

Foundry Local provides robust tools for managing AI models, ensuring that they're readily available for inference and easy to maintain. Model management is handled through the **Model Cache** and the **Command-Line Interface (CLI)**.

#### Model Cache

The model cache is a local storage system where AI models are downloaded and stored. It ensures that models are available for inference without requiring repeated downloads. The cache can be managed using the Foundry CLI or REST API.

- **Purpose**: Reduces latency by storing models locally.
- **Management Commands**:
- `foundry cache list`: Lists all models stored in the local cache.
- `foundry cache remove <model-name>`: Deletes a specific model from the cache.
- `foundry cache cd <path>`: Changes the directory where models are stored.

#### Model Lifecycle

1. **Download**: Models are downloaded from the Azure AI Foundry model catalog to local disk.
2. **Load**: Models are loaded into the Foundry Local service (and therefore memory) for inference. You can set a TTL (time-to-live) for how long the model should remain in memory (the default is 10 minutes).
3. **Run**: Models are inferenced.
4. **Unload**: Models can be unloaded from the inference engine to free up resources.
5. **Delete**: Models can be deleted from the local cache to free up disk space.

#### Model Compilation using Olive

Before models can be used with Foundry Local, they must be compiled and optimized in the [ONNX](https://onnx.ai) format. Microsoft provides a selection of published models in the Azure AI Foundry Model Catalog that are already optimized for Foundry Local. However, you aren't limited to those models - by using [Olive](https://microsoft.github.io/Olive/). Olive is a powerful framework for preparing AI models for efficient inference. It converts models into the ONNX format, optimizes their graph structure, and applies techniques like quantization to improve performance on local hardware.

**💡 TIP**: To learn more about compiling models for Foundry Local, read [Compile Hugging Face models for Foundry Local](../how-to/compile-models-for-foundry-local.md).

### Hardware Abstraction Layer

The hardware abstraction layer ensures that Foundry Local can run on various devices by abstracting the underlying hardware. To optimize performance based on the available hardware, Foundry Local supports:

- **multiple _execution providers_**, such as NVIDIA CUDA, AMD, Qualcomm, Intel.
- **multiple _device types_**, such as CPU, GPU, NPU.

### Developer Experiences

The Foundry Local architecture is designed to provide a seamless developer experience, enabling easy integration and interaction with AI models.

Developers can choose from various interfaces to interact with the system, including:

#### Command-Line Interface (CLI)

The Foundry CLI is a powerful tool for managing models, the inference engine, and the local cache.

**Examples**:

- `foundry model list`: Lists all available models in the local cache.
- `foundry model run <model-name>`: Runs a model.
- `foundry service status`: Checks the status of the service.

**💡 TIP**: To learn more about the CLI commands, read [Foundry Local CLI Reference](../reference/reference-cli.md).

#### Inferencing SDK Integration

Foundry Local supports integration with various SDKs, such as the OpenAI SDK, enabling developers to use familiar programming interfaces to interact with the local inference engine.

- **Supported SDKs**: Python, JavaScript, C#, and more.

**💡 TIP**: To learn more about integrating with inferencing SDKs, read [Integrate Foundry Local with Inferencing SDKs](../how-to/integrate-with-inference-sdks.md).

#### AI Toolkit for Visual Studio Code

The AI Toolkit for Visual Studio Code provides a user-friendly interface for developers to interact with Foundry Local. It allows users to run models, manage the local cache, and visualize results directly within the IDE.

- **Features**:
- Model management: Download, load, and run models from within the IDE.
- Interactive console: Send requests and view responses in real-time.
- Visualization tools: Graphical representation of model performance and results.

## Next Steps

- [Get started with AI Foundry Local](../get-started.md)
- [Integrate with Inference SDKs](../how-to/integrate-with-inference-sdks.md)
- [Foundry Local CLI Reference](../reference/reference-cli.md)
Loading