diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 00000000..5cd7cecf --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,33 @@ +# Project + +> This repo has been populated by an initial template to help get you started. Please +> make sure to update the content to build a great experience for community-building. + +As the maintainer of this project, please make a few updates: + +- Improving this README.MD file to provide a great experience +- Updating SUPPORT.MD with content about this project's support experience +- Understanding the security reporting process in SECURITY.MD +- Remove this section from the README + +## Contributing + +This project welcomes contributions and suggestions. Most contributions require you to agree to a +Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us +the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. + +When you submit a pull request, a CLA bot will automatically determine whether you need to provide +a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions +provided by the bot. You will only need to do this once across all repos using our CLA. + +This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). +For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or +contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. + +## Trademarks + +This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft +trademarks or logos is subject to and must follow +[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). +Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. +Any use of third-party trademarks or logos are subject to those third-party's policies. diff --git a/README.md b/README.md index 5cd7cecf..6a6a52fe 100644 --- a/README.md +++ b/README.md @@ -1,33 +1,129 @@ -# Project +--- +title: Get started with AI Foundry Local +titleSuffix: AI Foundry Local +description: Learn how to install, configure, and run your first AI model with AI Foundry Local +manager: scottpolly +keywords: Azure AI services, cognitive, AI models, local inference +ms.service: azure-ai-foundry +ms.topic: quickstart +ms.date: 02/20/2025 +ms.reviewer: samkemp +ms.author: samkemp +author: samuel100 +ms.custom: build-2025 +#customer intent: As a developer, I want to get started with AI Foundry Local so that I can run AI models locally. +--- -> This repo has been populated by an initial template to help get you started. Please -> make sure to update the content to build a great experience for community-building. +# Get started with AI Foundry Local -As the maintainer of this project, please make a few updates: +This article shows you how to get started with AI Foundry Local to run AI models on your device. Follow these steps to install the tool, discover available models, and run your first local AI model. -- Improving this README.MD file to provide a great experience -- Updating SUPPORT.MD with content about this project's support experience -- Understanding the security reporting process in SECURITY.MD -- Remove this section from the README +## Prerequisites -## Contributing +- A PC with sufficient specifications to run AI models locally + - Windows 10 or later + - Greater than 8GB RAM + - Greater than 3GB of free disk space for model caching (quantized Phi 3.2 models are ~3GB) +- Suggested hardware for optimal performance: + - Windows 11 + - NVIDIA GPU (2000 series or newer) OR AMD GPU (6000 series or newer) OR Qualcomm Snapdragon X Elite, with 8GB or more of VRAM + - Greater than 16GB RAM + - Greater than 15GB of free disk space for model caching (the largest models are ~15GB) +- Administrator access to install software -This project welcomes contributions and suggestions. Most contributions require you to agree to a -Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us -the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. +## Quickstart in 2-steps -When you submit a pull request, a CLA bot will automatically determine whether you need to provide -a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions -provided by the bot. You will only need to do this once across all repos using our CLA. +Follow these steps to get started with AI Foundry Local: -This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). -For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or -contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. +1. **Install Foundry Local** -## Trademarks + 1. Download AI Foundry Local for your platform (Windows, MacOS, Linux - x64/ARM) from the repository's releases page. + 2. Install the package by following the on-screen prompts. -This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft -trademarks or logos is subject to and must follow -[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). -Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. -Any use of third-party trademarks or logos are subject to those third-party's policies. + **IMPORTANT: For MacOS/Linux users:** Run both components in separate terminals: + + - Neutron Server (`Inference.Service.Agent`) - Use `chmod +x Inference.Service.Agent` to make executable + - Foundry Client (`foundry`) - Use `chmod +x foundry` to make executable, and add to your PATH + + 3. After installation, access the tool via command line with `foundry`. + +2. **Run your first model** + 1. Open a command prompt or terminal window. + 2. Run the DeepSeek-R1 model on the CPU using the following command: + ```bash + foundry model run deepseek-r1-1.5b-cpu + ``` + +**💡 TIP:** The `foundry model run ` command will automatically download the model if it is not already cached on your local machine, and then start an interactive chat session with the model. You're encouraged to try out different models by replacing `deepseek-r1-1.5b-cpu` with the name of any other model available in the catalog, located with the `foundry model list` command. + +## Explore Foundry Local CLI commands + +The foundry CLI is structured into several categories: + +- **Model**: Commands related to managing and running models +- **Service**: Commands for managing the AI Foundry Local service +- **Cache**: Commands for managing the local cache where models are stored + +To see all available commands, use the help option: + +```bash +foundry --help +``` + +**💡 TIP:** For a complete reference of all available CLI commands and their usage, see the [Foundry Local CLI Reference](./reference/reference-cli.md) + +## Security and privacy considerations + +AI Foundry Local is designed with privacy and security as core principles: + +- **Local processing**: All data processed by AI Foundry Local remains on your device and is never sent to Microsoft or any external services. +- **No telemetry**: AI Foundry Local does not collect usage data or model inputs. +- **Air-gapped environments**: AI Foundry Local can be used in disconnected environments after initial model download. + +### Security best practices + +- Use AI Foundry Local in environments that align with your organization's security policies. +- For handling sensitive data, ensure your device meets your organization's security requirements. +- Consider disk encryption for devices where cached models might contain sensitive fine-tuning data. + +### Licensing considerations + +Models available through AI Foundry Local are subject to their original licenses: + +- Open-source models maintain their original licenses (e.g., Apache 2.0, MIT). +- Commercial models may have specific usage restrictions or require separate licensing. +- Always review the licensing information for each model before deploying in production. + +## Production deployment scope + +AI Foundry Local is designed primarily for: + +- Individual developer workstations +- Single-node deployment +- Local application development and testing + +**⚠️ IMPORTANT:** AI Foundry Local is not currently intended for distributed, containerized, or multi-machine production deployment. For production-scale deployment needs, consider Azure AI Foundry for enterprise-grade availability and scale. + +## Troubleshooting + +### Common issues and solutions + +| Issue | Possible Cause | Solution | +| ----------------------- | --------------------------------------- | ----------------------------------------------------------------------------------------- | +| Slow inference | CPU-only model on large parameter count | Use GPU-optimized model variants when available | +| Model download failures | Network connectivity issues | Check your internet connection, try `foundry cache list` to verify cache state | +| Service won't start | Port conflicts or permission issues | Try `foundry service restart` or post an issue providing logs with `foundry zip-logsrock` | + +### Diagnosing performance issues + +If you're experiencing slow inference: + +1. Check that you're using GPU acceleration if available +2. Monitor memory usage during inference to detect bottlenecks +3. Consider a more quantized model variant (e.g., INT8 instead of FP16) +4. Experiment with batch sizes for non-interactive workloads + +## Next steps + +- [Learn how to integrate AI Foundry Local with your applications](./how-to/integrate-with-inference-sdks.md) +- [Explore the AI Foundry Local documentation](./index.yml) diff --git a/concepts/foundry-local-architecture.md b/concepts/foundry-local-architecture.md new file mode 100644 index 00000000..80886abd --- /dev/null +++ b/concepts/foundry-local-architecture.md @@ -0,0 +1,128 @@ +--- +title: Foundry Local Architecture +titleSuffix: AI Foundry Local +description: This article articulates the Foundry Local architecture +manager: scottpolly +ms.service: azure-ai-foundry +ms.custom: build-2025 +ms.topic: concept-article +ms.date: 02/12/2025 +ms.author: samkemp +author: samuel100 +--- + +# Foundry Local Architecture + +Foundry Local is designed to enable efficient, secure, and scalable AI model inference directly on local devices. This article explains the key components of the Foundry Local architecture and how they interact to deliver AI capabilities. + +The benefits of Foundry Local include: + +- **Low Latency**: By running models locally, Foundry Local minimizes the time it takes to process requests and return results. +- **Data Privacy**: Sensitive data can be processed locally without sending it to the cloud, ensuring compliance with data protection regulations. +- **Flexibility**: Foundry Local supports a wide range of hardware configurations, allowing users to choose the best setup for their needs. +- **Scalability**: Foundry Local can be deployed on various devices, from personal computers to powerful servers, making it suitable for different use cases. +- **Cost-Effectiveness**: Running models locally can reduce costs associated with cloud computing, especially for high-volume applications. +- **Offline Capabilities**: Foundry Local can operate without an internet connection, making it ideal for remote or disconnected environments. +- **Integration with Existing Workflows**: Foundry Local can be easily integrated into existing development and deployment workflows, allowing for a smooth transition to local inference. + +## Key Components + +The key components of the Foundry Local architecture are articulated in the following diagram: + +![Foundry Local Architecture Diagram](../media/architecture/foundry-local-arch.png) + +### Foundry Local Service + +The Foundry Local Service is an OpenAI compatible REST server that provides a standardized interface for interacting with the inference engine and model management. Developers can use this API to send requests, run models, and retrieve results programmatically. + +- **Endpoint**: `http://localhost:5272/v1` +- **Use Cases**: + - Integrating Foundry Local with custom applications. + - Running models via HTTP requests. + +### ONNX Runtime + +The ONNX runtime is a core component responsible for running AI models. It uses optimized ONNX models to perform inference efficiently on local hardware, such as CPUs, GPUs, or NPUs. + +**Features**: + +- Supports multiple hardware providers (for example: NVIDIA, AMD, Intel) and devices (for example: NPUs, CPUs, GPUs). +- Provides a unified interface for running models on different hardware platforms. +- Best-in-class performance. +- Supports quantized models for faster inference. + +### Model Management + +Foundry Local provides robust tools for managing AI models, ensuring that they're readily available for inference and easy to maintain. Model management is handled through the **Model Cache** and the **Command-Line Interface (CLI)**. + +#### Model Cache + +The model cache is a local storage system where AI models are downloaded and stored. It ensures that models are available for inference without requiring repeated downloads. The cache can be managed using the Foundry CLI or REST API. + +- **Purpose**: Reduces latency by storing models locally. +- **Management Commands**: + - `foundry cache list`: Lists all models stored in the local cache. + - `foundry cache remove `: Deletes a specific model from the cache. + - `foundry cache cd `: Changes the directory where models are stored. + +#### Model Lifecycle + +1. **Download**: Models are downloaded from the Azure AI Foundry model catalog to local disk. +2. **Load**: Models are loaded into the Foundry Local service (and therefore memory) for inference. You can set a TTL (time-to-live) for how long the model should remain in memory (the default is 10 minutes). +3. **Run**: Models are inferenced. +4. **Unload**: Models can be unloaded from the inference engine to free up resources. +5. **Delete**: Models can be deleted from the local cache to free up disk space. + +#### Model Compilation using Olive + +Before models can be used with Foundry Local, they must be compiled and optimized in the [ONNX](https://onnx.ai) format. Microsoft provides a selection of published models in the Azure AI Foundry Model Catalog that are already optimized for Foundry Local. However, you aren't limited to those models - by using [Olive](https://microsoft.github.io/Olive/). Olive is a powerful framework for preparing AI models for efficient inference. It converts models into the ONNX format, optimizes their graph structure, and applies techniques like quantization to improve performance on local hardware. + +**💡 TIP**: To learn more about compiling models for Foundry Local, read [Compile Hugging Face models for Foundry Local](../how-to/compile-models-for-foundry-local.md). + +### Hardware Abstraction Layer + +The hardware abstraction layer ensures that Foundry Local can run on various devices by abstracting the underlying hardware. To optimize performance based on the available hardware, Foundry Local supports: + +- **multiple _execution providers_**, such as NVIDIA CUDA, AMD, Qualcomm, Intel. +- **multiple _device types_**, such as CPU, GPU, NPU. + +### Developer Experiences + +The Foundry Local architecture is designed to provide a seamless developer experience, enabling easy integration and interaction with AI models. + +Developers can choose from various interfaces to interact with the system, including: + +#### Command-Line Interface (CLI) + +The Foundry CLI is a powerful tool for managing models, the inference engine, and the local cache. + +**Examples**: + +- `foundry model list`: Lists all available models in the local cache. +- `foundry model run `: Runs a model. +- `foundry service status`: Checks the status of the service. + +**💡 TIP**: To learn more about the CLI commands, read [Foundry Local CLI Reference](../reference/reference-cli.md). + +#### Inferencing SDK Integration + +Foundry Local supports integration with various SDKs, such as the OpenAI SDK, enabling developers to use familiar programming interfaces to interact with the local inference engine. + +- **Supported SDKs**: Python, JavaScript, C#, and more. + +**💡 TIP**: To learn more about integrating with inferencing SDKs, read [Integrate Foundry Local with Inferencing SDKs](../how-to/integrate-with-inference-sdks.md). + +#### AI Toolkit for Visual Studio Code + +The AI Toolkit for Visual Studio Code provides a user-friendly interface for developers to interact with Foundry Local. It allows users to run models, manage the local cache, and visualize results directly within the IDE. + +- **Features**: + - Model management: Download, load, and run models from within the IDE. + - Interactive console: Send requests and view responses in real-time. + - Visualization tools: Graphical representation of model performance and results. + +## Next Steps + +- [Get started with AI Foundry Local](../get-started.md) +- [Integrate with Inference SDKs](../how-to/integrate-with-inference-sdks.md) +- [Foundry Local CLI Reference](../reference/reference-cli.md) diff --git a/how-to/compile-models-for-foundry-local.md b/how-to/compile-models-for-foundry-local.md new file mode 100644 index 00000000..6535f4e1 --- /dev/null +++ b/how-to/compile-models-for-foundry-local.md @@ -0,0 +1,264 @@ +# Run Hugging Face models on Foundry Local + +Foundry Local lets you run ONNX models on your local device with high performance. While the model catalog includes pre-compiled models, you can also use any ONNX-formatted model. + +In this guide, you'll learn to: + +- **Convert and optimize** a Hugging Face model into the ONNX format using Olive +- **Run** the optimized model using Foundry Local + +## Prerequisites + +- Python 3.10 or later + +## Install Olive + +[Olive](https://github.com/microsoft/olive) is a toolkit for optimizing models to ONNX format. + +### Bash + +```bash +pip install olive-ai[auto-opt] +``` + +### PowerShell + +```powershell +pip install olive-ai[auto-opt] +``` + +**💡 TIP**: Install Olive in a virtual environment using [venv](https://docs.python.org/3/library/venv.html) or [conda](https://www.anaconda.com/docs/getting-started/miniconda/main). + +## Sign in to Hugging Face + +We'll optimize Llama-3.2-1B-Instruct, which requires Hugging Face authentication: + +### Bash + +```bash +huggingface-cli login +``` + +### PowerShell + +```powershell +huggingface-cli login +``` + +**Note**: You'll need to [create a Hugging Face token](https://huggingface.co/docs/hub/security-tokens) and [directly request access](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) to the model. + +## Compile the model + +### Step 1: Run the Olive `auto-opt` command + +Run the Olive `auto-opt` command to download, convert to ONNX, quantize, and optimize the model: + +### Bash + +```bash +olive auto-opt \ + --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \ + --trust_remote_code \ + --output_path models/llama \ + --device cpu \ + --provider CPUExecutionProvider \ + --use_ort_genai \ + --precision int4 \ + --log_level 1 +``` + +### PowerShell + +```powershell +olive auto-opt ` + --model_name_or_path meta-llama/Llama-3.2-1B-Instruct ` + --trust_remote_code ` + --output_path models/llama ` + --device cpu ` + --provider CPUExecutionProvider ` + --use_ort_genai ` + --precision int4 ` + --log_level 1 +``` + +**Note**: Compilation takes ~60 seconds plus model download time. + +The command uses the following parameters: + +| Parameter | Description | +| -------------------- | -------------------------------------------------------------------------- | +| `model_name_or_path` | Model source: Hugging Face ID, local path, or Azure AI Model registry ID | +| `output_path` | Where to save the optimized model | +| `device` | Target hardware: `cpu`, `gpu`, or `npu` | +| `provider` | Execution provider (e.g., `CPUExecutionProvider`, `CUDAExecutionProvider`) | +| `precision` | Model precision: `fp16`, `fp32`, `int4`, or `int8` | +| `use_ort_genai` | Creates inference configuration files | + +You can substitute any model from Hugging Face or a local path - Olive handles the conversion, optimization, and quantization automatically. + +### Step 2: Rename the output model + +Olive places files in a generic `model` directory. Rename it to make it easier to use: + +### Bash + +```bash +cd models/llama +mv model llama-3.2 +``` + +### PowerShell + +```powershell +cd models/llama +Rename-Item -Path "model" -NewName "llama-3.2" +``` + +### Step 3: Create chat template file + +A chat template is a structured format that defines how input and output messages are processed for a conversational AI model. It specifies the roles (e.g., system, user, assistant) and the structure of the conversation, ensuring that the model understands the context and generates appropriate responses. + +Foundry Local requires a chat template JSON file called `inference_model.json` in order to generate the appropriate responses. The template file contains the model name and a `PromptTemplate` object - this contains a `{Content}` placeholder, which Foundry Local will inject at runtime with the user prompt. + +```json +{ + "Name": "llama-3.2", + "PromptTemplate": { + "assistant": "{Content}", + "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{Content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" + } +} +``` + +To create the chat template file, you can use the `apply_chat_template` method from the Hugging Face library: + +**Note**: The following example uses the Python Hugging Face library to create a chat template. The Hugging Face library is a dependency for Olive, so if you're using the same Python virtual environment you do not need to install. If you're using a different environment, install the library with `pip install transformers`. + +```python +# generate_inference_model.py +# This script generates the inference_model.json file for the Llama-3.2 model. +import json +import os +from transformers import AutoTokenizer + +model_path = "models/llama/llama-3.2" +tokenizer = AutoTokenizer.from_pretrained(model_path) + +chat = [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "{Content}"}, +] + +template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) + +json_template = { + "Name": "llama-3.2", + "PromptTemplate": { + "assistant": "{Content}", + "prompt": template + } +} + +json_file = os.path.join(model_path, "inference_model.json") +with open(json_file, "w") as f: + json.dump(json_template, f, indent=2) +``` + +Run the script using: + +```bash +python generate_inference_model.py +``` + +## Run the model + +You can run your compiled model using the Foundry Local CLI, REST API, or OpenAI Python SDK. First, change the model cache directory to the models directory you created in the previous step: + +### Bash + +```bash +foundry cache cd models +foundry cache ls # should show llama-3.2 +``` + +### PowerShell + +```powershell +foundry cache cd models +foundry cache ls # should show llama-3.2 +``` + +### Using the Foundry Local CLI + +### Bash + +```bash +foundry model run llama-3.2 --verbose +``` + +### PowerShell + +```powershell +foundry model run llama-3.2 --verbose +``` + +### Using the REST API + +### Bash + +```bash +curl -X POST http://localhost:5272/v1/chat/completions \ +-H "Content-Type: application/json" \ +-d '{ + "model": "llama-3.2", + "messages": [{"role": "user", "content": "What is the capital of France?"}], + "temperature": 0.7, + "max_tokens": 50, + "stream": true +}' +``` + +### PowerShell + +```powershell +Invoke-RestMethod -Uri http://localhost:5272/v1/chat/completions ` + -Method Post ` + -ContentType "application/json" ` + -Body '{ + "model": "llama-3.2", + "messages": [{"role": "user", "content": "What is the capital of France?"}], + "temperature": 0.7, + "max_tokens": 50, + "stream": true + }' +``` + +### Using the OpenAI Python SDK + +```python +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:5272/v1", + api_key="none", # required but not used +) + +stream = client.chat.completions.create( + model="llama-3.2", + messages=[{"role": "user", "content": "What is the capital of France?"}], + temperature=0.7, + max_tokens=50, + stream=True, +) + +for event in stream: + print(event.choices[0].delta.content, end="", flush=True) +print("\n\n") +``` + +**💡 TIP**: You can use any language that supports HTTP requests. See [Integrate with Inferencing SDKs](./integrate-with-inference-sdks.md) for more options. + +## Next steps + +- [Learn more about Olive](https://microsoft.github.io/Olive/) +- [Integrate Foundry Local with Inferencing SDKs](./integrate-with-inference-sdks.md) diff --git a/how-to/integrate-with-inference-sdks.md b/how-to/integrate-with-inference-sdks.md new file mode 100644 index 00000000..bd14a106 --- /dev/null +++ b/how-to/integrate-with-inference-sdks.md @@ -0,0 +1,149 @@ +--- +title: Integrate with Inference SDKs +titleSuffix: AI Foundry Local +description: This article provides instructions on how to integrate Foundry Local with common Inferencing SDKs. +manager: scottpolly +ms.service: azure-ai-foundry +ms.custom: build-2025 +ms.topic: how-to +ms.date: 02/12/2025 +ms.author: samkemp +zone_pivot_groups: azure-ai-model-catalog-samples-chat +author: samuel100 +--- + +# Integrate Foundry Local with Inferencing SDKs + +AI Foundry Local provides a REST API endpoint that makes it easy to integrate with various inferencing SDKs and programming languages. This guide shows you how to connect your applications to locally running AI models using popular SDKs. + +## Prerequisites + +- AI Foundry Local installed and running on your system +- A model loaded into the service (use `foundry model load `) +- Basic knowledge of the programming language you want to use for integration +- Development environment for your chosen language + +## Understanding the REST API + +When AI Foundry Local is running, it exposes an OpenAI-compatible REST API endpoint at `http://localhost:5272/v1`. This endpoint supports standard API operations like: + +- `/completions` - For text completion +- `/chat/completions` - For chat-based interactions +- `/models` - To list available models + +## Language Examples + +### Python + +```python +from openai import OpenAI + +# Configure the client to use your local endpoint +client = OpenAI( + base_url="http://localhost:5272/v1", + api_key="not-needed" # API key isn't used but the client requires one +) + +# Chat completion example +response = client.chat.completions.create( + model="deepseek-r1-1.5b-cpu", # Use the name of your loaded model + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is the capital of France?"} + ], + max_tokens=100 +) + +print(response.choices[0].message.content) +``` + +### REST API + +```bash +curl http://localhost:5272/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek-r1-1.5b-cpu", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "What is the capital of France?" + } + ], + "max_tokens": 100 + }' +``` + +### JavaScript + +```javascript +import OpenAI from "openai"; + +// Configure the client to use your local endpoint +const openai = new OpenAI({ + baseURL: "http://localhost:5272/v1", + apiKey: "not-needed", // API key isn't used but the client requires one +}); + +async function generateText() { + const response = await openai.chat.completions.create({ + model: "deepseek-r1-1.5b-cpu", // Use the name of your loaded model + messages: [ + { role: "system", content: "You are a helpful assistant." }, + { role: "user", content: "What is the capital of France?" }, + ], + max_tokens: 100, + }); + + console.log(response.choices[0].message.content); +} + +generateText(); +``` + +### C# + +```csharp +using Azure.AI.OpenAI; +using Azure; + +// Configure the client to use your local endpoint +OpenAIClient client = new OpenAIClient( + new Uri("http://localhost:5272/v1"), + new AzureKeyCredential("not-needed") // API key isn't used but the client requires one +); + +// Chat completion example +var chatCompletionsOptions = new ChatCompletionsOptions() +{ + Messages = + { + new ChatMessage(ChatRole.System, "You are a helpful assistant."), + new ChatMessage(ChatRole.User, "What is the capital of France?") + }, + MaxTokens = 100 +}; + +Response response = await client.GetChatCompletionsAsync( + "deepseek-r1-1.5b-cpu", // Use the name of your loaded model + chatCompletionsOptions +); + +Console.WriteLine(response.Value.Choices[0].Message.Content); +``` + +## Best Practices + +1. **Error Handling**: Implement robust error handling to manage cases when the local service is unavailable or a model isn't loaded. +2. **Resource Management**: Be mindful of your local resources. Monitor CPU/RAM usage when making multiple concurrent requests. +3. **Fallback Strategy**: Consider implementing a fallback to cloud services for when local inference is insufficient. +4. **Model Preloading**: For production applications, ensure your model is preloaded before starting your application. + +## Next steps + +- [Compile Hugging Face models for Foundry Local](./compile-models-for-foundry-local.md) +- [Explore the AI Foundry Local CLI reference](../reference/reference-cli.md) diff --git a/how-to/manage.md b/how-to/manage.md new file mode 100644 index 00000000..bf5ddf52 --- /dev/null +++ b/how-to/manage.md @@ -0,0 +1,15 @@ +# Manage Foundry Local + +TODO + +## Prerequisites + +- TODO + +## Section + +TODO + +## Next step + +TODO diff --git a/includes/integrate-examples/csharp.md b/includes/integrate-examples/csharp.md new file mode 100644 index 00000000..ddc64521 --- /dev/null +++ b/includes/integrate-examples/csharp.md @@ -0,0 +1,65 @@ +## Basic Integration + +```csharp +// Install with: dotnet add package Azure.AI.OpenAI +using Azure.AI.OpenAI; +using Azure; + +// Create a client +OpenAIClient client = new OpenAIClient( + new Uri("http://localhost:5272/v1"), + new AzureKeyCredential("not-needed-for-local") +); + +// Chat completions +ChatCompletionsOptions options = new ChatCompletionsOptions() +{ + Messages = + { + new ChatMessage(ChatRole.User, "What is AI Foundry Local?") + }, + DeploymentName = "Phi-4-mini-gpu-int4-rtn-block-32" // Use model name here +}; + +Response response = await client.GetChatCompletionsAsync(options); +string completion = response.Value.Choices[0].Message.Content; +Console.WriteLine(completion); +``` + +## Streaming Response + +```csharp +// Install with: dotnet add package Azure.AI.OpenAI +using Azure.AI.OpenAI; +using Azure; +using System; +using System.Threading.Tasks; + +async Task StreamCompletionsAsync() +{ + OpenAIClient client = new OpenAIClient( + new Uri("http://localhost:5272/v1"), + new AzureKeyCredential("not-needed-for-local") + ); + + ChatCompletionsOptions options = new ChatCompletionsOptions() + { + Messages = + { + new ChatMessage(ChatRole.User, "Write a short story about AI") + }, + DeploymentName = "Phi-4-mini-gpu-int4-rtn-block-32" + }; + + await foreach (StreamingChatCompletionsUpdate update in client.GetChatCompletionsStreaming(options)) + { + if (update.ContentUpdate != null) + { + Console.Write(update.ContentUpdate); + } + } +} + +// Call the async method +await StreamCompletionsAsync(); +``` diff --git a/includes/integrate-examples/javascript.md b/includes/integrate-examples/javascript.md new file mode 100644 index 00000000..81e864bc --- /dev/null +++ b/includes/integrate-examples/javascript.md @@ -0,0 +1,129 @@ +## Using the OpenAI Node.js SDK + +```javascript +// Install with: npm install openai +import OpenAI from 'openai'; + +const openai = new OpenAI({ + baseURL: 'http://localhost:5272/v1', + apiKey: 'not-needed-for-local' +}); + +async function generateText() { + const response = await openai.chat.completions.create({ + model: 'Phi-4-mini-gpu-int4-rtn-block-32', + messages: [ + { role: 'user', content: 'How can I integrate AI Foundry Local with my app?' } + ], + }); + + console.log(response.choices[0].message.content); +} + +generateText(); +``` + +## Using Fetch API + +```javascript +async function queryModel() { + const response = await fetch('http://localhost:5272/v1/chat/completions', { + method: 'POST', + headers: { + 'Content-Type': 'application/json', + }, + body: JSON.stringify({ + model: 'Phi-4-mini-gpu-int4-rtn-block-32', + messages: [ + { role: 'user', content: 'What are the advantages of AI Foundry Local?' } + ] + }), + }); + + const data = await response.json(); + console.log(data.choices[0].message.content); +} + +queryModel(); +``` + +## Streaming Responses + +### Using OpenAI SDK + +```javascript +// Install with: npm install openai +import OpenAI from 'openai'; + +const openai = new OpenAI({ + baseURL: 'http://localhost:5272/v1', + apiKey: 'not-needed-for-local' +}); + +async function streamCompletion() { + const stream = await openai.chat.completions.create({ + model: 'Phi-4-mini-gpu-int4-rtn-block-32', + messages: [{ role: 'user', content: 'Write a short story about AI' }], + stream: true, + }); + + for await (const chunk of stream) { + if (chunk.choices[0]?.delta?.content) { + process.stdout.write(chunk.choices[0].delta.content); + } + } +} + +streamCompletion(); +``` + +### Using Fetch API and ReadableStream + +```javascript +async function streamWithFetch() { + const response = await fetch('http://localhost:5272/v1/chat/completions', { + method: 'POST', + headers: { + 'Content-Type': 'application/json', + 'Accept': 'text/event-stream', + }, + body: JSON.stringify({ + model: 'Phi-4-mini-gpu-int4-rtn-block-32', + messages: [{ role: 'user', content: 'Write a short story about AI' }], + stream: true, + }), + }); + + const reader = response.body.getReader(); + const decoder = new TextDecoder(); + + while (true) { + const { done, value } = await reader.read(); + if (done) break; + + const chunk = decoder.decode(value); + const lines = chunk.split('\n').filter(line => line.trim() !== ''); + + for (const line of lines) { + if (line.startsWith('data: ')) { + const data = line.substring(6); + if (data === '[DONE]') continue; + + try { + const json = JSON.parse(data); + const content = json.choices[0]?.delta?.content || ''; + if (content) { + // Print to console without line breaks, similar to process.stdout.write + process.stdout.write(content); + } + } catch (e) { + console.error('Error parsing JSON:', e); + } + } + } + } +} + +// Call the function to start streaming +streamWithFetch(); +``` diff --git a/includes/integrate-examples/python.md b/includes/integrate-examples/python.md new file mode 100644 index 00000000..8bab5670 --- /dev/null +++ b/includes/integrate-examples/python.md @@ -0,0 +1,67 @@ +## Using the OpenAI SDK + +```python +# Install with: pip install openai +import openai + +# Configure the client to use your local endpoint +client = openai.OpenAI( + base_url="http://localhost:5272/v1", + api_key="not-needed-for-local" # API key is not required for local usage +) + +# Chat completions +response = client.chat.completions.create( + model="Phi-4-mini-gpu-int4-rtn-block-32", # Use a model loaded in your service + messages=[ + {"role": "user", "content": "Explain how AI Foundry Local works."} + ] +) + +print(response.choices[0].message.content) +``` + +## Using Direct HTTP Requests + +```python +# Install with: pip install requests +import requests +import json + +url = "http://localhost:5272/v1/chat/completions" + +payload = { + "model": "Phi-4-mini-gpu-int4-rtn-block-32", + "messages": [ + {"role": "user", "content": "What are the benefits of running AI models locally?"} + ] +} + +headers = { + "Content-Type": "application/json" +} + +response = requests.post(url, headers=headers, data=json.dumps(payload)) +print(response.json()["choices"][0]["message"]["content"]) +``` + +## Streaming Response + +```python +import openai + +client = openai.OpenAI( + base_url="http://localhost:5272/v1", + api_key="not-needed-for-local" +) + +stream = client.chat.completions.create( + model="Phi-4-mini-gpu-int4-rtn-block-32", + messages=[{"role": "user", "content": "Write a short story about AI"}], + stream=True +) + +for chunk in stream: + if chunk.choices[0].delta.content is not None: + print(chunk.choices[0].delta.content, end="") +``` diff --git a/includes/integrate-examples/rest.md b/includes/integrate-examples/rest.md new file mode 100644 index 00000000..3fa90343 --- /dev/null +++ b/includes/integrate-examples/rest.md @@ -0,0 +1,19 @@ +## Basic Request + +For quick tests or integrations with command line scripts: + +```bash +curl http://localhost:5272/v1/chat/completions ^ + -H "Content-Type: application/json" ^ + -d "{\"model\": \"Phi-4-mini-gpu-int4-rtn-block-32\", \"messages\": [{\"role\": \"user\", \"content\": \"Tell me a short story\"}]}" +``` + +## Streaming Response + +**Note**: The example here works, but because there's no cleansing of the output, it may not be as clean as the other examples. + +```bash +curl http://localhost:5272/v1/chat/completions ^ + -H "Content-Type: application/json" ^ + -d "{\"model\": \"Phi-4-mini-gpu-int4-rtn-block-32\", \"messages\": [{\"role\": \"user\", \"content\": \"Tell me a short story\"}], \"stream\": true}" +``` diff --git a/media/architecture/foundry-local-arch.png b/media/architecture/foundry-local-arch.png new file mode 100644 index 00000000..cf5066d6 Binary files /dev/null and b/media/architecture/foundry-local-arch.png differ diff --git a/reference/reference-cli.md b/reference/reference-cli.md new file mode 100644 index 00000000..b4268833 --- /dev/null +++ b/reference/reference-cli.md @@ -0,0 +1,58 @@ +# Foundry Local CLI Reference + +This article provides a comprehensive reference for the AI Foundry Local command-line interface (CLI). The foundry CLI is structured into several categories to help you manage models, control the service, and maintain your local cache. + +## Overview + +To see all available commands, use the help option: + +```bash +foundry --help +``` + +The foundry CLI is structured into these main categories: + +- **Model**: Commands related to managing and running models +- **Service**: Commands for managing the AI Foundry Local service +- **Cache**: Commands for managing the local cache where models are stored + +## Model commands + +The following table summarizes the commands related to managing and running models: + +| **Command** | **Description** | +| -------------------------------- | -------------------------------------------------------------------------------- | +| `foundry model --help` | Displays all available model-related commands and their usage. | +| `foundry model run ` | Runs a specified model, downloading it if not cached, and starts an interaction. | +| `foundry model list` | Lists all available models for local use. | +| `foundry model info ` | Displays detailed information about a specific model. | +| `foundry model download ` | Downloads a model to the local cache without running it. | +| `foundry model load ` | Loads a model into the service. | +| `foundry model unload ` | Unloads a model from the service. | + +## Service commands + +The following table summarizes the commands related to managing and running the Foundry Local service: + +| **Command** | **Description** | +| ------------------------- | ------------------------------------------------------------------ | +| `foundry service --help` | Displays all available service-related commands and their usage. | +| `foundry service start` | Starts the AI Foundry Local service. | +| `foundry service stop` | Stops the AI Foundry Local service. | +| `foundry service restart` | Restarts the AI Foundry Local service. | +| `foundry service status` | Displays the current status of the AI Foundry Local service. | +| `foundry service ps` | Lists all models currently loaded in the AI Foundry Local service. | +| `foundry service logs` | Displays the logs of the AI Foundry Local service. | +| `foundry service set` | Set configuration of the AI Foundry Local service. | + +## Cache commands + +The following table summarizes the commands related to managing the local cache where models are stored: + +| **Command** | **Description** | +| ------------------------------ | -------------------------------------------------------------- | +| `foundry cache --help` | Displays all available cache-related commands and their usage. | +| `foundry cache pwd` | Displays the current cache directory. | +| `foundry cache list` | Lists all models stored in the local cache. | +| `foundry cache remove ` | Deletes a model from the local cache. | +| `foundry cache cd ` | Changes the cache directory. | diff --git a/reference/reference-rest.md b/reference/reference-rest.md new file mode 100644 index 00000000..06a07813 --- /dev/null +++ b/reference/reference-rest.md @@ -0,0 +1,16 @@ +--- +title: Foundry Local REST API Reference +titleSuffix: AI Foundry Local +description: Reference for Foundry Local REST API. +manager: scottpolly +ms.service: azure-ai-foundry +ms.custom: build-2025 +ms.topic: conceptual +ms.date: 02/12/2025 +ms.author: samkemp +author: samuel100 +--- + +# Foundry Local REST API Reference + +TODO diff --git a/tutorials/chat-application-with-open-web-ui.md b/tutorials/chat-application-with-open-web-ui.md new file mode 100644 index 00000000..bf581c5d --- /dev/null +++ b/tutorials/chat-application-with-open-web-ui.md @@ -0,0 +1,62 @@ +--- +title: Build a Chat application with Open Web UI +titleSuffix: AI Foundry Local +description: Learn how to build a chat application with FOundry local and Open Web UI +manager: scottpolly +keywords: Azure AI services, cognitive, AI models, local inference +ms.service: azure-ai-foundry +ms.topic: tutorial +ms.date: 02/20/2025 +ms.reviewer: samkemp +ms.author: samkemp +author: samuel100 +ms.custom: build-2025 +#customer intent: As a developer, I want to get started with AI Foundry Local so that I can run AI models locally. +--- + +# Build a Chat application with Open Web UI + +This tutorial guides you through setting up a chat application using AI Foundry Local and Open Web UI. By the end, you'll have a fully functional chat interface running locally on your device. + +## Prerequisites + +Before beginning this tutorial, make sure you have: + +- **AI Foundry Local** [installed](../get-started.md) on your machine. +- **At least one model loaded** using the `foundry model load` command, for example: + ```bash + foundry model load Phi-4-mini-gpu-int4-rtn-block-32 + ``` + +## Set up Open Web UI for chat + +1. **Install Open Web UI** by following the installation instructions from the [Open Web UI github](https://github.com/open-webui/open-webui). + +2. **Start Open Web UI** by running the following command in your terminal: + + ```bash + open-webui serve + ``` + + Then open your browser and navigate to [http://localhost:8080](http://localhost:8080). + +3. **Connect Open Web UI to AI Foundry Local**: + + - Go to **Settings** in the navigation menu + - Select **Connections** + - Choose **Manage Direct Connections** + - Click the **+** icon to add a new connection + - For the URL, enter `http://localhost:5272/v1` + - For the API Key, it presently can't be blank, so you can enter any value (e.g. `test`) + - Save the connection + +4. **Start chatting with your model**: + - The model list should automatically populate at the top of the UI + - Select one of your loaded models from the dropdown + - Begin your chat in the input box at the bottom of the screen + +That's it! You're now chatting with your AI model running completely locally on your device. + +## Next steps + +- Try [different models](../how-to/load-models.md) to compare performance and capabilities diff --git a/what-is-ai-foundry-local.md b/what-is-ai-foundry-local.md new file mode 100644 index 00000000..6444223a --- /dev/null +++ b/what-is-ai-foundry-local.md @@ -0,0 +1,36 @@ +# What is AI Foundry Local? + +AI Foundry Local is a local version of Azure AI Foundry that enables local execution of large language models (LLMs) directly on your device. This on-device AI inference solution provides privacy, customization, and cost benefits compared to cloud-based alternatives. Best of all, it fits into your existing workflows and applications with an easy-to-use CLI and REST API! + +Foundry Local applies the optimization work of ONNX Runtime, Olive, and the ONNX ecosystem, Foundry Local delivers a highly optimized and performant user experience for running AI models locally. + +## Key features + +- **On-Device Inference**: Run LLMs locally on your own hardware, reducing dependency on cloud services while keeping your data on-device. +- **Model Customization**: Choose from preset models or bring your own to match your specific requirements and use cases. +- **Cost Efficiency**: Avoid recurring cloud service costs by using your existing hardware, making AI tasks more accessible. +- **Seamless Integration**: Easily interface with your applications via an endpoint or test with the CLI, with the option to scale to Azure AI Foundry as your workload demands increase. + +## Use cases + +AI Foundry Local is ideal for scenarios where: + +- Data privacy and security are paramount +- You need to operate in environments with limited or no internet connectivity +- You want to reduce cloud inference costs +- You need low-latency AI responses for real-time applications +- You want to experiment with AI models before deploying to a cloud environment + +## Pricing and billing + +Entirely Free! You're using your own hardware, and there are no extra costs associated with running AI models locally. + +## How to get access + +Download from the Microsoft Store. (WIP) + +## Next steps + +- [Get started with AI Foundry Local](./get-started.md) +- [Compile Hugging Face models for Foundry Local](./how-to/compile-models-for-foundry-local.md) +- [Learn more about ONNX Runtime](https://onnxruntime.ai/docs/)