This guide covers everything you need to know for developing with Fluid Server, from initial setup to building and testing.
- OS: Windows 10/11
- Python: 3.10+ with
uvpackage manager - Memory: 8GB+ RAM (16GB recommended for 8B models)
- Storage: 10GB+ free space for models
- Runtime: OpenVINO 2025.2.0+ runtime
- Hardware: Intel Arc graphics or Intel NPU
- Runtime: ONNX Runtime QNN (bundled with dependencies)
- Hardware: Snapdragon X Elite device with HTP (Hexagon Tensor Processor)
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.ps1 | powershell# Download and install OpenVINO runtime from Intel's website
# https://docs.openvino.ai/2025/get-started/install-openvino.htmlgit clone https://github.com/FluidInference/fluid-server.git
cd fluid-server# Install all dependencies including development tools
uv sync
# Install development dependencies separately (if needed)
uv add --dev ty# Check that dependencies are installed correctly
uv run python -c "import fluid_server; print('Setup successful')"# Run with auto-reload for development
uv run python -m fluid_server --reload# Run with custom model path and debug logging
uv run python -m fluid_server --model-path ./models --log-level DEBUG --reload# Use the convenience script
.\scripts\start_server.ps1--reload- Auto-reload on code changes (development only)--model-path- Custom path to model directory--log-level- Set logging level (DEBUG, INFO, WARNING, ERROR)--host- Server host (default: 127.0.0.1)--port- Server port (default: 8080)
# Run type checking with ty
.\scripts\typecheck.ps1
# Or run directly
uv run ty# Format code with ruff
uv run ruff format .
# Check for linting issues
uv run ruff check .
# Fix auto-fixable linting issues
uv run ruff check --fix .# Test health endpoint
curl http://localhost:8080/health
# Test models endpoint
curl http://localhost:8080/v1/models
# Test basic chat completion
curl -X POST http://localhost:8080/v1/chat/completions `
-H "Content-Type: application/json" `
-d '{\"model\": \"current\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello\"}]}'# Test with actual models (requires model downloads)
.\scripts\test_with_models.ps1
# Kill server if needed
.\scripts\kill_server.ps1# Build standalone .exe with PyInstaller
.\scripts\build.ps1The build script creates dist/fluid-server.exe (approximately 276 MB with OpenVINO + QNN bundled).
The build process:
- Installs dependencies with
uv sync - Runs type checking with
ty - Creates executable with PyInstaller
- Includes all necessary runtime libraries
# Test the built executable
.\scripts\test_exe.ps1# Run the executable directly
.\dist\fluid-server.exe
# Run with custom options
.\dist\fluid-server.exe --host 127.0.0.1 --port 8080 --log-level DEBUGsrc/fluid_server/
├── __main__.py # CLI entry point
├── app.py # FastAPI application factory
├── config.py # Server configuration
├── api/ # API endpoints
│ ├── v1/
│ │ ├── chat.py # Chat completions
│ │ ├── audio.py # Audio transcription
│ │ ├── models.py # Model management
│ │ └── embeddings.py # Text embeddings
│ └── health.py # Health checks
├── managers/ # Core business logic
│ ├── runtime_manager.py # Model loading/unloading
│ └── embedding_manager.py # Embedding generation
├── runtimes/ # Model runtime implementations
│ ├── base.py # Abstract base runtime
│ ├── openvino_llm.py # OpenVINO LLM runtime
│ ├── openvino_whisper.py # OpenVINO Whisper runtime
│ ├── llamacpp.py # Llama.cpp runtime
│ └── qnn_whisper.py # QNN Whisper runtime
├── storage/ # Data persistence
│ └── lancedb_client.py # LanceDB vector storage
└── utils/ # Utilities
├── model_utils.py # Model discovery/downloading
└── platform_utils.py # Platform detection
models/
├── llm/ # Language models
│ ├── qwen3-8b-int8-ov/ # OpenVINO LLM models
│ └── phi-4-mini/ # Additional LLM models
├── whisper/ # Audio transcription models
│ ├── whisper-large-v3-turbo-ov-npu/ # OpenVINO Whisper
│ ├── whisper-large-v3-turbo-qnn/ # QNN Whisper
│ └── whisper-tiny/ # Smaller models
├── embeddings/ # Text embedding models
│ └── sentence-transformers_all-MiniLM-L6-v2/
└── cache/ # Compiled model cache
# Set development environment variables
$env:PYTHONPATH = "src"
$env:FLUID_LOG_LEVEL = "DEBUG"
$env:FLUID_MODEL_PATH = "./models"Create .vscode/settings.json:
{
"python.defaultInterpreterPath": ".venv/Scripts/python.exe",
"python.linting.enabled": true,
"python.linting.pylintEnabled": false,
"python.linting.flake8Enabled": false,
"python.formatting.provider": "black",
"python.formatting.blackPath": ".venv/Scripts/black.exe"
}# Ensure PYTHONPATH includes src directory
$env:PYTHONPATH = "src"
uv run python -m fluid_server# Run with debug logging to see model loading details
uv run python -m fluid_server --log-level DEBUG# Kill any existing server processes
.\scripts\kill_server.ps1
# Or use a different port
uv run python -m fluid_server --port 8081import logging
# Enable debug logging for specific components
logging.getLogger("fluid_server.managers.runtime_manager").setLevel(logging.DEBUG)
logging.getLogger("fluid_server.runtimes").setLevel(logging.DEBUG)# Run with performance profiling
uv run python -m cProfile -o profile_output.pstats -m fluid_server
# Analyze profile results
uv run python -c "import pstats; pstats.Stats('profile_output.pstats').sort_stats('cumulative').print_stats(20)"- Follow PEP 8 style guidelines
- Use type hints for all function parameters and return values
- Format code with
ruff format - Ensure all linting checks pass with
ruff check
- Use conventional commit messages
- Include relevant tests for new features
- Ensure all existing tests pass
- Update documentation for API changes
- Fork the repository
- Create a feature branch from
main - Make your changes with proper testing
- Ensure all checks pass (linting, type checking, tests)
- Submit a pull request with a clear description
- Implement the
BaseRuntimeabstract class - Add the runtime to the
RuntimeManager - Update configuration options
- Add appropriate tests
- Create new endpoint modules in
api/v1/ - Register routes in the main application
- Add request/response models using Pydantic
- Include comprehensive error handling
- Use async/await for I/O operations
- Implement connection pooling for external services
- Cache frequently accessed data
- Monitor memory usage with large models
This development guide provides the foundation for contributing to and extending Fluid Server. For specific implementation details, refer to the existing codebase and follow the established patterns.