A simple FastAPI project with a chat interface and API endpoints, featuring a local LLM model optimized for macOS.
- Chat Interface: Multi-threaded chat UI similar to popular AI chat applications
- Dark/Light Theme: Toggle between dark and light modes
- Local LLM Integration: Run AI models directly on your machine
- Model Switching: Change between different models on-the-fly
- API Endpoints: Access LLM functionality programmatically
This project uses uv for dependency management. If you don't have uv installed:
# Install uv (using pip)
pip install uv# Install dependencies using uv
uv pip install -e .This project uses strict typing with mypy. To run type checks:
uv run mypy main.py# Run in development mode with auto-reload
python main.pyFor macOS users (especially with Apple Silicon), use the optimized script:
./run_macos.shThe script automatically detects your available RAM and selects the best model.
The application will be available at:
- Chat Interface: http://localhost:8000/
- API Documentation: http://localhost:8000/docs
For production deployment, use the provided production script:
./run_production.sh- Access the Chat Interface: Open your browser and go to http://localhost:8000/
- Create New Conversations: Click "New Chat" to start a new thread
- Switch Between Threads: Click on any thread in the sidebar to switch contexts
- Change Models: Use the dropdown menu in the top-right to switch between models
- Toggle Dark/Light Mode: Click the moon/sun icon to change the theme
GET /: Chat interfacePOST /api/chat: Generate a chat responsePOST /api/set-model: Change the active model
GET /welcome: Returns a welcome messageGET /items: Returns all items in the collectionPOST /items: Add a new item to the collectionPOST /llm/generate: Generate a response from the LLM modelGET /llm/info: Get information about available models
This project has been optimized to work on macOS with Apple Silicon (M1/M2/M3). It uses:
- MPS (Metal Performance Shaders) when available for GPU acceleration
- Models that are compatible with 8GB RAM on macOS
- Memory optimizations for efficient inference
You can select which LLM model to use by setting the LLM_MODEL environment variable:
# Use the tiny model (default, suitable for systems with limited RAM)
LLM_MODEL=tiny python main.py
# Use the small model (better capabilities but requires more RAM)
LLM_MODEL=small python main.py
# Use the medium model (best capabilities on 8GB RAM)
LLM_MODEL=medium python main.pyAvailable models:
tiny: TinyLlama-1.1B-Chat-v1.0 (works on 4-8GB RAM)small: bigscience/bloom-560m (works on 4-8GB RAM)medium: microsoft/phi-2 (works on 8GB+ RAM)
See TRAINING.md for information on fine-tuning models with custom datasets.
You can test the API endpoints with the interactive Swagger UI at http://localhost:8000/docs