Build intelligent digital personas that remember, learn, and evolve.
A full-stack multimodal AI agent framework that combines conversational AI, vector memory, knowledge graphs, and face recognition into a cohesive digital human experience.
Current AI assistants are stateless. Each conversation starts from scratch. They don't remember past interactions, can't recognize familiar faces, and lack persistent understanding of the people they talk to.
What if an AI could:
- Remember your preferences across sessions?
- Recognize you by face and know your history?
- Build a growing knowledge graph of everything you've shared?
- Understand context from images, voice, and video — not just text?
AI Digital Human is an open-source framework for building persistent, multimodal AI personas that bridge the gap between stateless chatbots and truly intelligent digital companions.
Unlike traditional RAG systems that only retrieve documents, or chatbot frameworks that only handle text, AI Digital Human creates holistic digital identities with:
- Face identity — Recognize who you're talking to
- Episodic memory — Remember conversation history via vector similarity
- Structured knowledge — Build knowledge graphs of entities and relationships
- Multimodal understanding — Process images, audio, and video alongside text
- Configurable personality — Adjust tone, pace, and speaking style
User Input (text / image / audio / video)
│
▼
┌──────────────────────────────────────────────────────┐
│ Multimodal Orchestrator │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐│
│ │ Face │ │ STT │ │ Vision ││
│ │Recognition│ │(Whisper) │ │ (GLM-4V/GPT-4V)││
│ └────┬─────┘ └────┬─────┘ └────────┬─────────┘│
│ │ │ │ │
│ └──────────────┴───────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ Context Assembler │ │
│ │ ┌─────────────────┐ │ │
│ │ │ User Profile │ │◄── Face ID │
│ │ │ Vector Memory │ │◄── Qdrant │
│ │ │ Knowledge Graph │ │◄── PostgreSQL │
│ │ │ + System Prompt│ │◄── Speaking Style │
│ │ └─────────────────┘ │ │
│ └────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ LLM Orchestrator │ │
│ │ (GLM-4 / GPT-4o) │ │
│ └────────────────────────┘ │
│ │ │
└──────────────────────┼──────────────────────────────┘
│
▼
Streaming Response
| Store | Technology | Purpose | What's Stored |
|---|---|---|---|
| Vector Memory | Qdrant | Semantic similarity search | Conversation excerpts, facts, preferences |
| Knowledge Graph | PostgreSQL | Structured entity relationships | People, concepts, events, connections |
This dual-store approach mirrors how humans use both associative memory (similarity-based) and semantic memory (structured facts).
Upload Image → InsightFace (ArcFace) → 512-dim Embedding
│
▼
Qdrant Vector Search
│
▼
Cosine Similarity > Threshold
│
▼
User Identity + Profile
Before every LLM call, ChatService assembles context from:
- Speaking Style — Tone, pace, formality from database
- User Profile — Name, preferences, appearance from Knowledge Graph
- Relevant Memories — Top-K semantically similar past conversations
- Knowledge Subgraph — Entities and relationships relevant to the query
- Multimodal Analysis — Extracted from images/audio/video inputs
Conversation Text
│
▼
LLM (structured output)
│
▼
┌──────────────────────────────────────┐
│ Entities: {name, type, attributes} │
│ Relationships: {source, target, │
│ type, strength} │
└──────────────────────────────────────┘
│
▼
PostgreSQL (Knowledge Graph)
This project draws from several research areas:
| Field | Application in This Project |
|---|---|
| Retrieval-Augmented Generation (RAG) | Vector similarity search for relevant memories |
| Knowledge Graphs (KG) | Entity extraction + relationship mapping |
| Face Recognition | InsightFace ArcFace embeddings |
| ** Multimodal Learning** | Unified processing of text/image/audio/video |
| Persona Engineering | Configurable system prompts + speaking styles |
| Agent Memory Systems | Dual-store memory architecture |
| Aspect | Traditional Chatbot | AI Digital Human |
|---|---|---|
| Identity | Anonymous | Face-recognized |
| Memory | Session-only | Persistent vector + KG |
| Context | Last N messages | Full history + profile |
| Input | Text only | Multimodal |
| Personality | Fixed | Configurable per-user |
| Knowledge | LLM weights | Extracted + structured |
# 1. Clone
git clone https://github.com/SonicBotMan/AI-Digital-Human
cd AI-Digital-Human
# 2. Configure
cp .env.default .env
# Edit .env — add your GLM_API_KEY (free at https://open.bigmodel.cn)
# 3. Deploy
./deploy.sh --production
# 4. Access
open http://localhost:3000| Category | Default | Alternatives |
|---|---|---|
| LLM | GLM-4-Flash (free) | GPT-4o, MiniMax Text-01 |
| Vision | GLM-4V-Flash | GPT-4o Vision |
| STT | Whisper Turbo | — |
| Face | Buffalo_L (ArcFace) | — |
Backend: FastAPI · Pydantic v2 · SQLAlchemy 2.0 (async) · PostgreSQL · Qdrant · Redis
Frontend: Next.js 14 · React 18 · TypeScript · Tailwind CSS · shadcn/ui · React Flow
AI: GLM-4 (ZhipuAI) · InsightFace · faster-whisper
# Chat with streaming
POST /api/chat/stream # WebSocket for real-time streaming
# Face identity
POST /api/faces/register # Register face → embedding stored in Qdrant
POST /api/faces/identify # Upload image → returns matched user
# Knowledge graph
GET /api/knowledge/{user_id}/graph # Full KG as React Flow nodes/edges
POST /api/knowledge/{user_id}/entities # Add entity
POST /api/knowledge/{user_id}/relationships # Add relationship
# Multimodal
POST /api/analyze # Upload image/audio/video → AI analysis + entity extractionFull API docs: http://localhost:8000/api/docs
# Backend (Python 3.11+)
cd apps/api
source venv/bin/activate
uvicorn app.main:app --reload
# Frontend (Node.js 18+)
cd apps/web
npm install
npm run dev
# Run tests
cd apps/api && pytest tests/apps/
├── api/
│ ├── app/
│ │ ├── main.py # FastAPI entry point
│ │ ├── services/ # Business logic
│ │ │ ├── chat_service.py # Orchestrates context → LLM
│ │ │ ├── graph_service.py # Entity extraction + KG CRUD
│ │ │ ├── memory_service.py # Qdrant vector operations
│ │ │ ├── face_service.py # InsightFace integration
│ │ │ └── llm_service.py # LLM abstraction (GLM/GPT)
│ │ ├── routers/ # API endpoints
│ │ ├── models/ # SQLAlchemy + Pydantic schemas
│ │ └── mcp/ # Model Context Protocol server
│ └── tests/ # pytest + httpx
│
└── web/
├── src/
│ ├── app/ # Next.js App Router pages
│ ├── components/ # React components
│ │ ├── graph/ # React Flow knowledge graph
│ │ └── chat/ # Chat UI components
│ └── hooks/ # Custom React hooks
└── public/
MIT License — see LICENSE.
当前的 AI 助手是无状态的。每次对话都从零开始,不记得过去的交互,无法识别人脸,缺乏对对话者的持久理解。
如果 AI 能够:
- 跨会话记住你的偏好?
- 通过人脸识别你是谁,并了解你的历史?
- 构建你所分享的一切的知识图谱?
- 理解图像、语音和视频中的上下文——而不仅仅是文本?
AI Digital Human 是一个开源框架,用于构建持久的、多模态的 AI 人格,弥合无状态聊天机器人与真正智能的数字伴侣之间的差距。
与仅检索文档的传统 RAG 系统,或仅处理文本的聊天机器人框架不同,AI Digital Human 创建整体数字身份,具备:
- 人脸身份 — 识别你在和谁说话
- 情景记忆 — 通过向量相似性记住对话历史
- 结构化知识 — 构建实体和关系的知识图谱
- 多模态理解 — 处理图像、音频和视频以及文本
- 可配置人格 — 调整语气、语速和说话风格
采用双重记忆架构,模拟人类认知:
| 存储 | 技术 | 用途 |
|---|---|---|
| 向量记忆 | Qdrant | 语义相似性搜索 |
| 知识图谱 | PostgreSQL | 结构化实体关系 |
# 1. 克隆
git clone https://github.com/SonicBotMan/AI-Digital-Human
cd AI-Digital-Human
# 2. 配置
cp .env.default .env
# 编辑 .env — 添加你的 GLM_API_KEY(在 https://open.bigmodel.cn 免费获取)
# 3. 部署
./deploy.sh --production
# 4. 访问
open http://localhost:3000后端:FastAPI · Pydantic v2 · SQLAlchemy 2.0 (async) · PostgreSQL · Qdrant · Redis
前端:Next.js 14 · React 18 · TypeScript · Tailwind CSS · shadcn/ui · React Flow
AI:GLM-4 (智谱AI) · InsightFace · faster-whisper
MIT License — 见 LICENSE。
Star ⭐ if you find this interesting — contributions welcome!