A semantic chunking service for documents, GitHub repos, webpages, and Confluence. It creates semantically relevant chunks from your content that can be queried later.
- Multi-source ingestion: Markdown files, GitHub repositories, webpages, Confluence pages
- Website crawling: Ingest entire documentation sites via sitemap or link crawling
- Semantic chunking: Intelligent text chunking with code-awareness
- Code-aware: Extracts code blocks with context and language detection
- Vector search: Powered by Qdrant for fast semantic search
- Fast embeddings: Uses FastEmbed (BAAI/bge-small-en-v1.5) for efficient embedding generation
# Using Docker
docker compose up -d
# Or using Podman
podman compose up -duv sync# From a local markdown file
uv run deepcontext ingest ./docs/README.md
# From a GitHub URL
uv run deepcontext ingest https://github.com/vercel/next.js/blob/canary/contributing/core/developing.md
# From a webpage
uv run deepcontext ingest https://nextjs.org/docs/app/building-your-application/routing/middleware
# From a Confluence page (requires auth)
export CONFLUENCE_EMAIL=you@company.com
export CONFLUENCE_TOKEN=your_api_token
uv run deepcontext ingest https://company.atlassian.net/wiki/spaces/DOCS/pages/123456/Page-Title
# From an entire Confluence space
uv run deepcontext ingest-confluence https://company.atlassian.net/wiki DOCS --limit 50
# From an entire website (uses sitemap or crawls links)
uv run deepcontext ingest-website https://ui.shadcn.com/docs --max-pages 50uv run deepcontext search "middleware authentication"Search Results for: how to redirect in middleware
────────────────────────────────────────────────────────────
### Example
Source: /path/to/middleware.md
Score: 0.6693
Example typescript code demonstrating implementation.
┌──────────────────────────────────────────────────────────┐
│ import { NextResponse } from 'next/server' │
│ import type { NextRequest } from 'next/server' │
│ │
│ export function middleware(request: NextRequest) { │
│ return NextResponse.redirect(new URL('/home', ...)) │
│ } │
│ │
│ export const config = { │
│ matcher: '/about/:path*', │
│ } │
└──────────────────────────────────────────────────────────┘
────────────────────────────────────────────────────────────
Ingest content from any supported source.
SOURCE can be:
- A local file path (e.g.,
./docs/README.md) - A GitHub URL (e.g.,
https://github.com/vercel/next.js/blob/canary/docs/...) - A Confluence page (e.g.,
https://company.atlassian.net/wiki/spaces/DOCS/pages/123456) - A webpage URL (e.g.,
https://example.com/docs)
Options:
--chunk-size: Maximum chunk size in tokens (default: 1024)--threshold: Similarity threshold for semantic chunking (default: 0.7)--code-aware/--no-code-aware: Use code-aware chunking (default: enabled)--collection: Qdrant collection name (default: deepcontext_chunks)--host: Qdrant host (default: localhost)--port: Qdrant port (default: 6333)
Ingest all documentation from a GitHub repository.
Options:
--branch: Branch to fetch from (default: auto-detect)--path: Path within the repo to start from--extensions: File extensions to fetch (default: .md,.mdx)--collection: Qdrant collection name (default: deepcontext_chunks)--host: Qdrant host (default: localhost)--port: Qdrant port (default: 6333)
Authentication: Set GITHUB_TOKEN environment variable for higher rate limits (5,000/hour vs 60/hour for unauthenticated requests).
# Create token at https://github.com/settings/tokens with 'public_repo' scope
export GITHUB_TOKEN=ghp_your_token_here
uv run deepcontext ingest-repo vercel/next.js --path docsIngest all pages from a Confluence space.
Requires environment variables:
CONFLUENCE_EMAIL: Your Atlassian account emailCONFLUENCE_TOKEN: Your Atlassian API token (create one here)
Options:
--limit: Maximum pages to fetch (default: 100)--collection: Qdrant collection name (default: deepcontext_chunks)--host: Qdrant host (default: localhost)--port: Qdrant port (default: 6333)
Example:
export CONFLUENCE_EMAIL=you@company.com
export CONFLUENCE_TOKEN=your_api_token
uv run deepcontext ingest-confluence https://company.atlassian.net/wiki ENGINEERINGIngest all pages from a website using sitemap or link crawling.
The command will:
- Try to find and parse sitemap.xml
- If no sitemap, crawl links from the base URL
- Filter URLs to only those under the base path
Options:
--max-pages: Maximum pages to fetch (default: 100)--pattern: URL pattern to filter pages (regex)--no-sitemap: Don't try to use sitemap.xml, crawl links instead--collection: Qdrant collection name (default: deepcontext_chunks)--host: Qdrant host (default: localhost)--port: Qdrant port (default: 6333)
Examples:
# Ingest shadcn docs (uses sitemap automatically)
uv run deepcontext ingest-website https://ui.shadcn.com/docs
# Ingest with URL pattern filter
uv run deepcontext ingest-website https://nextjs.org/docs --pattern "/docs/app/"
# Force crawling instead of sitemap
uv run deepcontext ingest-website https://example.com/docs --no-sitemapSearch for semantically similar chunks.
Options:
--limit: Maximum results (default: 5)--language: Filter by code language--source-type: Filter by source type (markdown, github, webpage, confluence)--collection: Qdrant collection name (default: deepcontext_chunks)--host: Qdrant host (default: localhost)--port: Qdrant port (default: 6333)
Show statistics about the vector store.
Options:
--collection: Qdrant collection name (default: deepcontext_chunks)--host: Qdrant host (default: localhost)--port: Qdrant port (default: 6333)
Clear all data from the vector store. Requires confirmation.
Options:
--collection: Qdrant collection name (default: deepcontext_chunks)--host: Qdrant host (default: localhost)--port: Qdrant port (default: 6333)
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ ContentFetcher │ ──▶ │ DocumentChunker │ ──▶ │ VectorStore │
│ │ │ │ │ (Qdrant) │
│ - Markdown │ │ - Code-aware │ │ │
│ - GitHub │ │ - Section-based │ │ - FastEmbed │
│ - Webpages │ │ - Smart titles │ │ - Cosine sim │
│ - Websites │ │ │ │ │
│ - Confluence │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
from deepcontext import ContentFetcher, VectorStore
from deepcontext.chunker import ChunkConfig, DocumentChunker
# Fetch content
with ContentFetcher() as fetcher:
doc = fetcher.fetch("https://github.com/vercel/next.js/blob/canary/docs/...")
# Chunk with code awareness
config = ChunkConfig(
max_chunk_size=2000, # Max characters per chunk
min_chunk_size=200, # Min characters for a chunk
overlap_size=100, # Overlap between chunks
)
chunker = DocumentChunker(config)
chunks = chunker.chunk_document(doc)
# Store and search
store = VectorStore()
store.index_chunks(chunks)
results = store.search("middleware authentication")
for result in results:
print(result.chunk.to_display_format())
print(f"Score: {result.score}")
store.close()from deepcontext import ContentFetcher
with ContentFetcher() as fetcher:
# Fetch entire GitHub repo
docs = fetcher.fetch_github_repo("vercel/next.js", path="docs")
# Fetch Confluence space
docs = fetcher.fetch_confluence_space(
base_url="https://company.atlassian.net/wiki",
space_key="DOCS",
limit=50,
)
# Fetch entire website
docs = fetcher.fetch_website(
base_url="https://ui.shadcn.com/docs",
max_pages=100,
url_pattern=r"/docs/", # Optional regex filter
)| Variable | Description |
|---|---|
GITHUB_TOKEN |
GitHub personal access token for higher API rate limits |
CONFLUENCE_EMAIL |
Atlassian account email for Confluence access |
CONFLUENCE_TOKEN |
Atlassian API token for Confluence access |
You can also use a .env file in the project root - it will be loaded automatically.
- Python 3.12+
- Docker or Podman (for Qdrant)
MIT