Datacortex

Knowledge graph visualization and AI-powered analysis for Datacore.

Features

Workspace UI: Browser-based editor with file tree, Claude Code terminal, and knowledge graph
Graph Visualization: Interactive D3.js force-directed graph with node labels
Temporal Pulses: Snapshot graph state over time
Multi-Space: Configure which Datacore spaces to include
Metrics: Degree centrality, PageRank, Louvain clustering
AI Extensions: Semantic search, link suggestions, gap detection, Q&A

Installation

cd ~/Data/1-datafund/2-projects/datacortex
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Quick Start

# 1. Compute embeddings (first time, ~5-10 min)
datacortex embed

# 2. Start server and open workspace
datacortex serve
# Open http://localhost:8765/workspace.html

# 3. Get AI-powered suggestions
datacortex digest              # Link suggestions
datacortex gaps                # Knowledge gaps
datacortex insights            # Cluster analysis
datacortex search "query"      # Question answering
datacortex opportunities       # Low-hanging fruit for research

Workspace UI

The workspace provides a browser-based interface for working with your knowledge base alongside Claude Code.

┌──────────────────────────────────────────────────────────────────────────┐
│ [Search: Ctrl+P]                                  [Links ▼] [Graph]      │
├───────────────┬──────────────────────────────────┬───────────────────────┤
│ File Tree     │  Markdown Editor (CodeMirror 6)  │ Knowledge Graph (D3)  │
│ ├── 0-personal│  [Ask Claude] [Save] [Discard]   │ - Click node to open  │
│ └── 1-datafund│                                  │ - Synced with editor  │
├───────────────┴──────────────────────────────────┴───────────────────────┤
│ Claude Code Terminal (xterm.js) [Clear] [Reconnect]                      │
└──────────────────────────────────────────────────────────────────────────┘

Features:

File Tree: Browse and open files from your Datacore spaces
Markdown Editor: CodeMirror 6 with syntax highlighting, save/discard
Claude Code Terminal: WebSocket PTY bridge to Claude Code (new session per connection)
Knowledge Graph: D3.js force graph with labels, docked on right panel (toggle with Graph button)
Synced Views: Click a file → highlights in tree + zooms in graph; click graph node → opens file
Search: Ctrl+P to search by filename or content
Links: Dropdown showing outgoing wiki-links and backlinks

Access:

datacortex serve --port 8765
open http://localhost:8765/workspace.html

Or use /datacortex workspace from Claude Code.

CLI Commands

# Graph generation
datacortex generate --spaces personal,datafund
datacortex stats

# Pulse snapshots
datacortex pulse generate
datacortex pulse list
datacortex pulse diff 2025-01-01 2025-01-15

# AI Extensions
datacortex embed [--space NAME] [--force]
datacortex digest [--threshold 0.8] [--top-n 20]
datacortex gaps [--min-score 0.3]
datacortex insights [--cluster N] [--top 5]
datacortex search "query" [--top 10] [--no-expand]
datacortex opportunities [--top 15]

# Server
datacortex serve [--port 8765]

Datacore Commands

Use from Claude Code for AI-synthesized insights. These commands run the CLI tools and have Claude synthesize natural language recommendations from the results.

Command	Model	Purpose
`/datacortex`	-	Launch visualization server and open browser
`/datacortex-digest`	haiku	Link suggestions based on semantic similarity
`/datacortex-gaps`	haiku	Bridge suggestions between knowledge clusters
`/datacortex-insights`	sonnet	Deep cluster analysis with themes and patterns
`/datacortex-ask [question]`	haiku	Answer questions from your knowledge base (RAG)
`/datacortex-opportunities`	haiku	Find low-hanging fruit for research

Model assignments: Commands use ## Model hints to tell Claude Code which model to use. Haiku is fast/cheap for suggestions; Sonnet provides deeper analysis for insights.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    DATACORTEX SERVER                         │
│  - Embeddings (sentence-transformers, cached in SQLite)     │
│  - Vector similarity (cosine, matrix computation)           │
│  - Graph metrics (NetworkX, Louvain clustering)             │
│  - Compact output (TSV/markdown, ~60% smaller than JSON)    │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    CLAUDE CODE SESSION                       │
│  - Natural language synthesis                                │
│  - Link suggestions with reasoning                          │
│  - Bridge recommendations                                    │
│  - Question answering with citations                        │
└─────────────────────────────────────────────────────────────┘

Project Structure

datacortex/
├── src/datacortex/
│   ├── core/           # Models, config, database
│   ├── indexer/        # Graph building from zettel_db
│   ├── metrics/        # Centrality, clustering
│   ├── pulse/          # Temporal snapshots
│   ├── ai/             # Embeddings, similarity, cache
│   ├── digest/         # Daily link suggestions
│   ├── gaps/           # Knowledge gap detection
│   ├── insights/       # Cluster analysis
│   ├── qa/             # Question answering (RAG)
│   ├── api/            # FastAPI backend
│   │   └── routes/     # API endpoints (graph, files, terminal)
│   └── cli/            # Click commands
├── frontend/
│   ├── index.html      # Graph visualization
│   └── workspace.html  # Workspace UI (editor, terminal, graph)
├── config/             # YAML configuration
└── docs/               # Documentation

AI Extensions

Datacortex includes 5 AI-powered features that work together. The server computes embeddings, similarity, and metrics; Claude Code synthesizes natural language insights from the results.

Phase 1: Embeddings (`datacortex embed`)

Compute semantic embeddings for all documents using local sentence-transformers (no API keys needed).

Model: sentence-transformers/all-mpnet-base-v2 (768 dimensions, high quality)
Content: Title + first 500 characters (balanced quality/speed)
Cache: SQLite with content hash invalidation (only recomputes changed docs)
Speed: ~25 docs/sec on M1 Mac

datacortex embed              # Incremental (only new/changed)
datacortex embed --force      # Recompute all
datacortex embed --space personal  # Single space

Phase 2: Daily Digest (`datacortex digest`)

Find documents that should be linked based on semantic similarity but aren't yet connected.

Similar pairs: Documents with cosine similarity > 0.75 that have no existing link
Scoring: similarity * 0.5 + recency * 0.3 + centrality * 0.2
Orphans: Documents with no incoming links (candidates for integration)
Output: Compact TSV format for Claude Code to synthesize recommendations

datacortex digest --threshold 0.8 --top-n 20

Phase 3: Knowledge Gaps (`datacortex gaps`)

Detect sparse areas between knowledge clusters that need bridge notes.

Cluster centroids: Mean embedding of all documents in each Louvain cluster
Gap score: semantic_similarity - link_density (high similarity but few links = gap)
Boundary nodes: Documents that link to both clusters (potential bridges)
Bridge suggestions: Topics that could connect the clusters

datacortex gaps --min-score 0.3

Phase 4: Insight Extraction (`datacortex insights`)

Analyze knowledge clusters to identify themes, hubs, and patterns.

Cluster stats: Size, density, average centrality
Hub documents: Top 10 by PageRank centrality (most connected/influential)
Tag frequency: Top 10 tags revealing cluster themes
Content samples: Excerpts from top docs for context

datacortex insights --cluster 3    # Single cluster detail
datacortex insights --top 5        # Top 5 clusters by size

Phase 5: Question Answering (`datacortex search`)

RAG (Retrieval-Augmented Generation) pipeline for "What do I know about X?" queries.

Pipeline: Embed query → vector search top 10 → graph expansion (1-hop neighbors) → re-rank
Re-ranking: vec_score * 0.6 + recency * 0.2 + centrality * 0.2
Direct match boost: 1.2x for original vector search hits
Full content: Complete document text included for Claude to synthesize answers

datacortex search "data tokenization" --top 10
datacortex search "Data pilot" --no-expand  # Skip graph expansion

Phase 6: Research Opportunities (`datacortex opportunities`)

Find "low-hanging fruit" - stubs to fill, orphans to integrate, underlinked content to connect.

High-value stubs: Stub notes with many references but no content (concepts your KB expects but hasn't defined)
Integration candidates: Orphan documents with real content (100+ words) but no links
Underlinked content: Substantial documents (300+ words) with only 1-2 connections
Stub-heavy clusters: Topic areas where most notes are stubs (need research)

datacortex opportunities              # Find research opportunities
datacortex opportunities --top 20     # More results per category
datacortex opportunities --space datafund  # Single space

Example output:

## HIGH_VALUE_STUBS
Fair Data Economy | 16 refs | 0.402 | stub, needs-content
Bootstrap Liquidity Fund | 11 refs | 0.237 | stub, needs-content

## INTEGRATION_CANDIDATES
The fund in Datafund | 9166w | unknown | research/The fund in Datafund.md
SemantiCord - Technical Overview | 3700w | unknown | research/SemantiCord.md

## UNDERLINKED_CONTENT
ChainLink | 3507w | 1 links | page
Investment Thesis | 3337w | 2 links | page

## STUB_HEAVY_CLUSTERS
Cluster 89 | 129 nodes | 127 stubs | 98% | Datahaven; Roam Network; Swarmy.cloud
Cluster 1 | 41 nodes | 30 stubs | 73% | Triple-Sided Marketplace; AI Agents

Use with /datacortex-opportunities in Claude Code - it presents the list and offers to research/fill selected items.

Configuration

Create config/datacortex.local.yaml to override defaults:

spaces:
  - personal
  - datafund

server:
  port: 8765

graph:
  include_stubs: false
  compute_clusters: true

ai:
  embedding_model: sentence-transformers/all-mpnet-base-v2
  content_length: 500
  qa_model: claude-3-haiku-20240307

Development

pip install -e ".[dev]"
pytest

Version History

v0.2.0 - Workspace UI with file browser and terminal integration
v0.1.0 - Initial release: Graph visualization, embeddings, digest, gaps, insights, Q&A, opportunities, multi-hop search

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
commands		commands
config		config
docs		docs
examples		examples
frontend		frontend
pulses		pulses
src/datacortex		src/datacortex
tests		tests
.gitignore		.gitignore
CLAUDE.base.md		CLAUDE.base.md
EMBEDDINGS.md		EMBEDDINGS.md
PHASE_3_IMPLEMENTATION.md		PHASE_3_IMPLEMENTATION.md
README.md		README.md
install.sh		install.sh
module.yaml		module.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_embeddings.py		test_embeddings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Datacortex

Features

Installation

Quick Start

Workspace UI

CLI Commands

Datacore Commands

Architecture

Project Structure

AI Extensions

Phase 1: Embeddings (`datacortex embed`)

Phase 2: Daily Digest (`datacortex digest`)

Phase 3: Knowledge Gaps (`datacortex gaps`)

Phase 4: Insight Extraction (`datacortex insights`)

Phase 5: Question Answering (`datacortex search`)

Phase 6: Research Opportunities (`datacortex opportunities`)

Configuration

Development

Version History

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

datafund/datacortex

Folders and files

Latest commit

History

Repository files navigation

Datacortex

Features

Installation

Quick Start

Workspace UI

CLI Commands

Datacore Commands

Architecture

Project Structure

AI Extensions

Phase 1: Embeddings (datacortex embed)

Phase 2: Daily Digest (datacortex digest)

Phase 3: Knowledge Gaps (datacortex gaps)

Phase 4: Insight Extraction (datacortex insights)

Phase 5: Question Answering (datacortex search)

Phase 6: Research Opportunities (datacortex opportunities)

Configuration

Development

Version History

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Phase 1: Embeddings (`datacortex embed`)

Phase 2: Daily Digest (`datacortex digest`)

Phase 3: Knowledge Gaps (`datacortex gaps`)

Phase 4: Insight Extraction (`datacortex insights`)

Phase 5: Question Answering (`datacortex search`)

Phase 6: Research Opportunities (`datacortex opportunities`)

Packages