Selftag: LLM-Powered Semantic Search Engine with Adaptive RDF Indexing

⚠️ Work in Progress (WIP): This project is currently under active development. Features may be incomplete, APIs may change, and documentation may be outdated. Use at your own risk.

What is Selftag?

Selftag is an intelligent semantic search engine that uses Large Language Models (LLMs) to automatically build and maintain a searchable index of your documents. Unlike traditional search engines that rely on keyword matching, Selftag understands the semantic meaning of your content and creates a dynamic, ontology-driven index that adapts to your personal knowledge domain.

Key Innovation: LLM-Defined Search Index

Traditional Search:

Static keyword-based indexing
Fixed taxonomy and categories
Limited semantic understanding

Selftag:

LLM-powered semantic analysis generates RDF triples from document content
User-defined OWL ontologies customize the index structure
Adaptive indexing that evolves with your document collection
RDF/Turtle output for semantic interoperability

How It Works

1. LLM-Based Content Analysis

When you add a document, Selftag uses advanced LLMs (Ollama, OLLM) to:

Analyze the actual content (not just filenames)
Extract semantic concepts, topics, and relationships
Generate RDF triples using Dublin Core predicates
Understand context, entities, and document purpose

Example RDF Output:

@prefix dc: <http://purl.org/dc/elements/1.1/> .

<file://document.pdf> dc:title "Data Protection Legislation" .
<file://document.pdf> dc:subject "privacy-rights" .
<file://document.pdf> dc:subject "data-governance" .
<file://document.pdf> dc:type "legal-document" .
<file://document.pdf> dc:coverage "India" .
<file://document.pdf> dc:date "2023" .

2. User-Defined OWL Ontology

Your personal OWL ontology defines:

Custom classes for your domain (e.g., "LegalDocument", "ResearchPaper", "Policy")
Object properties for relationships (e.g., "cites", "implements", "supersedes")
Datatype properties for attributes (e.g., "hasJurisdiction", "hasEffectiveDate")
SHACL shapes for validation rules

The LLM uses your ontology to:

Generate RDF triples that match your domain vocabulary
Create index entries aligned with your knowledge structure
Enable semantic queries using your custom predicates

3. Adaptive Index Evolution

The search index dynamically adapts based on:

Document density: New categories emerge as you add related documents
User taxonomy: Your OWL ontology guides classification
Content patterns: The system learns your domain-specific terminology
Multi-dimensional classification: Documents can belong to multiple semantic dimensions

4. RDF-Specific Outputs

All indexing uses RDF/Turtle format for:

Semantic interoperability: Compatible with other RDF systems
Standard predicates: Dublin Core, FOAF, SKOS, and custom namespaces
Query flexibility: SPARQL-compatible triple stores
Knowledge graphs: Build rich semantic networks from your documents

Use Cases

Healthcare & Medical Research

Selftag excels in healthcare and medical research by integrating with established medical ontologies:

Medical Ontologies Supported:

DRON (Drug Ontology): Drug names, interactions, and pharmacological properties
SNOMED CT: Clinical terminology and medical concepts
ICD-10/ICD-11: Disease classification and coding
LOINC: Laboratory and clinical observations
UMLS: Unified medical language system
HL7 FHIR: Healthcare data exchange standards

Healthcare Applications:

Clinical Document Management: Index patient records, lab results, and medical reports using clinical terminology
Research Literature: Organize medical research papers by disease, treatment, and methodology
Drug Information: Tag pharmaceutical documents with DRON drug ontology terms
Regulatory Compliance: Index FDA submissions, clinical trial data, and regulatory documents
Medical Imaging: Semantic tagging of radiology reports and imaging studies

Example RDF Output for Medical Document:

@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dron: <http://purl.org/dron/> .
@prefix snomed: <http://snomed.info/id/> .

<file://patient-report.pdf> dc:type "clinical-document" .
<file://patient-report.pdf> dc:subject "diabetes-mellitus-type-2" .
<file://patient-report.pdf> dron:drug "metformin" .
<file://patient-report.pdf> snomed:condition "44054006" .  # Diabetes mellitus type 2
<file://patient-report.pdf> dc:date "2024-03-15" .

Legal & Compliance

Regulatory Documents: Index legislation, policies, and compliance documents
Case Law: Organize legal precedents by jurisdiction, topic, and outcome
Contract Management: Semantic search across contracts and agreements

Research & Academia

Academic Papers: Organize research by methodology, domain, and findings
Grant Applications: Index funding proposals and research plans
Data Management: Tag datasets with domain-specific ontologies

Business Intelligence

Financial Reports: Index earnings, audits, and financial statements
Market Research: Organize industry reports and competitive analysis
Strategic Planning: Tag business plans and strategy documents

Why Selftag is Better

Traditional Search Limitations

❌ Keyword matching misses semantic relationships
❌ Fixed taxonomies don't adapt to your domain
❌ No understanding of document meaning or context
❌ Limited to exact text matches

Selftag Advantages

✅ Semantic understanding: Finds documents by meaning, not just keywords
✅ Custom ontologies: Index structure matches your domain knowledge (including medical ontologies like DRON, SNOMED CT)
✅ Adaptive indexing: Index evolves as your collection grows
✅ RDF-based: Standard semantic web format for interoperability
✅ Multi-modal analysis: Handles text, PDFs, images, CSV with appropriate models
✅ Context-aware: Understands relationships, entities, and document purpose
✅ Domain-specific: Healthcare, legal, research, and business ontologies supported

Quick Start

Option 1: Native macOS (Recommended for MPS/GPU acceleration)

Prerequisites:

macOS with Apple Silicon (for Metal/MPS support)
Rust 1.70+ with Cargo
Python 3.8+ with PyTorch
OLLM library for large-context processing

Installation:

# Install Python dependencies
pip3 install torch torchvision torchaudio
pip3 install --no-build-isolation --no-deps ollm transformers accelerate

# Build the application
cargo build --release

# Run natively (enables MPS acceleration)
./scripts/run_native_macos.sh

Access:

Web UI: http://localhost:8080
API: http://localhost:8080/api
Device status: Check top-right corner for MPS/GPU/CPU indicator

Option 2: Docker (Cross-platform, CPU-only)

Prerequisites:

Docker and Docker Compose
Ollama service (for fallback processing)

Installation:

# Build and start services
docker-compose build
docker-compose up -d

# Access
# Web UI: http://localhost:8080
# API: http://localhost:8080/api

Note: Docker on macOS uses CPU only (MPS not available in containers). For GPU acceleration, use native macOS installation.

Usage

Web Interface

Upload Documents
- Open http://localhost:8080
- Upload PDFs, images, CSV files, or text documents
- Documents are automatically analyzed by LLM
View Semantic Tags
- See RDF triples generated from content
- Browse by Dublin Core predicates (dc:subject, dc:type, etc.)
- Filter by custom ontology classes
Create Custom Ontology
- Use the Ontology Wizard to define your domain structure
- Create custom classes, properties, and relationships
- The LLM will use your ontology for future indexing
Search by Meaning
- Query using semantic concepts, not just keywords
- Find related documents through RDF relationships
- Explore knowledge graphs of your document collection

API Endpoints

Process Files:

curl -X POST http://localhost:8080/api/process-files \
  -H "Content-Type: application/json" \
  -d '{
    "files": [{
      "name": "document.pdf",
      "content": "base64encodedcontent",
      "mime_type": "application/pdf",
      "size": 1024000
    }]
  }'

Get Device Status:

curl http://localhost:8080/api/ollm-device-status
# Returns: {"device": "mps", "mps_available": true, "ollm_available": true}

Health Check:

curl http://localhost:8080/health

Architecture

LLM Processing Pipeline

Document Upload
    ↓
Content Extraction (PDF/Image/CSV)
    ↓
LLM Analysis (Ollama/OLLM)
    ↓
RDF Triple Generation (Dublin Core + Custom Ontology)
    ↓
Index Storage (RDF Graph Database)
    ↓
Semantic Search Interface

LLM Providers

OLLM: Large-context processing (100k+ tokens) with KV cache offloading
- Auto-detects device: CUDA → MPS → CPU
- Supports models like llama3-8B-chat
- Best for large documents and comprehensive analysis
Ollama: Fast inference with vision models
- Text models: qwen2.5, llama3.1
- Vision models: llava (for PDF OCR)
- Fallback when OLLM unavailable

RDF Index Structure

The search index is built from RDF triples:

Subject: File URI (e.g., file://document.pdf) Predicate: Semantic property (e.g., dc:subject, dc:type, custom ontology predicates) Object: Tag value (e.g., "privacy-rights", "legal-document")

Example Index Entry:

<file://data-protection-bill.pdf> dc:subject "data-protection-law" .
<file://data-protection-bill.pdf> dc:subject "privacy-rights" .
<file://data-protection-bill.pdf> dc:type "legislation" .
<file://data-protection-bill.pdf> custom:hasJurisdiction "India" .
<file://data-protection-bill.pdf> custom:effectiveDate "2023-01-01" .

OWL Ontology Integration

Your custom OWL ontology defines:

Classes: Document types in your domain

:LegalDocument a owl:Class .
:ResearchPaper a owl:Class .
:PolicyDocument a owl:Class .
:ClinicalDocument a owl:Class .
:MedicalReport a owl:Class .

Properties: Relationships and attributes

:hasJurisdiction a owl:DatatypeProperty .
:cites a owl:ObjectProperty .
:supersedes a owl:ObjectProperty .
:hasDiagnosis a owl:ObjectProperty .
:prescribesDrug a owl:ObjectProperty .

SHACL Validation: Rules for RDF generation

:LegalDocumentShape a sh:NodeShape ;
  sh:property [
    sh:path :hasJurisdiction ;
    sh:datatype xsd:string ;
  ] .

:ClinicalDocumentShape a sh:NodeShape ;
  sh:property [
    sh:path :hasDiagnosis ;
    sh:class snomed:ClinicalFinding ;
  ] .

Medical Ontology Integration:

DRON (Drug Ontology): Import drug classes and properties for pharmaceutical document indexing
SNOMED CT: Use clinical terminology for medical document classification
ICD-10/ICD-11: Disease classification codes for healthcare indexing
LOINC: Laboratory observation codes for clinical data
UMLS: Unified medical language system integration
HL7 FHIR: Healthcare data exchange standard properties
Custom Medical Ontologies: Define your own healthcare domain structures

The LLM uses your ontology to:

Generate RDF triples with your custom predicates (including medical ontologies)
Classify documents into your domain classes (e.g., ClinicalDocument, ResearchPaper)
Create index entries that match your knowledge structure
Integrate with standard medical vocabularies (DRON, SNOMED CT, ICD, LOINC, UMLS, HL7 FHIR)

Configuration

Environment Variables

# Logging
export RUST_LOG=info

# Repository path
export TAGSISTANT_REPOSITORY=./data

# LLM Configuration (optional)
export OLLAMA_URL=http://localhost:11434
export OLLAMA_MODEL=llama3.1:8b

Model Files

Place OLLM model files in ./models/:

models/
  llama3-8B-chat/
    model-00001-of-00004.safetensors
    model-00002-of-00004.safetensors
    ...

Performance

Native macOS (MPS)

Model loading: 1-2 minutes
Inference: 5-10x faster than CPU
Large context: 100k+ tokens with KV cache

Docker (CPU)

Model loading: 5-10 minutes
Inference: Slower but functional
Large context: Supported with layer offloading

Troubleshooting

MPS Not Available

Docker: MPS not available in containers (use native macOS)
Native: Check PyTorch installation: python3 -c "import torch; print(torch.backends.mps.is_available())"

OLLM Model Not Found

Ensure model files are in ./models/{model-name}/
Check volume mount in Docker: ./models:/app/models

API Not Responding

Check server logs: tail -f /tmp/memesis_native.log
Verify port 8080 is not in use: lsof -i :8080

Scripts

All utility scripts are in the scripts/ directory:

run_native_macos.sh: Start application natively with MPS support
Test scripts: Various testing utilities (see scripts/ directory)

License

This project is licensed under the MIT License.

See LICENSE file for full license text.

Project Status

⚠️ Work in Progress: This project is under active development.

Features are being implemented and tested
APIs and data structures may change
Documentation may be incomplete or outdated
Some functionality may not work as expected

Contributions, bug reports, and feedback are welcome, but please note that this is an early-stage project.

Acknowledgments

Ollama: Local LLM inference
OLLM: Large-context processing with KV cache
PyTorch: Deep learning framework with MPS support
Dublin Core: Standard metadata vocabulary
OWL/SHACL: Semantic web standards for ontologies

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
base-old		base-old
data		data
docs		docs
evidence		evidence
scripts		scripts
src		src
test-data		test-data
tests		tests
web		web
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
check_ollm_device.py		check_ollm_device.py
debug.log		debug.log
debug_llm.log		debug_llm.log
demo_script		demo_script
demo_script.rs		demo_script.rs
docker-compose.yml		docker-compose.yml
memesis_debug.log		memesis_debug.log
ollm_analysis.py		ollm_analysis.py
ollm_primary_analysis.py		ollm_primary_analysis.py
package-lock.json		package-lock.json
package.json		package.json
pdf_debug.log		pdf_debug.log
playwright.config.ts		playwright.config.ts
test-output.log		test-output.log
test-quick-output.log		test-quick-output.log

License

Hewlbern/selftag

Folders and files

Latest commit

History

Repository files navigation

Selftag: LLM-Powered Semantic Search Engine with Adaptive RDF Indexing

What is Selftag?

Key Innovation: LLM-Defined Search Index

How It Works

1. LLM-Based Content Analysis

2. User-Defined OWL Ontology

3. Adaptive Index Evolution

4. RDF-Specific Outputs

Use Cases

Healthcare & Medical Research

Legal & Compliance

Research & Academia

Business Intelligence

Why Selftag is Better

Traditional Search Limitations

Selftag Advantages

Quick Start

Option 1: Native macOS (Recommended for MPS/GPU acceleration)

Option 2: Docker (Cross-platform, CPU-only)

Usage

Web Interface

API Endpoints

Architecture

LLM Processing Pipeline

LLM Providers

RDF Index Structure

OWL Ontology Integration

Configuration

Environment Variables

Model Files

Performance

Native macOS (MPS)

Docker (CPU)

Troubleshooting

MPS Not Available

OLLM Model Not Found

API Not Responding

Scripts

License

Project Status

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages