⚠️ Work in Progress (WIP): This project is currently under active development. Features may be incomplete, APIs may change, and documentation may be outdated. Use at your own risk.
Selftag is an intelligent semantic search engine that uses Large Language Models (LLMs) to automatically build and maintain a searchable index of your documents. Unlike traditional search engines that rely on keyword matching, Selftag understands the semantic meaning of your content and creates a dynamic, ontology-driven index that adapts to your personal knowledge domain.
Traditional Search:
- Static keyword-based indexing
- Fixed taxonomy and categories
- Limited semantic understanding
Selftag:
- LLM-powered semantic analysis generates RDF triples from document content
- User-defined OWL ontologies customize the index structure
- Adaptive indexing that evolves with your document collection
- RDF/Turtle output for semantic interoperability
When you add a document, Selftag uses advanced LLMs (Ollama, OLLM) to:
- Analyze the actual content (not just filenames)
- Extract semantic concepts, topics, and relationships
- Generate RDF triples using Dublin Core predicates
- Understand context, entities, and document purpose
Example RDF Output:
@prefix dc: <http://purl.org/dc/elements/1.1/> .
<file://document.pdf> dc:title "Data Protection Legislation" .
<file://document.pdf> dc:subject "privacy-rights" .
<file://document.pdf> dc:subject "data-governance" .
<file://document.pdf> dc:type "legal-document" .
<file://document.pdf> dc:coverage "India" .
<file://document.pdf> dc:date "2023" .Your personal OWL ontology defines:
- Custom classes for your domain (e.g., "LegalDocument", "ResearchPaper", "Policy")
- Object properties for relationships (e.g., "cites", "implements", "supersedes")
- Datatype properties for attributes (e.g., "hasJurisdiction", "hasEffectiveDate")
- SHACL shapes for validation rules
The LLM uses your ontology to:
- Generate RDF triples that match your domain vocabulary
- Create index entries aligned with your knowledge structure
- Enable semantic queries using your custom predicates
The search index dynamically adapts based on:
- Document density: New categories emerge as you add related documents
- User taxonomy: Your OWL ontology guides classification
- Content patterns: The system learns your domain-specific terminology
- Multi-dimensional classification: Documents can belong to multiple semantic dimensions
All indexing uses RDF/Turtle format for:
- Semantic interoperability: Compatible with other RDF systems
- Standard predicates: Dublin Core, FOAF, SKOS, and custom namespaces
- Query flexibility: SPARQL-compatible triple stores
- Knowledge graphs: Build rich semantic networks from your documents
Selftag excels in healthcare and medical research by integrating with established medical ontologies:
Medical Ontologies Supported:
- DRON (Drug Ontology): Drug names, interactions, and pharmacological properties
- SNOMED CT: Clinical terminology and medical concepts
- ICD-10/ICD-11: Disease classification and coding
- LOINC: Laboratory and clinical observations
- UMLS: Unified medical language system
- HL7 FHIR: Healthcare data exchange standards
Healthcare Applications:
- Clinical Document Management: Index patient records, lab results, and medical reports using clinical terminology
- Research Literature: Organize medical research papers by disease, treatment, and methodology
- Drug Information: Tag pharmaceutical documents with DRON drug ontology terms
- Regulatory Compliance: Index FDA submissions, clinical trial data, and regulatory documents
- Medical Imaging: Semantic tagging of radiology reports and imaging studies
Example RDF Output for Medical Document:
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dron: <http://purl.org/dron/> .
@prefix snomed: <http://snomed.info/id/> .
<file://patient-report.pdf> dc:type "clinical-document" .
<file://patient-report.pdf> dc:subject "diabetes-mellitus-type-2" .
<file://patient-report.pdf> dron:drug "metformin" .
<file://patient-report.pdf> snomed:condition "44054006" . # Diabetes mellitus type 2
<file://patient-report.pdf> dc:date "2024-03-15" .- Regulatory Documents: Index legislation, policies, and compliance documents
- Case Law: Organize legal precedents by jurisdiction, topic, and outcome
- Contract Management: Semantic search across contracts and agreements
- Academic Papers: Organize research by methodology, domain, and findings
- Grant Applications: Index funding proposals and research plans
- Data Management: Tag datasets with domain-specific ontologies
- Financial Reports: Index earnings, audits, and financial statements
- Market Research: Organize industry reports and competitive analysis
- Strategic Planning: Tag business plans and strategy documents
- ❌ Keyword matching misses semantic relationships
- ❌ Fixed taxonomies don't adapt to your domain
- ❌ No understanding of document meaning or context
- ❌ Limited to exact text matches
- ✅ Semantic understanding: Finds documents by meaning, not just keywords
- ✅ Custom ontologies: Index structure matches your domain knowledge (including medical ontologies like DRON, SNOMED CT)
- ✅ Adaptive indexing: Index evolves as your collection grows
- ✅ RDF-based: Standard semantic web format for interoperability
- ✅ Multi-modal analysis: Handles text, PDFs, images, CSV with appropriate models
- ✅ Context-aware: Understands relationships, entities, and document purpose
- ✅ Domain-specific: Healthcare, legal, research, and business ontologies supported
Prerequisites:
- macOS with Apple Silicon (for Metal/MPS support)
- Rust 1.70+ with Cargo
- Python 3.8+ with PyTorch
- OLLM library for large-context processing
Installation:
# Install Python dependencies
pip3 install torch torchvision torchaudio
pip3 install --no-build-isolation --no-deps ollm transformers accelerate
# Build the application
cargo build --release
# Run natively (enables MPS acceleration)
./scripts/run_native_macos.shAccess:
- Web UI: http://localhost:8080
- API: http://localhost:8080/api
- Device status: Check top-right corner for MPS/GPU/CPU indicator
Prerequisites:
- Docker and Docker Compose
- Ollama service (for fallback processing)
Installation:
# Build and start services
docker-compose build
docker-compose up -d
# Access
# Web UI: http://localhost:8080
# API: http://localhost:8080/apiNote: Docker on macOS uses CPU only (MPS not available in containers). For GPU acceleration, use native macOS installation.
-
Upload Documents
- Open http://localhost:8080
- Upload PDFs, images, CSV files, or text documents
- Documents are automatically analyzed by LLM
-
View Semantic Tags
- See RDF triples generated from content
- Browse by Dublin Core predicates (dc:subject, dc:type, etc.)
- Filter by custom ontology classes
-
Create Custom Ontology
- Use the Ontology Wizard to define your domain structure
- Create custom classes, properties, and relationships
- The LLM will use your ontology for future indexing
-
Search by Meaning
- Query using semantic concepts, not just keywords
- Find related documents through RDF relationships
- Explore knowledge graphs of your document collection
Process Files:
curl -X POST http://localhost:8080/api/process-files \
-H "Content-Type: application/json" \
-d '{
"files": [{
"name": "document.pdf",
"content": "base64encodedcontent",
"mime_type": "application/pdf",
"size": 1024000
}]
}'Get Device Status:
curl http://localhost:8080/api/ollm-device-status
# Returns: {"device": "mps", "mps_available": true, "ollm_available": true}Health Check:
curl http://localhost:8080/healthDocument Upload
↓
Content Extraction (PDF/Image/CSV)
↓
LLM Analysis (Ollama/OLLM)
↓
RDF Triple Generation (Dublin Core + Custom Ontology)
↓
Index Storage (RDF Graph Database)
↓
Semantic Search Interface
-
OLLM: Large-context processing (100k+ tokens) with KV cache offloading
- Auto-detects device: CUDA → MPS → CPU
- Supports models like llama3-8B-chat
- Best for large documents and comprehensive analysis
-
Ollama: Fast inference with vision models
- Text models: qwen2.5, llama3.1
- Vision models: llava (for PDF OCR)
- Fallback when OLLM unavailable
The search index is built from RDF triples:
Subject: File URI (e.g., file://document.pdf)
Predicate: Semantic property (e.g., dc:subject, dc:type, custom ontology predicates)
Object: Tag value (e.g., "privacy-rights", "legal-document")
Example Index Entry:
<file://data-protection-bill.pdf> dc:subject "data-protection-law" .
<file://data-protection-bill.pdf> dc:subject "privacy-rights" .
<file://data-protection-bill.pdf> dc:type "legislation" .
<file://data-protection-bill.pdf> custom:hasJurisdiction "India" .
<file://data-protection-bill.pdf> custom:effectiveDate "2023-01-01" .Your custom OWL ontology defines:
-
Classes: Document types in your domain
:LegalDocument a owl:Class . :ResearchPaper a owl:Class . :PolicyDocument a owl:Class . :ClinicalDocument a owl:Class . :MedicalReport a owl:Class .
-
Properties: Relationships and attributes
:hasJurisdiction a owl:DatatypeProperty . :cites a owl:ObjectProperty . :supersedes a owl:ObjectProperty . :hasDiagnosis a owl:ObjectProperty . :prescribesDrug a owl:ObjectProperty .
-
SHACL Validation: Rules for RDF generation
:LegalDocumentShape a sh:NodeShape ; sh:property [ sh:path :hasJurisdiction ; sh:datatype xsd:string ; ] . :ClinicalDocumentShape a sh:NodeShape ; sh:property [ sh:path :hasDiagnosis ; sh:class snomed:ClinicalFinding ; ] .
Medical Ontology Integration:
- DRON (Drug Ontology): Import drug classes and properties for pharmaceutical document indexing
- SNOMED CT: Use clinical terminology for medical document classification
- ICD-10/ICD-11: Disease classification codes for healthcare indexing
- LOINC: Laboratory observation codes for clinical data
- UMLS: Unified medical language system integration
- HL7 FHIR: Healthcare data exchange standard properties
- Custom Medical Ontologies: Define your own healthcare domain structures
The LLM uses your ontology to:
- Generate RDF triples with your custom predicates (including medical ontologies)
- Classify documents into your domain classes (e.g., ClinicalDocument, ResearchPaper)
- Create index entries that match your knowledge structure
- Integrate with standard medical vocabularies (DRON, SNOMED CT, ICD, LOINC, UMLS, HL7 FHIR)
# Logging
export RUST_LOG=info
# Repository path
export TAGSISTANT_REPOSITORY=./data
# LLM Configuration (optional)
export OLLAMA_URL=http://localhost:11434
export OLLAMA_MODEL=llama3.1:8bPlace OLLM model files in ./models/:
models/
llama3-8B-chat/
model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
...
- Model loading: 1-2 minutes
- Inference: 5-10x faster than CPU
- Large context: 100k+ tokens with KV cache
- Model loading: 5-10 minutes
- Inference: Slower but functional
- Large context: Supported with layer offloading
- Docker: MPS not available in containers (use native macOS)
- Native: Check PyTorch installation:
python3 -c "import torch; print(torch.backends.mps.is_available())"
- Ensure model files are in
./models/{model-name}/ - Check volume mount in Docker:
./models:/app/models
- Check server logs:
tail -f /tmp/memesis_native.log - Verify port 8080 is not in use:
lsof -i :8080
All utility scripts are in the scripts/ directory:
run_native_macos.sh: Start application natively with MPS support- Test scripts: Various testing utilities (see
scripts/directory)
This project is licensed under the MIT License.
Copyright (c) 2025 Michael Holborn (mikeholborn1990@gmail.com)
See LICENSE file for full license text.
- Features are being implemented and tested
- APIs and data structures may change
- Documentation may be incomplete or outdated
- Some functionality may not work as expected
Contributions, bug reports, and feedback are welcome, but please note that this is an early-stage project.
- Ollama: Local LLM inference
- OLLM: Large-context processing with KV cache
- PyTorch: Deep learning framework with MPS support
- Dublin Core: Standard metadata vocabulary
- OWL/SHACL: Semantic web standards for ontologies