Skip to content

Hewlbern/selftag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Selftag: LLM-Powered Semantic Search Engine with Adaptive RDF Indexing

⚠️ Work in Progress (WIP): This project is currently under active development. Features may be incomplete, APIs may change, and documentation may be outdated. Use at your own risk.

What is Selftag?

Selftag is an intelligent semantic search engine that uses Large Language Models (LLMs) to automatically build and maintain a searchable index of your documents. Unlike traditional search engines that rely on keyword matching, Selftag understands the semantic meaning of your content and creates a dynamic, ontology-driven index that adapts to your personal knowledge domain.

Key Innovation: LLM-Defined Search Index

Traditional Search:

  • Static keyword-based indexing
  • Fixed taxonomy and categories
  • Limited semantic understanding

Selftag:

  • LLM-powered semantic analysis generates RDF triples from document content
  • User-defined OWL ontologies customize the index structure
  • Adaptive indexing that evolves with your document collection
  • RDF/Turtle output for semantic interoperability

How It Works

1. LLM-Based Content Analysis

When you add a document, Selftag uses advanced LLMs (Ollama, OLLM) to:

  • Analyze the actual content (not just filenames)
  • Extract semantic concepts, topics, and relationships
  • Generate RDF triples using Dublin Core predicates
  • Understand context, entities, and document purpose

Example RDF Output:

@prefix dc: <http://purl.org/dc/elements/1.1/> .

<file://document.pdf> dc:title "Data Protection Legislation" .
<file://document.pdf> dc:subject "privacy-rights" .
<file://document.pdf> dc:subject "data-governance" .
<file://document.pdf> dc:type "legal-document" .
<file://document.pdf> dc:coverage "India" .
<file://document.pdf> dc:date "2023" .

2. User-Defined OWL Ontology

Your personal OWL ontology defines:

  • Custom classes for your domain (e.g., "LegalDocument", "ResearchPaper", "Policy")
  • Object properties for relationships (e.g., "cites", "implements", "supersedes")
  • Datatype properties for attributes (e.g., "hasJurisdiction", "hasEffectiveDate")
  • SHACL shapes for validation rules

The LLM uses your ontology to:

  • Generate RDF triples that match your domain vocabulary
  • Create index entries aligned with your knowledge structure
  • Enable semantic queries using your custom predicates

3. Adaptive Index Evolution

The search index dynamically adapts based on:

  • Document density: New categories emerge as you add related documents
  • User taxonomy: Your OWL ontology guides classification
  • Content patterns: The system learns your domain-specific terminology
  • Multi-dimensional classification: Documents can belong to multiple semantic dimensions

4. RDF-Specific Outputs

All indexing uses RDF/Turtle format for:

  • Semantic interoperability: Compatible with other RDF systems
  • Standard predicates: Dublin Core, FOAF, SKOS, and custom namespaces
  • Query flexibility: SPARQL-compatible triple stores
  • Knowledge graphs: Build rich semantic networks from your documents

Use Cases

Healthcare & Medical Research

Selftag excels in healthcare and medical research by integrating with established medical ontologies:

Medical Ontologies Supported:

  • DRON (Drug Ontology): Drug names, interactions, and pharmacological properties
  • SNOMED CT: Clinical terminology and medical concepts
  • ICD-10/ICD-11: Disease classification and coding
  • LOINC: Laboratory and clinical observations
  • UMLS: Unified medical language system
  • HL7 FHIR: Healthcare data exchange standards

Healthcare Applications:

  • Clinical Document Management: Index patient records, lab results, and medical reports using clinical terminology
  • Research Literature: Organize medical research papers by disease, treatment, and methodology
  • Drug Information: Tag pharmaceutical documents with DRON drug ontology terms
  • Regulatory Compliance: Index FDA submissions, clinical trial data, and regulatory documents
  • Medical Imaging: Semantic tagging of radiology reports and imaging studies

Example RDF Output for Medical Document:

@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dron: <http://purl.org/dron/> .
@prefix snomed: <http://snomed.info/id/> .

<file://patient-report.pdf> dc:type "clinical-document" .
<file://patient-report.pdf> dc:subject "diabetes-mellitus-type-2" .
<file://patient-report.pdf> dron:drug "metformin" .
<file://patient-report.pdf> snomed:condition "44054006" .  # Diabetes mellitus type 2
<file://patient-report.pdf> dc:date "2024-03-15" .

Legal & Compliance

  • Regulatory Documents: Index legislation, policies, and compliance documents
  • Case Law: Organize legal precedents by jurisdiction, topic, and outcome
  • Contract Management: Semantic search across contracts and agreements

Research & Academia

  • Academic Papers: Organize research by methodology, domain, and findings
  • Grant Applications: Index funding proposals and research plans
  • Data Management: Tag datasets with domain-specific ontologies

Business Intelligence

  • Financial Reports: Index earnings, audits, and financial statements
  • Market Research: Organize industry reports and competitive analysis
  • Strategic Planning: Tag business plans and strategy documents

Why Selftag is Better

Traditional Search Limitations

  • ❌ Keyword matching misses semantic relationships
  • ❌ Fixed taxonomies don't adapt to your domain
  • ❌ No understanding of document meaning or context
  • ❌ Limited to exact text matches

Selftag Advantages

  • Semantic understanding: Finds documents by meaning, not just keywords
  • Custom ontologies: Index structure matches your domain knowledge (including medical ontologies like DRON, SNOMED CT)
  • Adaptive indexing: Index evolves as your collection grows
  • RDF-based: Standard semantic web format for interoperability
  • Multi-modal analysis: Handles text, PDFs, images, CSV with appropriate models
  • Context-aware: Understands relationships, entities, and document purpose
  • Domain-specific: Healthcare, legal, research, and business ontologies supported

Quick Start

Option 1: Native macOS (Recommended for MPS/GPU acceleration)

Prerequisites:

  • macOS with Apple Silicon (for Metal/MPS support)
  • Rust 1.70+ with Cargo
  • Python 3.8+ with PyTorch
  • OLLM library for large-context processing

Installation:

# Install Python dependencies
pip3 install torch torchvision torchaudio
pip3 install --no-build-isolation --no-deps ollm transformers accelerate

# Build the application
cargo build --release

# Run natively (enables MPS acceleration)
./scripts/run_native_macos.sh

Access:

Option 2: Docker (Cross-platform, CPU-only)

Prerequisites:

  • Docker and Docker Compose
  • Ollama service (for fallback processing)

Installation:

# Build and start services
docker-compose build
docker-compose up -d

# Access
# Web UI: http://localhost:8080
# API: http://localhost:8080/api

Note: Docker on macOS uses CPU only (MPS not available in containers). For GPU acceleration, use native macOS installation.

Usage

Web Interface

  1. Upload Documents

    • Open http://localhost:8080
    • Upload PDFs, images, CSV files, or text documents
    • Documents are automatically analyzed by LLM
  2. View Semantic Tags

    • See RDF triples generated from content
    • Browse by Dublin Core predicates (dc:subject, dc:type, etc.)
    • Filter by custom ontology classes
  3. Create Custom Ontology

    • Use the Ontology Wizard to define your domain structure
    • Create custom classes, properties, and relationships
    • The LLM will use your ontology for future indexing
  4. Search by Meaning

    • Query using semantic concepts, not just keywords
    • Find related documents through RDF relationships
    • Explore knowledge graphs of your document collection

API Endpoints

Process Files:

curl -X POST http://localhost:8080/api/process-files \
  -H "Content-Type: application/json" \
  -d '{
    "files": [{
      "name": "document.pdf",
      "content": "base64encodedcontent",
      "mime_type": "application/pdf",
      "size": 1024000
    }]
  }'

Get Device Status:

curl http://localhost:8080/api/ollm-device-status
# Returns: {"device": "mps", "mps_available": true, "ollm_available": true}

Health Check:

curl http://localhost:8080/health

Architecture

LLM Processing Pipeline

Document Upload
    ↓
Content Extraction (PDF/Image/CSV)
    ↓
LLM Analysis (Ollama/OLLM)
    ↓
RDF Triple Generation (Dublin Core + Custom Ontology)
    ↓
Index Storage (RDF Graph Database)
    ↓
Semantic Search Interface

LLM Providers

  • OLLM: Large-context processing (100k+ tokens) with KV cache offloading

    • Auto-detects device: CUDA → MPS → CPU
    • Supports models like llama3-8B-chat
    • Best for large documents and comprehensive analysis
  • Ollama: Fast inference with vision models

    • Text models: qwen2.5, llama3.1
    • Vision models: llava (for PDF OCR)
    • Fallback when OLLM unavailable

RDF Index Structure

The search index is built from RDF triples:

Subject: File URI (e.g., file://document.pdf) Predicate: Semantic property (e.g., dc:subject, dc:type, custom ontology predicates) Object: Tag value (e.g., "privacy-rights", "legal-document")

Example Index Entry:

<file://data-protection-bill.pdf> dc:subject "data-protection-law" .
<file://data-protection-bill.pdf> dc:subject "privacy-rights" .
<file://data-protection-bill.pdf> dc:type "legislation" .
<file://data-protection-bill.pdf> custom:hasJurisdiction "India" .
<file://data-protection-bill.pdf> custom:effectiveDate "2023-01-01" .

OWL Ontology Integration

Your custom OWL ontology defines:

  1. Classes: Document types in your domain

    :LegalDocument a owl:Class .
    :ResearchPaper a owl:Class .
    :PolicyDocument a owl:Class .
    :ClinicalDocument a owl:Class .
    :MedicalReport a owl:Class .
  2. Properties: Relationships and attributes

    :hasJurisdiction a owl:DatatypeProperty .
    :cites a owl:ObjectProperty .
    :supersedes a owl:ObjectProperty .
    :hasDiagnosis a owl:ObjectProperty .
    :prescribesDrug a owl:ObjectProperty .
  3. SHACL Validation: Rules for RDF generation

    :LegalDocumentShape a sh:NodeShape ;
      sh:property [
        sh:path :hasJurisdiction ;
        sh:datatype xsd:string ;
      ] .
    
    :ClinicalDocumentShape a sh:NodeShape ;
      sh:property [
        sh:path :hasDiagnosis ;
        sh:class snomed:ClinicalFinding ;
      ] .
    

Medical Ontology Integration:

  • DRON (Drug Ontology): Import drug classes and properties for pharmaceutical document indexing
  • SNOMED CT: Use clinical terminology for medical document classification
  • ICD-10/ICD-11: Disease classification codes for healthcare indexing
  • LOINC: Laboratory observation codes for clinical data
  • UMLS: Unified medical language system integration
  • HL7 FHIR: Healthcare data exchange standard properties
  • Custom Medical Ontologies: Define your own healthcare domain structures

The LLM uses your ontology to:

  • Generate RDF triples with your custom predicates (including medical ontologies)
  • Classify documents into your domain classes (e.g., ClinicalDocument, ResearchPaper)
  • Create index entries that match your knowledge structure
  • Integrate with standard medical vocabularies (DRON, SNOMED CT, ICD, LOINC, UMLS, HL7 FHIR)

Configuration

Environment Variables

# Logging
export RUST_LOG=info

# Repository path
export TAGSISTANT_REPOSITORY=./data

# LLM Configuration (optional)
export OLLAMA_URL=http://localhost:11434
export OLLAMA_MODEL=llama3.1:8b

Model Files

Place OLLM model files in ./models/:

models/
  llama3-8B-chat/
    model-00001-of-00004.safetensors
    model-00002-of-00004.safetensors
    ...

Performance

Native macOS (MPS)

  • Model loading: 1-2 minutes
  • Inference: 5-10x faster than CPU
  • Large context: 100k+ tokens with KV cache

Docker (CPU)

  • Model loading: 5-10 minutes
  • Inference: Slower but functional
  • Large context: Supported with layer offloading

Troubleshooting

MPS Not Available

  • Docker: MPS not available in containers (use native macOS)
  • Native: Check PyTorch installation: python3 -c "import torch; print(torch.backends.mps.is_available())"

OLLM Model Not Found

  • Ensure model files are in ./models/{model-name}/
  • Check volume mount in Docker: ./models:/app/models

API Not Responding

  • Check server logs: tail -f /tmp/memesis_native.log
  • Verify port 8080 is not in use: lsof -i :8080

Scripts

All utility scripts are in the scripts/ directory:

  • run_native_macos.sh: Start application natively with MPS support
  • Test scripts: Various testing utilities (see scripts/ directory)

License

This project is licensed under the MIT License.

Copyright (c) 2025 Michael Holborn (mikeholborn1990@gmail.com)

See LICENSE file for full license text.

Project Status

⚠️ Work in Progress: This project is under active development.

  • Features are being implemented and tested
  • APIs and data structures may change
  • Documentation may be incomplete or outdated
  • Some functionality may not work as expected

Contributions, bug reports, and feedback are welcome, but please note that this is an early-stage project.

Acknowledgments

  • Ollama: Local LLM inference
  • OLLM: Large-context processing with KV cache
  • PyTorch: Deep learning framework with MPS support
  • Dublin Core: Standard metadata vocabulary
  • OWL/SHACL: Semantic web standards for ontologies

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published