Real-time plagiarism detection for student image submissions using perceptual hashing, CLIP embeddings, and vector search.
- Perceptual hashing (pHash, dHash, aHash) for fast duplicate detection
- CLIP embeddings for semantic similarity
- Vector search with FAISS or pgvector
- AI-generated image detection (DALL-E, Midjourney, Stable Diffusion)
- Peer and self-plagiarism checking
- Async processing with RabbitMQ
- Optional student ID hashing for privacy
./start-dev-env.shThis starts PostgreSQL and RabbitMQ containers using Podman and creates a .env file with default settings.
python -m venv venv
source venv/bin/activate # Windows: .\venv\Scripts\activate
pip install -r requirements.txtpython app.pyThe application will:
- Connect to PostgreSQL and RabbitMQ
- Start listening for plagiarism check submissions
- Process images and store results in the database
Configuration is managed via .env file (auto-created by start script).
# Database Connection
POSTGRES_USER=plagiarism_user
POSTGRES_PASSWORD=secure_password
POSTGRES_DB=plagiarism_db
POSTGRES_HOST=postgres # Use 'localhost' for local development
# Message Queue
RABBITMQ_HOST=rabbitmq # Use 'localhost' for local development
RABBITMQ_USER=admin
RABBITMQ_PASS=admin123# Hash matching (lower = stricter)
HASH_MATCH_THRESHOLD=8 # Hamming distance
# Semantic similarity (higher = stricter)
SEMANTIC_MATCH_THRESHOLD=0.80 # 0.0 to 1.0
# Self-plagiarism grace period
RESUBMISSION_WINDOW_DAYS=14 # Days# Choose between FAISS (in-memory) or pgvector (database)
USE_PGVECTOR=false # Set to 'true' for pgvectorSee .env.example for all available options.
# Install test dependencies
pip install -r requirements-test.txt
# Run tests
pytest tests/ -v
# With coverage report
pytest tests/ --cov --cov-report=html- Total: 384 tests
- Passing: 364 (94.8%)
- Skipped: 20 (DB manager mocks - covered by integration tests)
- Failing: 0
- Execution Time: ~7 minutes
Coverage by Component:
- ✅ Integration tests: 11/11 (end-to-end workflows)
- ✅ Hash handlers: Complete coverage
- ✅ CLIP embeddings: 35/38 (3 skipped - validation tests)
- ✅ AI detection: 15/15
- ✅ FAISS handler: 22/22
- ✅ Image validator: 18/18
mentorme/
├── app.py # Main application entry
├── api/
│ └── api.py # FastAPI REST endpoints
├── config/
│ └── config.py # Configuration management
├── database/
│ ├── db_manager.py # Database operations
│ ├── init.sql # Schema definition
│ └── dumps/ # Database backups
├── image_worker/
│ ├── worker.py # Core detection engine
│ ├── clip_handler.py # CLIP embeddings (768D)
│ ├── hash_handler.py # Perceptual hashing
│ ├── ai_generated_detector.py # AI detection
│ ├── image_validator.py # Image validation
│ ├── faiss_handler.py # FAISS vector backend
│ └── pgvector_handler.py # pgvector backend
├── mq/
│ └── rmq_client.py # RabbitMQ client
├── plag_checker/
│ ├── submissions_checker.py # Message orchestrator
│ └── submission_status.py # Status tracking
├── processors/
│ ├── base_processor.py # Base processor class
│ ├── image_processor.py # Image processing
│ └── text_processor.py # Text processing
├── scripts/
│ ├── dump_database.sh # Database backup
│ ├── restore_database.sh # Database restore
│ └── download_clip_model.py # CLIP model downloader
├── seeding/
│ ├── seed_ref_images.py # Reference image indexing
│ └── seed_from_xlsx.py # Bulk submission seeding
├── utils/
│ ├── security.py # Student ID hashing
│ └── exceptions.py # Custom exceptions
└── tests/ # Test suite (384 tests)
- Hash Check: Compares perceptual hashes (Hamming distance)
- CLIP Check: Generates 768D embedding (ViT-L/14), searches vector index
- AI Detection: Checks metadata and statistical patterns
- Result: Returns plagiarism status with confidence score
Priority: Peer > Reference > Self (resubmission) > Original
# Check status
podman ps
# View logs
podman logs mentorme-postgres
podman logs mentorme-rabbitmq
# Stop services
podman stop mentorme-postgres mentorme-rabbitmq
# Restart services
./start-dev-env.shRabbitMQ Management UI:
- URL: http://localhost:15672
- Login: admin/admin123
Database Stats:
-- Total submissions
SELECT COUNT(*) FROM submissions;
-- Plagiarism rate
SELECT
COUNT(*) FILTER (WHERE is_plagiarized = true) * 100.0 / COUNT(*) AS plagiarism_rate
FROM submissions;RabbitMQ Connection Issues:
podman restart mentorme-rabbitmq
podman logs mentorme-rabbitmqPostgreSQL Connection Issues:
podman restart mentorme-postgres
podman exec mentorme-postgres pg_isready -U plagiarism_userTest Failures:
pip install -r requirements-test.txt --upgrade
pytest tests/ -vv --tb=short- DOCUMENTATION.md - Complete technical documentation
- DATABASE_DUMPS.md - Database backup/restore guide
- Seeding README - Data seeding instructions
- Copilot Instructions - Development guidelines
MIT License
- OpenCLIP - Image embeddings
- FAISS - Vector search
- imagehash - Perceptual hashing
- asyncpg, psycopg3 - PostgreSQL drivers
- aio-pika - RabbitMQ client