Docent is a Python-based pipeline that extracts meaningful titles and structured headings from PDFs and filters them based on relevance to a given job description or persona. It leverages semantic similarity to return only the most informative content.
Docent processes collections of PDFs and identifies the most relevant sections using natural language understanding. The pipeline outputs a structured JSON summary that emphasizes meaningful and contextually appropriate content for any given use case (e.g., job-matching, persona targeting).
For each PDF, Docent performs the following steps:
-
📘 PDF Parsing Extracts hierarchical text data using
PyMuPDF. -
🧬 Embedding Generation Generates sentence-level embeddings via
sentence-transformersmodels (e.g.,all-MiniLM-L6-v2). -
🔍 Semantic Similarity Computes cosine similarity between PDF content and the target persona or job description.
-
✂️ Subsection Filtering Ranks paragraphs and retains only the most relevant ones based on similarity thresholds.
-
📦 Output Formatting Outputs structured data following a consistent JSON format (compatible with downstream systems or integrations).
Install required libraries using pip:
pip install pymupdf sentence-transformers scikit-learn numpysentence-transformers– Semantic embeddingPyMuPDF (fitz)– PDF parsing and layout detectionscikit-learn– Cosine similarity computation- Standard libraries:
os,glob,json,re,numpy
DOCENT/
├── app/
│ ├── embedding_utils.py # Text embedding functions
│ ├── main.py # Entry point
│ ├── output_formatter.py # JSON output formatter
│ ├── pdf_parser.py # PDF heading extractor
│ ├── pipeline.py # Workflow coordinator
│ ├── similarity_ranking.py # Semantic similarity scoring
│ └── subsection_processing.py # Subsection refinement
│
├── Collection 1/
│ ├── PDF/ # Source PDFs
│ ├── challenge1b_input.json # Persona/Job Description input
│ └── challenge1b_output.json # Final JSON output (generated)
│
├── Collection 2/
├── Collection 3/
│
├── sentence_transformers/ # (Optional) Local model directory
├── sentence_transformers.py # SentenceTransformer model wrapper
├── test_sentence_transformer.py # Model test script
├── output.json # Aggregated results (optional)
├── requirements.txt # Dependencies
├── Dockerfile # Container setup
└── .gitignore
- Place PDFs inside the relevant
Collectionfolder (e.g.,Collection 1/PDF/) - Add a
challenge1b_input.jsonfile in the collection folder containing the persona/job description
-
Build Docker Image
docker build -t docent:latest . -
Run Container
docker run -it --rm docent:latest
Customize paths or bind volumes if needed for local file access.
-
A JSON file (e.g.,
challenge1b_output.json) is generated for each collection -
Output includes:
- Extracted titles/headings
- Most relevant paragraphs
- Sectional structure for better readability
- 🔍 Smart relevance-based PDF section filtering
- 💬 Semantic understanding using transformer-based models
- 📂 Outputs in clean, structured JSON
- 🔒 Fully offline and customizable