DocAnalyzer is a command-line tool built with Python that:
- Extracts full text from PDF files
- Identifies and highlights named entities (like names, dates, places)
- Analyzes and visualizes word frequency
- Generates a word cloud
- Creates important sentences summary
- Saves output in organized files and images
This is a self-directed NLP + automation project designed for offline usage, perfect for researchers and students.
pdfplumber
nltk
spaCy
matplotlib
wordcloud
| Word Cloud Example | Word Frequency Plot |
|---|---|
![]() |
![]() |
We used a publicly available sample AI-related PDF (sample.pdf) which contains natural language and technical content. You can replace it with any other document.
All results are saved in the output/ folder:
| File | Description |
|---|---|
full_text.txt |
Complete text extracted from the PDF |
summary.txt |
Top most relevant sentences |
wordcloud.png |
Word cloud of frequent words |
frequency_plot.png |
Bar chart of top word frequencies |
Make sure sample.pdf is placed in the project root directory, like this:
DocAnalyzer/
├── pdf_analyzer.py
├── sample.pdf
├── requirements.txt
├── README.md
├── output/
│ ├── full_text.txt
│ ├── summary.txt
│ ├── frequency_plot.png
│ └── wordcloud.png
└── .gitignore
git clone https://github.com/Waleed99i/DocAnalyzer.git
cd DocAnalyzer
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -m nltk.downloader punkt stopwords
python -m spacy download en_core_web_sm
source venv/bin/activate
python pdf_analyzer.py
Checkout my Linkedin profile
This project is free to use under the MIT License.

