A production-ready Named Entity Recognition system for Twitter data using state-of-the-art Transformer models
Built by RATNESH SINGH
- Overview
- Features
- Demo
- Architecture
- Technology Stack
- Installation
- Usage
- UI Sections
- API Documentation
- Model Training
- Dataset
- Project Structure
- Performance
- Contributing
- License
- Contact
This project implements a Named Entity Recognition (NER) system specifically designed for Twitter data. It automatically identifies and classifies named entities such as persons, locations, companies, products, and more from informal, noisy tweet text.
Twitter generates ~500 million tweets per day. Understanding trends and topics requires going beyond simple hashtag analysis to extract meaningful entities from the content itself. This system addresses:
- Volume: Processing large-scale social media data
- Noise: Handling informal, unstructured user-generated content
- Accuracy: Providing fine-grained entity classification (10+ categories)
A full-stack web application powered by BERT (Bidirectional Encoder Representations from Transformers) that provides:
- Real-time entity extraction from text
- Interactive visualization of results
- Model training capabilities
- Comprehensive analytics dashboard
- Real-time NER: Instant entity extraction from any text input
- Multi-model Support: BERT, DistilBERT, RoBERTa, XLM-RoBERTa
- 10+ Entity Types: Person, Location, Company, Product, Facility, Music Artist, TV Show, Sports Team, and more
- Visual Analytics: Interactive charts and entity distribution graphs
- Model Training: Train custom models directly from the UI
- Business Case: Comprehensive project overview and impact analysis
- About: System features and capabilities
- Technical Documentation: Detailed technical guide from the research paper
- Analyze: Real-time entity extraction with visual highlighting
- Model & Training: Model selection and training interface
- Data Statistics: Dataset insights and entity distribution
- Logs: Real-time API monitoring
- Lazy Loading: Optimized startup time (<1 second)
- Caching: Smart data caching for improved response times
- Async Processing: Non-blocking model operations
- Health Monitoring: Real-time backend status indicators
- Streamlit Profile - https://share.streamlit.io/user/ratnesh-181998
- Project Demo - https://twitter-ner-system-ab12c.streamlit.app/
Input:
Apple MacBook is the best laptop in the world
Output:
- Apple β Company
- MacBook β Product
- world β Geo-location
The application features:
- Color-coded entity highlighting
- Interactive entity distribution charts
- Real-time prediction results
- Detailed entity tables with counts
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (Streamlit) β
β ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ β
β βBusiness β About βTechnical β Analyze β Model β β
β β Case β β Docs β β Training β β
β ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β REST API (HTTP)
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ
β Backend (FastAPI) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β API Endpoints β β
β β /predict /train /status /data-stats /logs β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Model Layer β β
β β β’ BERT/DistilBERT/RoBERTa β β
β β β’ Tokenization & Alignment β β
β β β’ Training & Inference β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ
β Data Layer β
β β’ wnut 16.txt.conll (Training) β
β β’ wnut 16test.txt.conll (Testing) β
β β’ Saved Models (PyTorch) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Streamlit
1.28.1- Interactive web application framework - Plotly
5.17.0- Interactive data visualization - Pandas
2.1.3- Data manipulation and analysis - annotated-text
4.0.1- Text annotation display
- FastAPI
0.104.1- Modern, high-performance web framework - Uvicorn
0.24.0- ASGI server - Pydantic
2.5.0- Data validation
- PyTorch
2.1.1- Deep learning framework - Transformers
4.35.0- Hugging Face transformers library - NumPy
1.26.2- Numerical computing
- BERT (bert-base-uncased) - 110M parameters
- DistilBERT (distilbert-base-uncased) - 66M parameters (faster)
- RoBERTa (roberta-base) - 125M parameters (improved BERT)
- XLM-RoBERTa (xlm-roberta-base) - Multilingual support
- Python 3.8 or higher
- pip package manager
- 4GB+ RAM (8GB recommended for training)
- Internet connection (for first-time model download)
git clone https://github.com/YOUR_USERNAME/twitter-ner-system.git
cd twitter-ner-systempip install -r requirements.txtpython --version # Should be 3.8+
pip list | grep transformers # Verify transformers is installed- Start the Backend Server
cd project/backend
python -m uvicorn main:app --port 8000 --reload- Start the Frontend Application (in a new terminal)
cd project/frontend
streamlit run app.py --server.port 8501- Access the Application
- Frontend: http://localhost:8501
- Backend API: http://localhost:8000
- API Docs: http://localhost:8000/docs
On first run, the system will:
- Download the BERT model (~400MB) - takes 2-5 minutes
- Load and prepare the training data
- Initialize the model for inference
Note: Subsequent runs are instant due to caching!
- Objective: Understanding Twitter trends through NER
- Challenge: Processing 500M+ tweets/day with noisy data
- Solution: Automated entity extraction for trend analysis
- Impact: Improved content recommendation and ad targeting
- System features and capabilities
- Supported entity types (10+ categories)
- Model architecture overview
- Dataset information
Interactive navigation through:
- Problem Statement
- Data Description (CoNLL format, BIO tagging)
- Process Overview
- LSTM + CRF Model Training
- BERT Model Implementation
- Tokenization & Alignment
- Model Comparison
- Future Work & Questions
- Text Input: Enter any text for entity extraction
- Visual Output: Color-coded entity highlighting
- Analytics:
- Total entities found
- Unique entity types
- Entity distribution chart
- Detailed entity table with counts
- Model Selection: Choose from 5 model architectures
- Training Controls:
- Epochs (1-10)
- Batch Size (8-64)
- Real-time training progress
- Dataset Info: Training/validation split details
- Training samples: Count and distribution
- Test samples: Validation data overview
- Entity distribution: Visual breakdown
- Max sequence length: Data characteristics
- Real-time API activity monitoring
- Error tracking and debugging
- Download logs functionality
http://localhost:8000
GET /Returns API status and available endpoints.
POST /predict
Content-Type: application/json
{
"text": "Apple MacBook is the best laptop in the world"
}Response:
{
"words": ["Apple", "MacBook", "is", "the", "best", "laptop", "in", "the", "world"],
"entities": ["B-company", "B-product", "O", "O", "O", "O", "O", "O", "B-geo-loc"],
"annotated": [
{"word": "Apple", "entity": "B-company", "color": "#2980B9"},
{"word": "MacBook", "entity": "B-product", "color": "#D35400"},
...
]
}POST /train
Content-Type: application/json
{
"model_type": "bert-base-uncased",
"epochs": 3,
"batch_size": 32
}GET /statusGET /data-statsGET /logs?lines=100- Select Model: Choose from BERT, DistilBERT, RoBERTa, or XLM-RoBERTa
- Configure Parameters:
- Epochs: Number of training iterations (recommended: 3-5)
- Batch Size: Samples per batch (recommended: 16-32)
- Start Training: Click "Start Training" button
- Monitor Progress: View real-time training status
- Model Saved: Automatically saved as
{model_name}_ner_model
- Format: CoNLL (BIO tagging scheme)
- Training Set:
wnut 16.txt.conll - Test Set:
wnut 16test.txt.conll - Entity Tags: 10+ fine-grained categories
- Use DistilBERT for faster training (66M params)
- Use BERT for best accuracy (110M params)
- Increase batch size if you have more RAM
- Monitor logs for training progress
- Source: Workshop on Noisy User-generated Text (WNUT) 2016
- Domain: Twitter/Social Media
- Format: CoNLL (one word per line, BIO tagging)
- Entities: 10 fine-grained types
- person - Names of people
- geo-loc - Geographic locations
- company - Company/organization names
- product - Product names
- facility - Buildings and facilities
- musicartist - Musicians and bands
- tvshow - TV show titles
- sportsteam - Sports team names
- movie - Movie titles
- other - Other named entities
Harry B-person
Potter I-person
was O
living O
in O
London B-geo-loc
project/
βββ backend/
β βββ main.py # FastAPI application
β βββ model_utils.py # NER model implementation
β βββ train_initial.py # Initial training script
β βββ ner_api.log # API logs
βββ frontend/
β βββ app.py # Streamlit application
βββ wnut 16.txt.conll # Training data
βββ wnut 16test.txt.conll # Test data
βββ tweeter-ner-nlp.pdf # Technical documentation
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ LICENSE # MIT License
βββ .gitignore # Git ignore rules
| Model | Parameters | Accuracy | Speed | Memory |
|---|---|---|---|---|
| BERT | 110M | βββββ | βββ | 400MB |
| DistilBERT | 66M | ββββ | βββββ | 250MB |
| RoBERTa | 125M | βββββ | ββ | 500MB |
| XLM-RoBERTa | 125M | βββββ | ββ | 500MB |
- β Lazy model loading (startup < 1 second)
- β Data caching (60-second TTL)
- β Async API endpoints
- β Batch processing support
- β GPU acceleration (if available)
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch
git checkout -b feature/amazing-feature
- Commit your changes
git commit -m 'Add amazing feature' - Push to the branch
git push origin feature/amazing-feature
- Open a Pull Request
- Follow PEP 8 style guide
- Add docstrings to all functions
- Include unit tests for new features
- Update documentation as needed
This project is licensed under the MIT License - see the LICENSE file for details.
- β Commercial use
- β Modification
- β Distribution
- β Private use
- β Liability
- β Warranty
RATNESH SINGH
- π§ Email: rattudacsit2021gate@gmail.com
- πΌ LinkedIn: https://www.linkedin.com/in/ratneshkumar1998/
- π GitHub: https://github.com/Ratnesh-181998
- π± Phone: +91-947XXXXX46
- π Live Demo: Streamlit
- π Documentation: GitHub Wiki
- π Issue Tracker: GitHub Issues
- Hugging Face for the Transformers library
- WNUT-16 for the dataset
- FastAPI and Streamlit communities
- PyTorch team for the deep learning framework
- Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
- Sanh, V., et al. (2019). DistilBERT, a distilled version of BERT.
- Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
- WNUT-16 Shared Task on Named Entity Recognition in Twitter.
β Star this repository if you find it helpful!
Made with β€οΈ by RATNESH SINGH