🐦 Twitter Named Entity Recognition System

A production-ready Named Entity Recognition system for Twitter data using state-of-the-art Transformer models

Built by RATNESH SINGH

📋 Table of Contents

Overview
Features
Demo
Architecture
Technology Stack
Installation
Usage
UI Sections
API Documentation
Model Training
Dataset
Project Structure
Performance
Contributing
License
Contact

🎯 Overview

This project implements a Named Entity Recognition (NER) system specifically designed for Twitter data. It automatically identifies and classifies named entities such as persons, locations, companies, products, and more from informal, noisy tweet text.

Problem Statement

Twitter generates ~500 million tweets per day. Understanding trends and topics requires going beyond simple hashtag analysis to extract meaningful entities from the content itself. This system addresses:

Volume: Processing large-scale social media data
Noise: Handling informal, unstructured user-generated content
Accuracy: Providing fine-grained entity classification (10+ categories)

Solution

A full-stack web application powered by BERT (Bidirectional Encoder Representations from Transformers) that provides:

Real-time entity extraction from text
Interactive visualization of results
Model training capabilities
Comprehensive analytics dashboard

✨ Features

🔍 Core Functionality

Real-time NER: Instant entity extraction from any text input
Multi-model Support: BERT, DistilBERT, RoBERTa, XLM-RoBERTa
10+ Entity Types: Person, Location, Company, Product, Facility, Music Artist, TV Show, Sports Team, and more
Visual Analytics: Interactive charts and entity distribution graphs
Model Training: Train custom models directly from the UI

🎨 User Interface

Business Case: Comprehensive project overview and impact analysis
About: System features and capabilities
Technical Documentation: Detailed technical guide from the research paper
Analyze: Real-time entity extraction with visual highlighting
Model & Training: Model selection and training interface
Data Statistics: Dataset insights and entity distribution
Logs: Real-time API monitoring

🚀 Performance

Lazy Loading: Optimized startup time (<1 second)
Caching: Smart data caching for improved response times
Async Processing: Non-blocking model operations
Health Monitoring: Real-time backend status indicators

🎬 Demo

Streamlit Profile - https://share.streamlit.io/user/ratnesh-181998
Project Demo - https://twitter-ner-system-ab12c.streamlit.app/

Entity Extraction Example

Input:

Apple MacBook is the best laptop in the world

Output:

Apple → Company
MacBook → Product
world → Geo-location

Visual Interface

The application features:

Color-coded entity highlighting
Interactive entity distribution charts
Real-time prediction results
Detailed entity tables with counts

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Frontend (Streamlit)                     │
│  ┌──────────┬──────────┬──────────┬──────────┬──────────┐  │
│  │Business  │  About   │Technical │ Analyze  │  Model   │  │
│  │  Case    │          │   Docs   │          │ Training │  │
│  └──────────┴──────────┴──────────┴──────────┴──────────┘  │
└─────────────────────────┬───────────────────────────────────┘
                          │ REST API (HTTP)
┌─────────────────────────▼───────────────────────────────────┐
│                    Backend (FastAPI)                         │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              API Endpoints                            │  │
│  │  /predict  /train  /status  /data-stats  /logs       │  │
│  └──────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              Model Layer                              │  │
│  │  • BERT/DistilBERT/RoBERTa                           │  │
│  │  • Tokenization & Alignment                          │  │
│  │  • Training & Inference                              │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────┬───────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────────┐
│                    Data Layer                                │
│  • wnut 16.txt.conll (Training)                             │
│  • wnut 16test.txt.conll (Testing)                          │
│  • Saved Models (PyTorch)                                   │
└─────────────────────────────────────────────────────────────┘

🛠️ Technology Stack

Frontend

Streamlit 1.28.1 - Interactive web application framework
Plotly 5.17.0 - Interactive data visualization
Pandas 2.1.3 - Data manipulation and analysis
annotated-text 4.0.1 - Text annotation display

Backend

FastAPI 0.104.1 - Modern, high-performance web framework
Uvicorn 0.24.0 - ASGI server
Pydantic 2.5.0 - Data validation

Machine Learning

PyTorch 2.1.1 - Deep learning framework
Transformers 4.35.0 - Hugging Face transformers library
NumPy 1.26.2 - Numerical computing

Supported Models

BERT (bert-base-uncased) - 110M parameters
DistilBERT (distilbert-base-uncased) - 66M parameters (faster)
RoBERTa (roberta-base) - 125M parameters (improved BERT)
XLM-RoBERTa (xlm-roberta-base) - Multilingual support

📦 Installation

Prerequisites

Python 3.8 or higher
pip package manager
4GB+ RAM (8GB recommended for training)
Internet connection (for first-time model download)

Step 1: Clone the Repository

git clone https://github.com/YOUR_USERNAME/twitter-ner-system.git
cd twitter-ner-system

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Verify Installation

python --version  # Should be 3.8+
pip list | grep transformers  # Verify transformers is installed

🚀 Usage

Quick Start

Start the Backend Server

cd project/backend
python -m uvicorn main:app --port 8000 --reload

Start the Frontend Application (in a new terminal)

cd project/frontend
streamlit run app.py --server.port 8501

Access the Application

First-Time Setup

On first run, the system will:

Download the BERT model (~400MB) - takes 2-5 minutes
Load and prepare the training data
Initialize the model for inference

Note: Subsequent runs are instant due to caching!

🎨 UI Sections

1. 💼 Business Case

Objective: Understanding Twitter trends through NER
Challenge: Processing 500M+ tweets/day with noisy data
Solution: Automated entity extraction for trend analysis
Impact: Improved content recommendation and ad targeting

2. ℹ️ About

System features and capabilities
Supported entity types (10+ categories)
Model architecture overview
Dataset information

3. 📚 Technical Documentation

Interactive navigation through:

Problem Statement
Data Description (CoNLL format, BIO tagging)
Process Overview
LSTM + CRF Model Training
BERT Model Implementation
Tokenization & Alignment
Model Comparison
Future Work & Questions

4. 🔍 Analyze

Text Input: Enter any text for entity extraction
Visual Output: Color-coded entity highlighting
Analytics:
- Total entities found
- Unique entity types
- Entity distribution chart
- Detailed entity table with counts

5. 🛠️ Model & Training

Model Selection: Choose from 5 model architectures
Training Controls:
- Epochs (1-10)
- Batch Size (8-64)
- Real-time training progress
Dataset Info: Training/validation split details

6. 📊 Data Statistics

Training samples: Count and distribution
Test samples: Validation data overview
Entity distribution: Visual breakdown
Max sequence length: Data characteristics

7. 📝 Logs

Real-time API activity monitoring
Error tracking and debugging
Download logs functionality

📡 API Documentation

Base URL

http://localhost:8000

Endpoints

1. Health Check

GET /

Returns API status and available endpoints.

2. Predict Entities

POST /predict
Content-Type: application/json

{
  "text": "Apple MacBook is the best laptop in the world"
}

Response:

{
  "words": ["Apple", "MacBook", "is", "the", "best", "laptop", "in", "the", "world"],
  "entities": ["B-company", "B-product", "O", "O", "O", "O", "O", "O", "B-geo-loc"],
  "annotated": [
    {"word": "Apple", "entity": "B-company", "color": "#2980B9"},
    {"word": "MacBook", "entity": "B-product", "color": "#D35400"},
    ...
  ]
}

3. Train Model

POST /train
Content-Type: application/json

{
  "model_type": "bert-base-uncased",
  "epochs": 3,
  "batch_size": 32
}

4. Training Status

GET /status

5. Data Statistics

GET /data-stats

6. API Logs

GET /logs?lines=100

🎓 Model Training

Training Process

Select Model: Choose from BERT, DistilBERT, RoBERTa, or XLM-RoBERTa
Configure Parameters:
- Epochs: Number of training iterations (recommended: 3-5)
- Batch Size: Samples per batch (recommended: 16-32)
Start Training: Click "Start Training" button
Monitor Progress: View real-time training status
Model Saved: Automatically saved as {model_name}_ner_model

Training Data

Format: CoNLL (BIO tagging scheme)
Training Set: wnut 16.txt.conll
Test Set: wnut 16test.txt.conll
Entity Tags: 10+ fine-grained categories

Performance Tips

Use DistilBERT for faster training (66M params)
Use BERT for best accuracy (110M params)
Increase batch size if you have more RAM
Monitor logs for training progress

📊 Dataset

WNUT-16 Dataset

Source: Workshop on Noisy User-generated Text (WNUT) 2016
Domain: Twitter/Social Media
Format: CoNLL (one word per line, BIO tagging)
Entities: 10 fine-grained types

Entity Types

person - Names of people
geo-loc - Geographic locations
company - Company/organization names
product - Product names
facility - Buildings and facilities
musicartist - Musicians and bands
tvshow - TV show titles
sportsteam - Sports team names
movie - Movie titles
other - Other named entities

Example Format

Harry       B-person
Potter      I-person
was         O
living      O
in          O
London      B-geo-loc

📁 Project Structure

project/
├── backend/
│   ├── main.py                 # FastAPI application
│   ├── model_utils.py          # NER model implementation
│   ├── train_initial.py        # Initial training script
│   └── ner_api.log            # API logs
├── frontend/
│   └── app.py                  # Streamlit application
├── wnut 16.txt.conll          # Training data
├── wnut 16test.txt.conll      # Test data
├── tweeter-ner-nlp.pdf        # Technical documentation
├── requirements.txt            # Python dependencies
├── README.md                   # This file
├── LICENSE                     # MIT License
└── .gitignore                 # Git ignore rules

📈 Performance

Model Comparison

Model	Parameters	Accuracy	Speed	Memory
BERT	110M	⭐⭐⭐⭐⭐	⭐⭐⭐	400MB
DistilBERT	66M	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	250MB
RoBERTa	125M	⭐⭐⭐⭐⭐	⭐⭐	500MB
XLM-RoBERTa	125M	⭐⭐⭐⭐⭐	⭐⭐	500MB

Optimization Features

✅ Lazy model loading (startup < 1 second)
✅ Data caching (60-second TTL)
✅ Async API endpoints
✅ Batch processing support
✅ GPU acceleration (if available)

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch
```
git checkout -b feature/amazing-feature
```
Commit your changes
```
git commit -m 'Add amazing feature'
```
Push to the branch
```
git push origin feature/amazing-feature
```
Open a Pull Request

Development Guidelines

Follow PEP 8 style guide
Add docstrings to all functions
Include unit tests for new features
Update documentation as needed

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

✅ Commercial use
✅ Modification
✅ Distribution
✅ Private use
❌ Liability
❌ Warranty

📞 Contact

RATNESH SINGH

📧 Email: rattudacsit2021gate@gmail.com
💼 LinkedIn: https://www.linkedin.com/in/ratneshkumar1998/
🐙 GitHub: https://github.com/Ratnesh-181998
📱 Phone: +91-947XXXXX46

Project Links

🌐 Live Demo: Streamlit
📖 Documentation: GitHub Wiki
🐛 Issue Tracker: GitHub Issues

🙏 Acknowledgments

Hugging Face for the Transformers library
WNUT-16 for the dataset
FastAPI and Streamlit communities
PyTorch team for the deep learning framework

📚 References

Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Sanh, V., et al. (2019). DistilBERT, a distilled version of BERT.
Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
WNUT-16 Shared Task on Named Entity Recognition in Twitter.

⭐ Star this repository if you find it helpful!

Made with ❤️ by RATNESH SINGH

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
DATASET		DATASET
EDA		EDA
backend		backend
frontend		frontend
streamlit_deploy		streamlit_deploy
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
FINAL_GITHUB_READY.md		FINAL_GITHUB_READY.md
FINAL_STATUS.md		FINAL_STATUS.md
GITHUB_UPLOAD_GUIDE.md		GITHUB_UPLOAD_GUIDE.md
INITIALIZATION_FIX.md		INITIALIZATION_FIX.md
LICENSE		LICENSE
MODEL_INIT_SOLUTION.md		MODEL_INIT_SOLUTION.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICK_START.md		QUICK_START.md
README.md		README.md
REQUIREMENTS_COVERAGE.md		REQUIREMENTS_COVERAGE.md
UI_ENHANCEMENTS.md		UI_ENHANCEMENTS.md
UPLOAD_READY.md		UPLOAD_READY.md
WORKING_STATUS.md		WORKING_STATUS.md
read_pdf.py		read_pdf.py
requirements.txt		requirements.txt
start.bat		start.bat
tweeter-ner-nlp.pdf		tweeter-ner-nlp.pdf
wnut 16.txt.conll		wnut 16.txt.conll
wnut 16test.txt.conll		wnut 16test.txt.conll

License

Ratnesh-181998/Twitter-NER-System

Folders and files

Latest commit

History

Repository files navigation