Skip to content

πŸš€ Production-ready NER system for Twitter data using BERT & Transformers. Provides real-time entity extraction, customizable model training, FastAPI backend, and an interactive Streamlit analytics dashboard. Built with PyTorch for high-accuracy NLP on social media text

License

Notifications You must be signed in to change notification settings

Ratnesh-181998/Twitter-NER-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

🐦 Twitter Named Entity Recognition System

Python FastAPI Streamlit Transformers License: MIT

A production-ready Named Entity Recognition system for Twitter data using state-of-the-art Transformer models

Built by RATNESH SINGH


πŸ“‹ Table of Contents


🎯 Overview

This project implements a Named Entity Recognition (NER) system specifically designed for Twitter data. It automatically identifies and classifies named entities such as persons, locations, companies, products, and more from informal, noisy tweet text.

Problem Statement

Twitter generates ~500 million tweets per day. Understanding trends and topics requires going beyond simple hashtag analysis to extract meaningful entities from the content itself. This system addresses:

  • Volume: Processing large-scale social media data
  • Noise: Handling informal, unstructured user-generated content
  • Accuracy: Providing fine-grained entity classification (10+ categories)

Solution

A full-stack web application powered by BERT (Bidirectional Encoder Representations from Transformers) that provides:

  • Real-time entity extraction from text
  • Interactive visualization of results
  • Model training capabilities
  • Comprehensive analytics dashboard

✨ Features

πŸ” Core Functionality

  • Real-time NER: Instant entity extraction from any text input
  • Multi-model Support: BERT, DistilBERT, RoBERTa, XLM-RoBERTa
  • 10+ Entity Types: Person, Location, Company, Product, Facility, Music Artist, TV Show, Sports Team, and more
  • Visual Analytics: Interactive charts and entity distribution graphs
  • Model Training: Train custom models directly from the UI

🎨 User Interface

  • Business Case: Comprehensive project overview and impact analysis
  • About: System features and capabilities
  • Technical Documentation: Detailed technical guide from the research paper
  • Analyze: Real-time entity extraction with visual highlighting
  • Model & Training: Model selection and training interface
  • Data Statistics: Dataset insights and entity distribution
  • Logs: Real-time API monitoring

πŸš€ Performance

  • Lazy Loading: Optimized startup time (<1 second)
  • Caching: Smart data caching for improved response times
  • Async Processing: Non-blocking model operations
  • Health Monitoring: Real-time backend status indicators

🎬 Demo

Entity Extraction Example

Input:

Apple MacBook is the best laptop in the world

Output:

  • Apple β†’ Company
  • MacBook β†’ Product
  • world β†’ Geo-location

Visual Interface

The application features:

  • Color-coded entity highlighting
  • Interactive entity distribution charts
  • Real-time prediction results
  • Detailed entity tables with counts

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Frontend (Streamlit)                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚Business  β”‚  About   β”‚Technical β”‚ Analyze  β”‚  Model   β”‚  β”‚
β”‚  β”‚  Case    β”‚          β”‚   Docs   β”‚          β”‚ Training β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚ REST API (HTTP)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Backend (FastAPI)                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚              API Endpoints                            β”‚  β”‚
β”‚  β”‚  /predict  /train  /status  /data-stats  /logs       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚              Model Layer                              β”‚  β”‚
β”‚  β”‚  β€’ BERT/DistilBERT/RoBERTa                           β”‚  β”‚
β”‚  β”‚  β€’ Tokenization & Alignment                          β”‚  β”‚
β”‚  β”‚  β€’ Training & Inference                              β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Data Layer                                β”‚
β”‚  β€’ wnut 16.txt.conll (Training)                             β”‚
β”‚  β€’ wnut 16test.txt.conll (Testing)                          β”‚
β”‚  β€’ Saved Models (PyTorch)                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Technology Stack

Frontend

  • Streamlit 1.28.1 - Interactive web application framework
  • Plotly 5.17.0 - Interactive data visualization
  • Pandas 2.1.3 - Data manipulation and analysis
  • annotated-text 4.0.1 - Text annotation display

Backend

  • FastAPI 0.104.1 - Modern, high-performance web framework
  • Uvicorn 0.24.0 - ASGI server
  • Pydantic 2.5.0 - Data validation

Machine Learning

  • PyTorch 2.1.1 - Deep learning framework
  • Transformers 4.35.0 - Hugging Face transformers library
  • NumPy 1.26.2 - Numerical computing

Supported Models

  1. BERT (bert-base-uncased) - 110M parameters
  2. DistilBERT (distilbert-base-uncased) - 66M parameters (faster)
  3. RoBERTa (roberta-base) - 125M parameters (improved BERT)
  4. XLM-RoBERTa (xlm-roberta-base) - Multilingual support

πŸ“¦ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • 4GB+ RAM (8GB recommended for training)
  • Internet connection (for first-time model download)

Step 1: Clone the Repository

git clone https://github.com/YOUR_USERNAME/twitter-ner-system.git
cd twitter-ner-system

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Verify Installation

python --version  # Should be 3.8+
pip list | grep transformers  # Verify transformers is installed

πŸš€ Usage

Quick Start

  1. Start the Backend Server
cd project/backend
python -m uvicorn main:app --port 8000 --reload
  1. Start the Frontend Application (in a new terminal)
cd project/frontend
streamlit run app.py --server.port 8501
  1. Access the Application

First-Time Setup

On first run, the system will:

  1. Download the BERT model (~400MB) - takes 2-5 minutes
  2. Load and prepare the training data
  3. Initialize the model for inference

Note: Subsequent runs are instant due to caching!


🎨 UI Sections

1. πŸ’Ό Business Case

  • Objective: Understanding Twitter trends through NER
  • Challenge: Processing 500M+ tweets/day with noisy data
  • Solution: Automated entity extraction for trend analysis
  • Impact: Improved content recommendation and ad targeting
image

2. ℹ️ About

  • System features and capabilities
  • Supported entity types (10+ categories)
  • Model architecture overview
  • Dataset information
image

3. πŸ“š Technical Documentation

Interactive navigation through:

  • Problem Statement
  • Data Description (CoNLL format, BIO tagging)
  • Process Overview
  • LSTM + CRF Model Training
  • BERT Model Implementation
  • Tokenization & Alignment
  • Model Comparison
  • Future Work & Questions
image

4. πŸ” Analyze

  • Text Input: Enter any text for entity extraction
  • Visual Output: Color-coded entity highlighting
  • Analytics:
    • Total entities found
    • Unique entity types
    • Entity distribution chart
    • Detailed entity table with counts
image image

5. πŸ› οΈ Model & Training

  • Model Selection: Choose from 5 model architectures
  • Training Controls:
    • Epochs (1-10)
    • Batch Size (8-64)
    • Real-time training progress
  • Dataset Info: Training/validation split details
image

6. πŸ“Š Data Statistics

  • Training samples: Count and distribution
  • Test samples: Validation data overview
  • Entity distribution: Visual breakdown
  • Max sequence length: Data characteristics
image

7. πŸ“ Logs

  • Real-time API activity monitoring
  • Error tracking and debugging
  • Download logs functionality
image

πŸ“‘ API Documentation

Base URL

http://localhost:8000

Endpoints

1. Health Check

GET /

Returns API status and available endpoints.

2. Predict Entities

POST /predict
Content-Type: application/json

{
  "text": "Apple MacBook is the best laptop in the world"
}

Response:

{
  "words": ["Apple", "MacBook", "is", "the", "best", "laptop", "in", "the", "world"],
  "entities": ["B-company", "B-product", "O", "O", "O", "O", "O", "O", "B-geo-loc"],
  "annotated": [
    {"word": "Apple", "entity": "B-company", "color": "#2980B9"},
    {"word": "MacBook", "entity": "B-product", "color": "#D35400"},
    ...
  ]
}

3. Train Model

POST /train
Content-Type: application/json

{
  "model_type": "bert-base-uncased",
  "epochs": 3,
  "batch_size": 32
}

4. Training Status

GET /status

5. Data Statistics

GET /data-stats

6. API Logs

GET /logs?lines=100

πŸŽ“ Model Training

Training Process

  1. Select Model: Choose from BERT, DistilBERT, RoBERTa, or XLM-RoBERTa
  2. Configure Parameters:
    • Epochs: Number of training iterations (recommended: 3-5)
    • Batch Size: Samples per batch (recommended: 16-32)
  3. Start Training: Click "Start Training" button
  4. Monitor Progress: View real-time training status
  5. Model Saved: Automatically saved as {model_name}_ner_model

Training Data

  • Format: CoNLL (BIO tagging scheme)
  • Training Set: wnut 16.txt.conll
  • Test Set: wnut 16test.txt.conll
  • Entity Tags: 10+ fine-grained categories

Performance Tips

  • Use DistilBERT for faster training (66M params)
  • Use BERT for best accuracy (110M params)
  • Increase batch size if you have more RAM
  • Monitor logs for training progress

πŸ“Š Dataset

WNUT-16 Dataset

  • Source: Workshop on Noisy User-generated Text (WNUT) 2016
  • Domain: Twitter/Social Media
  • Format: CoNLL (one word per line, BIO tagging)
  • Entities: 10 fine-grained types

Entity Types

  1. person - Names of people
  2. geo-loc - Geographic locations
  3. company - Company/organization names
  4. product - Product names
  5. facility - Buildings and facilities
  6. musicartist - Musicians and bands
  7. tvshow - TV show titles
  8. sportsteam - Sports team names
  9. movie - Movie titles
  10. other - Other named entities

Example Format

Harry       B-person
Potter      I-person
was         O
living      O
in          O
London      B-geo-loc

πŸ“ Project Structure

project/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ main.py                 # FastAPI application
β”‚   β”œβ”€β”€ model_utils.py          # NER model implementation
β”‚   β”œβ”€β”€ train_initial.py        # Initial training script
β”‚   └── ner_api.log            # API logs
β”œβ”€β”€ frontend/
β”‚   └── app.py                  # Streamlit application
β”œβ”€β”€ wnut 16.txt.conll          # Training data
β”œβ”€β”€ wnut 16test.txt.conll      # Test data
β”œβ”€β”€ tweeter-ner-nlp.pdf        # Technical documentation
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ LICENSE                     # MIT License
└── .gitignore                 # Git ignore rules

πŸ“ˆ Performance

Model Comparison

Model Parameters Accuracy Speed Memory
BERT 110M ⭐⭐⭐⭐⭐ ⭐⭐⭐ 400MB
DistilBERT 66M ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ 250MB
RoBERTa 125M ⭐⭐⭐⭐⭐ ⭐⭐ 500MB
XLM-RoBERTa 125M ⭐⭐⭐⭐⭐ ⭐⭐ 500MB

Optimization Features

  • βœ… Lazy model loading (startup < 1 second)
  • βœ… Data caching (60-second TTL)
  • βœ… Async API endpoints
  • βœ… Batch processing support
  • βœ… GPU acceleration (if available)

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch
    git checkout -b feature/amazing-feature
  3. Commit your changes
    git commit -m 'Add amazing feature'
  4. Push to the branch
    git push origin feature/amazing-feature
  5. Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guide
  • Add docstrings to all functions
  • Include unit tests for new features
  • Update documentation as needed

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

  • βœ… Commercial use
  • βœ… Modification
  • βœ… Distribution
  • βœ… Private use
  • ❌ Liability
  • ❌ Warranty

πŸ“ž Contact

RATNESH SINGH

Project Links


πŸ™ Acknowledgments

  • Hugging Face for the Transformers library
  • WNUT-16 for the dataset
  • FastAPI and Streamlit communities
  • PyTorch team for the deep learning framework

πŸ“š References

  1. Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
  2. Sanh, V., et al. (2019). DistilBERT, a distilled version of BERT.
  3. Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
  4. WNUT-16 Shared Task on Named Entity Recognition in Twitter.

⭐ Star this repository if you find it helpful!

Made with ❀️ by RATNESH SINGH

About

πŸš€ Production-ready NER system for Twitter data using BERT & Transformers. Provides real-time entity extraction, customizable model training, FastAPI backend, and an interactive Streamlit analytics dashboard. Built with PyTorch for high-accuracy NLP on social media text

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published