This toolkit provides a comprehensive pipeline for analyzing semantic change of words over time using contextualized word embeddings from Transformer models. It includes modules for corpus ingestion, embedding generation, word sense induction, and visualization. A user-friendly GUI allows for interactive configuration and analysis.
The analysis process is divided into three main stages:
-
Data Ingestion (Model-Agnostic): Raw text corpora from different time periods are processed into a structured format (SQLite database). This step involves tokenization and lemmatization. This only needs to be run once per corpus.
-
Batch Embedding & Ranking (Model-Specific):
- The system queries sentences from the ingested databases.
- It generates embeddings for frequent words using your chosen Transformer model.
- It then ranks these words by their semantic shift (cosine distance) to help you find interesting cases.
-
Detailed Analysis (Visualization):
- You select a specific word (e.g., one with a high shift score).
- The system clusters its embeddings to find distinct senses (WSI).
- Visualizations show how these senses evolve over time.
Key Takeaway: You can experiment with different models (e.g., BERT, RoBERTa) on the same ingested data without ever needing to re-ingest the corpus.
If you are new to Python, follow these steps to get started:
-
Install Python 3.12:
- Go to python.org.
- Download version 3.12 for your operating system (Windows/macOS).
- Run the installer. Important: On Windows, make sure to check the box that says "Add Python to PATH" before clicking Install.
-
Open a Terminal:
- Windows: Press the
Windows Key, typePowerShell, and press Enter. - Mac/Linux: Press
Cmd + Space, typeTerminal, and press Enter.
- Windows: Press the
-
Install the Project Manager (
uv):- In the terminal window, type the following command and press Enter:
pip install uv
- This tool will help you install all other required software easily.
- In the terminal window, type the following command and press Enter:
Requirement: Python 3.12
-
Clone the repository:
git clone <repository_url> cd <repository_name>
-
Install Dependencies: This project uses
uvfor fast package management.pip install uv uv sync
-
Run Tests: Verify the installation with the test suite.
uv run pytest tests/ -v
Place your raw text files for each time period into separate directories, for example:
data_source/t1/data_source/t2/
Then, run the ingestion script:
uv run python src/run_ingest.py --input-t1 data_source/t1 --input-t2 data_source/t2 --label-t1 1800 --label-t2 1900This will create data/corpus_t1.db and data/corpus_t2.db. You only need to do this once.
The easiest way to run an analysis is through the Streamlit-based GUI.
uv run streamlit run gui.pyThis will launch a web interface (usually at http://localhost:8501) where you can:
- Configure Settings: Set the data directory, embedding model (from Hugging Face), and analysis parameters.
- Run Single Word Analysis: Interactively analyze a focus word, view its clusters, and see the nearest neighbors for each sense.
- Run Batch Analysis: Pre-compute embeddings for all shared nouns between the two corpora.
- View Reports: Generate and view a Markdown report comparing word frequencies.
You can also run analyses directly from the command line:
Step 1: Batch Embedding Generation Pre-compute embeddings for frequent words.
uv run python -m src.semantic_change.embeddings_generation --model bert-base-uncased --max-samples 200Step 2: Rank Semantic Change Calculate the shift for all shared words to find the most interesting ones.
uv run python src/rank_semantic_change.py --output output/ranking.csvStep 3: Single Word Analysis Deep dive into a specific word.
uv run python main.py --word factory --model bert-base-uncasedThis project includes tools to run computationally intensive tasks (Ingestion, Embedding) on a SLURM-based HPC cluster.
See user_guide.md for detailed instructions on:
- Pushing code and data to the cluster (
src.cli.hpc push). - Submitting jobs (
src.cli.hpc submit). - Pulling results back to your local machine (
src.cli.hpc pull).
├── gui.py # Streamlit entry point
├── main.py # CLI entry point
├── config.json # Runtime configuration
├── pyproject.toml # Dependencies and pytest config
├── src/
│ ├── gui_app.py # Streamlit UI (view layer)
│ ├── main.py # CLI analysis logic
│ ├── semantic_change/
│ │ ├── config_manager.py # Configuration management (AppConfig dataclass)
│ │ ├── services.py # Business logic (StatsService, ClusterService)
│ │ ├── corpus.py # SQLite corpus access
│ │ ├── embedding.py # Transformer embeddings (BertEmbedder)
│ │ ├── vector_store.py # ChromaDB cache for embeddings
│ │ ├── wsi.py # Word Sense Induction (clustering)
│ │ ├── visualization.py # Plotly interactive visualizations
│ │ └── ingestor.py # Corpus ingestion pipeline
│ └── utils/
│ └── dependencies.py # Dependency checking utilities
├── tests/ # Unit tests (pytest)
│ ├── test_config_manager.py
│ ├── test_services.py
│ ├── test_dependencies.py
│ └── test_vector_store.py
└── data/
├── corpus_t1.db # Ingested corpus (period 1)
├── corpus_t2.db # Ingested corpus (period 2)
└── chroma_db/ # Cached embeddings (ChromaDB)
The codebase follows an MVC-like pattern:
- View Layer (
gui_app.py): Handles Streamlit rendering and user input - Service Layer (
services.py): Business logic for statistics and cluster operations - Data Layer (
corpus.py,vector_store.py): Database and cache access - Configuration (
config_manager.py): Centralized settings using dataclass pattern
This project is developed by Fotis Jannidis with significant contributions from Claude and Gemini. See CONTRIBUTORS.md for details.