Semantic Change Analysis Toolkit

This toolkit provides a comprehensive pipeline for analyzing semantic change of words over time using contextualized word embeddings from Transformer models. It includes modules for corpus ingestion, embedding generation, word sense induction, and visualization. A user-friendly GUI allows for interactive configuration and analysis.

Workflow Overview

The analysis process is divided into three main stages:

Data Ingestion (Model-Agnostic): Raw text corpora from different time periods are processed into a structured format (SQLite database). This step involves tokenization and lemmatization. This only needs to be run once per corpus.
Batch Embedding & Ranking (Model-Specific):
- The system queries sentences from the ingested databases.
- It generates embeddings for frequent words using your chosen Transformer model.
- It then ranks these words by their semantic shift (cosine distance) to help you find interesting cases.
Detailed Analysis (Visualization):
- You select a specific word (e.g., one with a high shift score).
- The system clusters its embeddings to find distinct senses (WSI).
- Visualizations show how these senses evolve over time.

Key Takeaway: You can experiment with different models (e.g., BERT, RoBERTa) on the same ingested data without ever needing to re-ingest the corpus.

Installation for Non-Programmers

If you are new to Python, follow these steps to get started:

Install Python 3.12:
- Go to python.org.
- Download version 3.12 for your operating system (Windows/macOS).
- Run the installer. Important: On Windows, make sure to check the box that says "Add Python to PATH" before clicking Install.
Open a Terminal:
- Windows: Press the Windows Key, type PowerShell, and press Enter.
- Mac/Linux: Press Cmd + Space, type Terminal, and press Enter.
Install the Project Manager (uv):
- In the terminal window, type the following command and press Enter:
```
pip install uv
```
- This tool will help you install all other required software easily.

Setup and Installation

Requirement: Python 3.12

Clone the repository:

git clone <repository_url>
cd <repository_name>

Install Dependencies: This project uses uv for fast package management.
```
pip install uv
uv sync
```
Run Tests: Verify the installation with the test suite.
```
uv run pytest tests/ -v
```

Usage

1. Ingest Your Corpora

Place your raw text files for each time period into separate directories, for example:

data_source/t1/
data_source/t2/

Then, run the ingestion script:

uv run python src/run_ingest.py --input-t1 data_source/t1 --input-t2 data_source/t2 --label-t1 1800 --label-t2 1900

This will create data/corpus_t1.db and data/corpus_t2.db. You only need to do this once.

2. Launch the GUI

The easiest way to run an analysis is through the Streamlit-based GUI.

uv run streamlit run gui.py

This will launch a web interface (usually at http://localhost:8501) where you can:

Configure Settings: Set the data directory, embedding model (from Hugging Face), and analysis parameters.
Run Single Word Analysis: Interactively analyze a focus word, view its clusters, and see the nearest neighbors for each sense.
Run Batch Analysis: Pre-compute embeddings for all shared nouns between the two corpora.
View Reports: Generate and view a Markdown report comparing word frequencies.

3. Command-Line Usage (Advanced)

You can also run analyses directly from the command line:

Step 1: Batch Embedding Generation Pre-compute embeddings for frequent words.

uv run python -m src.semantic_change.embeddings_generation --model bert-base-uncased --max-samples 200

Step 2: Rank Semantic Change Calculate the shift for all shared words to find the most interesting ones.

uv run python src/rank_semantic_change.py --output output/ranking.csv

Step 3: Single Word Analysis Deep dive into a specific word.

uv run python main.py --word factory --model bert-base-uncased

HPC Integration

This project includes tools to run computationally intensive tasks (Ingestion, Embedding) on a SLURM-based HPC cluster.

See user_guide.md for detailed instructions on:

Pushing code and data to the cluster (src.cli.hpc push).
Submitting jobs (src.cli.hpc submit).
Pulling results back to your local machine (src.cli.hpc pull).

Project Structure

├── gui.py                  # Streamlit entry point
├── main.py                 # CLI entry point
├── config.json             # Runtime configuration
├── pyproject.toml          # Dependencies and pytest config
├── src/
│   ├── gui_app.py          # Streamlit UI (view layer)
│   ├── main.py             # CLI analysis logic
│   ├── semantic_change/
│   │   ├── config_manager.py   # Configuration management (AppConfig dataclass)
│   │   ├── services.py         # Business logic (StatsService, ClusterService)
│   │   ├── corpus.py           # SQLite corpus access
│   │   ├── embedding.py        # Transformer embeddings (BertEmbedder)
│   │   ├── vector_store.py     # ChromaDB cache for embeddings
│   │   ├── wsi.py              # Word Sense Induction (clustering)
│   │   ├── visualization.py    # Plotly interactive visualizations
│   │   └── ingestor.py         # Corpus ingestion pipeline
│   └── utils/
│       └── dependencies.py     # Dependency checking utilities
├── tests/                  # Unit tests (pytest)
│   ├── test_config_manager.py
│   ├── test_services.py
│   ├── test_dependencies.py
│   └── test_vector_store.py
└── data/
    ├── corpus_t1.db        # Ingested corpus (period 1)
    ├── corpus_t2.db        # Ingested corpus (period 2)
    └── chroma_db/          # Cached embeddings (ChromaDB)

Architecture Notes

The codebase follows an MVC-like pattern:

View Layer (gui_app.py): Handles Streamlit rendering and user input
Service Layer (services.py): Business logic for statistics and cluster operations
Data Layer (corpus.py, vector_store.py): Database and cache access
Configuration (config_manager.py): Centralized settings using dataclass pattern

Contributors

This project is developed by Fotis Jannidis with significant contributions from Claude and Gemini. See CONTRIBUTORS.md for details.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.claude		.claude
docs		docs
hpc		hpc
references		references
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
README.md		README.md
architecture.md		architecture.md
config.json		config.json
gui.py		gui.py
hpc_config.json		hpc_config.json
job.slurm		job.slurm
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
sild.bat		sild.bat
smalltest.py		smalltest.py
user_guide.md		user_guide.md
uv.lock		uv.lock
workplan.md		workplan.md
workplan_done.md		workplan_done.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Change Analysis Toolkit

Workflow Overview

Installation for Non-Programmers

Setup and Installation

Usage

1. Ingest Your Corpora

2. Launch the GUI

3. Command-Line Usage (Advanced)

HPC Integration

Project Structure

Architecture Notes

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Change Analysis Toolkit

Workflow Overview

Installation for Non-Programmers

Setup and Installation

Usage

1. Ingest Your Corpora

2. Launch the GUI

3. Command-Line Usage (Advanced)

HPC Integration

Project Structure

Architecture Notes

Contributors

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages