diff --git a/README.md b/README.md index 1342353..db1907a 100644 --- a/README.md +++ b/README.md @@ -1,56 +1,114 @@ # CodeVibing -A visual gallery of AI-generated React components and experiments. Share and explore creative coding with AI assistance. +CodeVibing is a hybrid workspace that pairs a visual gallery of AI-generated React components with a research-grade Latin bibliography toolkit. The project combines a shareable Next.js playground for creative coding with a Python pipeline for constructing a master catalogue of Latin works (1450–1900). -## Features +```mermaid +flowchart TD + subgraph Frontend Gallery + A[Next.js App Router] + B[Shared UI Components] + C[Data Seeds] + A --> B + A --> C + end -- 🎨 Visual gallery of AI-generated projects -- 💻 Live React playground -- 🌟 Easy project sharing -- 📱 Responsive design -- 🎥 Auto-generated previews + subgraph Latin Corpus Toolkit + R[Raw Catalogue CSVs] + N[Normalization Utilities] + M[Master Bibliography Builder] + T[Translation Matcher] + P[Priority Scorer] + O[latin_master_1450_1900.csv] + R --> N --> M --> T --> P --> O + end -## Getting Started + B -->|Showcase| Gallery[Live Gallery Experience] + O -->|Insights| Gallery +``` -1. Clone the repository: - ```bash - git clone https://github.com/JDerekLomas/codevibing.git - cd codevibing - ``` +## Repository Structure + +``` +codevibing/ +├── src/ # Next.js application source +├── public/ # Static assets for the gallery +├── latin_corpus/ # Python toolkit for the Latin master bibliography +├── notebooks/ # Prototyping notebooks for dataset exploration +├── package.json # Frontend dependencies +└── requirements.txt # Python dependencies for the toolkit +``` -2. Install dependencies: +## Frontend Quick Start + +1. **Install dependencies** ```bash npm install ``` -3. Copy .env.example to .env.local and add your credentials: +2. **Configure environment variables** ```bash cp .env.example .env.local + # Edit .env.local and add any required API keys ``` -4. Start the development server: +3. **Run the development server** ```bash npm run dev ``` -Visit [http://localhost:3000](http://localhost:3000) to see the app running. + Visit to explore the gallery. -## Project Structure +## Latin Corpus Toolkit Overview +The toolkit in `latin_corpus/` assembles catalogue exports, flags digitization and translation coverage, and scores works for follow-up research. + +### Prerequisites + +```bash +cd latin_corpus +python -m venv .venv +source .venv/bin/activate # On Windows: .venv\Scripts\Activate.ps1 +pip install -r requirements.txt ``` -codevibing/ -├── src/ -│ ├── app/ # Next.js app directory -│ ├── components/ # Shared components -│ ├── lib/ # Utilities and shared code -│ └── data/ # Initial seed data -└── public/ # Static assets -``` + +### Workflow + +1. Place catalogue exports (USTC, VD16/17/18, ESTC, etc.) and translation series CSVs in `latin_corpus/data/raw/`. +2. Run the end-to-end builder: + ```bash + python -m latin_corpus.main + ``` +3. Inspect the generated master table at `latin_corpus/data/processed/latin_master_1450_1900.csv`. + +See [latin_corpus/README.md](latin_corpus/README.md) for detailed customization options, column mappings, and troubleshooting tips. + +## Publishing Your Own Copy to GitHub + +If you started from a local folder and want to push it to a new GitHub repository, follow these steps: + +1. Create an empty repository at . +2. Run the following commands from your project directory (replace the URL with your repo): + ```bash + git init + git remote add origin https://github.com//codevibing.git + git add . + git commit -m "Initial commit" + git branch -M main + git push -u origin main + ``` +3. Verify the remote: + ```bash + git remote -v + ``` +4. Clone elsewhere when needed: + ```bash + git clone https://github.com//codevibing.git + ``` ## Contributing -We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details. +We welcome improvements! Please read [CONTRIBUTING.md](CONTRIBUTING.md) for contribution guidelines. ## License -This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. \ No newline at end of file +This project is licensed under the MIT License. See [LICENSE](LICENSE) for details. diff --git a/latin_corpus/README.md b/latin_corpus/README.md new file mode 100644 index 0000000..bb63d91 --- /dev/null +++ b/latin_corpus/README.md @@ -0,0 +1,81 @@ +# Latin Corpus Toolkit + +This toolkit assembles disparate catalogue exports into a unified master bibliography of Latin works published between roughly 1450 and 1900. It normalizes metadata, flags digitized editions and modern translations, and assigns a configurable research priority score. + +## Pipeline at a Glance + +```mermaid +flowchart LR + R[Raw Catalogue CSVs\nUSTC / VD16-18 / ESTC / etc.] --> N[normalize.py\nAuthor & title cleanup] + N --> M[merge.py\nBuild master bibliography] + M --> T[translation_match.py\nMatch modern translations] + T --> P[priority.py\nScore & tag works] + P --> O[data/processed/latin_master_1450_1900.csv] +``` + +Each stage uses pandas DataFrames and can be customized through configuration dictionaries and helper functions. + +## Directory Layout + +``` +latin_corpus/ +├── data/ +│ ├── raw/ # Drop catalogue & translation CSV/TSV exports here +│ └── processed/ # Generated outputs (e.g., latin_master_1450_1900.csv) +├── latin_corpus/ # Python package with the normalization/merge pipeline +├── notebooks/ # Optional Jupyter notebooks for exploration +└── requirements.txt # Toolkit-specific dependencies +``` + +## Quick Start + +1. **Create a virtual environment and install dependencies** + ```bash + cd latin_corpus + python -m venv .venv + source .venv/bin/activate # On Windows: .venv\Scripts\Activate.ps1 + pip install -r requirements.txt + ``` + +2. **Stage your source data** + * Copy catalogue exports (USTC, VD16/VD17/VD18, ESTC, national catalogues, etc.) into `data/raw/`. + * Add translation spreadsheets (Loeb, I Tatti, Brill, or custom lists) to the same folder. + +3. **Run the end-to-end build** + ```bash + python -m latin_corpus.main + ``` + The script prints progress summaries and writes `data/processed/latin_master_1450_1900.csv`. + +## Configuring Inputs + +* **Column mappings:** The loader functions in `io_utils.py` accept optional dictionaries for renaming columns when catalogue exports use different headings. +* **Language filtering:** `merge.py` includes a `LANGUAGE_ALLOWED` configuration block—add or remove variants as needed (e.g., `"lat"`, `"Latin"`). +* **Translation files:** Adjust the `TRANSLATION_SERIES` list near the top of `latin_corpus/main.py` if your filenames differ or you want to add additional translation datasets. +* **Fuzzy matching:** `translation_match.py` exposes `MATCHING_CONFIG` for enabling/disabling fuzzy title similarity and tuning thresholds. + +## Inspecting Results + +You can explore the master bibliography interactively using the provided notebook: + +```bash +jupyter notebook notebooks/build_master_example.ipynb +``` + +Within the notebook, import and call: + +```python +from latin_corpus.merge import build_master_bibliography +master_df = build_master_bibliography() +master_df.head() +``` + +## Troubleshooting + +* Install pandas and related dependencies if you see a `MissingDependencyError` from `_compat.py`. +* Verify filenames and encodings for any CSV/TSV that fails to load; the loaders accept both UTF-8 and Latin-1. +* Delete or move old outputs in `data/processed/` if you want to regenerate the master CSV from scratch. + +## Contributing + +Pull requests and issue reports are welcome. Please follow the repository-wide [CONTRIBUTING.md](../CONTRIBUTING.md) guidelines when proposing changes. diff --git a/latin_corpus/data/processed/.gitkeep b/latin_corpus/data/processed/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/latin_corpus/data/raw/.gitkeep b/latin_corpus/data/raw/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/latin_corpus/latin_corpus/__init__.py b/latin_corpus/latin_corpus/__init__.py new file mode 100644 index 0000000..54c53bf --- /dev/null +++ b/latin_corpus/latin_corpus/__init__.py @@ -0,0 +1,32 @@ +"""Utility package for constructing a Latin bibliography master table.""" + +from __future__ import annotations + +from importlib import import_module +from typing import Any + +__all__ = [ + "add_priority_scores", + "add_translation_flags", + "build_master_bibliography", + "build_translation_index", + "run_pipeline", +] + + +_MODULE_MAP = { + "run_pipeline": (".main", "run_pipeline"), + "build_master_bibliography": (".merge", "build_master_bibliography"), + "add_priority_scores": (".priority", "add_priority_scores"), + "add_translation_flags": (".translation_match", "add_translation_flags"), + "build_translation_index": (".translation_match", "build_translation_index"), +} + + +def __getattr__(name: str) -> Any: # pragma: no cover - dynamic import glue + try: + module_name, attr = _MODULE_MAP[name] + except KeyError as exc: + raise AttributeError(f"module 'latin_corpus' has no attribute {name!r}") from exc + module = import_module(module_name, package=__name__) + return getattr(module, attr) diff --git a/latin_corpus/latin_corpus/_compat.py b/latin_corpus/latin_corpus/_compat.py new file mode 100644 index 0000000..ce23449 --- /dev/null +++ b/latin_corpus/latin_corpus/_compat.py @@ -0,0 +1,33 @@ +"""Compatibility helpers for optional runtime dependencies.""" + +from __future__ import annotations + +from importlib import import_module +from types import ModuleType + + +class MissingDependencyError(RuntimeError): + """Raised when a required optional dependency is unavailable.""" + + +def require_pandas() -> ModuleType: + """Return the :mod:`pandas` module or raise a helpful error message. + + The toolkit leans heavily on pandas for all tabular operations. When the + dependency is not installed, importing modules that rely on pandas results + in an opaque ``ModuleNotFoundError``. Centralising the import behind this + helper lets us surface an actionable instruction for users instead. + """ + + try: + return import_module("pandas") + except ModuleNotFoundError as exc: # pragma: no cover - import-time guard + raise MissingDependencyError( + "pandas is required for the latin_corpus toolkit. Install the " + "dependencies via 'pip install -r requirements.txt' before running " + "the pipeline." + ) from exc + + +__all__ = ["MissingDependencyError", "require_pandas"] + diff --git a/latin_corpus/latin_corpus/io_utils.py b/latin_corpus/latin_corpus/io_utils.py new file mode 100644 index 0000000..4597fbb --- /dev/null +++ b/latin_corpus/latin_corpus/io_utils.py @@ -0,0 +1,236 @@ +"""Input/output utilities for catalogue and translation data. + +This module centralises reading and writing logic so that the rest of the +pipeline can rely on consistent DataFrame schemas. All file interactions are +confined to the ``latin_corpus/data`` directory tree by default, but custom +paths may be supplied when integrating additional catalogues. +""" + +from __future__ import annotations + +import logging +from pathlib import Path +from typing import Iterable, Mapping, MutableMapping, Optional + +from ._compat import require_pandas + +pd = require_pandas() + +LOGGER = logging.getLogger(__name__) + +PACKAGE_ROOT = Path(__file__).resolve().parents[1] +RAW_DATA_DIR = PACKAGE_ROOT / "data" / "raw" +PROCESSED_DATA_DIR = PACKAGE_ROOT / "data" / "processed" + +# Default file names act as placeholders that users can replace with their own +# exports. The files do not need to exist; missing files yield empty frames so +# that the rest of the pipeline can still run for testing purposes. +DEFAULT_FILENAMES: Mapping[str, str] = { + "USTC": "ustc_export.csv", + "VD16": "vd16_export.csv", + "VD17": "vd17_export.csv", + "VD18": "vd18_export.csv", + "ESTC": "estc_export.csv", +} + +# Columns that are commonly requested downstream. Missing columns are filled +# with ``pd.NA`` so that DataFrame operations remain well defined. +CORE_CATALOG_COLUMNS: tuple[str, ...] = ( + "source_id", + "author", + "title", + "full_title", + "imprint_place", + "imprint_year", + "language", + "subjects", + "digital_facsimile_urls", +) + +TRANSLATION_COLUMNS: tuple[str, ...] = ( + "latin_author", + "latin_title", + "modern_language", + "translation_series", + "year_of_translation", +) + +DEFAULT_READ_KWARGS: Mapping[str, object] = { + "dtype": str, + "keep_default_na": False, + "na_values": ["", "NA", "N/A", "null", "None"], +} + + +def _resolve_path(path: Optional[Path | str], default_name: str) -> Path: + """Return the resolved path to a raw data file. + + Parameters + ---------- + path: + Custom path provided by the caller. May be absolute or relative. + default_name: + File name (without directory) to use when ``path`` is ``None``. + + Returns + ------- + Path + Fully resolved file path. The file does not need to exist yet. + """ + + if path is None: + candidate = RAW_DATA_DIR / default_name + else: + candidate = Path(path) + if not candidate.is_absolute(): + candidate = RAW_DATA_DIR / candidate + return candidate + + +def _ensure_columns(frame: pd.DataFrame, required: Iterable[str]) -> pd.DataFrame: + """Guarantee that ``frame`` has the specified ``required`` columns.""" + + missing = [col for col in required if col not in frame.columns] + if missing: + frame = frame.assign(**{col: pd.NA for col in missing}) + return frame + + +def _load_csv(path: Path, *, column_map: Optional[Mapping[str, str]] = None, **kwargs) -> pd.DataFrame: + """Load a CSV/TSV file into a DataFrame with optional column renaming. + + If the file does not exist, an empty DataFrame with mapped columns is + returned instead of raising an exception. This behaviour allows the + pipeline to run in environments where only a subset of catalogues are + available. + """ + + read_kwargs: MutableMapping[str, object] = dict(DEFAULT_READ_KWARGS) + read_kwargs.update(kwargs) + + if path.suffix.lower() == ".tsv": + read_kwargs.setdefault("sep", "\t") + + if not path.exists(): + LOGGER.warning("File not found: %s", path) + frame = pd.DataFrame() + else: + frame = pd.read_csv(path, **read_kwargs) + LOGGER.info("Loaded %s with %s rows and %s columns", path, len(frame), len(frame.columns)) + + if column_map: + frame = frame.rename(columns=column_map) + + return frame + + +def load_ustc(path: Optional[Path | str] = None, *, column_map: Optional[Mapping[str, str]] = None, **kwargs) -> pd.DataFrame: + """Load a Universal Short Title Catalogue (USTC) export. + + Parameters + ---------- + path: + Optional custom path. Defaults to ``data/raw/ustc_export.csv``. + column_map: + Mapping from source column names to canonical names. Columns listed in + :data:`CORE_CATALOG_COLUMNS` should be covered. + **kwargs: + Additional arguments forwarded to :func:`pandas.read_csv`. + + Returns + ------- + pandas.DataFrame + DataFrame with at least the columns defined in + :data:`CORE_CATALOG_COLUMNS`. Missing fields are populated with ``pd.NA``. + """ + + resolved = _resolve_path(path, DEFAULT_FILENAMES["USTC"]) + frame = _load_csv(resolved, column_map=column_map, **kwargs) + frame = _ensure_columns(frame, CORE_CATALOG_COLUMNS) + return frame + + +def load_vd(path: Optional[Path | str] = None, *, catalog_name: str, column_map: Optional[Mapping[str, str]] = None, **kwargs) -> pd.DataFrame: + """Load a VD catalogue (VD16/VD17/VD18) export. + + Parameters + ---------- + path: + Optional custom file path relative to ``data/raw``. If omitted, a + placeholder derived from ``catalog_name`` is used (e.g. ``vd16_export.csv``). + catalog_name: + Name of the catalogue, used to construct default file names and + populate metadata fields downstream. + column_map: + Optional rename mapping, analogous to :func:`load_ustc`. + **kwargs: + Additional keyword arguments for :func:`pandas.read_csv`. + """ + + default_name = DEFAULT_FILENAMES.get(catalog_name.upper(), f"{catalog_name.lower()}_export.csv") + resolved = _resolve_path(path, default_name) + frame = _load_csv(resolved, column_map=column_map, **kwargs) + frame = _ensure_columns(frame, CORE_CATALOG_COLUMNS) + return frame + + +def load_translation_list(path: Optional[Path | str] = None, *, series_name: str, column_map: Optional[Mapping[str, str]] = None, **kwargs) -> pd.DataFrame: + """Load a CSV containing information about modern translations. + + Parameters + ---------- + path: + Optional custom path relative to ``data/raw``. If omitted, the file name + defaults to ``{series_name}.csv`` in snake case (e.g. ``loeb_classical_library.csv``). + series_name: + Label describing the translation series (e.g. "Loeb"). This value is not + used during loading but is convenient when building indices downstream. + column_map: + Optional rename mapping for column normalisation. + **kwargs: + Additional keyword arguments for :func:`pandas.read_csv`. + """ + + default_name = f"{series_name.lower().replace(' ', '_')}.csv" + resolved = _resolve_path(path, default_name) + frame = _load_csv(resolved, column_map=column_map, **kwargs) + frame = _ensure_columns(frame, TRANSLATION_COLUMNS) + return frame + + +def save_processed(df: pd.DataFrame, filename: str, *, index: bool = False, **kwargs) -> Path: + """Write ``df`` to ``data/processed`` with ``filename``. + + Parameters + ---------- + df: + DataFrame to be persisted. + filename: + File name (with extension) relative to ``data/processed``. + index: + Whether to include the DataFrame index. Defaults to ``False``. + **kwargs: + Additional arguments forwarded to :meth:`pandas.DataFrame.to_csv`. + + Returns + ------- + Path + The path of the saved file, allowing the caller to log or reuse it. + """ + + PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True) + path = PROCESSED_DATA_DIR / filename + df.to_csv(path, index=index, **kwargs) + LOGGER.info("Saved processed data to %s", path) + return path + + +__all__ = [ + "DEFAULT_FILENAMES", + "RAW_DATA_DIR", + "PROCESSED_DATA_DIR", + "load_translation_list", + "load_ustc", + "load_vd", + "save_processed", +] diff --git a/latin_corpus/latin_corpus/main.py b/latin_corpus/latin_corpus/main.py new file mode 100644 index 0000000..2210668 --- /dev/null +++ b/latin_corpus/latin_corpus/main.py @@ -0,0 +1,110 @@ +"""Command-line entry point for building the Latin master dataset.""" + +from __future__ import annotations + +import logging +from pathlib import Path +from typing import Iterable, Mapping, MutableMapping, TYPE_CHECKING + +from ._compat import MissingDependencyError + +if TYPE_CHECKING: # pragma: no cover - import for typing only + import pandas as pd + + +LOGGER = logging.getLogger(__name__) + +MASTER_OUTPUT_FILENAME = "latin_master_1450_1900.csv" + +# Default translation series specifications. Adjust or extend this tuple to suit +# the catalogues available in ``data/raw``. +TRANSLATION_SERIES: tuple[Mapping[str, object], ...] = ( + {"label": "Loeb", "path": "loeb_classical_library.csv"}, + {"label": "I Tatti", "path": "i_tatti_renaissance_library.csv"}, + {"label": "Brill", "path": "brill_translations.csv"}, +) + + +def _load_translation_frames(series_specs: Iterable[Mapping[str, object]]) -> Mapping[str, "pd.DataFrame"]: + frames: MutableMapping[str, "pd.DataFrame"] = {} + for spec in series_specs: + label = str(spec.get("label", "")).strip() + if not label: + LOGGER.warning("Skipping translation series with missing label: %s", spec) + continue + path = spec.get("path") + column_map = spec.get("column_map") + read_kwargs = spec.get("read_kwargs", {}) + from .io_utils import load_translation_list + + frames[label] = load_translation_list(path=path, series_name=label, column_map=column_map, **read_kwargs) + return frames + + +def _print_summary(df: "pd.DataFrame", output_path: Path) -> None: + total_rows = len(df) + LOGGER.info("Saved master table to %s", output_path) + LOGGER.info("Total rows: %s", total_rows) + + if total_rows == 0: + LOGGER.warning("No data rows available. Check catalogue inputs in data/raw/.") + return + + facsimile_series = df["has_digital_facsimile"].fillna(False).astype(bool) + translation_series = df["has_modern_translation"].fillna(False).astype(bool) + + percent_unscanned = 100.0 * (1.0 - facsimile_series.mean()) + percent_untranslated = 100.0 * (1.0 - translation_series.mean()) + + LOGGER.info("%% without digital facsimile: %.2f", percent_unscanned) + LOGGER.info("%% without modern translation: %.2f", percent_untranslated) + + top_priority = df.sort_values("priority_score", ascending=False).head(20) + if top_priority.empty: + LOGGER.info("No rows with positive priority scores yet.") + return + + display_columns = [ + "work_id", + "author", + "title", + "imprint_year", + "has_digital_facsimile", + "has_modern_translation", + "priority_score", + "priority_tags", + ] + LOGGER.info("Top 20 priority works:\n%s", top_priority[display_columns].to_string(index=False)) + + +def run_pipeline() -> "pd.DataFrame": + """Execute the full pipeline and return the enriched master DataFrame.""" + + from .io_utils import save_processed + from .merge import build_master_bibliography + from .priority import add_priority_scores + from .translation_match import DEFAULT_MATCH_CONFIG, add_translation_flags, build_translation_index + + master = build_master_bibliography() + translation_frames = _load_translation_frames(TRANSLATION_SERIES) + translation_index = build_translation_index(translation_frames) + with_translations = add_translation_flags(master, translation_index, config=DEFAULT_MATCH_CONFIG) + scored = add_priority_scores(with_translations) + output_path = save_processed(scored, MASTER_OUTPUT_FILENAME) + _print_summary(scored, output_path) + return scored + + +def main() -> None: + """Entry point used by ``python -m latin_corpus.main``.""" + + logging.basicConfig(level=logging.INFO, format="%(levelname)s:%(name)s:%(message)s") + try: + run_pipeline() + except MissingDependencyError as exc: + LOGGER.error("%s", exc) + raise SystemExit(1) from exc + + +if __name__ == "__main__": # pragma: no cover - CLI entry point + main() diff --git a/latin_corpus/latin_corpus/merge.py b/latin_corpus/latin_corpus/merge.py new file mode 100644 index 0000000..ceec5c9 --- /dev/null +++ b/latin_corpus/latin_corpus/merge.py @@ -0,0 +1,295 @@ +"""Catalogue normalisation and merging utilities.""" + +from __future__ import annotations + +import hashlib +import logging +from dataclasses import dataclass +from typing import Callable, Dict, Iterable, Mapping, Optional + +from ._compat import require_pandas + +pd = require_pandas() + +from .io_utils import DEFAULT_FILENAMES, load_ustc, load_vd +from .normalize import extract_year, normalize_author, normalize_title, standardize_language_label + +LOGGER = logging.getLogger(__name__) + +CatalogLoader = Callable[..., pd.DataFrame] + + +@dataclass +class CatalogSpec: + """Configuration for a source catalogue.""" + + loader: CatalogLoader + column_map: Mapping[str, str] + default_filename: str + extra_kwargs: Mapping[str, object] | None = None + + +# Default column mappings are deliberately conservative and should be adjusted to +# match the exported CSV headers used in your environment. +CATALOG_SPECS: Dict[str, CatalogSpec] = { + "USTC": CatalogSpec( + loader=load_ustc, + column_map={ + "ustc_id": "source_id", + "author": "author", + "short_title": "title", + "full_title": "full_title", + "imprint_place": "imprint_place", + "imprint_year": "imprint_year", + "language": "language", + "subjects": "subjects", + "digital_facsimile_urls": "digital_facsimile_urls", + }, + default_filename=DEFAULT_FILENAMES["USTC"], + ), + "VD16": CatalogSpec( + loader=load_vd, + column_map={ + "vd16": "source_id", + "author": "author", + "title_short": "title", + "title_full": "full_title", + "place": "imprint_place", + "year": "imprint_year", + "language": "language", + "keywords": "subjects", + "digital_urls": "digital_facsimile_urls", + }, + default_filename=DEFAULT_FILENAMES["VD16"], + extra_kwargs={"catalog_name": "VD16"}, + ), + "VD17": CatalogSpec( + loader=load_vd, + column_map={ + "vd17": "source_id", + "author": "author", + "short_title": "title", + "full_title": "full_title", + "place": "imprint_place", + "imprint_year": "imprint_year", + "language": "language", + "subjects": "subjects", + "digital_facsimile": "digital_facsimile_urls", + }, + default_filename=DEFAULT_FILENAMES["VD17"], + extra_kwargs={"catalog_name": "VD17"}, + ), + "VD18": CatalogSpec( + loader=load_vd, + column_map={ + "vd18": "source_id", + "author": "author", + "title": "title", + "title_full": "full_title", + "place": "imprint_place", + "year": "imprint_year", + "language": "language", + "subjects": "subjects", + "digital_facsimile": "digital_facsimile_urls", + }, + default_filename=DEFAULT_FILENAMES["VD18"], + extra_kwargs={"catalog_name": "VD18"}, + ), +} + +CATALOG_PRIORITY: tuple[str, ...] = ("USTC", "VD16", "VD17", "VD18", "ESTC") + +VALUE_COLUMNS: tuple[str, ...] = ( + "author", + "title", + "full_title", + "imprint_place", + "subjects", + "digital_facsimile_urls", +) + + +def _load_catalogue(name: str, overrides: Optional[Mapping[str, object]] = None) -> pd.DataFrame: + """Load and lightly clean a single catalogue.""" + + spec = CATALOG_SPECS.get(name) + if spec is None: + LOGGER.warning("No catalog specification for %s; skipping", name) + return pd.DataFrame() + + kwargs = dict(spec.extra_kwargs or {}) + if overrides: + kwargs.update(overrides) + + frame = spec.loader(path=kwargs.pop("path", None), column_map=spec.column_map, **kwargs) + if frame.empty: + return frame + + frame["source_catalog"] = name + + for col in ("author", "title", "full_title", "imprint_place", "subjects", "digital_facsimile_urls"): + if col in frame.columns: + frame[col] = frame[col].fillna("").astype(str).str.strip() + else: + frame[col] = "" + + frame["author_norm"] = frame["author"].apply(normalize_author) + frame["title_norm"] = frame["title"].apply(normalize_title) + frame["imprint_year"] = frame["imprint_year"].apply(extract_year) + + frame["language_standardized"] = frame["language"].apply(standardize_language_label) + lang_text = frame["language"].fillna("").astype(str) + mask = frame["language_standardized"].eq("Latin") | lang_text.str.contains("latin", case=False) + frame = frame[mask] + frame["language"] = frame["language_standardized"].fillna(frame["language"]) + + frame["has_digital_facsimile"] = frame["digital_facsimile_urls"].apply(lambda value: bool(str(value).strip())) + frame["digital_facsimile_sources"] = frame["has_digital_facsimile"].map({True: name, False: ""}) + + return frame[ + [ + "source_catalog", + "source_id", + "author", + "author_norm", + "title", + "title_norm", + "full_title", + "imprint_place", + "imprint_year", + "language", + "subjects", + "digital_facsimile_urls", + "has_digital_facsimile", + "digital_facsimile_sources", + ] + ] + + +def _combine_source_ids(values: Iterable[str]) -> str: + unique = sorted({v for v in values if v}) + return ";".join(unique) + + +def _combine_strings(values: Iterable[str]) -> str: + unique = sorted({v.strip() for v in values if v and v.strip()}) + return ";".join(unique) + + +def _generate_work_id(author_norm: str, title_norm: str, imprint_year: Optional[int]) -> str: + year_token = str(imprint_year) if imprint_year is not None else "na" + digest = hashlib.md5(f"{author_norm}||{title_norm}||{year_token}".encode("utf-8")).hexdigest() + return f"wrk_{digest[:12]}" + + +def _deduplicate(master: pd.DataFrame) -> pd.DataFrame: + if master.empty: + return master + + rank_map = {name: idx for idx, name in enumerate(CATALOG_PRIORITY)} + master = master.copy() + master["catalog_rank"] = master["source_catalog"].map(rank_map).fillna(len(rank_map)).astype(int) + master["data_completeness"] = master[list(VALUE_COLUMNS)].notna().sum(axis=1) + master["imprint_year_group"] = master["imprint_year"].fillna(-1).astype(int) + master["dedupe_key"] = list(zip(master["author_norm"], master["title_norm"], master["imprint_year_group"])) + + master_sorted = master.sort_values(by=["catalog_rank", "data_completeness"], ascending=[True, False]) + best = master_sorted.drop_duplicates(subset="dedupe_key", keep="first") + + source_id_map = master.groupby("dedupe_key")["source_id"].apply(_combine_source_ids) + digital_url_map = master.groupby("dedupe_key")["digital_facsimile_urls"].apply(_combine_strings) + digital_sources_map = master.groupby("dedupe_key")["digital_facsimile_sources"].apply(_combine_strings) + has_digital_map = master.groupby("dedupe_key")["has_digital_facsimile"].any() + + best = best.copy() + best["source_id"] = best["dedupe_key"].map(source_id_map) + best["digital_facsimile_urls"] = best["dedupe_key"].map(digital_url_map).fillna("") + best["digital_facsimile_sources"] = best["dedupe_key"].map(digital_sources_map).fillna("") + best["has_digital_facsimile"] = best["dedupe_key"].map(has_digital_map).fillna(False) + + best["work_id"] = best.apply( + lambda row: _generate_work_id(row["author_norm"], row["title_norm"], row["imprint_year"]), axis=1 + ) + + best = best.drop(columns=["catalog_rank", "data_completeness", "imprint_year_group", "dedupe_key"]) + best["imprint_year"] = best["imprint_year"].astype("Int64") + + columns = [ + "work_id", + "source_catalog", + "source_id", + "author", + "author_norm", + "title", + "title_norm", + "full_title", + "imprint_place", + "imprint_year", + "language", + "subjects", + "digital_facsimile_urls", + "has_digital_facsimile", + "digital_facsimile_sources", + ] + + return best[columns] + + +def build_master_bibliography(overrides: Optional[Mapping[str, Mapping[str, object]]] = None) -> pd.DataFrame: + """Construct a unified DataFrame across all configured catalogues. + + Parameters + ---------- + overrides: + Optional mapping keyed by catalogue name that supplies keyword arguments + for the respective loader (e.g. ``{"USTC": {"path": "ustc_subset.csv"}}``). + + Returns + ------- + pandas.DataFrame + Normalised and de-duplicated catalogue entries limited to Latin-language + records. The result includes generated ``work_id`` values and flags for + known digital facsimiles. + """ + + frames = [] + for name in CATALOG_SPECS: + frame = _load_catalogue(name, overrides=overrides.get(name) if overrides else None) + if frame.empty: + LOGGER.info("Catalogue %s produced no records (missing file or no Latin entries)", name) + continue + frames.append(frame) + + if not frames: + LOGGER.warning("No catalogue data available; returning empty DataFrame") + return pd.DataFrame( + columns=[ + "work_id", + "source_catalog", + "source_id", + "author", + "author_norm", + "title", + "title_norm", + "full_title", + "imprint_place", + "imprint_year", + "language", + "subjects", + "digital_facsimile_urls", + "has_digital_facsimile", + "digital_facsimile_sources", + ] + ) + + combined = pd.concat(frames, ignore_index=True) + master = _deduplicate(combined) + LOGGER.info("Master bibliography contains %s rows", len(master)) + return master + + +__all__ = [ + "CATALOG_PRIORITY", + "CATALOG_SPECS", + "build_master_bibliography", +] diff --git a/latin_corpus/latin_corpus/normalize.py b/latin_corpus/latin_corpus/normalize.py new file mode 100644 index 0000000..c697973 --- /dev/null +++ b/latin_corpus/latin_corpus/normalize.py @@ -0,0 +1,125 @@ +"""Normalisation helpers for bibliographic metadata.""" + +from __future__ import annotations + +import re +import string +from typing import Iterable, Optional + +from unidecode import unidecode + +# Configuration values collected in a dictionary for quick adjustments. +CONFIG = { + "author_honorifics": ( + "dr", + "prof", + "professor", + "rev", + "reverend", + "sir", + "dom", + "fr", + "fra", + ), + "title_leading_stopwords": ( + "de", + "in", + "ad", + "liber", + ), + "punctuation_preserve_title": {":", ","}, +} + +PUNCTUATION_TABLE_AUTHOR = str.maketrans({ch: " " for ch in string.punctuation}) +PUNCTUATION_TABLE_TITLE = str.maketrans( + {ch: " " for ch in string.punctuation if ch not in CONFIG["punctuation_preserve_title"]} +) + +LANGUAGE_MAP = { + "lat": "Latin", + "la": "Latin", + "latin": "Latin", + "latine": "Latin", + "latius": "Latin", +} + +YEAR_PATTERN = re.compile(r"(1[45-9]\d{2})") + + +def _normalise_whitespace(value: str) -> str: + return re.sub(r"\s+", " ", value).strip() + + +def _strip_honorifics(value: str, honorifics: Iterable[str]) -> str: + pattern = r"^(?:(?:" + "|".join(map(re.escape, honorifics)) + r")\.?,?\s+)+" + return re.sub(pattern, "", value) + + +def normalize_author(name: Optional[str]) -> str: + """Return a lowercased, ASCII-fied author string without honorifics.""" + + if not name: + return "" + value = unidecode(str(name)).lower() + value = _strip_honorifics(value, CONFIG["author_honorifics"]) + value = value.translate(PUNCTUATION_TABLE_AUTHOR) + return _normalise_whitespace(value) + + +def normalize_title(title: Optional[str]) -> str: + """Return a normalised title suitable for matching across catalogues.""" + + if not title: + return "" + value = unidecode(str(title)).lower() + value = value.translate(PUNCTUATION_TABLE_TITLE) + value = _normalise_whitespace(value) + + for stopword in CONFIG["title_leading_stopwords"]: + if value.startswith(f"{stopword} "): + value = value[len(stopword) + 1 :] + break + + return value + + +def extract_year(value: Optional[str | int | float]) -> Optional[int]: + """Extract the first plausible Gregorian year (1450–1999) from ``value``.""" + + if value is None or value != value: # NaN check + return None + + if isinstance(value, (int, float)) and not isinstance(value, bool): + int_value = int(value) + if 1450 <= int_value <= 1900: + return int_value + return None + + text = unidecode(str(value)) + match = YEAR_PATTERN.search(text) + if not match: + return None + year = int(match.group(1)) + return year if 1450 <= year <= 1900 else None + + +def standardize_language_label(label: Optional[str]) -> Optional[str]: + """Map language codes and descriptors to canonical names.""" + + if not label: + return None + cleaned = unidecode(label).lower().strip() + if cleaned in LANGUAGE_MAP: + return LANGUAGE_MAP[cleaned] + if cleaned.startswith("lat"): + return "Latin" + return label.strip() + + +__all__ = [ + "CONFIG", + "extract_year", + "normalize_author", + "normalize_title", + "standardize_language_label", +] diff --git a/latin_corpus/latin_corpus/priority.py b/latin_corpus/latin_corpus/priority.py new file mode 100644 index 0000000..dcc9fcd --- /dev/null +++ b/latin_corpus/latin_corpus/priority.py @@ -0,0 +1,107 @@ +"""Priority scoring utilities for the Latin master table.""" + +from __future__ import annotations + +from typing import Mapping, MutableMapping, Optional + +from ._compat import require_pandas + +pd = require_pandas() + + +PRIORITY_WEIGHTS: Mapping[str, float] = { + "missing_facsimile": 2.0, + "missing_translation": 2.0, + "scientific": 1.0, + "hermetic": 1.0, + "colonial": 1.0, + "early_modern_peak": 1.0, +} + +KEYWORD_GROUPS: Mapping[str, tuple[str, ...]] = { + "scientific": ("astronom", "physic", "medic", "anatom", "botan", "mathemat"), + "hermetic": ("hermet", "alchem", "cabal", "magia", "occult"), + "colonial": ("india", "china", "mexic", "peru", "brazil", "goa", "iapon", "japan"), +} + +EARLY_MODERN_RANGE: tuple[int, int] = (1500, 1650) + + +def _ensure_columns(frame: pd.DataFrame) -> pd.DataFrame: + defaults = { + "has_digital_facsimile": False, + "has_modern_translation": False, + "subjects": "", + "title": "", + "priority_score": 0.0, + "priority_tags": "", + } + for col, default in defaults.items(): + if col not in frame.columns: + frame[col] = default + return frame + + +def _detect_keyword_tags(text: str) -> set[str]: + if not text: + return set() + lowered = text.lower() + tags = {name for name, keywords in KEYWORD_GROUPS.items() if any(keyword in lowered for keyword in keywords)} + return tags + + +def add_priority_scores(master_df: pd.DataFrame, *, weights: Optional[Mapping[str, float]] = None) -> pd.DataFrame: + """Compute priority scores and tags for ``master_df``.""" + + if master_df.empty: + result = master_df.copy() + result["priority_score"] = pd.Series(dtype=float) + result["priority_tags"] = pd.Series(dtype=str) + return result + + working = _ensure_columns(master_df.copy()) + applied_weights: MutableMapping[str, float] = dict(PRIORITY_WEIGHTS) + if weights: + applied_weights.update(weights) + + scores = [] + tags_list = [] + lower_bound, upper_bound = EARLY_MODERN_RANGE + + for _, row in working.iterrows(): + score = 0.0 + tags: list[str] = [] + + if not bool(row.get("has_digital_facsimile", False)): + score += applied_weights["missing_facsimile"] + tags.append("unscanned") + + if not bool(row.get("has_modern_translation", False)): + score += applied_weights["missing_translation"] + tags.append("untranslated") + + text_blob = f"{row.get('title', '')} {row.get('subjects', '')}".strip() + for keyword_tag in _detect_keyword_tags(text_blob): + score += applied_weights.get(keyword_tag, 0.0) + tags.append(keyword_tag) + + imprint_year = row.get("imprint_year") + if pd.notna(imprint_year): + try: + year_int = int(imprint_year) + except (TypeError, ValueError): + year_int = None + if year_int is not None and lower_bound <= year_int <= upper_bound: + score += applied_weights["early_modern_peak"] + tags.append("early_modern_peak") + + scores.append(score) + tags_list.append(";".join(sorted(dict.fromkeys(tags)))) + + working["priority_score"] = scores + working["priority_tags"] = tags_list + + return working + + +__all__ = ["add_priority_scores", "PRIORITY_WEIGHTS", "KEYWORD_GROUPS", "EARLY_MODERN_RANGE"] diff --git a/latin_corpus/latin_corpus/translation_match.py b/latin_corpus/latin_corpus/translation_match.py new file mode 100644 index 0000000..7201178 --- /dev/null +++ b/latin_corpus/latin_corpus/translation_match.py @@ -0,0 +1,255 @@ +"""Utilities for matching Latin works to modern translations.""" + +from __future__ import annotations + +import logging +from typing import Iterable, Mapping, MutableMapping, Optional + +from ._compat import require_pandas + +pd = require_pandas() + +from .normalize import normalize_author, normalize_title + +LOGGER = logging.getLogger(__name__) + + +TRANSLATION_INDEX_COLUMNS: tuple[str, ...] = ( + "series_name", + "latin_author_norm", + "latin_title_norm", + "modern_language", + "translation_year", +) + + +DEFAULT_MATCH_CONFIG: Mapping[str, object] = { + "enable_fuzzy": True, + "fuzzy_threshold": 0.9, +} + + +try: # pragma: no cover - optional dependency handling + from rapidfuzz import fuzz as _rf_fuzz + + def _similarity(a: str, b: str) -> float: + return _rf_fuzz.ratio(a, b) / 100.0 + +except Exception: # pragma: no cover - optional dependency handling + try: + from Levenshtein import ratio as _lev_ratio + + def _similarity(a: str, b: str) -> float: + return _lev_ratio(a, b) + + except Exception: + _similarity = None # type: ignore[assignment] + + +def _normalise_translation_frame(frame: pd.DataFrame, series_name: str) -> pd.DataFrame: + """Return a copy of ``frame`` with normalised author/title columns.""" + + if frame.empty: + return pd.DataFrame(columns=[*TRANSLATION_INDEX_COLUMNS, "modern_languages", "translation_years"]) + + working = frame.copy() + working["series_name"] = series_name + working["latin_author_norm"] = working["latin_author"].fillna("").astype(str).map(normalize_author) + working["latin_title_norm"] = working["latin_title"].fillna("").astype(str).map(normalize_title) + working["modern_language"] = working["modern_language"].fillna("").astype(str).str.strip() + working["translation_year"] = working["year_of_translation"].fillna("").astype(str).str.extract(r"(\d{4})")[0] + + return working[TRANSLATION_INDEX_COLUMNS] + + +def build_translation_index(frames: Mapping[str, pd.DataFrame]) -> pd.DataFrame: + """Construct a translation index DataFrame from raw series frames. + + Parameters + ---------- + frames: + Mapping of human-readable series labels to DataFrames loaded via + :func:`latin_corpus.io_utils.load_translation_list`. Each frame should + contain the columns ``latin_author`` and ``latin_title`` in addition to + ``modern_language`` and ``year_of_translation``. + + Returns + ------- + pandas.DataFrame + Normalised index keyed by ``latin_author_norm`` and ``latin_title_norm``. + Additional columns store the concatenated translation sources, modern + languages, and translation years for quick lookups during matching. + """ + + normalised: list[pd.DataFrame] = [] + for series_name, frame in frames.items(): + normalised.append(_normalise_translation_frame(frame, series_name)) + + if not normalised: + return pd.DataFrame(columns=[ + "latin_author_norm", + "latin_title_norm", + "translation_sources", + "modern_languages", + "translation_years", + ]) + + combined = pd.concat(normalised, ignore_index=True) + if combined.empty: + empty = combined.assign( + translation_sources=pd.Series(dtype=str), + modern_languages=pd.Series(dtype=str), + translation_years=pd.Series(dtype=str), + ) + return empty[ + [ + "latin_author_norm", + "latin_title_norm", + "translation_sources", + "modern_languages", + "translation_years", + ] + ] + grouped = combined.groupby(["latin_author_norm", "latin_title_norm"], dropna=False) + + def _collapse(values: Iterable[str]) -> str: + cleaned = [] + for value in values: + if pd.isna(value): + continue + text = str(value).strip() + if not text or text.lower() in {"na", "nan", "none"}: + continue + cleaned.append(text) + unique = sorted(set(cleaned)) + return ";".join(unique) + + aggregated = grouped.agg( + translation_sources=("series_name", _collapse), + modern_languages=("modern_language", _collapse), + translation_years=("translation_year", _collapse), + ).reset_index() + + return aggregated + + +def _prepare_author_lookup(index: pd.DataFrame) -> Mapping[str, pd.DataFrame]: + lookup: MutableMapping[str, pd.DataFrame] = {} + if index.empty: + return lookup + + for author, frame in index.groupby("latin_author_norm"): + lookup[str(author)] = frame + return lookup + + +def add_translation_flags( + master_df: pd.DataFrame, + translation_index: pd.DataFrame, + *, + config: Optional[Mapping[str, object]] = None, +) -> pd.DataFrame: + """Annotate ``master_df`` with translation availability information. + + Parameters + ---------- + master_df: + Bibliographic DataFrame produced by :func:`build_master_bibliography`. + Must contain ``author_norm`` and ``title_norm`` columns. + translation_index: + Output of :func:`build_translation_index` with normalised keys and + aggregated translation metadata. + config: + Optional configuration mapping overriding values in + :data:`DEFAULT_MATCH_CONFIG`. Supported keys are ``enable_fuzzy`` (bool) + and ``fuzzy_threshold`` (float). + + Returns + ------- + pandas.DataFrame + Copy of ``master_df`` enriched with the boolean field + ``has_modern_translation`` and supporting metadata columns + (``translation_sources``, ``translation_languages``, + ``translation_years``). + """ + + if master_df.empty: + result = master_df.copy() + result["has_modern_translation"] = False + result["translation_sources"] = "" + result["translation_languages"] = "" + result["translation_years"] = "" + return result + + working = master_df.copy() + + merged = working.merge( + translation_index, + how="left", + left_on=["author_norm", "title_norm"], + right_on=["latin_author_norm", "latin_title_norm"], + ) + + merged["has_modern_translation"] = merged["translation_sources"].fillna("").astype(str).str.len() > 0 + merged["translation_sources"] = merged["translation_sources"].fillna("").astype(str) + merged["translation_languages"] = merged["modern_languages"].fillna("").astype(str) + merged["translation_years"] = merged["translation_years"].fillna("").astype(str) + + needs_fuzzy = ~merged["has_modern_translation"] + if needs_fuzzy.any(): + merged = _apply_fuzzy_matches(merged, translation_index, needs_fuzzy, config) + + merged = merged.drop(columns=["latin_author_norm", "latin_title_norm", "modern_languages"], errors="ignore") + + return merged + + +def _apply_fuzzy_matches( + merged: pd.DataFrame, + translation_index: pd.DataFrame, + mask: pd.Series, + config: Optional[Mapping[str, object]], +) -> pd.DataFrame: + options: MutableMapping[str, object] = dict(DEFAULT_MATCH_CONFIG) + if config: + options.update(config) + + if not options.get("enable_fuzzy", True): + return merged + + if _similarity is None: + LOGGER.warning("Fuzzy matching requested but no similarity backend is available.") + return merged + + threshold = float(options.get("fuzzy_threshold", 0.9)) + + lookup = _prepare_author_lookup(translation_index) + for idx in merged.index[mask]: + author = str(merged.at[idx, "author_norm"]) + title = str(merged.at[idx, "title_norm"]) + candidates = lookup.get(author) + if candidates is None or candidates.empty: + continue + + best_score = 0.0 + best_row: Optional[pd.Series] = None + for _, candidate in candidates.iterrows(): + score = _similarity(title, str(candidate["latin_title_norm"])) + if score > best_score: + best_score = score + best_row = candidate + + if best_row is not None and best_score >= threshold: + merged.at[idx, "has_modern_translation"] = True + merged.at[idx, "translation_sources"] = best_row.get("translation_sources", "") + merged.at[idx, "translation_languages"] = best_row.get("modern_languages", "") + merged.at[idx, "translation_years"] = best_row.get("translation_years", "") + + return merged + + +__all__ = [ + "DEFAULT_MATCH_CONFIG", + "build_translation_index", + "add_translation_flags", +] diff --git a/latin_corpus/notebooks/build_master_example.ipynb b/latin_corpus/notebooks/build_master_example.ipynb new file mode 100644 index 0000000..0bb9b38 --- /dev/null +++ b/latin_corpus/notebooks/build_master_example.ipynb @@ -0,0 +1,38 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Master bibliography quickstart\n", + "This notebook cell demonstrates how to call `build_master_bibliography()` and inspect the resulting DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from latin_corpus.merge import build_master_bibliography\n", + "\n", + "master_df = build_master_bibliography()\n", + "master_df.head()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/latin_corpus/requirements.txt b/latin_corpus/requirements.txt new file mode 100644 index 0000000..aecd483 --- /dev/null +++ b/latin_corpus/requirements.txt @@ -0,0 +1,4 @@ +pandas +python-Levenshtein +unidecode +rapidfuzz