Skip to content

Text Analysis Tutorial: Setup & Template Walkthrough #244

@chinaexpert1

Description

@chinaexpert1

Got it — we’ll start with T0: Setup & Template Walkthrough and instantiate it into the tutorial sub-issue format. I’ll fill in the details and add a Notes field at the end for context and possible improvements. Here’s the draft:


Title & Overview

Template: Setup & Template Walkthrough: An Intermediate, End-to-End Analysis Tutorial
Overview (≤2 sentences): This tutorial introduces learners to a standardized workflow for NLP experiments, covering environment setup, data hygiene, reproducibility, and report generation. It is intermediate because it emphasizes rigorous experiment scaffolding and governance rather than just “getting a model to run.”

Purpose

The value-add is building defensible, reproducible baselines while setting up a robust project structure. Learners move beyond single notebooks toward versioned experiments, config management, error analysis scaffolding, and light reporting/serving foundations.

Prerequisites

  • Skills: Python, Git, virtual envs; basic NumPy/pandas; ML basics (train/val/test, overfitting, regularization).
  • NLP: tokenization (wordpiece/BPE), embeddings vs TF-IDF, evaluation metrics (accuracy, macro-F1).
  • Tooling: scikit-learn or gensim; spaCy or NLTK; Hugging Face Transformers or Haystack.

Setup Instructions

  • Environment: Conda/Poetry (Python 3.11), deterministic seeds, .env (for secrets/paths, e.g., MLFLOW_TRACKING_URI, DATA_CACHE_DIR)
  • Install: pandas, scikit-learn, spaCy, Hugging Face Transformers, Datasets, MLflow, FastAPI, Uvicorn.
  • Dataset: use IMDB (small) and AG News (medium) classification datasets (HF Datasets catalog). Both have permissive licenses and train/validation/test splits.
  • Repo layout:
tutorials/setup_template/
├─ notebooks/
├─ src/
│  ├─ utils.py
│  ├─ data.py
│  ├─ baseline.py
│  └─ serve.py
├─ configs/
│  └─ baseline.yaml
├─ reports/
├─ data/README.md
├─ .env.example
└─ requirements.txt

Core Concepts

  • Determinism in ML experiments: seeds, config files, pinned deps.1
  • Reproducibility: track dataset versions, metrics, and commits.
  • Data hygiene: leakage prevention, split integrity, license notes.
  • Governance: documenting metrics tables, configs, and error analysis.
  • Guardrails: schema validation, simple checks before training or serving.

Step-by-Step Walkthrough

What you’ll build: a tiny app that reads text (movie reviews/news), turns it into numbers (TF-IDF), trains a simple classifier (Logistic Regression), and tracks results with MLflow. Optional: a tiny FastAPI endpoint to get predictions.

  1. Make the project folder
    Windows(Powershell)
# Create folders
New-Item -ItemType Directory -Force -Path tutorials\setup_template\notebooks | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\src | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\configs | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\reports | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\data | Out-Null

# Create empty files we’ll fill next
New-Item tutorials\setup_template\.env.example -ItemType File -Force | Out-Null
New-Item tutorials\setup_template\requirements.txt -ItemType File -Force | Out-Null
New-Item tutorials\setup_template\configs\baseline.yaml -ItemType File -Force | Out-Null

macOS (Apple Silicon, zsh):

mkdir -p tutorials/setup_template/{notebooks,src,configs,reports,data}
touch tutorials/setup_template/.env.example \
      tutorials/setup_template/requirements.txt \
      tutorials/setup_template/configs/baseline.yaml

Project layout (for reference):

tutorials/setup_template/
├─ notebooks/
├─ src/
│  ├─ utils.py
│  ├─ data.py
│  ├─ eda.py
│  ├─ baseline.py
│  └─ serve.py
├─ configs/
│  └─ baseline.yaml
├─ reports/
├─ data/README.md
├─ .env.example
└─ requirements.txt
  1. Create and activate Python environment (Python 3.11)

Windows (PowerShell):

conda create -n nlp311 python=3.11 -y
conda activate nlp311

macOS (zsh):

conda create -n nlp311 python=3.11 -y
conda activate nlp311
# Optional: if builds fail on Apple Silicon
python -m pip install --upgrade pip wheel setuptools
  1. Install the packages
    Open tutorials/setup_template/requirements.txt and paste:
pandas==2.2.2
scikit-learn==1.5.2
spacy==3.7.6
matplotlib==3.9.2
datasets==3.0.1
transformers==4.44.2
mlflow==2.16.2
python-dotenv==1.0.1
pydantic==2.9.2
fastapi==0.115.0
uvicorn==0.30.6

Then install:
Windows

cd tutorials\setup_template
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
python -m spacy download en_core_web_sm

macOS:

cd tutorials/setup_template
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
python -m spacy download en_core_web_sm
  1. Add config and environment variables
    .env.example (then copy to .env):
MLFLOW_TRACKING_URI=./mlruns
DATA_CACHE_DIR=./.hf_cache

Copy example to real file:

Windows:

Copy-Item .env.example .env -Force

macOS:

cp .env.example .env

configs/baseline.yaml— paste:

experiment_name: "t0_setup_template"
dataset: "imdb"         # options: imdb, ag_news
test_size: 0.2
random_state: 42
tfidf:
  max_features: 30000
  ngram_range: [1, 2]
model:
  type: "logreg"
  C: 2.0
  max_iter: 200
metrics:
  average: "macro"
  1. Add the code files
    Create and paste the code below into files inside src/
    5.1 src/utils.py
import os, random, numpy as np

def set_all_seeds(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except Exception:
        # torch is optional; ignore if not installed
        pass

def get_env(name: str, default: str = "") -> str:
    from dotenv import load_dotenv
    load_dotenv()
    return os.getenv(name, default)

5.2 src/data.py

from datasets import load_dataset
import pandas as pd
from collections import Counter

def load_text_classification(name, cache_dir=None):
    """
    Loads a Hugging Face dataset and returns 3 DataFrames:
    train_df, valid_df (or None), test_df with columns: text, label
    """
    ds = load_dataset(name, cache_dir=cache_dir)
    train_df = pd.DataFrame(ds["train"])
    test_df  = pd.DataFrame(ds["test"])
    valid_df = pd.DataFrame(ds["validation"]) if "validation" in ds else None
    return train_df, valid_df, test_df

def describe_dataset(df, text_col="text", label_col="label"):
    lengths = df[text_col].astype(str).str.split().map(len)
    counts  = Counter(df[label_col])
    return {
        "rows": len(df),
        "avg_tokens": float(lengths.mean()),
        "median_tokens": float(lengths.median()),
        "label_counts": dict(counts),
    }

5.3 src/eda.py

# Quick, beginner-friendly EDA that saves pictures into reports/
import os, yaml
import matplotlib.pyplot as plt
from src.utils import set_all_seeds, get_env
from src.data import load_text_classification, describe_dataset

def main(cfg_path="configs/baseline.yaml"):
    set_all_seeds(42)
    cfg = yaml.safe_load(open(cfg_path))
    cache = get_env("DATA_CACHE_DIR", "./.hf_cache")

    train_df, valid_df, test_df = load_text_classification(cfg["dataset"], cache_dir=cache)

    # 1) Print simple stats
    print("TRAIN:", describe_dataset(train_df))
    if valid_df is not None:
        print("VALID:", describe_dataset(valid_df))
    print("TEST :", describe_dataset(test_df))

    # 2) Plot token length histogram (train)
    lengths = train_df["text"].astype(str).str.split().map(len)
    plt.figure()
    lengths.hist(bins=50)
    plt.xlabel("Tokens per example"); plt.ylabel("Count"); plt.title("Token Lengths (train)")
    os.makedirs("reports", exist_ok=True)
    plt.savefig("reports/eda_token_lengths.png", dpi=160, bbox_inches="tight")
    print("Saved: reports/eda_token_lengths.png")

if __name__ == "__main__":
    main()

5.4src/baseline.py

import os, yaml, mlflow, mlflow.sklearn
from src.utils import set_all_seeds, get_env
from src.data import load_text_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

def run_baseline(cfg_path="configs/baseline.yaml"):
    set_all_seeds(42)
    cfg = yaml.safe_load(open(cfg_path))

    mlflow.set_tracking_uri(get_env("MLFLOW_TRACKING_URI", "./mlruns"))
    mlflow.set_experiment(cfg["experiment_name"])

    train_df, valid_df, test_df = load_text_classification(
        cfg["dataset"], cache_dir=get_env("DATA_CACHE_DIR", "./.hf_cache")
    )

    if valid_df is None:
        train_df, valid_df = train_test_split(
            train_df, test_size=cfg["test_size"], random_state=cfg["random_state"], stratify=train_df["label"]
        )

    X_train, y_train = train_df["text"].astype(str), train_df["label"]
    X_valid, y_valid = valid_df["text"].astype(str), valid_df["label"]
    X_test,  y_test  = test_df["text"].astype(str),  test_df["label"]

    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(
            max_features=cfg["tfidf"]["max_features"],
            ngram_range=tuple(cfg["tfidf"]["ngram_range"])
        )),
        ("clf", LogisticRegression(
            C=cfg["model"]["C"],
            max_iter=cfg["model"]["max_iter"]
        ))
    ])

    with mlflow.start_run():
        # log params
        mlflow.log_params({
            "dataset": cfg["dataset"],
            "tfidf_max_features": cfg["tfidf"]["max_features"],
            "tfidf_ngram_range": str(cfg["tfidf"]["ngram_range"]),
            "model": cfg["model"]["type"],
            "C": cfg["model"]["C"],
            "max_iter": cfg["model"]["max_iter"],
            "random_state": cfg["random_state"]
        })

        pipe.fit(X_train, y_train)
        y_pred_valid = pipe.predict(X_valid)
        y_pred_test  = pipe.predict(X_test)

        # metrics
        acc_valid = accuracy_score(y_valid, y_pred_valid)
        f1_valid  = f1_score(y_valid, y_pred_valid, average=cfg["metrics"]["average"])
        acc_test  = accuracy_score(y_test, y_pred_test)
        f1_test   = f1_score(y_test, y_pred_test, average=cfg["metrics"]["average"])

        mlflow.log_metrics({
            "valid_accuracy": acc_valid,
            "valid_f1_macro": f1_valid,
            "test_accuracy": acc_test,
            "test_f1_macro": f1_test
        })

        # save confusion matrix
        os.makedirs("reports", exist_ok=True)
        fig = ConfusionMatrixDisplay.from_predictions(y_test, y_pred_test).figure_
        fig.savefig("reports/confusion_matrix.png", dpi=180, bbox_inches="tight")
        mlflow.log_artifact("reports/confusion_matrix.png")

        # save text report
        report = classification_report(y_test, y_pred_test)
        with open("reports/classification_report.txt", "w") as f:
            f.write(report)
        mlflow.log_artifact("reports/classification_report.txt")

        # save model
        mlflow.sklearn.log_model(pipe, artifact_path="model")

        print("Validation -> acc:", acc_valid, "f1_macro:", f1_valid)
        print("Test       -> acc:", acc_test,  "f1_macro:", f1_test)
        print("\nClassification report saved at reports/classification_report.txt")

if __name__ == "__main__":
    run_baseline()

5.5 (Optional)src/serve.py

from fastapi import FastAPI
from pydantic import BaseModel
import mlflow.pyfunc, glob

app = FastAPI(title="T0 Baseline Inference")

class InferRequest(BaseModel):
    text: str

def _latest_model_path():
    # look for the newest model saved by MLflow locally
    candidates = sorted(glob.glob("mlruns/*/*/artifacts/model"))
    if not candidates:
        raise RuntimeError("No model artifacts found. Run baseline first.")
    return candidates[-1]

@app.post("/infer")
def infer(payload: InferRequest):
    model = mlflow.pyfunc.load_model(_latest_model_path())
    pred = model.predict([payload.text])
    return {"label": int(pred[0])}
  1. Run it — EDA → Baseline → MLflow
    6.1 EDA (quick checks)
    Windows:
$env:MLFLOW_TRACKING_URI = ".\mlruns"
$env:DATA_CACHE_DIR = ".\.hf_cache"
python .\src\eda.py

macOS:

export MLFLOW_TRACKING_URI=./mlruns
export DATA_CACHE_DIR=./.hf_cache
python ./src/eda.py

You’ll see dataset stats printed and an image saved to reports/eda_token_lengths.png
6.2 Train the baseline + log to MLflow
Windows:

python .\src\baseline.py
mlflow ui --backend-store-uri ".\mlruns" --host 127.0.0.1 --port 5000

macOS:

python ./src/baseline.py
mlflow ui --backend-store-uri ./mlruns --host 127.0.0.1 --port 5000

Open http://127.0.0.1:5000/ → you’ll see your run, parameters, metrics, and artifacts (confusion_matrix.png, classification_report.txt, model).
Expectations: IMDB usually gets a solid accuracy with TF-IDF + Logistic Regression (often ~0.85–0.9). AG News will be lower/harder because it’s 4-class.

  1. Optional: Run a tiny API for inference
    Start the server:
# macOS (zsh); on Windows, replace slashes with backslashes and use PowerShell
uvicorn src.serve:app --host 127.0.0.1 --port 8000 --reload

Test it (one example):
macOS (curl):

curl -X POST http://127.0.0.1:8000/infer \
  -H "Content-Type: application/json" \
  -d '{"text":"A surprisingly heartfelt and funny movie."}'

Windows (PowerShell):

$body = @{ text = "A surprisingly heartfelt and funny movie." } | ConvertTo-Json
Invoke-RestMethod -Method Post -Uri http://127.0.0.1:8000/infer -ContentType "application/json" -Body $body
  1. Switch dataset and re-run (practice)
    Change configs/baseline.yaml:
dataset: "ag_news"

Then re-run steps 6.1 and 6.2. Compare metrics in MLflow.
This teaches that different tasks/datasets change difficulty and results.

Tiny glossary (for absolute beginners)

  • Token: a piece of text, usually a word.
  • TF-IDF: a way to turn text into numbers by counting words and down-weighting common ones.
  • Logistic Regression: a simple, reliable classifier.
  • Train / Validation / Test: train the model, tune it on validation, and report final scores on test.
  • Accuracy: how often predictions are correct.
  • Macro-F1: balances precision/recall across classes; good when classes are uneven.

Common Pitfalls & Troubleshooting

  • Install fails on Mac M-series: run python -m pip install --upgrade pip wheel setuptools and try again.
  • spaCy model error: run python -m spacy download en_core_web_sm.
  • MLflow UI empty: make sure you ran src/baseline.py before opening the UI.
  • No model found for API: run the baseline once to create a model artifact.

Additional Resources

  1. Environment setup: Conda/Poetry, .env, fixed seeds.
  2. Dataset load: download IMDB/AG News, verify splits, save schema in data/README.md.
  3. EDA: class balance, token length distributions.
  4. Baseline sanity: TF-IDF + Logistic Regression, log metrics table.
  5. Experiment governance: config YAML for hyperparams, metrics logging to MLflow.
  6. Reporting: generate reports/setup_template.md from notebook with nbconvert.
  7. (Optional) Serve: demo FastAPI endpoint for inference with schema validation.

Hands-On Exercises

  • Try both datasets (IMDB vs AG News) and compare reproducibility logs.
  • Add noise (duplicates, shuffle seeds) to test determinism.
  • Run ablations: turn off seed fixing, compare reproducibility.
  • Stretch: connect MLflow run metadata to Weights & Biases.

Common Pitfalls & Troubleshooting

  • Forgetting to set seeds → non-reproducible results.
  • Data leakage from overlapping splits.
  • Unpinned dependencies breaking reproducibility.
  • Missing .env leads to secret path issues.
  • CI not running → unchecked notebook failures.

Best Practices

  • Always log commit hash, dataset version, config.
  • PR checklist: metrics ≥ baseline, README updated, tests green.
  • Write unit tests for tokenization, vectorization, and schema validation.
  • Keep seed-fixing utilities in src/utils.py.
  • Separate experiments (configs/notebooks) from reporting.

Reflection & Discussion Prompts

  • Why does reproducibility matter in civic-tech / applied NLP projects?
  • What’s the tradeoff between fast iteration and strict reproducibility?
  • How might governance differ in regulated vs open-data contexts?

Next Steps / Advanced Extensions

  • Automate report generation in CI.
  • Introduce containerized reproducibility (Docker).
  • Connect experiment tracking with deployment logs.
  • Move from IMDB/AG News to a civic dataset (e.g., 311 complaints).

Glossary / Key Terms

  • Reproducibility: ability to re-run experiment with identical results.
  • Data leakage: unintended information in train/test overlap.
  • Seed fixing: controlling randomness across frameworks.
  • Governance: tracking configs, metrics, and artifacts.

Additional Resources

Contributors

Author(s): TBD
Reviewer(s): TBD
Maintainer(s): TBD
Date updated: 2025-09-20
Dataset license: IMDB (ACL), AG News (Creative Commons).

Issues Referenced

Epic: HfLA Text Analysis Tutorials (T0–T14).
This sub-issue: T0 Setup & Template Walkthrough.


Notes:
I chose IMDB (small binary classification) and AG News (medium 4-class classification) because they are light enough for setup/debug, yet distinct in size and task complexity. Both test the scaffolding under different load conditions. For governance, I leaned on MLflow for run-tracking (simpler than W&B but extensible). The FastAPI step is optional but sets the stage for later deployment tutorials.


Overview

REPLACE THIS TEXT -Text here that clearly states the purpose of this issue in 2 sentences or less.

Action Items

REPLACE THIS TEXT -If this is the beginning of the task this is most likely something to be researched and documented.

REPLACE THIS TEXT -If the issue has already been researched, and the course of action is clear, this will describe the steps. However, if the steps can be divided into tasks for more than one person, we recommend dividing it up into separate issues, or assigning it as a pair programming task.

Resources/Instructions

Metadata

Metadata

Assignees

Type

No type

Projects

Status

In progress (actively working)

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions