Text Analysis Tutorial: Setup & Template Walkthrough

Got it — we’ll start with **T0: Setup & Template Walkthrough** and instantiate it into the tutorial sub-issue format. I’ll fill in the details and add a **Notes** field at the end for context and possible improvements. Here’s the draft:

---

# Title & Overview

**Template:** *Setup & Template Walkthrough: An Intermediate, End-to-End Analysis Tutorial*
**Overview (≤2 sentences):** This tutorial introduces learners to a standardized workflow for NLP experiments, covering environment setup, data hygiene, reproducibility, and report generation. It is intermediate because it emphasizes rigorous experiment scaffolding and governance rather than just “getting a model to run.”

# Purpose

The value-add is building defensible, reproducible baselines while setting up a robust project structure. Learners move beyond single notebooks toward versioned experiments, config management, error analysis scaffolding, and light reporting/serving foundations.

# Prerequisites

* Skills: Python, Git, virtual envs; basic NumPy/pandas; ML basics (train/val/test, overfitting, regularization).
* NLP: tokenization (wordpiece/BPE), embeddings vs TF-IDF, evaluation metrics (accuracy, macro-F1).
* Tooling: scikit-learn **or** gensim; spaCy **or** NLTK; Hugging Face Transformers **or** Haystack.

# Setup Instructions

* Environment: Conda/Poetry (Python 3.11), deterministic seeds, `.env` (for secrets/paths, e.g., `MLFLOW_TRACKING_URI`, `DATA_CACHE_DIR`)
* Install: pandas, scikit-learn, spaCy, Hugging Face Transformers, Datasets, MLflow, FastAPI, Uvicorn.
* Dataset: use **IMDB (small)** and **AG News (medium)** classification datasets (HF Datasets catalog). Both have permissive licenses and train/validation/test splits.
* Repo layout:

```text
tutorials/setup_template/
├─ notebooks/
├─ src/
│  ├─ utils.py
│  ├─ data.py
│  ├─ baseline.py
│  └─ serve.py
├─ configs/
│  └─ baseline.yaml
├─ reports/
├─ data/README.md
├─ .env.example
└─ requirements.txt
```


# Core Concepts

* Determinism in ML experiments: seeds, config files, pinned deps.[1](https://medium.com/@rajboopathiking/understanding-deterministic-vs-probabilistic-machine-learning-a-unified-view-across-learning-a07176b0ce3d)
* Reproducibility: track dataset versions, metrics, and commits.
* Data hygiene: leakage prevention, split integrity, license notes.
* Governance: documenting metrics tables, configs, and error analysis.
* Guardrails: schema validation, simple checks before training or serving.

# Step-by-Step Walkthrough

What you’ll build: a tiny app that reads text (movie reviews/news), turns it into numbers (TF-IDF), trains a simple classifier (Logistic Regression), and tracks results with MLflow. Optional: a tiny FastAPI endpoint to get predictions.

1) Make the project folder
Windows(Powershell)
```powershell
# Create folders
New-Item -ItemType Directory -Force -Path tutorials\setup_template\notebooks | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\src | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\configs | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\reports | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\data | Out-Null

# Create empty files we’ll fill next
New-Item tutorials\setup_template\.env.example -ItemType File -Force | Out-Null
New-Item tutorials\setup_template\requirements.txt -ItemType File -Force | Out-Null
New-Item tutorials\setup_template\configs\baseline.yaml -ItemType File -Force | Out-Null
```

macOS (Apple Silicon, zsh):
```zsh
mkdir -p tutorials/setup_template/{notebooks,src,configs,reports,data}
touch tutorials/setup_template/.env.example \
      tutorials/setup_template/requirements.txt \
      tutorials/setup_template/configs/baseline.yaml
```

Project layout (for reference):
```text
tutorials/setup_template/
├─ notebooks/
├─ src/
│  ├─ utils.py
│  ├─ data.py
│  ├─ eda.py
│  ├─ baseline.py
│  └─ serve.py
├─ configs/
│  └─ baseline.yaml
├─ reports/
├─ data/README.md
├─ .env.example
└─ requirements.txt
```

2) Create and activate Python environment (Python 3.11)

Windows (PowerShell):
```powershell
conda create -n nlp311 python=3.11 -y
conda activate nlp311
```

macOS (zsh):
```zsh
conda create -n nlp311 python=3.11 -y
conda activate nlp311
# Optional: if builds fail on Apple Silicon
python -m pip install --upgrade pip wheel setuptools
```

3) Install the packages
Open `tutorials/setup_template/requirements.txt` and paste:
```text
pandas==2.2.2
scikit-learn==1.5.2
spacy==3.7.6
matplotlib==3.9.2
datasets==3.0.1
transformers==4.44.2
mlflow==2.16.2
python-dotenv==1.0.1
pydantic==2.9.2
fastapi==0.115.0
uvicorn==0.30.6
```
Then install:
Windows
```powershell
cd tutorials\setup_template
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```
macOS:
```zsh
cd tutorials/setup_template
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```

4) Add config and environment variables
`.env.example` (then copy to `.env`):
```env
MLFLOW_TRACKING_URI=./mlruns
DATA_CACHE_DIR=./.hf_cache
```
Copy example to real file:

Windows:
```powershell
Copy-Item .env.example .env -Force
```
macOS:
```zsh
cp .env.example .env
```
`configs/baseline.yaml`— paste:
```yaml
experiment_name: "t0_setup_template"
dataset: "imdb"         # options: imdb, ag_news
test_size: 0.2
random_state: 42
tfidf:
  max_features: 30000
  ngram_range: [1, 2]
model:
  type: "logreg"
  C: 2.0
  max_iter: 200
metrics:
  average: "macro"
```

5) Add the code files
Create and paste the code below into files inside `src/`
5.1 `src/utils.py`
```python

import os, random, numpy as np

def set_all_seeds(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except Exception:
        # torch is optional; ignore if not installed
        pass

def get_env(name: str, default: str = "") -> str:
    from dotenv import load_dotenv
    load_dotenv()
    return os.getenv(name, default)
```
5.2 `src/data.py`
```python

from datasets import load_dataset
import pandas as pd
from collections import Counter

def load_text_classification(name, cache_dir=None):
    """
    Loads a Hugging Face dataset and returns 3 DataFrames:
    train_df, valid_df (or None), test_df with columns: text, label
    """
    ds = load_dataset(name, cache_dir=cache_dir)
    train_df = pd.DataFrame(ds["train"])
    test_df  = pd.DataFrame(ds["test"])
    valid_df = pd.DataFrame(ds["validation"]) if "validation" in ds else None
    return train_df, valid_df, test_df

def describe_dataset(df, text_col="text", label_col="label"):
    lengths = df[text_col].astype(str).str.split().map(len)
    counts  = Counter(df[label_col])
    return {
        "rows": len(df),
        "avg_tokens": float(lengths.mean()),
        "median_tokens": float(lengths.median()),
        "label_counts": dict(counts),
    }

```
5.3 `src/eda.py`
```python

# Quick, beginner-friendly EDA that saves pictures into reports/
import os, yaml
import matplotlib.pyplot as plt
from src.utils import set_all_seeds, get_env
from src.data import load_text_classification, describe_dataset

def main(cfg_path="configs/baseline.yaml"):
    set_all_seeds(42)
    cfg = yaml.safe_load(open(cfg_path))
    cache = get_env("DATA_CACHE_DIR", "./.hf_cache")

    train_df, valid_df, test_df = load_text_classification(cfg["dataset"], cache_dir=cache)

    # 1) Print simple stats
    print("TRAIN:", describe_dataset(train_df))
    if valid_df is not None:
        print("VALID:", describe_dataset(valid_df))
    print("TEST :", describe_dataset(test_df))

    # 2) Plot token length histogram (train)
    lengths = train_df["text"].astype(str).str.split().map(len)
    plt.figure()
    lengths.hist(bins=50)
    plt.xlabel("Tokens per example"); plt.ylabel("Count"); plt.title("Token Lengths (train)")
    os.makedirs("reports", exist_ok=True)
    plt.savefig("reports/eda_token_lengths.png", dpi=160, bbox_inches="tight")
    print("Saved: reports/eda_token_lengths.png")

if __name__ == "__main__":
    main()

```
5.4`src/baseline.py`
```python

import os, yaml, mlflow, mlflow.sklearn
from src.utils import set_all_seeds, get_env
from src.data import load_text_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

def run_baseline(cfg_path="configs/baseline.yaml"):
    set_all_seeds(42)
    cfg = yaml.safe_load(open(cfg_path))

    mlflow.set_tracking_uri(get_env("MLFLOW_TRACKING_URI", "./mlruns"))
    mlflow.set_experiment(cfg["experiment_name"])

    train_df, valid_df, test_df = load_text_classification(
        cfg["dataset"], cache_dir=get_env("DATA_CACHE_DIR", "./.hf_cache")
    )

    if valid_df is None:
        train_df, valid_df = train_test_split(
            train_df, test_size=cfg["test_size"], random_state=cfg["random_state"], stratify=train_df["label"]
        )

    X_train, y_train = train_df["text"].astype(str), train_df["label"]
    X_valid, y_valid = valid_df["text"].astype(str), valid_df["label"]
    X_test,  y_test  = test_df["text"].astype(str),  test_df["label"]

    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(
            max_features=cfg["tfidf"]["max_features"],
            ngram_range=tuple(cfg["tfidf"]["ngram_range"])
        )),
        ("clf", LogisticRegression(
            C=cfg["model"]["C"],
            max_iter=cfg["model"]["max_iter"]
        ))
    ])

    with mlflow.start_run():
        # log params
        mlflow.log_params({
            "dataset": cfg["dataset"],
            "tfidf_max_features": cfg["tfidf"]["max_features"],
            "tfidf_ngram_range": str(cfg["tfidf"]["ngram_range"]),
            "model": cfg["model"]["type"],
            "C": cfg["model"]["C"],
            "max_iter": cfg["model"]["max_iter"],
            "random_state": cfg["random_state"]
        })

        pipe.fit(X_train, y_train)
        y_pred_valid = pipe.predict(X_valid)
        y_pred_test  = pipe.predict(X_test)

        # metrics
        acc_valid = accuracy_score(y_valid, y_pred_valid)
        f1_valid  = f1_score(y_valid, y_pred_valid, average=cfg["metrics"]["average"])
        acc_test  = accuracy_score(y_test, y_pred_test)
        f1_test   = f1_score(y_test, y_pred_test, average=cfg["metrics"]["average"])

        mlflow.log_metrics({
            "valid_accuracy": acc_valid,
            "valid_f1_macro": f1_valid,
            "test_accuracy": acc_test,
            "test_f1_macro": f1_test
        })

        # save confusion matrix
        os.makedirs("reports", exist_ok=True)
        fig = ConfusionMatrixDisplay.from_predictions(y_test, y_pred_test).figure_
        fig.savefig("reports/confusion_matrix.png", dpi=180, bbox_inches="tight")
        mlflow.log_artifact("reports/confusion_matrix.png")

        # save text report
        report = classification_report(y_test, y_pred_test)
        with open("reports/classification_report.txt", "w") as f:
            f.write(report)
        mlflow.log_artifact("reports/classification_report.txt")

        # save model
        mlflow.sklearn.log_model(pipe, artifact_path="model")

        print("Validation -> acc:", acc_valid, "f1_macro:", f1_valid)
        print("Test       -> acc:", acc_test,  "f1_macro:", f1_test)
        print("\nClassification report saved at reports/classification_report.txt")

if __name__ == "__main__":
    run_baseline()

```
5.5 (Optional)`src/serve.py`
```python

from fastapi import FastAPI
from pydantic import BaseModel
import mlflow.pyfunc, glob

app = FastAPI(title="T0 Baseline Inference")

class InferRequest(BaseModel):
    text: str

def _latest_model_path():
    # look for the newest model saved by MLflow locally
    candidates = sorted(glob.glob("mlruns/*/*/artifacts/model"))
    if not candidates:
        raise RuntimeError("No model artifacts found. Run baseline first.")
    return candidates[-1]

@app.post("/infer")
def infer(payload: InferRequest):
    model = mlflow.pyfunc.load_model(_latest_model_path())
    pred = model.predict([payload.text])
    return {"label": int(pred[0])}

```

6) Run it — EDA → Baseline → MLflow
6.1 EDA (quick checks)
Windows:
```powershell
$env:MLFLOW_TRACKING_URI = ".\mlruns"
$env:DATA_CACHE_DIR = ".\.hf_cache"
python .\src\eda.py
```
macOS:
```zsh
export MLFLOW_TRACKING_URI=./mlruns
export DATA_CACHE_DIR=./.hf_cache
python ./src/eda.py
```
You’ll see dataset stats printed and an image saved to `reports/eda_token_lengths.png`
6.2 Train the baseline + log to MLflow
Windows:
```powershell
python .\src\baseline.py
mlflow ui --backend-store-uri ".\mlruns" --host 127.0.0.1 --port 5000
```
macOS:
```zsh
python ./src/baseline.py
mlflow ui --backend-store-uri ./mlruns --host 127.0.0.1 --port 5000
```
Open http://127.0.0.1:5000/ → you’ll see your run, parameters, metrics, and artifacts (confusion_matrix.png, classification_report.txt, model).
Expectations: IMDB usually gets a solid accuracy with TF-IDF + Logistic Regression (often ~0.85–0.9). AG News will be lower/harder because it’s 4-class.

7) Optional: Run a tiny API for inference
Start the server:
```text
# macOS (zsh); on Windows, replace slashes with backslashes and use PowerShell
uvicorn src.serve:app --host 127.0.0.1 --port 8000 --reload
```
Test it (one example):
macOS (curl):
```text
curl -X POST http://127.0.0.1:8000/infer \
  -H "Content-Type: application/json" \
  -d '{"text":"A surprisingly heartfelt and funny movie."}'
```
Windows (PowerShell):
```text
$body = @{ text = "A surprisingly heartfelt and funny movie." } | ConvertTo-Json
Invoke-RestMethod -Method Post -Uri http://127.0.0.1:8000/infer -ContentType "application/json" -Body $body
```

8) Switch dataset and re-run (practice)
Change `configs/baseline.yaml`:
```text
dataset: "ag_news"
```
Then re-run steps 6.1 and 6.2. Compare metrics in MLflow.
This teaches that different tasks/datasets change difficulty and results.

# Tiny glossary (for absolute beginners)

- Token: a piece of text, usually a word.
- TF-IDF: a way to turn text into numbers by counting words and down-weighting common ones.
- Logistic Regression: a simple, reliable classifier.
- Train / Validation / Test: train the model, tune it on validation, and report final scores on test.
- Accuracy: how often predictions are correct.
- Macro-F1: balances precision/recall across classes; good when classes are uneven.

# Common Pitfalls & Troubleshooting

* Install fails on Mac M-series: run `python -m pip install --upgrade pip wheel setuptools` and try again.
* spaCy model error: run `python -m spacy download en_core_web_sm`.
* MLflow UI empty: make sure you ran `src/baseline.py` **before** opening the UI.
* No model found for API: run the baseline once to create a model artifact. 

# Additional Resources

* [TF-IDF (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
* [Logistic Regression (scikit-learn)](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
* [Datasets (Hugging Face)](https://huggingface.co/docs/datasets)
* [MLflow Tracking](https://mlflow.org/docs/latest/tracking.html)
* [FastAPI Tutorial](https://fastapi.tiangolo.com/tutorial/)


1. **Environment setup:** Conda/Poetry, `.env`, fixed seeds.
2. **Dataset load:** download IMDB/AG News, verify splits, save schema in `data/README.md`.
3. **EDA:** class balance, token length distributions.
4. **Baseline sanity:** TF-IDF + Logistic Regression, log metrics table.
5. **Experiment governance:** config YAML for hyperparams, metrics logging to MLflow.
6. **Reporting:** generate `reports/setup_template.md` from notebook with nbconvert.
7. *(Optional)* **Serve:** demo FastAPI endpoint for inference with schema validation.

# Hands-On Exercises

* Try both datasets (IMDB vs AG News) and compare reproducibility logs.
* Add noise (duplicates, shuffle seeds) to test determinism.
* Run ablations: turn off seed fixing, compare reproducibility.
* Stretch: connect MLflow run metadata to Weights & Biases.

# Common Pitfalls & Troubleshooting

* Forgetting to set seeds → non-reproducible results.
* Data leakage from overlapping splits.
* Unpinned dependencies breaking reproducibility.
* Missing `.env` leads to secret path issues.
* CI not running → unchecked notebook failures.

# Best Practices

* Always log commit hash, dataset version, config.
* PR checklist: metrics ≥ baseline, README updated, tests green.
* Write unit tests for tokenization, vectorization, and schema validation.
* Keep seed-fixing utilities in `src/utils.py`.
* Separate experiments (configs/notebooks) from reporting.

# Reflection & Discussion Prompts

* Why does reproducibility matter in civic-tech / applied NLP projects?
* What’s the tradeoff between fast iteration and strict reproducibility?
* How might governance differ in regulated vs open-data contexts?

# Next Steps / Advanced Extensions

* Automate report generation in CI.
* Introduce containerized reproducibility (Docker).
* Connect experiment tracking with deployment logs.
* Move from IMDB/AG News to a civic dataset (e.g., 311 complaints).

# Glossary / Key Terms

* **Reproducibility**: ability to re-run experiment with identical results.
* **Data leakage**: unintended information in train/test overlap.
* **Seed fixing**: controlling randomness across frameworks.
* **Governance**: tracking configs, metrics, and artifacts.

# Additional Resources

* [[scikit-learn docs](https://scikit-learn.org/stable/)](https://scikit-learn.org/stable/)
* [[spaCy](https://spacy.io/)](https://spacy.io/)
* [[Hugging Face Transformers](https://huggingface.co/docs/transformers)](https://huggingface.co/docs/transformers)
* [[Hugging Face Datasets](https://huggingface.co/datasets)](https://huggingface.co/datasets)
* [[MLflow](https://mlflow.org/)](https://mlflow.org/)
* [[FastAPI](https://fastapi.tiangolo.com/)](https://fastapi.tiangolo.com/)

# Contributors

Author(s): TBD
Reviewer(s): TBD
Maintainer(s): TBD
Date updated: 2025-09-20
Dataset license: IMDB (ACL), AG News (Creative Commons).

# Issues Referenced

Epic: HfLA Text Analysis Tutorials (T0–T14).
This sub-issue: T0 Setup & Template Walkthrough.

---

**Notes:**
I chose IMDB (small binary classification) and AG News (medium 4-class classification) because they are light enough for setup/debug, yet distinct in size and task complexity. Both test the scaffolding under different load conditions. For governance, I leaned on MLflow for run-tracking (simpler than W\&B but extensible). The FastAPI step is optional but sets the stage for later deployment tutorials.

---


### Overview
REPLACE THIS TEXT -Text here that clearly states the purpose of this issue in 2 sentences or less.

### Action Items
REPLACE THIS TEXT -If this is the beginning of the task this is most likely something to be researched and documented.

REPLACE THIS TEXT -If the issue has already been researched, and the course of action is clear, this will describe the steps.  However, if the steps can be divided into tasks for more than one person, we recommend dividing it up into separate issues, or assigning it as a pair programming task.

### Resources/Instructions



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Text Analysis Tutorial: Setup & Template Walkthrough #244

Title & Overview

Purpose

Prerequisites

Setup Instructions

Core Concepts

Step-by-Step Walkthrough

Tiny glossary (for absolute beginners)

Common Pitfalls & Troubleshooting

Additional Resources

Hands-On Exercises

Common Pitfalls & Troubleshooting

Best Practices

Reflection & Discussion Prompts

Next Steps / Advanced Extensions

Glossary / Key Terms

Additional Resources

Contributors

Issues Referenced

Overview

Action Items

Resources/Instructions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Text Analysis Tutorial: Setup & Template Walkthrough #244

Description

Title & Overview

Purpose

Prerequisites

Setup Instructions

Core Concepts

Step-by-Step Walkthrough

Tiny glossary (for absolute beginners)

Common Pitfalls & Troubleshooting

Additional Resources

Hands-On Exercises

Common Pitfalls & Troubleshooting

Best Practices

Reflection & Discussion Prompts

Next Steps / Advanced Extensions

Glossary / Key Terms

Additional Resources

Contributors

Issues Referenced

Overview

Action Items

Resources/Instructions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions