-
-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Got it — we’ll start with T0: Setup & Template Walkthrough and instantiate it into the tutorial sub-issue format. I’ll fill in the details and add a Notes field at the end for context and possible improvements. Here’s the draft:
Title & Overview
Template: Setup & Template Walkthrough: An Intermediate, End-to-End Analysis Tutorial
Overview (≤2 sentences): This tutorial introduces learners to a standardized workflow for NLP experiments, covering environment setup, data hygiene, reproducibility, and report generation. It is intermediate because it emphasizes rigorous experiment scaffolding and governance rather than just “getting a model to run.”
Purpose
The value-add is building defensible, reproducible baselines while setting up a robust project structure. Learners move beyond single notebooks toward versioned experiments, config management, error analysis scaffolding, and light reporting/serving foundations.
Prerequisites
- Skills: Python, Git, virtual envs; basic NumPy/pandas; ML basics (train/val/test, overfitting, regularization).
- NLP: tokenization (wordpiece/BPE), embeddings vs TF-IDF, evaluation metrics (accuracy, macro-F1).
- Tooling: scikit-learn or gensim; spaCy or NLTK; Hugging Face Transformers or Haystack.
Setup Instructions
- Environment: Conda/Poetry (Python 3.11), deterministic seeds,
.env(for secrets/paths, e.g.,MLFLOW_TRACKING_URI,DATA_CACHE_DIR) - Install: pandas, scikit-learn, spaCy, Hugging Face Transformers, Datasets, MLflow, FastAPI, Uvicorn.
- Dataset: use IMDB (small) and AG News (medium) classification datasets (HF Datasets catalog). Both have permissive licenses and train/validation/test splits.
- Repo layout:
tutorials/setup_template/
├─ notebooks/
├─ src/
│ ├─ utils.py
│ ├─ data.py
│ ├─ baseline.py
│ └─ serve.py
├─ configs/
│ └─ baseline.yaml
├─ reports/
├─ data/README.md
├─ .env.example
└─ requirements.txt
Core Concepts
- Determinism in ML experiments: seeds, config files, pinned deps.1
- Reproducibility: track dataset versions, metrics, and commits.
- Data hygiene: leakage prevention, split integrity, license notes.
- Governance: documenting metrics tables, configs, and error analysis.
- Guardrails: schema validation, simple checks before training or serving.
Step-by-Step Walkthrough
What you’ll build: a tiny app that reads text (movie reviews/news), turns it into numbers (TF-IDF), trains a simple classifier (Logistic Regression), and tracks results with MLflow. Optional: a tiny FastAPI endpoint to get predictions.
- Make the project folder
Windows(Powershell)
# Create folders
New-Item -ItemType Directory -Force -Path tutorials\setup_template\notebooks | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\src | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\configs | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\reports | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\data | Out-Null
# Create empty files we’ll fill next
New-Item tutorials\setup_template\.env.example -ItemType File -Force | Out-Null
New-Item tutorials\setup_template\requirements.txt -ItemType File -Force | Out-Null
New-Item tutorials\setup_template\configs\baseline.yaml -ItemType File -Force | Out-NullmacOS (Apple Silicon, zsh):
mkdir -p tutorials/setup_template/{notebooks,src,configs,reports,data}
touch tutorials/setup_template/.env.example \
tutorials/setup_template/requirements.txt \
tutorials/setup_template/configs/baseline.yamlProject layout (for reference):
tutorials/setup_template/
├─ notebooks/
├─ src/
│ ├─ utils.py
│ ├─ data.py
│ ├─ eda.py
│ ├─ baseline.py
│ └─ serve.py
├─ configs/
│ └─ baseline.yaml
├─ reports/
├─ data/README.md
├─ .env.example
└─ requirements.txt
- Create and activate Python environment (Python 3.11)
Windows (PowerShell):
conda create -n nlp311 python=3.11 -y
conda activate nlp311macOS (zsh):
conda create -n nlp311 python=3.11 -y
conda activate nlp311
# Optional: if builds fail on Apple Silicon
python -m pip install --upgrade pip wheel setuptools- Install the packages
Opentutorials/setup_template/requirements.txtand paste:
pandas==2.2.2
scikit-learn==1.5.2
spacy==3.7.6
matplotlib==3.9.2
datasets==3.0.1
transformers==4.44.2
mlflow==2.16.2
python-dotenv==1.0.1
pydantic==2.9.2
fastapi==0.115.0
uvicorn==0.30.6
Then install:
Windows
cd tutorials\setup_template
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
python -m spacy download en_core_web_smmacOS:
cd tutorials/setup_template
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
python -m spacy download en_core_web_sm- Add config and environment variables
.env.example(then copy to.env):
MLFLOW_TRACKING_URI=./mlruns
DATA_CACHE_DIR=./.hf_cacheCopy example to real file:
Windows:
Copy-Item .env.example .env -ForcemacOS:
cp .env.example .envconfigs/baseline.yaml— paste:
experiment_name: "t0_setup_template"
dataset: "imdb" # options: imdb, ag_news
test_size: 0.2
random_state: 42
tfidf:
max_features: 30000
ngram_range: [1, 2]
model:
type: "logreg"
C: 2.0
max_iter: 200
metrics:
average: "macro"- Add the code files
Create and paste the code below into files insidesrc/
5.1src/utils.py
import os, random, numpy as np
def set_all_seeds(seed: int = 42):
random.seed(seed)
np.random.seed(seed)
try:
import torch
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
except Exception:
# torch is optional; ignore if not installed
pass
def get_env(name: str, default: str = "") -> str:
from dotenv import load_dotenv
load_dotenv()
return os.getenv(name, default)5.2 src/data.py
from datasets import load_dataset
import pandas as pd
from collections import Counter
def load_text_classification(name, cache_dir=None):
"""
Loads a Hugging Face dataset and returns 3 DataFrames:
train_df, valid_df (or None), test_df with columns: text, label
"""
ds = load_dataset(name, cache_dir=cache_dir)
train_df = pd.DataFrame(ds["train"])
test_df = pd.DataFrame(ds["test"])
valid_df = pd.DataFrame(ds["validation"]) if "validation" in ds else None
return train_df, valid_df, test_df
def describe_dataset(df, text_col="text", label_col="label"):
lengths = df[text_col].astype(str).str.split().map(len)
counts = Counter(df[label_col])
return {
"rows": len(df),
"avg_tokens": float(lengths.mean()),
"median_tokens": float(lengths.median()),
"label_counts": dict(counts),
}5.3 src/eda.py
# Quick, beginner-friendly EDA that saves pictures into reports/
import os, yaml
import matplotlib.pyplot as plt
from src.utils import set_all_seeds, get_env
from src.data import load_text_classification, describe_dataset
def main(cfg_path="configs/baseline.yaml"):
set_all_seeds(42)
cfg = yaml.safe_load(open(cfg_path))
cache = get_env("DATA_CACHE_DIR", "./.hf_cache")
train_df, valid_df, test_df = load_text_classification(cfg["dataset"], cache_dir=cache)
# 1) Print simple stats
print("TRAIN:", describe_dataset(train_df))
if valid_df is not None:
print("VALID:", describe_dataset(valid_df))
print("TEST :", describe_dataset(test_df))
# 2) Plot token length histogram (train)
lengths = train_df["text"].astype(str).str.split().map(len)
plt.figure()
lengths.hist(bins=50)
plt.xlabel("Tokens per example"); plt.ylabel("Count"); plt.title("Token Lengths (train)")
os.makedirs("reports", exist_ok=True)
plt.savefig("reports/eda_token_lengths.png", dpi=160, bbox_inches="tight")
print("Saved: reports/eda_token_lengths.png")
if __name__ == "__main__":
main()5.4src/baseline.py
import os, yaml, mlflow, mlflow.sklearn
from src.utils import set_all_seeds, get_env
from src.data import load_text_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
def run_baseline(cfg_path="configs/baseline.yaml"):
set_all_seeds(42)
cfg = yaml.safe_load(open(cfg_path))
mlflow.set_tracking_uri(get_env("MLFLOW_TRACKING_URI", "./mlruns"))
mlflow.set_experiment(cfg["experiment_name"])
train_df, valid_df, test_df = load_text_classification(
cfg["dataset"], cache_dir=get_env("DATA_CACHE_DIR", "./.hf_cache")
)
if valid_df is None:
train_df, valid_df = train_test_split(
train_df, test_size=cfg["test_size"], random_state=cfg["random_state"], stratify=train_df["label"]
)
X_train, y_train = train_df["text"].astype(str), train_df["label"]
X_valid, y_valid = valid_df["text"].astype(str), valid_df["label"]
X_test, y_test = test_df["text"].astype(str), test_df["label"]
pipe = Pipeline([
("tfidf", TfidfVectorizer(
max_features=cfg["tfidf"]["max_features"],
ngram_range=tuple(cfg["tfidf"]["ngram_range"])
)),
("clf", LogisticRegression(
C=cfg["model"]["C"],
max_iter=cfg["model"]["max_iter"]
))
])
with mlflow.start_run():
# log params
mlflow.log_params({
"dataset": cfg["dataset"],
"tfidf_max_features": cfg["tfidf"]["max_features"],
"tfidf_ngram_range": str(cfg["tfidf"]["ngram_range"]),
"model": cfg["model"]["type"],
"C": cfg["model"]["C"],
"max_iter": cfg["model"]["max_iter"],
"random_state": cfg["random_state"]
})
pipe.fit(X_train, y_train)
y_pred_valid = pipe.predict(X_valid)
y_pred_test = pipe.predict(X_test)
# metrics
acc_valid = accuracy_score(y_valid, y_pred_valid)
f1_valid = f1_score(y_valid, y_pred_valid, average=cfg["metrics"]["average"])
acc_test = accuracy_score(y_test, y_pred_test)
f1_test = f1_score(y_test, y_pred_test, average=cfg["metrics"]["average"])
mlflow.log_metrics({
"valid_accuracy": acc_valid,
"valid_f1_macro": f1_valid,
"test_accuracy": acc_test,
"test_f1_macro": f1_test
})
# save confusion matrix
os.makedirs("reports", exist_ok=True)
fig = ConfusionMatrixDisplay.from_predictions(y_test, y_pred_test).figure_
fig.savefig("reports/confusion_matrix.png", dpi=180, bbox_inches="tight")
mlflow.log_artifact("reports/confusion_matrix.png")
# save text report
report = classification_report(y_test, y_pred_test)
with open("reports/classification_report.txt", "w") as f:
f.write(report)
mlflow.log_artifact("reports/classification_report.txt")
# save model
mlflow.sklearn.log_model(pipe, artifact_path="model")
print("Validation -> acc:", acc_valid, "f1_macro:", f1_valid)
print("Test -> acc:", acc_test, "f1_macro:", f1_test)
print("\nClassification report saved at reports/classification_report.txt")
if __name__ == "__main__":
run_baseline()5.5 (Optional)src/serve.py
from fastapi import FastAPI
from pydantic import BaseModel
import mlflow.pyfunc, glob
app = FastAPI(title="T0 Baseline Inference")
class InferRequest(BaseModel):
text: str
def _latest_model_path():
# look for the newest model saved by MLflow locally
candidates = sorted(glob.glob("mlruns/*/*/artifacts/model"))
if not candidates:
raise RuntimeError("No model artifacts found. Run baseline first.")
return candidates[-1]
@app.post("/infer")
def infer(payload: InferRequest):
model = mlflow.pyfunc.load_model(_latest_model_path())
pred = model.predict([payload.text])
return {"label": int(pred[0])}- Run it — EDA → Baseline → MLflow
6.1 EDA (quick checks)
Windows:
$env:MLFLOW_TRACKING_URI = ".\mlruns"
$env:DATA_CACHE_DIR = ".\.hf_cache"
python .\src\eda.pymacOS:
export MLFLOW_TRACKING_URI=./mlruns
export DATA_CACHE_DIR=./.hf_cache
python ./src/eda.pyYou’ll see dataset stats printed and an image saved to reports/eda_token_lengths.png
6.2 Train the baseline + log to MLflow
Windows:
python .\src\baseline.py
mlflow ui --backend-store-uri ".\mlruns" --host 127.0.0.1 --port 5000macOS:
python ./src/baseline.py
mlflow ui --backend-store-uri ./mlruns --host 127.0.0.1 --port 5000Open http://127.0.0.1:5000/ → you’ll see your run, parameters, metrics, and artifacts (confusion_matrix.png, classification_report.txt, model).
Expectations: IMDB usually gets a solid accuracy with TF-IDF + Logistic Regression (often ~0.85–0.9). AG News will be lower/harder because it’s 4-class.
- Optional: Run a tiny API for inference
Start the server:
# macOS (zsh); on Windows, replace slashes with backslashes and use PowerShell
uvicorn src.serve:app --host 127.0.0.1 --port 8000 --reload
Test it (one example):
macOS (curl):
curl -X POST http://127.0.0.1:8000/infer \
-H "Content-Type: application/json" \
-d '{"text":"A surprisingly heartfelt and funny movie."}'
Windows (PowerShell):
$body = @{ text = "A surprisingly heartfelt and funny movie." } | ConvertTo-Json
Invoke-RestMethod -Method Post -Uri http://127.0.0.1:8000/infer -ContentType "application/json" -Body $body
- Switch dataset and re-run (practice)
Changeconfigs/baseline.yaml:
dataset: "ag_news"
Then re-run steps 6.1 and 6.2. Compare metrics in MLflow.
This teaches that different tasks/datasets change difficulty and results.
Tiny glossary (for absolute beginners)
- Token: a piece of text, usually a word.
- TF-IDF: a way to turn text into numbers by counting words and down-weighting common ones.
- Logistic Regression: a simple, reliable classifier.
- Train / Validation / Test: train the model, tune it on validation, and report final scores on test.
- Accuracy: how often predictions are correct.
- Macro-F1: balances precision/recall across classes; good when classes are uneven.
Common Pitfalls & Troubleshooting
- Install fails on Mac M-series: run
python -m pip install --upgrade pip wheel setuptoolsand try again. - spaCy model error: run
python -m spacy download en_core_web_sm. - MLflow UI empty: make sure you ran
src/baseline.pybefore opening the UI. - No model found for API: run the baseline once to create a model artifact.
Additional Resources
- TF-IDF (scikit-learn)
- Logistic Regression (scikit-learn)
- Datasets (Hugging Face)
- MLflow Tracking
- FastAPI Tutorial
- Environment setup: Conda/Poetry,
.env, fixed seeds. - Dataset load: download IMDB/AG News, verify splits, save schema in
data/README.md. - EDA: class balance, token length distributions.
- Baseline sanity: TF-IDF + Logistic Regression, log metrics table.
- Experiment governance: config YAML for hyperparams, metrics logging to MLflow.
- Reporting: generate
reports/setup_template.mdfrom notebook with nbconvert. - (Optional) Serve: demo FastAPI endpoint for inference with schema validation.
Hands-On Exercises
- Try both datasets (IMDB vs AG News) and compare reproducibility logs.
- Add noise (duplicates, shuffle seeds) to test determinism.
- Run ablations: turn off seed fixing, compare reproducibility.
- Stretch: connect MLflow run metadata to Weights & Biases.
Common Pitfalls & Troubleshooting
- Forgetting to set seeds → non-reproducible results.
- Data leakage from overlapping splits.
- Unpinned dependencies breaking reproducibility.
- Missing
.envleads to secret path issues. - CI not running → unchecked notebook failures.
Best Practices
- Always log commit hash, dataset version, config.
- PR checklist: metrics ≥ baseline, README updated, tests green.
- Write unit tests for tokenization, vectorization, and schema validation.
- Keep seed-fixing utilities in
src/utils.py. - Separate experiments (configs/notebooks) from reporting.
Reflection & Discussion Prompts
- Why does reproducibility matter in civic-tech / applied NLP projects?
- What’s the tradeoff between fast iteration and strict reproducibility?
- How might governance differ in regulated vs open-data contexts?
Next Steps / Advanced Extensions
- Automate report generation in CI.
- Introduce containerized reproducibility (Docker).
- Connect experiment tracking with deployment logs.
- Move from IMDB/AG News to a civic dataset (e.g., 311 complaints).
Glossary / Key Terms
- Reproducibility: ability to re-run experiment with identical results.
- Data leakage: unintended information in train/test overlap.
- Seed fixing: controlling randomness across frameworks.
- Governance: tracking configs, metrics, and artifacts.
Additional Resources
- [scikit-learn docs](https://scikit-learn.org/stable/)
- [spaCy](https://spacy.io/)
- [Hugging Face Transformers](https://huggingface.co/docs/transformers)
- [Hugging Face Datasets](https://huggingface.co/datasets)
- [MLflow](https://mlflow.org/)
- [FastAPI](https://fastapi.tiangolo.com/)
Contributors
Author(s): TBD
Reviewer(s): TBD
Maintainer(s): TBD
Date updated: 2025-09-20
Dataset license: IMDB (ACL), AG News (Creative Commons).
Issues Referenced
Epic: HfLA Text Analysis Tutorials (T0–T14).
This sub-issue: T0 Setup & Template Walkthrough.
Notes:
I chose IMDB (small binary classification) and AG News (medium 4-class classification) because they are light enough for setup/debug, yet distinct in size and task complexity. Both test the scaffolding under different load conditions. For governance, I leaned on MLflow for run-tracking (simpler than W&B but extensible). The FastAPI step is optional but sets the stage for later deployment tutorials.
Overview
REPLACE THIS TEXT -Text here that clearly states the purpose of this issue in 2 sentences or less.
Action Items
REPLACE THIS TEXT -If this is the beginning of the task this is most likely something to be researched and documented.
REPLACE THIS TEXT -If the issue has already been researched, and the course of action is clear, this will describe the steps. However, if the steps can be divided into tasks for more than one person, we recommend dividing it up into separate issues, or assigning it as a pair programming task.
Resources/Instructions
Metadata
Metadata
Assignees
Labels
Type
Projects
Status