Yutong2002 · Yutong2002 · Nov 18, 2025 · Nov 20, 2025 · Nov 20, 2025 · Dec 4, 2025
diff --git a/01_materials/notebooks/Classification-1.ipynb b/01_materials/notebooks/Classification-1.ipynb
@@ -2326,7 +2326,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "base",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
@@ -2340,7 +2340,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.14"
+   "version": "3.9.6"
   }
  },
  "nbformat": 4,

diff --git a/01_materials/notebooks/Classification-2.ipynb b/01_materials/notebooks/Classification-2.ipynb
@@ -1,5 +1,10 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -951,12 +956,14 @@
     "\"But wait, what is that `np.random.seed()` thing?\". It comes from **NumPy**, a popular Python library that helps us work with numbers and do math really fast, especially when dealing with large amounts of data. It's great for things like handling lists of numbers, performing calculations, and generating random numbers.\n",
     "\n",
     "The `np.random.seed()` function is used to control the randomness in your code. Normally, when we generate random numbers, they change every time we run the code. By setting a \"seed\" with `np.random.seed()`, we make sure the random numbers stay the same each time we run it. This is useful when we want consistent results for testing or comparisons.\n",
-    "NumPy arrays are fast and powerful, allowing us to do all sorts of math and number operations easily!"
+    "NumPy arrays are fast and powerful, allowing us to do all sorts of math and number operations easily!\n",
+    "\n",
+    "** number 1 in np.random.seed(1) is random -> can be any number"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -965,7 +972,7 @@
     "\n",
     "#split the data\n",
     "cancer_train, cancer_test = train_test_split(\n",
-    "    standardized_cancer, train_size=0.75, stratify=standardized_cancer[\"diagnosis\"]\n",
+    "    standardized_cancer, train_size=0.75, shuffle= True, stratify=standardized_cancer[\"diagnosis\"]\n",
     ")"
    ]
   },
@@ -1277,7 +1284,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -1295,14 +1302,16 @@
     "knn.score(\n",
     "    cancer_test[[\"perimeter_mean\", \"concavity_mean\"]],\n",
     "    cancer_test[\"diagnosis\"]\n",
-    ")"
+    ")\n",
+    "\n",
+    "# has to plug in the X and Y factors "
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The output shows that the estimated accuracy of the classifier on the test data was 88%!"
+    "The output shows that the estimated accuracy of the classifier on the test data was 92%!"
    ]
   },
   {
@@ -1436,10 +1445,10 @@
    "source": [
     "The confusion matrix reveals the following:\n",
     "\n",
-    "- 44 observations were correctly predicted as malignant.\n",
-    "- 88 observations were correctly predicted as benign.\n",
-    "- 9 observations were incorrectly classified as benign when they were actually malignant.\n",
-    "- 2 observations were incorrectly classified as malignant when they were actually benign."
+    "- True Positive:44 observations were correctly predicted as malignant.\n",
+    "- True negative:88 observations were correctly predicted as benign.\n",
+    "- False negative:9 observations were incorrectly classified as benign when they were actually malignant.\n",
+    "- False Positive: 2 observations were incorrectly classified as malignant when they were actually benign."
    ]
   },
   {
@@ -1578,7 +1587,7 @@
     "\n",
     "To find the best value for $k$ or tune any model parameter, we aim to maximize the classifier’s accuracy on unseen data. However, the test set should not be used during tuning. Instead, we split the training data into two subsets: one for **training the model** and the other for **evaluating its performance (validation)**. This approach helps select the optimal parameter value while keeping the test set untouched.\n",
     "\n",
-    "so the data split would look like:\n",
+    "so the data split would look like: (Random picking one validation group)\n",
     "\n",
     "![](./images/TVT.001.png)"
    ]
@@ -1592,7 +1601,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -1609,16 +1618,16 @@
    "source": [
     "# We're re-using the train_test_split function here in order to split the training data into sub-training and validation sets.\n",
     "cancer_subtrain, cancer_validation = train_test_split(\n",
-    "    cancer_train, train_size=0.75, stratify=cancer_train[\"diagnosis\"]\n",
+    "    cancer_train, train_size=0.75, shuffle=True, stratify=cancer_train[\"diagnosis\"]\n",
     ")\n",
     "\n",
     "# fit the model on the sub-training data\n",
-    "knn = KNeighborsClassifier(n_neighbors=3)\n",
+    "knn = KNeighborsClassifier(n_neighbors=3) # create KNN model with K=3\n",
     "X = cancer_subtrain[[\"perimeter_mean\", \"concavity_mean\"]]\n",
     "y = cancer_subtrain[\"diagnosis\"]\n",
-    "knn.fit(X, y)\n",
+    "knn.fit(X, y) # fit to sub-training data\n",
     "\n",
-    "# compute the score on validation data\n",
+    "# compute the score on validation data and compare with previous knn score when K=5\n",
     "acc = knn.score(\n",
     "    cancer_validation[[\"perimeter_mean\", \"concavity_mean\"]],\n",
     "    cancer_validation[\"diagnosis\"]\n",
@@ -1671,7 +1680,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -1763,12 +1772,13 @@
     "\n",
     "cv_5_df = pd.DataFrame(returned_dictionary)    # Converting it to pandas DataFrame\n",
     "\n",
-    "cv_5_df"
+    "cv_5_df\n",
+    "# Shuffle the data -default and split into 5 folds"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -1826,7 +1836,7 @@
     }
    ],
    "source": [
-    "# Compute mean and standard error of the mean (SEM) for each column\n",
+    "# Compute mean and standard error of the mean (SEM) for each column >> accuracy\n",
     "cv_5_metrics = cv_5_df.agg([\"mean\", \"sem\"])\n",
     "\n",
     "cv_5_metrics"
@@ -1884,7 +1894,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1896,9 +1906,10 @@
     "`range(stop)` starts from 0 and goes up to `stop-1`. For example, `range(4)` produces 0, 1, 2, 3.\n",
     "\"\"\"\n",
     "\n",
+    "# set up K number from 1 to 99 with step size 5 >> skipping by 5\n",
     "parameter_grid = {\n",
     "    \"n_neighbors\": range(1, 100, 5),\n",
-    "}"
+    "}\n"
    ]
   },
   {
@@ -1916,7 +1927,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -2551,6 +2562,7 @@
     "    cancer_train[\"diagnosis\"]\n",
     ")\n",
     "\n",
+    "# obtaining the accuracies for each K value >> mean & std accuracy for each K\n",
     "accuracies_grid = pd.DataFrame(cancer_tune_grid.cv_results_)\n",
     "accuracies_grid"
    ]
@@ -2789,7 +2801,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "base",
+   "display_name": "lcr-env",
    "language": "python",
    "name": "python3"
   },
@@ -2803,7 +2815,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.10"
+   "version": "3.11.13"
   }
  },
  "nbformat": 4,

diff --git a/01_materials/notebooks/Clustering.ipynb b/01_materials/notebooks/Clustering.ipynb
@@ -454,12 +454,12 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "# Perform K-means clustering\n",
-    "kmeans = KMeans(n_clusters=5, random_state=0)\n",
+    "kmeans = KMeans(n_clusters=5, random_state=0) # set K=5 for the number of clusters + set seed\n",
     "clusters = kmeans.fit(standardized_penguins)"
    ]
   },
@@ -875,7 +875,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "base",
+   "display_name": "lcr-env",
    "language": "python",
    "name": "python3"
   },
@@ -889,7 +889,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.10"
+   "version": "3.11.13"
   }
  },
  "nbformat": 4,

diff --git a/01_materials/notebooks/Regression-1.ipynb b/01_materials/notebooks/Regression-1.ipynb
@@ -1385,7 +1385,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -1422,13 +1422,13 @@
     "    \"n_neighbors\": range(1, 201, 3),        # But wait...? What is this?\n",
     "}\n",
     "\n",
-    "# Step 4: Initialize and fit GridSearchCV\n",
+    "# Step 4: Initialize and fit GridSearchCV to find the best n_neighbors & scoring\n",
     "sacr_gridsearch = GridSearchCV(\n",
     "    estimator=knn_regressor,\n",
     "    param_grid=param_grid,\n",
     "    cv=5,\n",
     "    scoring=\"neg_root_mean_squared_error\"\n",
-    ")\n",
+    ") # setting to negative -> because sklearn expects higher scores to be better / higher ranking\n",
     "\n",
     "sacr_gridsearch.fit(X_train, y_train)\n",
     "\n",
@@ -1604,7 +1604,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -1716,6 +1716,7 @@
     }
    ],
    "source": [
+    "# convert negative back to positive for comparison (finding the smallest RMSPE for better regression)\n",
     "results[\"mean_test_score\"] = -results[\"mean_test_score\"]\n",
     "# could also code this as results[\"mean_test_score\"] = results[\"mean_test_score\"].abs()\n",
     "results"
@@ -1856,7 +1857,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -1874,7 +1875,7 @@
     "# Make predictions on the test set\n",
     "sacramento_test[\"predicted\"] = sacr_gridsearch.predict(sacramento_test[[\"sq__ft\"]])\n",
     "\n",
-    "# Calculate RMSPE\n",
+    "# Calculate RMSPE (prediction error) for the test dataset\n",
     "rmspe = mean_squared_error(\n",
     "    y_true=sacramento_test[\"price\"],\n",
     "    y_pred=sacramento_test[\"predicted\"]\n",
@@ -1903,7 +1904,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -1918,7 +1919,7 @@
     }
    ],
    "source": [
-    "# Calculate R² \n",
+    "# Calculate R² to check goodness of fit of the model (closer to 1 indicates better fit)\n",
     "r2 = r2_score( \n",
     "y_true=sacramento_test[\"price\"], y_pred=sacramento_test[\"predicted\"] \n",
     ")\n",
@@ -2014,7 +2015,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "base",
+   "display_name": "lcr-env",
    "language": "python",
    "name": "python3"
   },
@@ -2028,7 +2029,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.10"
+   "version": "3.11.13"
   }
  },
  "nbformat": 4,

diff --git a/01_materials/notebooks/Regression-2.ipynb b/01_materials/notebooks/Regression-2.ipynb
@@ -402,7 +402,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -454,7 +454,7 @@
     "# fit the linear regression model\n",
     "lm = LinearRegression()\n",
     "lm.fit(\n",
-    "   sacramento_train[[\"sq__ft\"]],  # A single-column data frame (square footage)\n",
+    "   sacramento_train[[\"sq__ft\"]],  # A single-column data frame (square footage);[[ ]] is used for predictors (X)\n",
     "   sacramento_train[\"price\"]  # A series (house prices)\n",
     ")\n",
     "\n",
@@ -1576,7 +1576,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "base",
+   "display_name": "lcr-env",
    "language": "python",
    "name": "python3"
   },
@@ -1590,7 +1590,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.10"
+   "version": "3.11.13"
   }
  },
  "nbformat": 4,