Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions 01_materials/notebooks/Classification-1.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2326,7 +2326,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "base",
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
Expand All @@ -2340,7 +2340,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
"version": "3.9.6"
}
},
"nbformat": 4,
Expand Down
62 changes: 37 additions & 25 deletions 01_materials/notebooks/Classification-2.ipynb
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -951,12 +956,14 @@
"\"But wait, what is that `np.random.seed()` thing?\". It comes from **NumPy**, a popular Python library that helps us work with numbers and do math really fast, especially when dealing with large amounts of data. It's great for things like handling lists of numbers, performing calculations, and generating random numbers.\n",
"\n",
"The `np.random.seed()` function is used to control the randomness in your code. Normally, when we generate random numbers, they change every time we run the code. By setting a \"seed\" with `np.random.seed()`, we make sure the random numbers stay the same each time we run it. This is useful when we want consistent results for testing or comparisons.\n",
"NumPy arrays are fast and powerful, allowing us to do all sorts of math and number operations easily!"
"NumPy arrays are fast and powerful, allowing us to do all sorts of math and number operations easily!\n",
"\n",
"** number 1 in np.random.seed(1) is random -> can be any number"
]
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -965,7 +972,7 @@
"\n",
"#split the data\n",
"cancer_train, cancer_test = train_test_split(\n",
" standardized_cancer, train_size=0.75, stratify=standardized_cancer[\"diagnosis\"]\n",
" standardized_cancer, train_size=0.75, shuffle= True, stratify=standardized_cancer[\"diagnosis\"]\n",
")"
]
},
Expand Down Expand Up @@ -1277,7 +1284,7 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -1295,14 +1302,16 @@
"knn.score(\n",
" cancer_test[[\"perimeter_mean\", \"concavity_mean\"]],\n",
" cancer_test[\"diagnosis\"]\n",
")"
")\n",
"\n",
"# has to plug in the X and Y factors "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output shows that the estimated accuracy of the classifier on the test data was 88%!"
"The output shows that the estimated accuracy of the classifier on the test data was 92%!"
]
},
{
Expand Down Expand Up @@ -1436,10 +1445,10 @@
"source": [
"The confusion matrix reveals the following:\n",
"\n",
"- 44 observations were correctly predicted as malignant.\n",
"- 88 observations were correctly predicted as benign.\n",
"- 9 observations were incorrectly classified as benign when they were actually malignant.\n",
"- 2 observations were incorrectly classified as malignant when they were actually benign."
"- True Positive:44 observations were correctly predicted as malignant.\n",
"- True negative:88 observations were correctly predicted as benign.\n",
"- False negative:9 observations were incorrectly classified as benign when they were actually malignant.\n",
"- False Positive: 2 observations were incorrectly classified as malignant when they were actually benign."
]
},
{
Expand Down Expand Up @@ -1578,7 +1587,7 @@
"\n",
"To find the best value for $k$ or tune any model parameter, we aim to maximize the classifier’s accuracy on unseen data. However, the test set should not be used during tuning. Instead, we split the training data into two subsets: one for **training the model** and the other for **evaluating its performance (validation)**. This approach helps select the optimal parameter value while keeping the test set untouched.\n",
"\n",
"so the data split would look like:\n",
"so the data split would look like: (Random picking one validation group)\n",
"\n",
"![](./images/TVT.001.png)"
]
Expand All @@ -1592,7 +1601,7 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -1609,16 +1618,16 @@
"source": [
"# We're re-using the train_test_split function here in order to split the training data into sub-training and validation sets.\n",
"cancer_subtrain, cancer_validation = train_test_split(\n",
" cancer_train, train_size=0.75, stratify=cancer_train[\"diagnosis\"]\n",
" cancer_train, train_size=0.75, shuffle=True, stratify=cancer_train[\"diagnosis\"]\n",
")\n",
"\n",
"# fit the model on the sub-training data\n",
"knn = KNeighborsClassifier(n_neighbors=3)\n",
"knn = KNeighborsClassifier(n_neighbors=3) # create KNN model with K=3\n",
"X = cancer_subtrain[[\"perimeter_mean\", \"concavity_mean\"]]\n",
"y = cancer_subtrain[\"diagnosis\"]\n",
"knn.fit(X, y)\n",
"knn.fit(X, y) # fit to sub-training data\n",
"\n",
"# compute the score on validation data\n",
"# compute the score on validation data and compare with previous knn score when K=5\n",
"acc = knn.score(\n",
" cancer_validation[[\"perimeter_mean\", \"concavity_mean\"]],\n",
" cancer_validation[\"diagnosis\"]\n",
Expand Down Expand Up @@ -1671,7 +1680,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -1763,12 +1772,13 @@
"\n",
"cv_5_df = pd.DataFrame(returned_dictionary) # Converting it to pandas DataFrame\n",
"\n",
"cv_5_df"
"cv_5_df\n",
"# Shuffle the data -default and split into 5 folds"
]
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -1826,7 +1836,7 @@
}
],
"source": [
"# Compute mean and standard error of the mean (SEM) for each column\n",
"# Compute mean and standard error of the mean (SEM) for each column >> accuracy\n",
"cv_5_metrics = cv_5_df.agg([\"mean\", \"sem\"])\n",
"\n",
"cv_5_metrics"
Expand Down Expand Up @@ -1884,7 +1894,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -1896,9 +1906,10 @@
"`range(stop)` starts from 0 and goes up to `stop-1`. For example, `range(4)` produces 0, 1, 2, 3.\n",
"\"\"\"\n",
"\n",
"# set up K number from 1 to 99 with step size 5 >> skipping by 5\n",
"parameter_grid = {\n",
" \"n_neighbors\": range(1, 100, 5),\n",
"}"
"}\n"
]
},
{
Expand All @@ -1916,7 +1927,7 @@
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -2551,6 +2562,7 @@
" cancer_train[\"diagnosis\"]\n",
")\n",
"\n",
"# obtaining the accuracies for each K value >> mean & std accuracy for each K\n",
"accuracies_grid = pd.DataFrame(cancer_tune_grid.cv_results_)\n",
"accuracies_grid"
]
Expand Down Expand Up @@ -2789,7 +2801,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "base",
"display_name": "lcr-env",
"language": "python",
"name": "python3"
},
Expand All @@ -2803,7 +2815,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
"version": "3.11.13"
}
},
"nbformat": 4,
Expand Down
8 changes: 4 additions & 4 deletions 01_materials/notebooks/Clustering.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -454,12 +454,12 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Perform K-means clustering\n",
"kmeans = KMeans(n_clusters=5, random_state=0)\n",
"kmeans = KMeans(n_clusters=5, random_state=0) # set K=5 for the number of clusters + set seed\n",
"clusters = kmeans.fit(standardized_penguins)"
]
},
Expand Down Expand Up @@ -875,7 +875,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "base",
"display_name": "lcr-env",
"language": "python",
"name": "python3"
},
Expand All @@ -889,7 +889,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
"version": "3.11.13"
}
},
"nbformat": 4,
Expand Down
21 changes: 11 additions & 10 deletions 01_materials/notebooks/Regression-1.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1385,7 +1385,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -1422,13 +1422,13 @@
" \"n_neighbors\": range(1, 201, 3), # But wait...? What is this?\n",
"}\n",
"\n",
"# Step 4: Initialize and fit GridSearchCV\n",
"# Step 4: Initialize and fit GridSearchCV to find the best n_neighbors & scoring\n",
"sacr_gridsearch = GridSearchCV(\n",
" estimator=knn_regressor,\n",
" param_grid=param_grid,\n",
" cv=5,\n",
" scoring=\"neg_root_mean_squared_error\"\n",
")\n",
") # setting to negative -> because sklearn expects higher scores to be better / higher ranking\n",
"\n",
"sacr_gridsearch.fit(X_train, y_train)\n",
"\n",
Expand Down Expand Up @@ -1604,7 +1604,7 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -1716,6 +1716,7 @@
}
],
"source": [
"# convert negative back to positive for comparison (finding the smallest RMSPE for better regression)\n",
"results[\"mean_test_score\"] = -results[\"mean_test_score\"]\n",
"# could also code this as results[\"mean_test_score\"] = results[\"mean_test_score\"].abs()\n",
"results"
Expand Down Expand Up @@ -1856,7 +1857,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -1874,7 +1875,7 @@
"# Make predictions on the test set\n",
"sacramento_test[\"predicted\"] = sacr_gridsearch.predict(sacramento_test[[\"sq__ft\"]])\n",
"\n",
"# Calculate RMSPE\n",
"# Calculate RMSPE (prediction error) for the test dataset\n",
"rmspe = mean_squared_error(\n",
" y_true=sacramento_test[\"price\"],\n",
" y_pred=sacramento_test[\"predicted\"]\n",
Expand Down Expand Up @@ -1903,7 +1904,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -1918,7 +1919,7 @@
}
],
"source": [
"# Calculate R² \n",
"# Calculate R² to check goodness of fit of the model (closer to 1 indicates better fit)\n",
"r2 = r2_score( \n",
"y_true=sacramento_test[\"price\"], y_pred=sacramento_test[\"predicted\"] \n",
")\n",
Expand Down Expand Up @@ -2014,7 +2015,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "base",
"display_name": "lcr-env",
"language": "python",
"name": "python3"
},
Expand All @@ -2028,7 +2029,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
"version": "3.11.13"
}
},
"nbformat": 4,
Expand Down
8 changes: 4 additions & 4 deletions 01_materials/notebooks/Regression-2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -402,7 +402,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -454,7 +454,7 @@
"# fit the linear regression model\n",
"lm = LinearRegression()\n",
"lm.fit(\n",
" sacramento_train[[\"sq__ft\"]], # A single-column data frame (square footage)\n",
" sacramento_train[[\"sq__ft\"]], # A single-column data frame (square footage);[[ ]] is used for predictors (X)\n",
" sacramento_train[\"price\"] # A series (house prices)\n",
")\n",
"\n",
Expand Down Expand Up @@ -1576,7 +1576,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "base",
"display_name": "lcr-env",
"language": "python",
"name": "python3"
},
Expand All @@ -1590,7 +1590,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
"version": "3.11.13"
}
},
"nbformat": 4,
Expand Down
Loading