-
Notifications
You must be signed in to change notification settings - Fork 3
Expand file tree
/
Copy path2_Classification.Rmd
More file actions
816 lines (470 loc) · 24.4 KB
/
2_Classification.Rmd
File metadata and controls
816 lines (470 loc) · 24.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
---
title: "Biological Examples"
author: "Ryan Franckowiak"
date: "January 30, 2019"
output: html_document
---
###Load Required Libraries
```{r}
library(plyr)
library(dplyr)
library(FSA)
library(PerformanceAnalytics)
library(psych)
library(car)
library(rcompanion)
library(lmtest)
library(aod)
library(Amelia)
library(cvAUC)
```
###Bird introductions to New Zealand
In the 1800s, many people released their favorite bird species in New Zealand hoping that they would become established in the country. Veltman et al. (1996) assessed the success or failure (dependent variable) of these introduced species (n=79) based on 14 independent variables
Logistic regression can be used to predict the probability of success of a new introduction and assess which factors influence the success of these introductions. Once a model is developed it can be used to predict the success of future introductions.
```{r}
#Load New Zealand bird data file
Data.df<-read.csv("ZealandBirds.csv", header=TRUE)
Data.df
```
```{r}
#Only include variables that are numeric or can be made numeric
Data.num <-
select(Data.df,
Status,
Length,
Mass,
Range,
Migr,
Insect,
Diet,
Clutch,
Broods,
Wood,
Upland,
Water,
Release,
Indiv)
```
```{r}
### Covert integer variables to numeric variables
Data.num$Status = as.numeric(Data.num$Status)
Data.num$Length = as.numeric(Data.num$Length)
Data.num$Migr = as.numeric(Data.num$Migr)
Data.num$Insect = as.numeric(Data.num$Insect)
Data.num$Diet = as.numeric(Data.num$Diet)
Data.num$Broods = as.numeric(Data.num$Broods)
Data.num$Wood = as.numeric(Data.num$Wood)
Data.num$Upland = as.numeric(Data.num$Upland)
Data.num$Water = as.numeric(Data.num$Water)
Data.num$Release = as.numeric(Data.num$Release)
Data.num$Indiv = as.numeric(Data.num$Indiv)
headtail(Data.num)
```
```{r}
summary(Data.num)
```
###Multicollinearity
Multicollinearity is a statistical phenomenon in which predictor variables in a logistic regression model are highly correlated. It is not uncommon when there are a large number of covariates in the model.
Multicollinearity can cause unstable estimates and inaccurate variances which affects confidence intervals and hypothesis tests. The existence of collinearity inflates the variances of the parameter estimates, and thus can lead to incorrect inferences about relationships between explanatory and response variables.
Examining the correlation matrix for the selected covariates can be helpful to detect multicollinearity
```{r}
#Examining correlations among variables (Pearson or Spearman)
#library(PerformanceAnalytics)
chart.Correlation(Data.num,
method="spearman",
histogram=TRUE,
pch=16)
```
Another approach is to examine the correlation coefficients directly
```{r}
#library(psych)
corr.test(Data.num,
use = "pairwise",
method="spearman",
adjust="none", # Can adjust p-values; see ?p.adjust for options
alpha=.05)
```
###Missing Data
Missing data can severely affect the performance and reliability of logistic regression models. A missing data plot can be used to get a quick idea of the amount of missing data in the dataset. The x-axis shows attributes (i.e., covariates) and the y-axis shows instances (i.e., observations or individuals). Horizontal lines indicate missing data for an instance, vertical blocks represent missing data for an attribute.
```{r}
Amelia::missmap(Data.num, main = "Missing values vs observed")
```
###Remove missing data
```{r}
Data.omit = na.omit(Data.num)
```
###Simple Logistic Regression Example
```{r}
model.simple = glm(Status ~ Release,
data=Data.omit,
family = binomial(link="logit")
)
summary(model.simple)
```
In the output above, the 'call' is R reminding us what model was ran, what options we specified, etc.
The deviance residuals, are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model.
The next part of the output shows (1) the coefficients, (2) their standard errors, (3) the z-statistic (sometimes called a Wald z-statistic), and (4) the associated p-values.
The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
For example: every unit change in 'Release', the log odds of establishment (versus non-establishment) increases by 0.33234.
Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC. These values can be used to help assess model fit.
###Model significance testing
The overall effect of 'Release' can be tested using the wald.test() function from the 'aod' package.
The wald.test() function.
(1) b supplies the coefficients,
(2) Sigma supplies the variance covariance matrix of the error terms,
(3) Terms tells R which terms in the model are to be tested (in this case only 1)
```{r}
aod::wald.test(b = coef(model.simple), Sigma = vcov(model.simple), Terms = 1)
```
The chi-squared test statistic of 19.7, with one degrees of freedom is associated with a p-value of 8.9e-06 indicating that the overall effect of 'Release' is statistically significant.
###Odds ratio
Odds-ratios can be calculated by exponentiating the modely coefficients
```{r}
## odds ratios only
exp(coef(model.simple))
```
The same logic can be used to get odds ratios and their confidence intervals.
```{r}
## odds ratios and 95% CI
exp(cbind(OR = coef(model.simple), confint(model.simple)))
```
This shows that a one unit increase in 'Release' increases the odds of becoming established in New Zealand (versus not establishing) by a factor of 1.39.
###Assessing Model fit
One measure of model fit is to test whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model). The test statistic is the difference between the residual deviance for the model with predictors and the null model. The test statistic has a chi-squared distribution with the degrees of freedom equal to the differences in degrees of freedom between the current and the null model (i.e., the number of predictor variables in the model). To find the difference in deviance for the two models (i.e., the test statistic) we can use the command:
```{r}
with(model.simple, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
```
The chi-square of 34.21 with 1 degrees of freedom and an associated p-value of 4.939945e-09 tells us that our model as a whole fits significantly better than an empty (NULL)model.
```{r}
library(ggplot2)
ggplot(Data.omit, aes(x=Release, y=Status)) + geom_point() +
stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
```
###Reciever Operating Characteristics (ROC)
ROC curves provide a graphical representation of classifier performance. The ROC curve is a two-dimensional plot with sensitivity on the Y axis, 1 -Specificity on the X axis.
```{r}
#ROCR Curve
library(ROCR)
predicted <- predict(model.simple, type = 'response')
ROCRpred <- prediction(predicted, Data.omit$Status)
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))
```
###Area under the curve (AUC)
The AUC is defined as the probability that the fit model will score a randomly drawn positive sample higher than a randomly drawn negative sample. AUC ranges between 0 and 1. A higher AUC values indicates a more powerful classifier (Assuming the general shape of the ROC curves remains relative constant)
```{r}
# Outcome Flag & Predicted probability
cvAUC::AUC(predicted, Data.omit$Status, label.ordering = NULL)
```
###Multiple logistic regression
The New Zealand bird data will be extended to include multiple predictor variables (i.e., independent variables). Missing data can be problematic for multiple logistic regression. The method for handle missing values in a multiple regression is to remove all observations from the data set that have any missing values (e.g., data frame called Data.omit).
```{r}
### Create new data frame with all missing values removed (NA’s)
Data.omit = na.omit(Data.num)
```
###Variable selection procedure
There are two main approaches towards variable selection: the all possible regressions approach (or multiple working hypotheses) and automatic methods. Automatic methods are useful when the number of explanatory variables is large and it is not feasible to fit all possible models. Automatic methods use a search algorithm (e.g., Forward selection, Backward elimination and Stepwise regression) to find the best model. To perform forward selection we need to begin by specifying a starting model and the range of models which we want to examine in the search. The R function step() can be used to perform variable selection.
```{r}
### Define full and null models and do step procedure
model.null = glm(Status ~ 1,
data=Data.omit,
family = binomial(link="logit")
)
model.full = glm(Status ~ Length + Mass + Range + Migr + Insect + Diet +
Clutch + Broods + Wood + Upland + Water +
Release + Indiv,
data=Data.omit,
family = binomial(link="logit")
)
step(model.null,
scope = list(upper=model.full),
direction="both",
test="Chisq",
data=Data.omit)
```
According to this procedure, the best model is the one that includes the variables: Upland, Migr, Mass, Indiv, Insect, and Wood
###Final Model Data
In the final model, it is important to exclude only those observations that have missing values in the variables that are actually included in the final model. For testing the overall p-value of the final model, plotting the final model, or using the glm.compare function, we will create a data frame with only those observations excluded (i.e., data frame called Data.final).
```{r}
Data.final <-
select(Data.num,
Status,
Upland,
Migr,
Mass,
Indiv,
Insect,
Wood)
Data.final <- na.omit(Data.final)
```
```{r}
#Final Model
model.final = glm(Status ~ Upland + Migr + Mass + Indiv + Insect + Wood,
data=Data.final,
family = binomial(link="logit")
)
summary(model.final)
```
In the output above, the 'call' is R reminding us what model was ran, what options we specified, etc.
The deviance residuals, are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model.
The next part of the output shows (1) the coefficients, (2) their standard errors, (3) the z-statistic (sometimes called a Wald z-statistic), and (4) the associated p-values.
The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
###Assessing significance of model covariates
Another way to overall effect of retained predictor variables is using the Wald test (results similar to those shown above).
```{r}
#Analysis of variance for individual terms
car::Anova(model.final, type="II", test="Wald")
```
###Pseudo-R-squared
R does not produce r-squared values for generalized linear models (glm) to assess model fit. Instead, psuedo-R-squared values can be generated that to provide information similar to that provided by R-squared in OLS regression; however, psuedo-R-squared values cannot be interpretated exactly as R-squared in OLS regression. These pseudo-R-squared values compare the maximum likelihood of the model to a nested null model fit with the same method. They should be interpreted as a relative measure among similar models.
```{r}
#Pseudo-R-squared
rcompanion::nagelkerke(model.final)
```
###Testing overall model fit
Note that testing p-values for a logistic regression uses a Chi-square tests to evaluate the significance of the overall model.
```{r}
### Define null models and compare to final model
model.null = glm(Status ~ 1,
data=Data.final,
family = binomial(link="logit")
)
anova(model.final, model.null, test="Chisq")
```
###Likelihood ratio test
A likelihood ratio test can also be used to test the significance of the overall model
```{r}
lrtest(model.final)
```
A plot of standardized residuals vs. predicted values. The residuals should be unbiased and homoscedastic.
```{r}
#Plot of standardized residuals
plot(fitted(model.final),
rstandard(model.final))
```
```{r}
#Simple plot of predicted values
Data.final =
select(Data.num,
Status,
Upland,
Migr,
Mass,
Indiv,
Insect,
Wood)
Data.final = na.omit(Data.final)
Data.final$predy = predict(model.final, type="response")
### Plot
library(ggplot2)
ggplot(Data.final, aes(x=predy, y=Status)) + geom_point() +
stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
```
```{r}
#ROCR Curve
library(ROCR)
ROCRpred.final <- prediction(Data.final$predy, Data.final$Status)
ROCRperf.final <- performance(ROCRpred.final, 'tpr','fpr')
plot(ROCRperf.final, colorize = TRUE, text.adj = c(-0.2,1.7))
```
###Overdispersion
Is an indication that the model doesn’t fit the data well. Overdispersion is a situation where the residual deviance of the glm is large relative to the residual degrees of freedom. These values are shown in the summary of the model. One guideline is that if the ratio of the residual deviance to the residual degrees of freedom exceeds 1.5, then the model is overdispersed.
```{r}
summary(model.final)$deviance/summary(model.final)$df.residual
```
######################################
######################################
###Linear Discriminant Analysis (LDA)
The LDA aims to identify linear combinations of predictor variables that maximize the separation between classes, then use these linear discrimants to predict the class of individuals. LDA assumes that predictors are normally distributed (Gaussian distribution) and that the different classes have class-specific means and equal variance/covariance.
Iris data:
Goal: predicting iris species based on the predictor variables Sepal.Length, Sepal.Width, Petal.Length, Petal.Width.
```{r}
library(tidyverse)
library(caret)
library(klaR)
```
```{r}
data(iris)
head(iris)
```
###Training vs Test data
To generate a LDA model and evaluate its performance requires the available data to be separated into training and test data.
```{r}
# Split the data into training (80%) and test set (20%)
set.seed(123)
training.samples <- iris$Species %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- iris[training.samples, ]
test.data <- iris[-training.samples, ]
```
###Assumption of LDA
Inspecting the univariate distributions of each variable and make sure that they are normally distribute. If not, you can transform them using log and root for exponential distributions and Box-Cox for skewed distributions.
It is generally recommended to standardize/normalize continuous predictors before the analysis since discriminant analysis can be affected by the scale/unit in which predictor variables are measured, and also outliers should be removed.
```{r}
# Estimate preprocessing parameters
preproc.param <- train.data %>% preProcess(method = c("center", "scale"))
# Transform the data using the estimated parameters
train.transformed <- preproc.param %>% predict(train.data)
test.transformed <- preproc.param %>% predict(test.data)
```
###Specifying the LDA Model
```{r}
library(MASS)
model.lda <- lda(Species~., data = train.transformed)
model.lda
```
LDA determines group means and for each individual computes the probability of belonging to the different groups. The individual is then assigned to the group with the highest probability score.
LDA output:
1) Prior probabilities of groups: the proportion of training observations in each group (e.g., 33% of the training observations in the setosa group)
2) Group means: group center of gravity or mean of each variable in each group.
3) Coefficients of linear discriminants: linear combination of predictor variables used to form the LDA decision rule.
LD1 = 0.90*Sepal.Length + 0.65*Sepal.Width - 4.04*Petal.Length - 2.4*Petal.Width.
LD2 = 0.04*Sepal.Length + 0.89*Sepal.Width - 2.1*Petal.Length - 2.6*Petal.Width.
###Two-dimensional Plot of linear discriminants
Linear discriminants are plotted using values of LD1 and LD2 for each of the training observations.
```{r}
lda.data <- cbind(train.transformed, predict(model.lda)$x)
ggplot(lda.data, aes(LD1, LD2)) + geom_point(aes(color = Species))
```
As you can see, there are three distinct groups with some overlap between virginica and versicolor. Plotting again, but adding the code dimen = 1 will only plot in one dimension (LD1). Think of it as a projection of the data onto LD1 with a histogram of that data.
###One-dimensional plot of linear discriminants
```{r, fig.width=6, fig.height=18}
plot(model.lda, dimen = 1, type = "b")
```
These plots illustrate the separation between groups as well as overlapping areas that are potential for mix-ups when predicting classes.
###LDA Partition Plots
An alternate way to plot the linear discriminant functions is to produce an array of plots for every combination of two variables. Colored regions delineate each classification area and any observation that falls within a region is predicted to be from a specific class. Each plot also includes the apparent error rate for that view of the data.
```{r}
partimat(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=train.transformed, method="lda")
```
###Confusion matrix
A confusion matrix can be used to assess the prediction accuracy of a LDA model. The model is run on the training data to verify the model fits the data properly (observed vs predicted). The table output includes three species as the row labels and the predicted species at the column labels.
```{r}
lda.train <- predict(model.lda)
train.transformed$lda <- lda.train$class
table(train.transformed$lda,train.transformed$Species)
```
The total number of correctly predicted observations is the sum of the diagonal.Based on the earlier plots, it makes sense that a few iris versicolor and iris virginica observations may be miscategorized.
###Assess model performance
The test set is run against the model to determine its accuracy.
```{r}
lda.test <- predict(model.lda,test.transformed)
test.transformed$lda <- lda.test$class
table(test.transformed$lda,test.transformed$Species)
```
Overall the model performs very well with the testing set with an accuracy of 100%.
###Overall Model Accuracy:
You can compute the overall model accuracy (x100).
```{r}
mean(lda.test$class==test.transformed$Species)
```
The model correctly classified 100% of observations.
#Prediction: Test data
Information can be extracted for each individuals in the test data set regarding their class assignment
```{r}
predictions <- model.lda %>% predict(test.transformed)
# Predicted classes
head(predictions$class, 6)
# Predicted probabilities of class memebership.
tail(predictions$posterior, 6)
# Linear discriminants
head(predictions$x, 3)
```
The output returns the following elements:
1) The predicted classes of observations (i.e., assigned species name)
2) The posterior probability that the corresponding observation belongs to the groups (columns are the groups, rows are the individuals and values are posterior probabilities)
3) Linear discriminants (e.g., used for plotting)
###Quadratic discriminant analysis - QDA
Quadratic discriminant analysis is a modification of LDA that does not assume equal covariance matrices amongst the groups (i.e., the covariance matrix can be different for each class). QDA is recommended if the training set is very large, or if the assumption of a common covariance matrix for the K classes is clearly not met.
```{r}
#library(MASS)
# Fit the model
model.qda <- qda(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data = train.transformed)
model.qda
# Make predictions
predictions.qda <- model.qda %>% predict(test.transformed)
# Model accuracy
mean(predictions.qda$class == test.transformed$Species)
```
### QDA Partition Plots
Plotting the the quadratic discriminant functions using partition plots provides a good way visualize the difference between the linear functions used in LDA and the quadratic functions used in QDA. The colored regions delineate each classification area. Observations that fall within a region is predicted to be from that specific class. Each plot also includes the apparent error rate for that view of the data.
```{r}
partimat(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data = train.transformed, method="qda")
```
#QDA confusion matrix
The predict function works exactly the same way as before and can be used to create a confusion matrix for the training data.
```{r}
qda.train <- predict(model.qda)
train.transformed$qda <- qda.train$class
table(train.transformed$qda,train.transformed$Species)
```
This model fit the training data very well.
###QDA Model Performance
The test set is run against the model to determine its accuracy.
```{r}
qda.test <- predict(model.qda,test.transformed)
test.transformed$qda <- qda.test$class
table(test.transformed$qda,test.transformed$Species)
```
The model correctly predicted the class of observations 100% of the time.
### K-Nearest Neighbor (KNN)
KNN is a non-parametric supervised learning technique in which we try to classify the data point to a given category with the help of training set. More simply, it captures information of all training cases and classifies new cases based on a similarity. Predictions are made for a new instance (x) by searching through the entire training set for the K most similar cases (neighbors) and summarizing the output variable for those K cases (i.e., based on the mode or most common class value).
```{r}
head(iris)
```
Standardization
When independent variables in training data are measured in different units, it is important to standardize variables before calculating distance.
```{r}
##Generate a random number that is 80% of the total number of rows in dataset.
ran <- sample(1:nrow(iris), 0.8 * nrow(iris))
##the normalization function is created
nor <-function(x) {
(x-min(x))/(max(x)-min(x))
}
##Run nomalization on first 4 coulumns of dataset because they are the predictors
iris_norm <- as.data.frame(lapply(iris[,-5], nor))
summary(iris_norm)
```
The “k” is the number of neighbors that knn uses to determine what class to label an unknown example. The number of “k” to use can vary but a rule of thumb is to take the square root of the total number of observations in the training data (e.g., k=11)
```{r}
##extract training set
iris_train <- iris_norm[ran,]
##extract testing set
iris_test <- iris_norm[-ran,]
##extract 5th column of train dataset because it will be used as 'cl' argument in knn function.
iris_target_category <- iris[ran,5]
##extract 5th column if test dataset to measure the accuracy
iris_test_category <- iris[-ran,5]
##load the package class
library(class)
##run knn function
pr <- knn(iris_train,iris_test,cl=iris_target_category,k=11)
##create confusion matrix
tab <- table(pr,iris_test_category)
tab
```
```{r}
##this function divides the correct predictions by total number of predictions that tell us how accurate teh model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
```
###KNN plot
```{r}
#library(class)
#library(ggplot2)
# Do knn
fit = knn(iris_train,iris_test,cl=iris_target_category,k=11)
# Create a dataframe to simplify charting
plot.df = data.frame(iris_test, predicted = fit)
# Use ggplot
# 2-D plots example only
# Sepal.Length vs Sepal.Width
# First use Convex hull to determine boundary points of each cluster
plot.df1 = data.frame(x = plot.df$Sepal.Length,
y = plot.df$Sepal.Width,
predicted = plot.df$predicted)
find_hull = function(df) df[chull(df$x, df$y), ]
boundary = plyr::ddply(plot.df1, .variables = "predicted", .fun = find_hull)
ggplot(plot.df, aes(Sepal.Length, Sepal.Width, color = predicted, fill = predicted)) +
geom_point(size = 5) +
geom_polygon(data = boundary, aes(x,y), alpha = 0.5)
```