StatsCrashCourse/2_Classification_Comprehensive.Rmd at master · ColauttiLab/StatsCrashCourse · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Classification Methods w/ Biological Examples"
author: "Ryan Franckowiak"
date: "January 30, 2019"
output: html_document
---

#Classification Methods

Classification is the process of predicting a qualitative response for an observation and involves assigning the observation to a category, or class. Qualitative variables are also referred to as categorical variables. These terms are often used interchangeably

There are a variety of different classification techniques, or classifiers, used to predict a qualitative response. The four most widely used classifiers are: logistic regression, linear discriminant analysis, quadratic discriminant analysis, and K-nearest neighbors.

This tutorial begins with a example data set for non-native bird species introduced to New Zealand (Veltman et al. 1996). Simple and multiple Logistic regression analyses will be used to identify species likelihood of become established or failing to become established (binomial response variable) based on one or more predictor variables.

The second half of the tutorial uses linear discriminant analysis, quadratic discriminant analysis, and K-nearest neighbours to predict the identity of three iris species (>2 classes) based on flower characteristics (predictor variables) (Anderson 1936, Fisher 1936).


###Load Required Libraries

```{r, echo=FALSE}

library(readr)
library(plyr)
library(dplyr)
library(FSA)
library(PerformanceAnalytics)
library(psych)
library(car)
library(rcompanion)
library(lmtest)
library(aod)
library(Amelia)
library(cvAUC)
library(caret)
library(ROCR)
library(ggplot2)
library(tidyverse)
library(broom)
library(stats)
library(stats4)

```

##Exercise 1

###Example data set: Bird introductions to New Zealand

During the 19th century, many people released their favorite bird species in New Zealand hoping to establish naturally reproducting populations. Veltman et al. (1996) assessed the success or failure (dependent variable) of 79 species involved in 496 introduction events based on present-day status of each species. Predictors included life-history variables (taxon, body mass, body length, geographic range, clutch size, brood frequency, diet, migratory status, and habitat use) and measures of introduction effort (number of release events, minimum number of individuals released).

In this tutorial, simple and multiple Logistic regression will be used to predict the probability of success of bird introductions and assess which factors influence the success of these introductions. Begin by importing the "Birds.CSV" data set from the working directory (i.e., GitHub).


```{r, echo=FALSE}
#library(readr)

Birds_df <- readr::read_csv("Birds.csv")

#View(Birds_df)

```

###Numeric variables

All of variables included in the Bird_df data set need to be converted from intergers to numeric variables. Select the 14 variables using the 'select()' function.

```{r}
#library(dplyr)

Birds_num <-
   dplyr::select(Birds_df,
                 Status,
                 Length,
                 Mass,
                 Range,
                 Migr,
                 Insect,
                 Diet,
                 Clutch,
                 Broods,
                 Wood,
                 Upland,
                 Water,
                 Release,
                 Indiv)

```

###Convert variables

Convert the selected variables from intergers to numeric variables using the 'as.numveric()' function.

```{r}

Birds_num$Status  = as.numeric(Birds_num$Status)
Birds_num$Length  = as.numeric(Birds_num$Length)
Birds_num$Migr    = as.numeric(Birds_num$Migr)
Birds_num$Insect  = as.numeric(Birds_num$Insect)
Birds_num$Diet    = as.numeric(Birds_num$Diet)
Birds_num$Broods  = as.numeric(Birds_num$Broods)
Birds_num$Wood    = as.numeric(Birds_num$Wood)
Birds_num$Upland  = as.numeric(Birds_num$Upland)
Birds_num$Water   = as.numeric(Birds_num$Water)
Birds_num$Release = as.numeric(Birds_num$Release)
Birds_num$Indiv   = as.numeric(Birds_num$Indiv)

#Visualise

headTail(Birds_num)

```

###Summary statistics

Examine the summary statistics for each of the 14 variables.

```{r}

summary(Birds_num)

```

###Missing data plot

Missing data can severely affect the performance and reliability of logistic regression models. To get a quick idea of the amount of missing data in the dataset, use the 'missmap()' function from the package 'Amelia' to generate a missing data plot.

```{r}
#library(Amelia)

Amelia::missmap(Birds_num, main = "Missing values vs observed")

```

The x-axis shows attributes (i.e., covariates) and the y-axis shows number of observations (i.e., species number, n=79). Horizontal red lines indicate missing data for a given species, vertical red blocks represent missing data for an attribute.

###Remove missing data

Remove observations (i.e., bird species) with missing data using the 'na.omit()' function.

```{r}

Birds_omit = na.omit(Birds_num)

```

###Generate training and testing datasets

Split the data into separate training (e.g., 80%) and testing (e.g., 20%) datasets using the 'creatDataPartition()' function from the 'caret' package. The training data will be used to develop the logistic regression model and the testing dataset will be used to independently evaluate the performance of the model. To ensure reproducibility of results set the random seed using the 'set.seed()' function.

```{r}
#library(caret)

set.seed(123)

training_samples <- Birds_omit$Status %>% caret::createDataPartition(p = 0.8, list = FALSE)

Birds_train <- Birds_omit[training_samples, ]
Birds_test <- Birds_omit[-training_samples, ]

```

###Logistic regression: General information

Logistic regression applies to data where the response variable falls into one of two categories (e.g., 1/0, Yes/No, Presence/Absence, etc…). Instead of modeling the response Y directly, logistic regression models the probability that Y belongs to a particular category.

In this examples, simple logistic regression will be used to model the relationship between a binary outcome variable (success or failure) and a single predictor variable (e.g., Release: number of times a species was introduced). This is accomplished by modeling the logit-transformed probability as a linear relationship with the single predictor variable. The model is fit using the 'glm()' function with a 'logit' link function (Bernoulli distribution).

Note: Predictor variables do not need to be standardized. However, standardization may help resolve convergence issues. Although, convergence issues typically only occur in extreme cases.

```{r}
#library(stats)

Birds_SLRmod = stats::glm(Status ~ Release,
                 data=Birds_train,
                 family = binomial(link="logit")
                 )

summary(Birds_SLRmod)

```

###Simple logistic regression output

In this examples, the estimated coefficient for the intercept is interpreted as the log odds for a bird species successfully becoming established when it has been released zero times. This coefficient is often meaningless on its own and is largely ignored.

The estimated coefficient for 'Release' provides the change in the log odds of the outcome (successfully becoming established) for a one unit increase in the predictor variable (number of times released). In the current example: each additional time a species was released, the log odds of becoming established increases by 0.4641 (see 'Odds ratio' below for interpretation)

The Z-value is a test statistic (Wald tests) that measures the ratio between the square of the regression coefficient to the square of the standard error of the coefficient. A Z-value sufficiently far from 0 indicates the coefficient estimate is both large and precise enough to be statistically different from 0 (p-value < 0.05) . Conversely, a Z-value close to 0 indicates the coefficient estimate is too small or too imprecise to be certain that the term has an effect on the response (p-value > 0.05).

###Odds ratio

Because coefficients are in log-odds units, they are often difficult to interpret. To improve intrepretation, estimates are typically converted into odds ratios by exponentiating the model coefficient using the 'exp()' function.

```{r}

## odds ratios only
exp(coef(Birds_SLRmod))

```

###Odds ratio w/ confidence intervals

The same logic can be used to get odds ratios and their confidence intervals.

```{r}

## odds ratios and 95% CI
exp(cbind(OR = coef(Birds_SLRmod), confint(Birds_SLRmod)))

```

This shows us that the odds of a bird species becoming established in New Zealand increased by a factor of 1.59x for each additional introduction attempt made (Number of releases).

###Model Assumptions:

Logistic regression relaxes many of the assumptions of linear regression (i.e., general linear models). For example, logistic regression:

  - does not require a linear relationship between the dependent and independent variables
  - does not require the error terms (residuals) to be normally distributed
  - does not require homoscedasticity (homogeneity of variance)

###Model Requirements:

Logistic regression does however require that:

  - observations are independent (no repeated measurements or matched data)
  - the dependent variable is binary (cannot be measured on an interval or ratio scale)
  - predictor variables are linearly related to log odds
  - excessively influential observations (i.e., outliers) are removed
  - samples size are large

###Linearity assumption

Check the linear relationship between predictor variables and the logit of the outcome. This can be done by visually inspecting the scatter plot between each predictor and the logit values.

```{r}
#library(stats)
#library(tidyverse)
#library(broom)

#Generate predicted values using the fitted SLR model
Birds_SLRprob <- stats::predict(Birds_SLRmod, type = "response")

#Convert single predictor variable to a new data frame
Birds_SLRrelease <- data.frame(Birds_train[c("Release")])

#Calculate logit values

Birds_SLRlogit <- Birds_SLRrelease %>%
  dplyr::mutate(logit = log(Birds_SLRprob/(1-Birds_SLRprob))) %>%
  tidyr::gather(key = "predictors", value = "predictor.value", -logit)

#Plot relationship using 'ggplot()' function

ggplot(Birds_SLRlogit, aes(predictor.value, logit)) +
  geom_point(size = 2.0, alpha = 0.5) +
  geom_smooth(method = "loess") +
  theme_bw() +
  facet_wrap(~"Release", scales = "free_y")

```

###Influential values

Influential values (i.e., outliers) are extreme data points that can alter the quality of the logistic regression model. The most extreme values in the data can be identified by visualizing Cook’s distance values (which=4) using the 'plot()' function from the 'stats4' package.

```{r}
#library(stats4)

stats4::plot(Birds_SLRmod, which = 4, id.n = 4)

```

Note that there are four observations that could be potential outliers. Use the standardized residuals to check whether the four most extreme observations are potentially influential observations.

Data points with absolute standardized residuals above 3 represent 'true' outliers. Calculate the standardized residuals and the Cook’s distance using the 'augment()' function available in the 'broom' package.

```{r}
#library(broom)
#library(dplyr)

# Extract model results
Birds_SLRcooks <- broom::augment(Birds_SLRmod) %>%
  dplyr::mutate(index = 1:n())

Birds_SLRcooks %>% dplyr::top_n(3, .cooksd)

```

Visualize the standardized residuals using the ggplot() function.

```{r}
#library(ggplot2)

ggplot2::ggplot(Birds_SLRcooks, aes(index, .std.resid)) +
  geom_point(alpha = .5) + theme_bw()

```

None of the observations have a standardized residual larger than 3 (or -3) so we can concluded there are no outliers (influential observations) in the bird training data.

###Assessing Model fit

The likelihood ratio test can be used to assess improvements in model fit when predictor variables are added to a model. This is accomplished by comparing the likelihood of the data for the model of interest (e.g., single predictor model) against the likelihood of the data for a model with fewer predictors (e.g., null model). The resulting test statistic approximates a chi-squared distribution with the degrees of freedom equal to the number of parameters. The likelihood ratio test can be performed using the 'lrtest(') function from the 'lmtest' package.

```{r}
#library(lmtest)

lmtest::lrtest(Birds_SLRmod)

```

A similar approach for assessing model fit is to calculated the difference between the residual deviance for the model of interest (e.g., single predictor model) and the residual deviance for the null model (i.e., intercept only). This test also approximates a chi-squared distribution with the degrees of freedom equal to the differences in degrees of freedom between the model of interest and the null model (i.e., the number of predictor variables in the model). The test can be calculated using the 'anova()' function from 'stats' package.

```{r}
#library(stats)

stats::anova(Birds_SLRmod,
      stats::update(Birds_SLRmod, ~1),    # update produces null model for comparison
      test="Chisq")

#or
# with(Bird_SLRmod, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

```

The chi-square of 37.62 with a difference of 1 degree of freedom has an associated p-value of 8.605e-10, which demonstrates that the single variable model fits significantly better than the null model.

###Plotting

A simple way to plot the model is to make use of ggplot’s 'stat_smooth()' function.

```{r}
#library(ggplot2)

ggplot2::ggplot(Birds_train, aes(x=Release, y=Status)) + geom_point() +
  stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)

```

###Confusion matrix

a confusion matrix is a very useful tool for evaluating the output of a model and examining all possible outcomes of your predictions (true positive, true negative, false positive, false negative). A confusion matrix can be generated using the 'confusionMatrix()' function available in the 'caret' package.

```{r}
#library(caret)

# Use your model to generate predicted values for the training datas
set.seed(123)
pBirds_SLRtrain <- predict(Birds_SLRmod, newdata = Birds_train, type = "response")

# use caret and compute a confusion matrix
caret::confusionMatrix(data = as.factor(as.numeric(pBirds_SLRtrain>0.5)), reference = as.factor(Birds_train$Status))

```

###Reciever Operating Characteristics (ROC)

ROC curves provide a graphical representation of classifier performance. The ROC curve is a two-dimensional plot with sensitivity on the Y axis, 1 -Specificity on the X axis.

```{r}
#ROCR Curve
#library(ROCR)

ROCRpred_SLRtrain<-ROCR::prediction(pBirds_SLRtrain, Birds_train$Status)
ROCRperf_SLRtrain<-ROCR::performance(ROCRpred_SLRtrain, 'tpr','fpr')
ROCR::plot(ROCRperf_SLRtrain, colorize = TRUE, text.adj = c(-0.2,1.7))

```

###Area under the curve (AUC)

The AUC is defined as the probability that the fit model will score a randomly drawn positive sample higher than a randomly drawn negative sample. AUC ranges between 0 and 1. A higher AUC values indicates a more powerful classifier (Assuming the general shape of the ROC curves remains relative constant)

```{r}
#Library(cvAUC)

cvAUC::AUC(pBirds_SLRtrain, Birds_train$Status, label.ordering = NULL)

```

###Model performance:test data

The model performance should be evaluated using the test dataset

```{r}
#library(caret)

# Use your model predicted values for the test dataset
set.seed(123)
pBirds_SLRtest <- predict(Birds_SLRmod, newdata = Birds_test, type = "response")

# use caret and compute a confusion matrix
confusionMatrix(data = as.factor(as.numeric(pBirds_SLRtest>0.5)), reference = as.factor(Birds_test$Status))

```

The classifier’s performance for the test data, which are distinct from the data used to train the classifier, differs from the training data, suggesting the model is not particularly good at classifying individuals. Although, some of this may be due to the small samples size.

###Reciever Operating Characteristics (ROC)

ROC curves provide a graphical representation of classifier performance. The ROC curve is a two-dimensional plot with sensitivity on the Y axis, 1 -Specificity on the X axis.

```{r}
#ROCR Curve
#library(ROCR)

ROCRpred_SLRtest <- ROCR::prediction(pBirds_SLRtest, Birds_test$Status)
ROCRperf_SLRtest <- ROCR::performance(ROCRpred_SLRtest, 'tpr','fpr')
ROCR::plot(ROCRperf_SLRtest, colorize = TRUE, text.adj = c(-0.2,1.7))

```

###Area under the curve (AUC)

The AUC is defined as the probability that the fitted model will score a randomly drawn positive sample higher than a randomly drawn negative sample. AUC ranges between 0 and 1. A higher AUC values indicates a more powerful classifier (Assuming the general shape of the ROC curves remains relative constant)

```{r}
#library(cvAUC)

cvAUC::AUC(pBirds_SLRtest, Birds_test$Status, label.ordering = NULL)

```

The AUC value reveals that the single variable model will only classify indiviuals correctly with a probability of 0.85. This is considerably lower than the AUC value reported for the training data (0.93). This clearly demonstrates why it is important to independently evaluate the performance of a fitted model using a different dataset than the one used to generate the model.

###Multiple logistic regression

The New Zealand bird data will be extended to include multiple independent (predictor) variables. Just as a reminder, the predictors include both life-history variables (taxon, body mass, body length, geographic range, clutch size, brood frequency, diet, migratory status, and habitat use) and measures of introduction effort (number of release events, minimum number of individuals released).

Unfortunately, due to the limited number of samples included in this dataset, it is not possible to separate the data into training and test datasets as this results in the model failing to converge. Therefore, the multiple logistic regression analysis will be run with the numeric transformed data with missing data omitted. As a consequence, it is not possible to independently evaluate model performance in this case.

###Variable selection procedure

There are two main approaches for variable selection: the all possible regressions approach (or multiple working hypotheses) and automatic methods. Automatic methods are useful when the number of explanatory variables is large and it is not feasible to fit all possible models. Automatic methods use a search algorithm (e.g., Forward selection, Backward elimination and Stepwise regression) to find the best model. To perform forward selection we need to begin by specifying a starting model and the range of models which we want to examine in the search. The 'step()' function from base R can be used to perform variable selection.

```{r}
#library(stats)

### Define full and null models and do step procedure

Birds_MLRnull = stats::glm(Status ~ 1,
                 data=Birds_omit,
                 family = binomial(link="logit")
                 )

Birds_MLRfull = stats::glm(Status ~ Length + Mass + Range + Migr + Insect + Diet +
                          Clutch + Broods + Wood + Upland + Water +
                          Release + Indiv,
                 data=Birds_omit,
                 family = binomial(link="logit")
                 )

step(Birds_MLRnull,
     scope = list(upper=Birds_MLRfull),
             direction="both",
             test="Chisq",
             data=Birds_omit)

```

According to this procedure, the best model is the one that includes the variables: Upland, Migr,  Mass, Indiv, Insect, and Wood

###Final Model Data

In the final model, it is important to exclude those observations that have missing values and thus only include those variables with complete records.

```{r}
Birds_final <-
   select(Birds_num,
          Status,
          Upland,
          Migr,
          Mass,
          Indiv,
          Insect,
          Wood)

Birds_final <- na.omit(Birds_final)

```


###Multicollinearity

Multicollinearity is a statistical phenomenon in which predictor variables in a logistic regression model are highly correlated. It is not uncommon when there are a large number of covariates in the model. Multicollinearity can cause unstable estimates and inaccurate variances which affects confidence intervals and hypothesis tests. The existence of collinearity inflates the variances of parameter estimates, and thus can lead to incorrect inferences about relationships between explanatory and response variables. Examining the correlation matrix for the selected covariates can be helpful to detect multicollinearity.

```{r}
#library(PerformanceAnalytics)

#Examining correlations among variables (Pearson or Spearman)
PerformanceAnalytics::chart.Correlation(Birds_final,
                  method="spearman",
                  histogram=TRUE,
                  pch=16)

```

Another approach is to examine the correlation coefficients directly

```{r}
#library(psych)

psych::corr.test(Birds_final,
          use = "pairwise",
          method="spearman",
          adjust="none",      # Can adjust p-values; see ?p.adjust for options
          alpha=.05)

```

None of the correlation values exceed 0.70 (rule of thumb) and thus there is little to no evidence of multicollinearity among predictors

###Fitting final model

```{r}
#library(stats)

Birds_MLRmod = stats::glm(Status ~ Upland + Migr + Mass + Indiv + Insect + Wood,
                  data=Birds_final,
                  family = binomial(link="logit")
                  )

summary(Birds_MLRmod)

```

###Assessing significance of model covariates

Another way to overall effect of retained predictor variables is using the Wald test (results similar to those shown above).

```{r}
#library(car)

car::Anova(Birds_MLRmod, type="II", test="Wald")

```

###Variable Importance

To assess the relative importance of individual predictors in the model, we can also look at the absolute value of the t-statistic for each model parameter. This technique is utilized by the varImp function in the caret package for general and generalized linear models.

```{r}
library(caret)

caret::varImp(Birds_MLRmod)

```

###Odds ratio

Odds ratios for each predictor are calculated by exponentiating the model coefficient using the 'exp()' function.

```{r}

exp(coef(Birds_MLRmod))

```

###Influential values

Influential values (i.e., outliers) are identified by visualizing Cook’s distance values (which=4) using the 'plot()' function from the 'stats4' package.

```{r}
#library(stats4)

stats4::plot(Birds_MLRmod, which = 4, id.n = 3)

```

Note that there are three observations that could be potential outliers. Use the standardized residuals to check whether the three most extreme observations are potentially influential observations.

Data points with an absolute standardized residuals above 3 represent 'true' outliers. Calculate the standardized residuals and the Cook’s distance using the 'augment()' function available in the 'broom' package.

```{r}
#library(broom)
#library(dplyr)

# Extract model results
Birds_MLRcooks <- broom::augment(Birds_MLRmod) %>%
  dplyr::mutate(index = 1:n())

Birds_MLRcooks %>% dplyr::top_n(3, .cooksd)

```

Visualize the standardized residuals using the ggplot() function.

```{r}
#library(ggplot2)

ggplot2::ggplot(Birds_MLRcooks, aes(index, .std.resid)) +
  geom_point(alpha = .5) + theme_bw()

```

None of the observations have a standardized residual larger than 3 (or -3) so we can concluded there are no outliers (influential observations) in the bird data.

###Overdispersion

Is an indication that the model doesn’t fit the data well. Overdispersion is a situation where the residual deviance of the glm is large relative to the residual degrees of freedom.  These values are shown in the summary of the model. One guideline is that if the ratio of the residual deviance to the residual degrees of freedom exceeds 1.5, then the model is overdispersed.

```{r}

summary(Birds_MLRmod)$deviance/summary(Birds_MLRmod)$df.residual

```

There is no evidence of overdispersion for this model

###Likelihood ratio test

A likelihood ratio test is used to test the significance of the overall model

```{r}

lrtest(Birds_MLRmod)

```

A similar approach for assessing model fit is to calculated the difference between the residual deviance for the model of interest (e.g., final model) and the residual deviance for the null model (i.e., intercept only). This test also approximates a chi-squared distribution with the degrees of freedom equal to the differences in degrees of freedom between the model of interest and the null model (i.e., the number of predictor variables in the model). The test can be calculated using the 'anova()' function from 'stats' package.

```{r}
#library(stats)

stats::anova(Birds_MLRmod,
      stats::update(Birds_MLRmod, ~1),    # update produces null model for comparison
      test="Chisq")

#or
# with(Bird_MLRmod, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

```

###Pseudo-R-squared

R does not produce r-squared values for generalized linear models (glm) to assess model fit. Instead, psuedo-R-squared values can be generated that provide information similar to that provided by R-squared in OLS regression; however, psuedo-R-squared values cannot be interpretated exactly as R-squared in OLS regression. These pseudo-R-squared values compare the maximum likelihood of the model to a nested null model fit with the same method.  They should be interpreted as a relative measure among similar models.

```{r}
#library(rcompanion)

rcompanion::nagelkerke(Birds_MLRmod)

```

#Plot predicted logistic curve

Because there are multiple predictors included in the model, the only way to examine the relationship visually is to plot predicted values from the model using the 'predict()' function available in the 'stats' package. The plot is generated using ggplot’s 'stat_smooth()' function.

```{r}
#Predicted values
#library(stats)

Birds_final$predy = stats::predict(Birds_MLRmod, type="response")

###Plot
#library(ggplot2)

ggplot(Birds_final, aes(x=predy, y=Status)) + geom_point() +
  stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)

```

###Confusion matrix

a confusion matrix is a very useful tool for evaluating the output of a model and examining all possible outcomes of your predictions (true positive, true negative, false positive, false negative). A confusion matrix can be generated using the 'confusionMatrix()' function available in the 'caret' package.

```{r}
#library(caret)
# Use your model generate predicted values
set.seed(123)
pBirds_MLRfinal <- predict(Birds_MLRmod, newdata = Birds_final, type = "response")

# use caret and compute a confusion matrix
caret::confusionMatrix(data = as.factor(as.numeric(pBirds_MLRfinal>0.5)), reference = as.factor(Birds_final$Status))

```

###Reciever Operating Characteristics (ROC)

ROC curves provide a graphical representation of classifier performance. The ROC curve is a two-dimensional plot with sensitivity on the Y axis, 1 -Specificity on the X axis.

```{r}
#library(ROCR)

ROCRpred_MLRfinal <- ROCR::prediction(Birds_final$predy, Birds_final$Status)
ROCRperf_MLRfinal <- ROCR::performance(ROCRpred_MLRfinal, 'tpr','fpr')
ROCR::plot(ROCRperf_MLRfinal, colorize = TRUE, text.adj = c(-0.2,1.7))

```

###Area under the curve (AUC)

The AUC is defined as the probability that the fit model will score a randomly drawn positive sample higher than a randomly drawn negative sample. AUC ranges between 0 and 1. A higher AUC values indicates a more powerful classifier (Assuming the general shape of the ROC curves remains relative constant)

```{r}
#library(cvAUC)

cvAUC::AUC(pBirds_MLRfinal, Birds_final$Status, label.ordering = NULL)

```

##Exercise 2

###Linear Discriminant Analysis (LDA)

The LDA aims to identify linear combinations of predictor variables that maximize the separation between >2 classes, then use these linear discrimants to predict the class of individuals. LDA assumes that predictors are normally distributed (Gaussian distribution) and that the different classes have class-specific means and equal variance/covariance.

Iris data:

The iris data was first introduced by Ronald Fisher in 1936. The dataset contains three different plant species (setosa, virginica, versicolor) and four features measured for each sample including sepal length, sepal width, petal length, petal width. All measurements given in centimeters.

###Load Libraries

```{r}

library(tidyverse)
library(caret)
library(klaR)

```

###Load data

```{r}

library(readr)

Iris_df <- readr::read_csv("Iris.csv")
Iris_df$species <- as.factor(Iris_df$species)

#View(Birds_df)

```

###Generate training and testing datasets

Split the data into separate training (e.g., 80%) and testing (e.g., 20%) datasets using the 'creatDataPartition()' function from the 'caret' package. To ensure reproducibility of results set the random seed using the 'set.seed()' function.

```{r}
#library(caret)

# Split the data into training (80%) and test set (20%)
set.seed(123)

Iris_training_samples<-Iris_df$species %>% caret::createDataPartition(p=0.8, list=FALSE)

Iris_train <-data.frame(Iris_df[Iris_training_samples, ])
Iris_test <-data.frame(Iris_df[-Iris_training_samples, ])

```

###Assumption of LDA

Inspect the univariate distributions of each variable and make sure that they are normally distribute. If not, you can transform them using log and root for exponential distributions and Box-Cox for skewed distributions.

```{r}
library(car)
#par(mfrow=c(2,2))
car::qqPlot(Iris_train$s.length)
#car::qqPlot(Iris_train$s.width)
#car::qqPlot(Iris_train$p.length)
#car::qqPlot(Iris_train$p.width)

```

```{r}
#Test for normality

shapiro.test(Iris_train$s.length)
shapiro.test(Iris_train$s.width)
shapiro.test(Iris_train$p.length)
shapiro.test(Iris_train$p.width)

```

It is generally recommended to standardize/normalize continuous predictors before the analysis since discriminant analysis can be affected by the scale/unit in which predictor variables are measured, and also outliers should be removed.

```{r}
#library(caret)
#library(stats)

# Estimate preprocessing parameters
preproc.param <- Iris_train %>% caret::preProcess(method = c("center", "scale"))

# Transform the data using the estimated parameters
Iris_train_trans <- preproc.param %>% stats::predict(Iris_train)
Iris_test_trans <- preproc.param %>% stats::predict(Iris_test)

```

###Specifying the LDA Model

```{r}
#library(MASS)

Iris_train_lda <- MASS::lda(species~., data = Iris_train_trans)
Iris_train_lda

```

LDA determines group means and for each individual computes the probability of belonging to the different groups. The individual is then assigned to the group with the highest probability score.

LDA output:

1) Prior probabilities of groups: the proportion of training observations in each group (e.g., 33% of the training observations in each of the three group)

2) Group means: group center of gravity or mean of each variable in each group.

3) Coefficients of linear discriminants: linear combination of predictor variables used to form the LDA decision rule.

      LD1 = 0.90*Sepal.Length + 0.66*Sepal.Width - 4.04*Petal.Length - 2.4*Petal.Width.
      LD2 = 0.04*Sepal.Length + 0.89*Sepal.Width - 2.2*Petal.Length - 2.6*Petal.Width.


###Two-dimensional Plot of linear discriminants

Linear discriminants are plotted using values of LD1 and LD2 for each of the training observations.

```{r}
#library(stats)

Iris_lda_data <- cbind(Iris_train_trans, stats::predict(Iris_train_lda)$x)
ggplot(Iris_lda_data, aes(LD1, LD2)) + geom_point(aes(color = species))

```

There are three relatively distinct groups with some overlap between virginica and versicolor.

Plotting again, but but this time adding the code dimen = 1 will only plot in one dimension (LD1). Think of it as a projection of the data onto LD1 with a histogram of that data.

###One-dimensional plot of linear discriminants

```{r, fig.width=6, fig.height=18}

plot(Iris_train_lda, dimen = 1, type = "b")

```

These plots illustrate the separation between groups as well as overlapping areas that are potential for mix-ups when predicting classes.

###LDA Partition Plots

An alternate way to plot the linear discriminant functions is to produce an array of plots for every combination of two variables. Colored regions delineate each classification area and any observation that falls within a region is predicted to be from a specific class. Each plot also includes the apparent error rate for that view of the data.

```{r}
#library(klaR)

partimat(species~ s.length + s.width + p.length + p.width, data=Iris_train_trans, method="lda")

```

###Confusion matrix

A confusion matrix can be used to assess the prediction accuracy of a LDA model. The model is run on the training data to verify the model fits the data properly (observed vs predicted). The table output includes three species as the row labels and the predicted species at the column labels.

```{r}

Iris_train_pred <- predict(Iris_train_lda)
Iris_train_trans$lda <- Iris_train_pred$class
table(Iris_train_trans$lda,Iris_train_trans$species)

```

The total number of correctly predicted observations is the sum of the diagonal. Based on the earlier plots, it makes sense that a few iris versicolor and iris virginica observations may be miscategorized.

###Assess model performance

The test dataset is run against the model to determine model assignment accuracy.

```{r}

Iris_test_pred <- predict(Iris_train_lda, Iris_test_trans)
Iris_test_trans$lda <- Iris_test_pred$class
table(Iris_test_trans$lda,Iris_test_trans$species)

```

 Overall the model performs very well with the testing set with an accuracy of 100%.

###Overall Model Accuracy:

You can compute the overall model accuracy (x100).

```{r}

mean(Iris_test_pred$class==Iris_test_trans$species)

```

The model correctly classified 100% of observations.

###Prediction: Test data

Information can be extracted for each individuals in the test data set regarding their class assignment

```{r}

#Iris_test_pred <- Iris_train_lda %>% predict(Iris_test_trans)

# Predicted classes
head(Iris_test_pred$class, 6)

# Predicted probabilities of class memebership.
head(Iris_test_pred$posterior, 6)

# Linear discriminants
head(Iris_test_pred$x, 6)

```

The output returns the following elements:

1) The predicted classes of observations (i.e., assigned species name)

2) The posterior probability that the corresponding observation belongs to the groups (columns are the groups, rows are the individuals and values are posterior probabilities)

3) Linear discriminants (e.g., used for plotting)


###Quadratic discriminant analysis - QDA

Quadratic discriminant analysis is a modification of LDA that does not assume equal covariance matrices amongst the groups (i.e., the covariance matrix can be different for each class). QDA is recommended if the training set is very large, or if the assumption of a common covariance matrix for the K classes is clearly not met.


###Specifying the LDA Model

```{r}

#library(MASS)

# Fit the model

Iris_train_qda <- qda(species~ s.length + s.width + p.length + p.width, data = Iris_train_trans)
Iris_train_qda

# Make predictions
Iris_pred_qda <- Iris_train_qda %>% predict(Iris_train_trans)

# Model accuracy
mean(Iris_pred_qda$class == Iris_train_trans$species)

```

### QDA Partition Plots

Plotting the the quadratic discriminant functions using partition plots provides a good way visualize the difference between the linear functions used in LDA and the quadratic functions used in QDA. The colored regions delineate each classification area. Observations that fall within a region is predicted to be from that specific class. Each plot also includes the apparent error rate for that view of the data.

```{r}

partimat(species~ s.length + s.width + p.length + p.width, data = Iris_train_trans, method="qda")

```

###QDA confusion matrix

The predict function works exactly the same way as before and can be used to create a confusion matrix for the training data. The table output includes three species as the row labels and the predicted species at the column labels.

```{r}

Iris_train_QDApred <- predict(Iris_train_qda)
Iris_train_trans$qda <- Iris_train_QDApred$class
table(Iris_train_trans$qda,Iris_train_trans$species)

```

The total number of correctly predicted observations is the sum of the diagonal. Based on the earlier plots, it makes sense that a few iris versicolor and iris virginica observations may be miscategorized.

###Assess model performance

The test dataset is run against the model to determine model assignment accuracy.

```{r}

Iris_test_QDApred <- predict(Iris_train_qda, Iris_test_trans)
Iris_test_trans$qda <- Iris_test_QDApred$class
table(Iris_test_trans$qda,Iris_test_trans$species)

```

 Overall the model performs very well with the testing set with an accuracy of 100%.

###Overall Model Accuracy:

You can compute the overall model accuracy (x100).

```{r}

mean(Iris_test_QDApred$class==Iris_test_trans$species)

```

The model correctly classified 100% of observations.

### K-Nearest Neighbor (KNN)

KNN is a non-parametric supervised learning technique in which we try to classify the data point to a given category with the help of training set. More simply, it captures information of all training cases and classifies new cases based on a similarity. Predictions are made for a new instance (x) by searching through the entire training set for the K most similar cases (neighbors) and summarizing the output variable for those K cases (i.e., based on the mode or most common class value).


```{r}