-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathproject.html
More file actions
80 lines (66 loc) · 3.46 KB
/
project.html
File metadata and controls
80 lines (66 loc) · 3.46 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
<!DOCTYPE html><html><head><meta charset="utf-8"><title>Untitled Document.md</title><style></style></head><body id="preview">
<p>Initialisation of the project, setting the working directory etc
enabling Multicores, not sure if this helps
<a href="http://machinelearningmastery.com/tuning-machine-learning-models-using-the-caret-r-package/">http://machinelearningmastery.com/tuning-machine-learning-models-using-the-caret-r-package/</a></p>
<p>Loading Data</p>
<pre><code class="language-{r,">setwd("/cousera/")
library(caret)
library(doMC)
set.seed(1000)
registerDoMC(cores = 2)
</code></pre>
<p>masking the n/a and empty strings, no idea what #DIV/0! stands for but thrown out to</p>
<pre><code class="language-{r,">ignoreStrings <- c("NA","","#DIV/0!")
trainingRaw <- read.csv("pml-training.csv", na.strings=ignoreStrings)
trainingRaw <- trainingRaw[,8:160]
</code></pre>
<p>skipping username etc, “roll_belt” is the first intersting one</p>
<pre><code class="language-{r,">testingRaw <- read.csv("pml-testing.csv", na.strings=ignoreStrings)
testingRaw <- testingRaw[,8:160]
</code></pre>
<p>searching for near zero variance</p>
<pre><code class="language-{r,">noVariance <- nearZeroVar(trainingRaw)
str(noVariance)
</code></pre>
<p>those in variable noVariance have apparently very little variance</p>
<p>Removed the ones without Variance</p>
<pre><code class="language-{r,">trainingRawNoVariance <- trainingRaw[,-noVariance]
testingRawNoVariance <- testingRaw[,-noVariance]
</code></pre>
<p>the NA fields are still there</p>
<pre><code class="language-{r,">str(trainingSet)
</code></pre>
<p>so next remove the NA Fields</p>
<pre><code class="language-{r,">nonEmptyFields <- names(trainingRawNoVariance[,colSums(is.na(trainingRawNoVariance)) == 0]);
trainingData <- trainingRawNoVariance[,c(nonEmptyFields)];
</code></pre>
<p>Calculating the out of Sample error</p>
<pre><code class="language-{r,">tmpTrain <- createDataPartition(y = trainingData$classe,
p = 0.8,
list = F)
trainingInt <- trainingData[tmpTrain,]
testingInt <- trainingData[-tmpTrain,]
modelRfOutOfSample <- randomForest(classe ~ ., method="parRF", data = trainingInt)
predictionsOutOfSample <- predict(modelRfOutOfSample, newdata = testingInt)
confusionMatrix(predictionsOutOfSample, testingInt$classe)
</code></pre>
<p>Random Forst is the one with the best error rate</p>
<pre><code class="language-{r,">modelRandomForest <- train(classe ~ ., method="parRF", data=trainingData)
predictTraining <- predict(modelRandomForest, trainingData)
</code></pre>
<p>check the training data / Print accuracy
Choosing random forrest seems to be the best matching one</p>
<pre><code class="language-{r,">confMatrix <- confusionMatrix(predictTraining, trainingData$classe)
print(confMatrix$overall)
</code></pre>
<p>Confusion Matrix output</p>
<pre><code class="language-{txt}">Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue McnemarPValue
1.0000000 1.0000000 0.9998120 1.0000000 0.2843747 0.0000000 NaN
</code></pre>
<p>now predict on testdata and write for the quiz into a file</p>
<pre><code class="language-{r,">
predictionTesting <- predict(modelRandomForest, testingRawNoVariance)
print(predictionTesting)
write.table(predictionTesting,file="pred.txt")
</code></pre>
</body></html>