practicalmachinelearning/project.html at master · cblicous/practicalmachinelearning · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
<!DOCTYPE html><html><head><meta charset="utf-8"><title>Untitled Document.md</title><style></style></head><body id="preview">
<p>Initialisation of the project, setting the working directory etc
enabling Multicores, not sure if this helps
<a href="http://machinelearningmastery.com/tuning-machine-learning-models-using-the-caret-r-package/">http://machinelearningmastery.com/tuning-machine-learning-models-using-the-caret-r-package/</a></p>
<p>Loading Data</p>
<pre><code class="language-{r,">setwd(&quot;/cousera/&quot;)

library(caret)
library(doMC)
set.seed(1000)

registerDoMC(cores = 2)
</code></pre>
<p>masking the n/a and empty strings, no idea what #DIV/0! stands for but thrown out to</p>
<pre><code class="language-{r,">ignoreStrings &lt;- c(&quot;NA&quot;,&quot;&quot;,&quot;#DIV/0!&quot;)
trainingRaw &lt;- read.csv(&quot;pml-training.csv&quot;, na.strings=ignoreStrings)
trainingRaw &lt;- trainingRaw[,8:160]
</code></pre>
<p>skipping username etc, “roll_belt” is the first intersting one</p>
<pre><code class="language-{r,">testingRaw &lt;- read.csv(&quot;pml-testing.csv&quot;, na.strings=ignoreStrings)
testingRaw &lt;- testingRaw[,8:160]
</code></pre>
<p>searching for near zero variance</p>
<pre><code class="language-{r,">noVariance &lt;- nearZeroVar(trainingRaw)
str(noVariance)
</code></pre>
<p>those in variable noVariance have apparently very little variance</p>
<p>Removed the ones without Variance</p>
<pre><code class="language-{r,">trainingRawNoVariance &lt;- trainingRaw[,-noVariance]
testingRawNoVariance &lt;- testingRaw[,-noVariance]
</code></pre>
<p>the  NA fields are still there</p>
<pre><code class="language-{r,">str(trainingSet)
</code></pre>
<p>so next remove the NA Fields</p>
<pre><code class="language-{r,">nonEmptyFields &lt;- names(trainingRawNoVariance[,colSums(is.na(trainingRawNoVariance)) == 0]);

trainingData &lt;- trainingRawNoVariance[,c(nonEmptyFields)];

</code></pre>
<p>Calculating the out of Sample error</p>
<pre><code class="language-{r,">tmpTrain &lt;- createDataPartition(y = trainingData$classe,
                               p = 0.8,
                               list = F)
trainingInt &lt;- trainingData[tmpTrain,]
testingInt &lt;- trainingData[-tmpTrain,]

modelRfOutOfSample &lt;- randomForest(classe ~ ., method=&quot;parRF&quot;, data = trainingInt)

predictionsOutOfSample &lt;- predict(modelRfOutOfSample, newdata = testingInt)

confusionMatrix(predictionsOutOfSample, testingInt$classe)

</code></pre>
<p>Random Forst is the one with the best error rate</p>
<pre><code class="language-{r,">modelRandomForest &lt;- train(classe ~ ., method=&quot;parRF&quot;, data=trainingData)
predictTraining &lt;- predict(modelRandomForest, trainingData)
</code></pre>
<p>check the training data / Print accuracy
Choosing random forrest seems to be the best matching one</p>
<pre><code class="language-{r,">confMatrix &lt;- confusionMatrix(predictTraining, trainingData$classe)
print(confMatrix$overall)

</code></pre>
<p>Confusion Matrix output</p>
<pre><code class="language-{txt}">Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull AccuracyPValue  McnemarPValue
1.0000000      1.0000000      0.9998120      1.0000000      0.2843747      0.0000000            NaN
</code></pre>
<p>now predict on testdata and write for the quiz into a file</p>
<pre><code class="language-{r,">

predictionTesting &lt;- predict(modelRandomForest, testingRawNoVariance)

print(predictionTesting)

write.table(predictionTesting,file=&quot;pred.txt&quot;)

</code></pre>

</body></html>