diff --git a/Questions.txt b/Questions.txt new file mode 100644 index 0000000..f6f8f26 --- /dev/null +++ b/Questions.txt @@ -0,0 +1,4 @@ +1. The general trend in the curve is the higher the percentage of data used for training, the higher the accuracy of the test. +2. The lower half of the curve appears to have more noise than the top half of the curve, which may be because there is less data to build a model with and thus, we cannot expect very representative models. +3. The greater the number of trials, the more smooth the curve. At 100 trials, I start to get a smoother curve with some repetitive bumps. When increased to 500, I get a fairly smooth curve. +4. When the C value is increased, the computer's accuracy at lower percentages of data seems to improve. When the C value is dexreased, the computer's accuracy at higher percentages of data seems to improve. diff --git a/learning_curve.py b/learning_curve.py index 2364f2c..bb7fa97 100755 --- a/learning_curve.py +++ b/learning_curve.py @@ -19,6 +19,15 @@ # TODO: your code here +for i in range(len(train_percentages)): + result = 0 + for j in range(num_trials): + X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, train_size=train_percentages[i]/100.0) + model = LogisticRegression(C=10**-10) + model.fit(X_train, y_train) + result += model.score(X_test, y_test) + test_accuracies[i] = result/num_trials + fig = plt.figure() plt.plot(train_percentages, test_accuracies) plt.xlabel('Percentage of Data Used for Training') diff --git a/plot.png b/plot.png new file mode 100644 index 0000000..c94ba36 Binary files /dev/null and b/plot.png differ