diff --git a/figure_1.png b/figure_1.png new file mode 100644 index 0000000..c6ffb12 Binary files /dev/null and b/figure_1.png differ diff --git a/learning_curve.py b/learning_curve.py index 2364f2c..92e0283 100755 --- a/learning_curve.py +++ b/learning_curve.py @@ -17,7 +17,14 @@ # You should repeat each training percentage num_trials times to smooth out variability # for consistency with the previous example use model = LogisticRegression(C=10**-10) for your learner -# TODO: your code here +model = LogisticRegression(C=10**-50) +for x in train_percentages: + summing = [] + for i in range(num_trials): + x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, train_size=float(x) * .01) + model.fit(x_train, y_train) + summing.append(model.score(x_test, y_test)) + test_accuracies[(x-1)/5] = float(sum(summing))/len(summing) fig = plt.figure() plt.plot(train_percentages, test_accuracies) diff --git a/questions.txt b/questions.txt new file mode 100644 index 0000000..e8e14e4 --- /dev/null +++ b/questions.txt @@ -0,0 +1,4 @@ +1. The general trend of the curve is upwards. It seems like as accuracy increases, the percentage of data used for training increases, yet the curve appears to be leveled off at higher percentages. +2. It seems like the curve appears to be noisier in the range of 55 to 80 percent of data used for training. The reason would be that there are less observation in the model to continuously measure the accuracy of the model. +3. About 100 trials to get the a smooth curve. +4. When I tried with C=10**-1 (the larger value than the one that I used), the curve rapidly increases and normalizes at a high percentage.