From a9ad672a5a3c05ce83b05723c130935997c7681d Mon Sep 17 00:00:00 2001 From: jralexander Date: Fri, 26 Apr 2019 13:57:02 -0700 Subject: [PATCH 1/3] Update GitHub Issue Classification to v 1.0.0 --- .../tutorials/github-issue-classification.md | 231 ++++-------------- 1 file changed, 54 insertions(+), 177 deletions(-) diff --git a/docs/machine-learning/tutorials/github-issue-classification.md b/docs/machine-learning/tutorials/github-issue-classification.md index 1a039c4ca39df..8f3d95ad2a2cc 100644 --- a/docs/machine-learning/tutorials/github-issue-classification.md +++ b/docs/machine-learning/tutorials/github-issue-classification.md @@ -1,19 +1,17 @@ --- title: Use ML.NET in a GitHub issue multiclass classification scenario description: Discover how to use ML.NET in a multiclass classification scenario to classify GitHub issues to assign them to a given area. -ms.date: 03/12/2019 +ms.date: 04/26/2019 ms.topic: tutorial ms.custom: mvc #Customer intent: As a developer, I want to use ML.NET to apply a multiclass classification learning algorithm so that I can understand how to classify GitHGub issues to assign them to a given area. --- # Tutorial: Use ML.NET in a multiclass classification scenario to classify GitHub issues -This sample tutorial illustrates using ML.NET to create a GitHub issue classifier via a .NET Core console application using C# in Visual Studio 2017. +This sample tutorial illustrates using ML.NET to create a GitHub issue classifier to train a model that classifies and predicts the Area label for a GitHub issue via a .NET Core console application using C# in Visual Studio 2019. In this tutorial, you learn how to: > [!div class="checklist"] -> * Understand the problem -> * Select the appropriate machine learning algorithm > * Prepare your data > * Transform the data > * Train the model @@ -21,15 +19,6 @@ In this tutorial, you learn how to: > * Predict with the trained model > * Deploy and Predict with a loaded model -> [!NOTE] -> This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, visit [the ML.NET introduction](https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet). - -This tutorial and related sample are currently using **ML.NET version 0.11**. For more information, see the release notes at the [dotnet/machinelearning github repo](https://github.com/dotnet/machinelearning/tree/master/docs/release-notes). - -## GitHub issue sample overview - -The sample is a console app that uses ML.NET to train a model that classifies and predicts the Area label for a GitHub issue. It also evaluates the model with a second dataset for quality analysis. The issue datasets are from the dotnet/corefx GitHub repo. - You can find the source code for this tutorial at the [dotnet/samples](https://github.com/dotnet/samples/tree/master/machine-learning/tutorials/GitHubIssueClassification) repository. ## Prerequisites @@ -39,74 +28,6 @@ You can find the source code for this tutorial at the [dotnet/samples](https://g * The [Github issues tab separated file (issues_train.tsv)](https://raw.githubusercontent.com/dotnet/samples/master/machine-learning/tutorials/GitHubIssueClassification/Data/issues_train.tsv). * The [Github issues test tab separated file (issues_test.tsv)](https://raw.githubusercontent.com/dotnet/samples/master/machine-learning/tutorials/GitHubIssueClassification/Data/issues_test.tsv). -## Machine learning workflow - -This tutorial follows a machine learning workflow that enables the process to move in an orderly fashion. - -The workflow phases are as follows: - -1. **Understand the problem** -2. **Prepare your data** - * **Load the data** - * **Extract features (Transform your data)** -3. **Build and train** - * **Train the model** - * **Evaluate the model** -4. **Deploy Model** - * **Use the Model to predict** - -### Understand the problem - -You first need to understand the problem, so you can break it down to parts that can support building and training the model. Breaking down the problem allows you to predict and evaluate the results. - -The problem for this tutorial is to understand what area incoming GitHub issues belong to in order to label them correctly for prioritization and scheduling. - -You can break down the problem to the following parts: - -* the issue title text -* the issue description text -* an area value for the model training data -* a predicted area value that you can evaluate and then use operationally - -You then need to **determine** the area, which helps you with the machine learning task selection. - -## Select the appropriate machine learning algorithm - -With this problem, you know the following facts: - -Training data: - -GitHub issues can be labeled in several areas (**Area**) as in the following examples: - -* area-System.Numerics -* area-System.Xml -* area-Infrastructure -* area-System.Linq -* area-System.IO - -Predict the **Area** of a new GitHub Issue such as in the following examples: - -* Contract.Assert vs Debug.Assert -* Make fields readonly in System.Xml - -The classification machine learning algorithm is best suited for this scenario. - -### About the classification learning algorithm - -Classification is a machine learning algorithm that uses data to **determine** the category, type, or class of an item or row of data. For example, you can use classification to: - -* Identify sentiment as positive or negative. -* Classify email as spam, junk, or good. -* Determine whether a patient's lab sample is cancerous. -* Categorize customers by their propensity to respond to a sales campaign. - -Classification learning algorithm use cases are frequently one of the following types: - -* Binary: either A or B. -* Multiclass: multiple categories that can be predicted by using a single model. - -For this type of problem, use a Multiclass classification learning algorithm, since your issue category prediction can be one of multiple categories (multiclass) rather than just two (binary). - ## Create a console application ### Create a project @@ -123,7 +44,7 @@ For this type of problem, use a Multiclass classification learning algorithm, si 4. Install the **Microsoft.ML NuGet Package**: - In Solution Explorer, right-click on your project and select **Manage NuGet Packages**. Choose "nuget.org" as the Package source, select the Browse tab, search for **Microsoft.ML**, select that package in the list, and select the **Install** button. Select the **OK** button on the **Preview Changes** dialog and then select the **I Accept** button on the **License Acceptance** dialog if you agree with the license terms for the packages listed. + In Solution Explorer, right-click on your project and select **Manage NuGet Packages**. Choose "nuget.org" as the Package source, select the Browse tab, search for **Microsoft.ML**, select the **v 1.0.0** package in the list, and select the **Install** button. Select the **OK** button on the **Preview Changes** dialog and then select the **I Accept** button on the **License Acceptance** dialog if you agree with the license terms for the packages listed. ### Prepare your data @@ -137,15 +58,14 @@ Add the following additional `using` statements to the top of the *Program.cs* f [!code-csharp[AddUsings](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#AddUsings)] -Create three global fields to hold the paths to the recently downloaded files, and global variables for the `MLContext`,`DataView`, `PredictionEngine`, and `TextLoader`: +Create three global fields to hold the paths to the recently downloaded files, and global variables for the `MLContext`,`DataView`, and `PredictionEngine`: * `_trainDataPath` has the path to the dataset used to train the model. * `_testDataPath` has the path to the dataset used to evaluate the model. * `_modelPath` has the path where the trained model is saved. * `_mlContext` is the that provides processing context. -* `_trainingDataView` is the used to process the training dataset. +* `_trainingDataView` is the used to process the training dataset. * `_predEngine` is the used for single predictions. -* `_reader` is the used to load and transform the datasets. Add the following code to the line right above the `Main` method to specify those paths and the other variables: @@ -165,16 +85,20 @@ Remove the existing class definition and add the following code, which has two c [!code-csharp[DeclareGlobalVariables](~/samples/machine-learning/tutorials/GitHubIssueClassification/GitHubIssueData.cs#DeclareTypes)] +The `label` is the column you want to predict. The identified `Features` are the inputs you give the model to predict the Label. + +Use the [LoadColumnAttribute](xref:Microsoft.ML.Data.LoadColumnAttribute) to specify the indices of the source columns in the data set. + `GitHubIssue` is the input dataset class and has the following fields: -* `ID` contains a value for the GitHub issue ID -* `Area` contains a value for the `Area` label -* `Title` contains the GitHub issue title -* `Description` contains the GitHub issue description +* the first column `ID` (GitHub Issue ID) +* the second column `Area` (the prediction for training) +* the third column `Title` (GitHub issue title) is the first `feature` used for predicting the `Area` +* the fourth column `Description` is the second `feature` used for predicting the `Area` -`IssuePrediction` is the class used for prediction after the model has been trained. It has a single `string` (`Area`) and a `PredictedLabel` `ColumnName` attribute. The `Label` is used to create and train the model, and it's also used with a second dataset to evaluate the model. The `PredictedLabel` is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used. +`IssuePrediction` is the class used for prediction after the model has been trained. It has a single `string` (`Area`) and a `PredictedLabel` `ColumnName` attribute. The `PredictedLabel` is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used. -When building a model with ML.NET, you start by creating an . `MLContext` is comparable conceptually to using `DbContext` in Entity Framework. The environment provides a context for your ML job that can be used for exception tracking and logging. +All ML.NET operations start in the [MLContext](xref:Microsoft.ML.MLContext) class. Initializing `mlContext` creates a new ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to `DBContext` in `Entity Framework`. ### Initialize variables in Main @@ -184,28 +108,14 @@ Initialize the `_mlContext` global variable with a new instance of `MLContext` ## Load the data -Next, initialize the `_trainingDataView` global variable and load the data with the `_trainDataPath` parameter. - - As the input and output of [`Transforms`](../basic-concepts-model-training-in-mldotnet.md#transformer), a `DataView` is the fundamental data pipeline type, comparable to `IEnumerable` for `LINQ`. - -In ML.NET, data is similar to a `SQL view`. It is lazily evaluated, schematized, and heterogenous. The object is the first part of the pipeline, and loads the data. For this tutorial, it loads a dataset with issue titles, descriptions, and corresponding area GitHub label. The `DataView` is used to create and train the model. - -Since your previously created `GitHubIssue` data model type matches the dataset schema, you can combine the initialization, mapping, and dataset loading into one line of code. - -Load the data using the `MLContext.Data.LoadFromTextFile` wrapper for the [LoadFromTextFile method](xref:Microsoft.ML.TextLoaderSaverCatalog.LoadFromTextFile%60%601%28Microsoft.ML.DataOperationsCatalog,System.String,System.Char,System.Boolean,System.Boolean,System.Boolean,System.Boolean%29). It returns a - which infers the dataset schema from the `GitHubIssue` data model type and uses the dataset header. - -You defined the data schema previously when you created the `GitHubIssue` class. For your schema: - -* the first column `ID` (GitHub Issue ID) -* the second column `Area` (the prediction for training) -* the third column `Title` (GitHub issue title) is the first [feature](../resources/glossary.md##feature) used for predicting the `Area` -* the fourth column `Description` is the second feature used for predicting the `Area` +ML.NET uses the [IDataView class](xref:Microsoft.ML.IDataView) as a flexible, efficient way of describing numeric or text tabular data. `IDataView` can load either text files or in real time (for example, SQL database or log files). To initialize and load the `_trainingDataView` global variable in order to use it for the pipeline, add the following code after the `mlContext` initialization: [!code-csharp[LoadTrainData](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#LoadTrainData)] +The [LoadFromTextFile()](xref:Microsoft.ML.TextLoaderSaverCatalog.LoadFromTextFile%60%601%28Microsoft.ML.DataOperationsCatalog,System.String,System.Char,System.Boolean,System.Boolean,System.Boolean,System.Boolean%29) defines the data schema and reads in the file. It takes in the data path variables and returns an `IDataView`. + Add the following as the next line of code in the `Main` method: [!code-csharp[CallProcessData](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CallProcessData)] @@ -226,24 +136,15 @@ public static IEstimator ProcessData() ## Extract Features and transform the data -Pre-processing and cleaning data are important tasks that occur before a dataset is used effectively for machine learning. Raw data is often noisy and unreliable, and may be missing values. Using data without these modeling tasks can produce misleading results. - -ML.NET's transform pipelines compose a custom `transforms`set that is applied to your data before training or testing. The transforms' primary purpose is data [featurization](../resources/glossary.md#feature-engineering). Machine learning algorithms understand [featurized](../resources/glossary.md#feature) data, so the next step is to transform our textual data into a format that our ML algorithms recognize. That format is a [numeric vector](../resources/glossary.md#numerical-feature-vector). - -In the next steps, we refer to the columns by the names defined in the `GitHubIssue` class. - -When the model is trained and evaluated, by default, the values in the **Label** column are considered as correct values to be predicted. As we want to predict the Area GitHub label for a `GitHubIssue`, copy the `Area` column into the **Label** column. To do that, use the `MLContext.Transforms.Conversion.MapValueToKey`, which is a wrapper for the transformation class. The `MapValueToKey` returns an that will effectively be a pipeline. Name this `pipeline` as you will then append the trainer to the `EstimatorChain`. Add the next line of code: +As you want to predict the Area GitHub label for a `GitHubIssue`, use the [MapValueToKey()](xref:Microsoft.ML.ConversionsExtensionsCatalog.MapValueToKey%2A) method to transform the `Area` column into a numeric key type `Label` column (a format accepted by classification algorithms) and add it as a new dataset column: [!code-csharp[MapValueToKey](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#MapValueToKey)] - Featurizing assigns different numeric key values to the different values in each of the columns and is used by the machine learning algorithm. Next, call `mlContext.Transforms.Text.FeaturizeText` which featurizes the text (`Title` and `Description`) columns into a numeric vector for each called `TitleFeaturized` and `DescriptionFeaturized`. Append the featurization for both columns to the pipeline with the following code: +Next, call `mlContext.Transforms.Text.FeaturizeText` which transforms the text (`Title` and `Description`) columns into a numeric vector for each called `TitleFeaturized` and `DescriptionFeaturized`. Append the featurization for both columns to the pipeline with the following code: [!code-csharp[FeaturizeText](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#FeaturizeText)] ->[!WARNING] -> ML.NET Version 0.10 has changed the order of the Transform parameters. This will not error out until you build. Use the parameter names for Transforms as illustrated in the previous code snippet. - -The last step in data preparation combines all of the feature columns into the **Features** column using the `Concatenate` transformation class. By default, a learning algorithm processes only features from the **Features** column. Append this transformation to the pipeline with the following code: +The last step in data preparation combines all of the feature columns into the **Features** column using the [Concatenate()](xref:Microsoft.ML.TransformExtensionsCatalog.Concatenate%2A) method. By default, a learning algorithm processes only features from the **Features** column. Append this transformation to the pipeline with the following code: [!code-csharp[Concatenate](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#Concatenate)] @@ -271,7 +172,6 @@ The `BuildAndTrainModel` method executes the following tasks: * Creates the training algorithm class. * Trains the model. * Predicts area based on training data. -* Saves the model to a `.zip` file. * Returns the model. Create the `BuildAndTrainModel` method, just after the `Main` method, using the following code: @@ -283,25 +183,38 @@ public static IEstimator BuildAndTrainModel(IDataView trainingData } ``` -Notice that two parameters are passed into the BuildAndTrainModel method; an `IDataView` for the training dataset (`trainingDataView`), and a for the processing pipeline created in ProcessData (`pipeline`). +Add the following code as the first line of the `BuildAndTrainModel` method: + +### Add a machine learning algorithm + +### About the classification task + +"Is it A or B?" - Add the following code as the first line of the `BuildAndTrainModel` method: +![classification machine learning algorithm](./media/sentiment-analysis/classification.png) -### Choose a learning algorithm +Classification is a machine learning task that uses data to **determine** the category, type, or class of an item or row of data and is frequently one of the following types: -To add the learning algorithm, call the `mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent` wrapper method which returns a object. The `SdcaMultiClassTrainer` is appended to the `pipeline` and accepts the featurized `Title` and `Description` (`Features`) and the `Label` input parameters to learn from the historic data. You also need to map the label to the value to return to its original readable state. Do both of those actions with the following code: +* Binary: either A or B. +* Multiclass: multiple categories that can be predicted by using a single model. + +For this type of problem, use a Multiclass classification learning algorithm, since your issue category prediction can be one of multiple categories (multiclass) rather than just two (binary). + +Append the machine learning algorithm to the data transformation definitions by adding the following as the next line of code in `BuildAndTrainModel()`: [!code-csharp[AddTrainer](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#AddTrainer)] -### Train the model +The [SdcaMaximumEntropy](xref:Microsoft.ML.Trainers.SdcaMaximumEntropyMulticlassTrainer) is your multiclass classification training algorithm. This is appended to the `pipeline` and accepts the featurized `Title` and `Description` (`Features`) and the `Label` input parameters to learn from the historic data. -You train the model, , based on the dataset that has been loaded and transformed. Once the estimator has been defined, you train your model using the while providing the already loaded training data. This method returns a model to use for predictions. `trainingPipeline.Fit()` trains the pipeline and returns a `Transformer` based on the `DataView` passed in. The experiment is not executed until the `.Fit()` method runs. +### Train the model -Add the following code to the `BuildAndTrainModel` method: +Fit the model to the `splitTrainSet` data and return the trained model by adding the following as the next line of code in the `BuildAndTrainModel()` method: [!code-csharp[TrainModel](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#TrainModel)] -While the `model` is a `transformer` that operates on many rows of data, a need for predictions on individual examples is a common production scenario. The is a wrapper that is returned from the `CreatePredictionEngine` method. Let's add the following code to create the `PredictionEngine` as the next line in the `BuildAndTrainModel` Method: +The `Fit()`method trains your model by transforming the dataset and applying the training. + +The [PredictionEngine](xref:Microsoft.ML.PredictionEngine%602) is a convenience API, which allows you to pass in and then perform a prediction on a single instance of data. Add this as the next line in the `BuildAndTrainModel()` method: [!code-csharp[CreatePredictionEngine1](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CreatePredictionEngine1)] @@ -311,7 +224,7 @@ Add a GitHub issue to test the trained model's prediction in the `Predict` metho [!code-csharp[CreateTestIssue1](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CreateTestIssue1)] -You can use that to predict the `Area` label of a single instance of the issue data. To get a prediction, use on the data. The input data is a string and the model includes the featurization. Your pipeline is in sync during training and prediction. You didn’t have to write preprocessing/featurization code specifically for predictions, and the same API takes care of both batch and one-time predictions. +Use the [Predict()](xref:Microsoft.ML.PredictionEngine%602.Predict%2A) function makes a prediction on a single row of data: [!code-csharp[Predict](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#Predict)] @@ -349,13 +262,13 @@ Add a call to the new method from the `Main` method, right under the `BuildAndTr [!code-csharp[CallEvaluate](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CallEvaluate)] -As you did previously with the training dataset, you can combine the initialization, mapping, and test dataset loading into one line of code. You can evaluate the model using this dataset as a quality check. Add the following code to the `Evaluate` method: +As you did previously with the training dataset, load the test dataset by adding the following code to the `Evaluate` method: [!code-csharp[LoadTestDataset](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#LoadTestDataset)] -The `MulticlassClassificationContext.Evaluate` is a wrapper for the method that computes the quality metrics for the model using the specified dataset. It returns a object that contains the overall metrics computed by multiclass classification evaluators. +The [Evaluate()](xref:Microsoft.ML.MulticlassClassificationCatalog.Evaluate%2A) method computes the quality metrics for the model using the specified dataset. It returns a object that contains the overall metrics computed by multiclass classification evaluators. To display the metrics to determine the quality of the model, you need to get them first. -Notice the use of the `Transform` method of the machine learning `_trainedModel` global variable (a transformer) to input the features and return predictions. Add the following code to the `Evaluate` method as the next line: +Notice the use of the [Transform()](xref:Microsoft.ML.ITransformer.Transform) method of the machine learning `_trainedModel` global variable (an [ITransformer](xref:Microsoft.ML.ITransformer)) to input the features and return predictions. Add the following code to the `Evaluate` method as the next line: [!code-csharp[Evaluate](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#Evaluate)] @@ -375,38 +288,7 @@ Use the following code to display the metrics, share the results, and then act o [!code-csharp[DisplayMetrics](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#DisplayMetrics)] -### Save the trained and evaluated model - -At this point, you have a model of type that can be integrated into any of your existing or new .NET applications. To save your trained model to a .zip file, add the following code to call the `SaveModelAsFile` method as the next line in `BuildAndTrainModel`: - -[!code-csharp[CallSaveModel](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CallSaveModel)] - -## Save the model as a .zip file - -Create the `SaveModelAsFile` method, just after the `Evaluate` method, using the following code: - -```csharp -private static void SaveModelAsFile(MLContext mlContext, ITransformer model) -{ - -} -``` - -The `SaveModelAsFile` method executes the following tasks: - -* Saves the model as a .zip file. - -Next, create a method to save the model so that it can be reused and consumed in other applications. The `ITransformer` has a method that takes in the `_modelPath` global field, and a . To save the model as a zip file, you'll create the `FileStream` immediately before calling the `SaveTo` method. Add the following code to the `SaveModelAsFile` method as the next line: - -[!code-csharp[SaveModel](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#SaveModel)] - -You could also display where the file was written by writing a console message with the `_modelPath`, using the following code: - -```csharp -Console.WriteLine("The model is saved to {0}", _modelPath); -``` - -## Deploy and Predict with a loaded model +## Deploy and Predict with a model Add a call to the new method from the `Main` method, right under the `Evaluate` method call, using the following code: @@ -428,17 +310,15 @@ The `PredictIssue` method executes the following tasks: * Combines test data and predictions for reporting. * Displays the predicted results. -First, load the model that you saved previously with the following code: - -[!code-csharp[LoadModel](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#LoadModel)] - Add a GitHub issue to test the trained model's prediction in the `Predict` method by creating an instance of `GitHubIssue`: [!code-csharp[AddTestIssue](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#AddTestIssue)] +As you did previously, create a `PredictionEngine` instance with the following code: + [!code-csharp[CreatePredictionEngine](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CreatePredictionEngine)] -Now that you have a model, you can use that to predict the Area GitHub label of a single instance of the GitHub issue data. To get a prediction, use on the data. The input data is a string and the model includes the featurization. Your pipeline is in sync during training and prediction. You didn’t have to write preprocessing/featurization code specifically for predictions, and the same API takes care of both batch and one-time predictions. Add the following code to the `PredictIssue` method for the predictions: +Use the `PredictionEngine` to predict the Area GitHub label by adding the following code to the `PredictIssue` method for the prediction: [!code-csharp[PredictIssue](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#PredictIssue)] @@ -454,14 +334,13 @@ Your results should be similar to the following. As the pipeline processes, it d ```console =============== Single Prediction just-trained-model - Result: area-System.Net =============== -The model is saved to C:\Users\johalex\dotnet-samples\samples\machine-learning\tutorials\GitHubIssueClassification\bin\Debug\netcoreapp2.0\..\..\..\Models\model.zip ************************************************************************************************************* * Metrics for Multi-class Classification model - Test Data *------------------------------------------------------------------------------------------------------------ -* MicroAccuracy: 0.74 -* MacroAccuracy: 0.687 -* LogLoss: .932 -* LogLossReduction: 63.852 +* MicroAccuracy: 0.741 +* MacroAccuracy: 0.67 +* LogLoss: .916 +* LogLossReduction: .645 ************************************************************************************************************* =============== Single Prediction - Result: area-System.Data =============== ``` @@ -472,8 +351,6 @@ Congratulations! You've now successfully built a machine learning model for clas In this tutorial, you learned how to: > [!div class="checklist"] -> * Understand the problem -> * Select the appropriate machine learning algorithm > * Prepare your data > * Transform the data > * Train the model From 6aec730336a48237565713e6c91323ae2a9493e3 Mon Sep 17 00:00:00 2001 From: John Alexander Date: Mon, 29 Apr 2019 11:23:17 -0700 Subject: [PATCH 2/3] fixed xref --- .../tutorials/github-issue-classification.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/machine-learning/tutorials/github-issue-classification.md b/docs/machine-learning/tutorials/github-issue-classification.md index 8f3d95ad2a2cc..1f41a245c3bee 100644 --- a/docs/machine-learning/tutorials/github-issue-classification.md +++ b/docs/machine-learning/tutorials/github-issue-classification.md @@ -1,7 +1,7 @@ --- title: Use ML.NET in a GitHub issue multiclass classification scenario description: Discover how to use ML.NET in a multiclass classification scenario to classify GitHub issues to assign them to a given area. -ms.date: 04/26/2019 +ms.date: 04/29/2019 ms.topic: tutorial ms.custom: mvc #Customer intent: As a developer, I want to use ML.NET to apply a multiclass classification learning algorithm so that I can understand how to classify GitHGub issues to assign them to a given area. @@ -44,7 +44,7 @@ You can find the source code for this tutorial at the [dotnet/samples](https://g 4. Install the **Microsoft.ML NuGet Package**: - In Solution Explorer, right-click on your project and select **Manage NuGet Packages**. Choose "nuget.org" as the Package source, select the Browse tab, search for **Microsoft.ML**, select the **v 1.0.0** package in the list, and select the **Install** button. Select the **OK** button on the **Preview Changes** dialog and then select the **I Accept** button on the **License Acceptance** dialog if you agree with the license terms for the packages listed. + In Solution Explorer, right-click on your project and select **Manage NuGet Packages**. Choose "nuget.org" as the Package source, select the Browse tab, search for **Microsoft.ML**, select the **v 1.0.0-preview** package in the list, and select the **Install** button. Select the **OK** button on the **Preview Changes** dialog and then select the **I Accept** button on the **License Acceptance** dialog if you agree with the license terms for the packages listed. ### Prepare your data @@ -268,7 +268,7 @@ As you did previously with the training dataset, load the test dataset by adding The [Evaluate()](xref:Microsoft.ML.MulticlassClassificationCatalog.Evaluate%2A) method computes the quality metrics for the model using the specified dataset. It returns a object that contains the overall metrics computed by multiclass classification evaluators. To display the metrics to determine the quality of the model, you need to get them first. -Notice the use of the [Transform()](xref:Microsoft.ML.ITransformer.Transform) method of the machine learning `_trainedModel` global variable (an [ITransformer](xref:Microsoft.ML.ITransformer)) to input the features and return predictions. Add the following code to the `Evaluate` method as the next line: +Notice the use of the [Transform()](xref:Microsoft.ML.ITransformer.Transform%2A) method of the machine learning `_trainedModel` global variable (an [ITransformer](xref:Microsoft.ML.ITransformer)) to input the features and return predictions. Add the following code to the `Evaluate` method as the next line: [!code-csharp[Evaluate](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#Evaluate)] From c61ad42c7acd6d28b18a22b5a12fb5a76e2b7a7b Mon Sep 17 00:00:00 2001 From: John Alexander Date: Mon, 29 Apr 2019 15:23:55 -0700 Subject: [PATCH 3/3] Revised based on feedback --- .../tutorials/github-issue-classification.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/docs/machine-learning/tutorials/github-issue-classification.md b/docs/machine-learning/tutorials/github-issue-classification.md index 1f41a245c3bee..ae8641514426c 100644 --- a/docs/machine-learning/tutorials/github-issue-classification.md +++ b/docs/machine-learning/tutorials/github-issue-classification.md @@ -19,6 +19,12 @@ In this tutorial, you learn how to: > * Predict with the trained model > * Deploy and Predict with a loaded model +> [!NOTE] +> This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, visit [the ML.NET introduction](https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet). + +This tutorial and related sample are currently using **ML.NET version 1.0.0-preview**. For more information, see the release notes at the [dotnet/machinelearning GitHub repo](https://github.com/dotnet/machinelearning/tree/master/docs/release-notes). + + You can find the source code for this tutorial at the [dotnet/samples](https://github.com/dotnet/samples/tree/master/machine-learning/tutorials/GitHubIssueClassification) repository. ## Prerequisites @@ -183,16 +189,8 @@ public static IEstimator BuildAndTrainModel(IDataView trainingData } ``` -Add the following code as the first line of the `BuildAndTrainModel` method: - -### Add a machine learning algorithm - ### About the classification task -"Is it A or B?" - -![classification machine learning algorithm](./media/sentiment-analysis/classification.png) - Classification is a machine learning task that uses data to **determine** the category, type, or class of an item or row of data and is frequently one of the following types: * Binary: either A or B. @@ -200,7 +198,7 @@ Classification is a machine learning task that uses data to **determine** the ca For this type of problem, use a Multiclass classification learning algorithm, since your issue category prediction can be one of multiple categories (multiclass) rather than just two (binary). -Append the machine learning algorithm to the data transformation definitions by adding the following as the next line of code in `BuildAndTrainModel()`: +Append the machine learning algorithm to the data transformation definitions by adding the following as the first line of code in `BuildAndTrainModel()`: [!code-csharp[AddTrainer](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#AddTrainer)]