diff --git a/docs/machine-learning/how-to-guides/normalizers-preprocess-data-ml-net.md b/docs/machine-learning/how-to-guides/normalizers-preprocess-data-ml-net.md deleted file mode 100644 index d8e467d9099ca..0000000000000 --- a/docs/machine-learning/how-to-guides/normalizers-preprocess-data-ml-net.md +++ /dev/null @@ -1,61 +0,0 @@ ---- -title: Preprocess training data with normalizers to use in data processing -description: Learn how to use normalizers to preprocess training data for use in machine learning model building, training, and scoring with ML.NET -ms.date: 03/05/2019 -ms.custom: mvc,how-to -#Customer intent: As a developer, I want to use normalizers to preprocess training data so that I can optimize it in machine learning model building, training, and scoring with ML.NET. ---- -# Preprocess training data with normalizers to use in data processing - -> [!NOTE] -> This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, visit [the ML.NET introduction](https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet). - -This how-to and related sample are currently using **ML.NET version 0.10**. For more information, see the release notes at the [dotnet/machinelearning GitHub repo](https://github.com/dotnet/machinelearning/tree/master/docs/release-notes). - -ML.NET exposes a number of [parametric and non-parametric algorithms](https://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/). - -It's **not** as important which normalizer you choose as it is to **use** a normalizer when training linear or other parametric models. - -Always include the normalizer directly in the ML.NET learning pipeline, so it: - -- is only trained on the training data, and not on your test data, -- is correctly applied to all the new incoming data, without the need for extra pre-processing at prediction time. - -Here's a snippet of code that demonstrates normalization in learning pipelines. It assumes the Iris dataset: - -```csharp -// Create a new context for ML.NET operations. It can be used for exception tracking and logging, -// as a catalog of available operations and as the source of randomness. -var mlContext = new MLContext(); - -// Define the reader: specify the data columns and where to find them in the text file. -var reader = mlContext.Data.CreateTextLoader( - columns: new TextLoader.Column[] - { - // The four features of the Iris dataset will be grouped together as one Features column. - new TextLoader.Column("Features",DataKind.R4,0,3), - // Label: kind of iris. - new TextLoader.Column("Label",DataKind.TX,4) - }, - // Default separator is tab, but the dataset has comma. - separatorChar: ',', - // First line of the file is a header, not a data row. - hasHeader: true -); - -// Read the training data. -var trainData = reader.Read(dataPath); - -// Apply all kinds of standard ML.NET normalization to the raw features. -var pipeline = - mlContext.Transforms.Normalize( - new NormalizingEstimator.MinMaxColumn(inputColumnName:"Features", outputColumnName:"MinMaxNormalized", fixZero: true), - new NormalizingEstimator.MeanVarColumn(inputColumnName: "Features", outputColumnName: "MeanVarNormalized", fixZero: true), - new NormalizingEstimator.BinningColumn(inputColumnName: "Features", outputColumnName: "BinNormalized", numBins: 256)); - -// Let's train our pipeline of normalizers, and then apply it to the same data. -var normalizedData = pipeline.Fit(trainData).Transform(trainData); - -// Inspect one column of the resulting dataset. -var meanVarValues = normalizedData.GetColumn(mlContext, "MeanVarNormalized").ToArray(); -``` diff --git a/docs/machine-learning/how-to-guides/serve-model-serverless-azure-functions-ml-net.md b/docs/machine-learning/how-to-guides/serve-model-serverless-azure-functions-ml-net.md index c04e330f73ba7..0f29f3157d63b 100644 --- a/docs/machine-learning/how-to-guides/serve-model-serverless-azure-functions-ml-net.md +++ b/docs/machine-learning/how-to-guides/serve-model-serverless-azure-functions-ml-net.md @@ -1,28 +1,28 @@ --- -title: Deploy ML.NET Model to Azure Functions +title: "How-To: Deploy ML.NET machine learning model to Azure Functions" description: Serve ML.NET sentiment analysis machine learning model for prediction over the internet using Azure Functions -ms.date: 03/08/2019 -ms.custom: mvc,how-to +ms.date: 04/29/2019 +author: luisquintanilla +ms.author: luquinta +ms.custom: mvc, how-to #Customer intent: As a developer, I want to use my ML.NET Machine Learning model to make predictions through the internet using Azure Functions --- -# How-To: Use ML.NET Model in Azure Functions +# How-To: Deploy ML.NET machine learning model to Azure Functions -This how-to shows how individual predictions can be made using a pre-built ML.NET machine learning model through the internet in a serverless environment such as Azure Functions. - -> [!NOTE] -> This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, visit [the ML.NET introduction](https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet). - -This how-to and related sample are currently using **ML.NET version 0.10**. For more information, see the release notes at the [dotnet/machinelearning github repo](https://github.com/dotnet/machinelearning/tree/master/docs/release-notes). +Learn how to deploy a pre-trained ML.NET machine learning model for predictions over HTTP through an Azure Functions serverless environment. ## Prerequisites - [Visual Studio 2017 15.6 or later](https://visualstudio.microsoft.com/downloads/?utm_medium=microsoft&utm_source=docs.microsoft.com&utm_campaign=inline+link&utm_content=download+vs2017) with the ".NET Core cross-platform development" workload and "Azure development" installed. - [Azure Functions Tools](/azure/azure-functions/functions-develop-vs#check-your-tools-version) - Powershell -- Pre-trained model. - - Use the [ML.NET Sentiment Analysis tutorial](../tutorials/sentiment-analysis.md) to build your own model. - - Download this [pre-trained sentiment analysis machine learning model](https://github.com/dotnet/samples/blob/master/machine-learning/models/sentimentanalysis/sentiment_model.zip) +- Pre-trained model + - Use the [ML.NET Sentiment Analysis tutorial](../tutorials/sentiment-analysis.md) to build your own model. + + or + + - Download this [pre-trained sentiment analysis machine learning model](https://github.com/dotnet/samples/blob/master/machine-learning/models/sentimentanalysis/sentiment_model.zip) ## Create Azure Functions Project @@ -53,23 +53,21 @@ Create a class to predict sentiment. Add a new class to your project: The *AnalyzeSentiment.cs* file opens in the code editor. Add the following `using` statement to the top of *GitHubIssueData.cs*: -```csharp -using System; -using System.IO; -using System.Threading.Tasks; -using Microsoft.AspNetCore.Mvc; -using Microsoft.Azure.WebJobs; -using Microsoft.Azure.WebJobs.Extensions.Http; -using Microsoft.AspNetCore.Http; -using Microsoft.Extensions.Logging; -using Newtonsoft.Json; -using Microsoft.ML; -using Microsoft.ML.Core.Data; -using Microsoft.ML.Data; -using MLNETServerless.DataModels; -``` + ```csharp + using System; + using System.IO; + using System.Threading.Tasks; + using Microsoft.AspNetCore.Mvc; + using Microsoft.Azure.WebJobs; + using Microsoft.Azure.WebJobs.Extensions.Http; + using Microsoft.AspNetCore.Http; + using Microsoft.Extensions.Logging; + using Newtonsoft.Json; + using Microsoft.ML; + using SentimentAnalysisFunctionsApp.DataModels; + ``` -### Create Data Models +## Create Data Models You need to create some classes for your input data and predictions. Add a new class to your project: @@ -82,8 +80,8 @@ You need to create some classes for your input data and predictions. Add a new c using Microsoft.ML.Data; ``` - Remove the existing class definition and add the following code to the SentimentData.cs file: - + Remove the existing class definition and add the following code to the *SentimentData.cs* file: + ```csharp public class SentimentData { @@ -104,42 +102,65 @@ You need to create some classes for your input data and predictions. Add a new c Remove the existing class definition and add the following code to the *SentimentPrediction.cs* file: ```csharp - public class SentimentPrediction + public class SentimentPrediction : SentimentData { [ColumnName("PredictedLabel")] public bool Prediction { get; set; } } ``` -### Add Prediction Logic +`SentimentPrediction` inherits from `SentimentData` which provides access to the original data in the `Text` property as well as the output generated by the model. + +## Load the model into the function + +Insert the following code inside the *AnalyzeSentiment* class: + +```csharp +// Define MLContext +static MLContext _mlContext; + +// Define model +static ITransformer _model; + +// Define model's DataViewSchema +static DataViewSchema _modelSchema; + +// Define PredictionEngine +static PredictionEngine _predictionEngine; + +// AnalyzeSentiment class constructor +static AnalyzeSentiment() +{ + // Create MLContext + _mlContext = new MLContext(); + + // Load Model + _model = _mlContext.Model.Load("MLModels/sentiment_model.zip", out _modelSchema); + + // Create Prediction Engine + _predictionEngine = _mlContext.Model.CreatePredictionEngine(_model); +} +``` + +The constructor contains initialization logic for the [`MLContext`](xref:Microsoft.ML.MLContext), model and [`PredictionEngine`](xref:Microsoft.ML.PredictionEngine%602) so that it can be shared throughout the lifecycle of the function instance. This approach reduces the need to load the model from disk each time the `Run` method executes. + +## Use the model to make predictions Replace the existing implementation of *Run* method in *AnalyzeSentiment* class with the following code: ```csharp public static async Task Run( - [HttpTrigger(AuthorizationLevel.Function,"post", Route = null)] HttpRequest req, - ILogger log) +[HttpTrigger(AuthorizationLevel.Function, "post", Route = null)] HttpRequest req, +ILogger log) { log.LogInformation("C# HTTP trigger function processed a request."); - //Create Context - MLContext mlContext = new MLContext(); - - //Load Model - using (var fs = File.OpenRead("MLModels/sentiment_model.zip")) - { - model = mlContext.Model.Load(fs); - } - //Parse HTTP Request Body string requestBody = await new StreamReader(req.Body).ReadToEndAsync(); SentimentData data = JsonConvert.DeserializeObject(requestBody); - - //Create Prediction Engine - PredictionEngine predictionEngine = model.CreatePredictionEngine(mlContext); - + //Make Prediction - SentimentPrediction prediction = predictionEngine.Predict(data); + SentimentPrediction prediction = _predictionEngine.Predict(data); //Convert prediction to string string isToxic = Convert.ToBoolean(prediction.Prediction) ? "Toxic" : "Not Toxic"; @@ -149,6 +170,8 @@ public static async Task Run( } ``` +When the `Run` method executes, the incoming data from the HTTP request is deserialized and used as input for the [`PredictionEngine`](xref:Microsoft.ML.PredictionEngine%602). The [`Predict`](xref:Microsoft.ML.PredictionEngineBase%602.Predict*) method is then called to generate a prediction and return the result to the user. + ## Test Locally Now that everything is set up, it's time to test the application: @@ -156,15 +179,15 @@ Now that everything is set up, it's time to test the application: 1. Run the application 1. Open PowerShell and enter the code into the prompt where PORT is the port your application is running on. Typically the port is 7071. -```powershell -Invoke-RestMethod "http://localhost:/api/AnalyzeSentiment" -Method Post -Body (@{Text="This is a very rude movie"} | ConvertTo-Json) -ContentType "application/json" -``` - -If successful, the output should look similar to the text below: + ```powershell + Invoke-RestMethod "http://localhost:/api/AnalyzeSentiment" -Method Post -Body (@{Text="This is a very rude movie"} | ConvertTo-Json) -ContentType "application/json" + ``` -```powershell -Toxic -``` + If successful, the output should look similar to the text below: + + ```powershell + Toxic + ``` Congratulations! You have successfully served your model to make predictions over the internet using an Azure Function. diff --git a/docs/machine-learning/how-to-guides/train-model-categorical-ml-net.md b/docs/machine-learning/how-to-guides/train-model-categorical-ml-net.md deleted file mode 100644 index f5ed8b75508e4..0000000000000 --- a/docs/machine-learning/how-to-guides/train-model-categorical-ml-net.md +++ /dev/null @@ -1,99 +0,0 @@ ---- -title: Apply feature engineering for model training on categorical data -description: Learn how to apply feature engineering for machine learning model training on categorical data with ML.NET -ms.date: 03/05/2019 -ms.custom: mvc,how-to -#Customer intent: As a developer, I want to apply feature engineering for my machine learning model training on categorical data with ML.NET so that I can use my model in the ML.NET processing pipeline. ---- - -# Apply feature engineering for model training on categorical data - -> [!NOTE] -> This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, visit [the ML.NET introduction](https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet). - -This how-to and related sample are currently using **ML.NET version 0.10**. For more information, see the release notes at the [dotnet/machinelearning GitHub repo](https://github.com/dotnet/machinelearning/tree/master/docs/release-notes). - -You need to convert any non float data to `float` data types since all ML.NET `learners` expect features as a `float vector`. - -If the dataset contains `categorical` data (for example, 'enum'), ML.NET offers several ways of converting it to features: - -- One-hot encoding -- Hash-based one-hot encoding -- Binary encoding (convert category index into a bit sequence and use bits as features) - -A `one-hot encoding` can be wasteful if some categories are very high-cardinality (lots of different values, with a small set commonly occurring. In that case, reduce the number of slots to encode with count-based feature selection. - -Include categorical featurization directly in the ML.NET learning pipeline to ensure that the categorical transformation: - -- is only 'trained' on the training data, and not on your test data, -- is correctly applied to new incoming data, without extra pre-processing at prediction time. - -The following example illustrates categorical handling for the [adult census dataset](https://github.com/dotnet/machinelearning/blob/master/test/data/adult.tiny.with-schema.txt): - - -```console -Label Workclass education marital-status occupation relationship ethnicity sex native-country-region age fnlwgt education-num capital-gain capital-loss hours-per-week -0 Private 11th Never-married Machine-op-inspct Own-child Black Male United-States 25 226802 7 0 0 40 -0 Private HS-grad Married-civ-spouse Farming-fishing Husband White Male United-States 38 89814 9 0 0 50 -1 Local-gov Assoc-acdm Married-civ-spouse Protective-serv Husband White Male United-States 28 336951 12 0 0 40 -1 Private Some-college Married-civ-spouse Machine-op-inspct Husband Black Male United-States 44 160323 10 7688 0 40 -``` - - -```csharp -// Create a new context for ML.NET operations. It can be used for exception tracking and logging, -// as a catalog of available operations and as the source of randomness. -var mlContext = new MLContext(); - -// Define the reader: specify the data columns and where to find them in the text file. -var reader = mlContext.Data.CreateTextLoader(new[] - { - new TextLoader.Column("Label", DataKind.BL, 0), - // We will load all the categorical features into one vector column of size 8. - new TextLoader.Column("CategoricalFeatures", DataKind.TX, 1, 8), - // Similarly, load all numerical features into one vector of size 6. - new TextLoader.Column("NumericalFeatures", DataKind.R4, 9, 14), - // Let's also separately load the 'Workclass' column. - new TextLoader.Column("Workclass", DataKind.TX, 1), - }, - hasHeader: true -); - -// Read the data. -var data = reader.Read(dataPath); - -// Inspect the first 10 records of the categorical columns to check that they are correctly read. -var catColumns = data.GetColumn(mlContext, "CategoricalFeatures").Take(10).ToArray(); - -// Build several alternative featurization pipelines. -var pipeline = - // Convert each categorical feature into one-hot encoding independently. - mlContext.Transforms.Categorical.OneHotEncoding("CategoricalFeatures", "CategoricalOneHot") - // Convert all categorical features into indices, and build a 'word bag' of these. - .Append(mlContext.Transforms.Categorical.OneHotEncoding("CategoricalFeatures", "CategoricalBag",OneHotEncodingTransformer.OutputKind.Bag)) - // One-hot encode the workclass column, then drop all the categories that have fewer than 10 instances in the train set. - .Append(mlContext.Transforms.Categorical.OneHotEncoding("Workclass", "WorkclassOneHot")) - .Append(mlContext.Transforms.FeatureSelection.SelectFeaturesBasedOnCount("WorkclassOneHot", "WorkclassOneHotTrimmed", count: 10)); - -// Let's train our pipeline, and then apply it to the same data. -var transformedData = pipeline.Fit(data).Transform(data); - -// Inspect some columns of the resulting dataset. -var categoricalBags = transformedData.GetColumn(mlContext, "CategoricalBag").Take(10).ToArray(); -var workclasses = transformedData.GetColumn(mlContext, "WorkclassOneHotTrimmed").Take(10).ToArray(); - -// Of course, if we want to train the model, we will need to compose a single float vector of all the features. -// Here's how we could do this: - -var fullLearningPipeline = pipeline - // Concatenate two of the 3 categorical pipelines, and the numeric features. - .Append(mlContext.Transforms.Concatenate("Features", "NumericalFeatures", "CategoricalBag", "WorkclassOneHotTrimmed")) - // Cache data in memory so that the following trainer will be able to access training examples without - // reading them from disk multiple times. - .AppendCacheCheckpoint(mlContext) - // Now we're ready to train. We chose our FastTree trainer for this classification task. - .Append(mlContext.BinaryClassification.Trainers.FastTree(numTrees: 50)); - -// Train the model. -var model = fullLearningPipeline.Fit(data); -``` diff --git a/docs/machine-learning/how-to-guides/train-model-textual-ml-net.md b/docs/machine-learning/how-to-guides/train-model-textual-ml-net.md deleted file mode 100644 index a0c6eaa9254c6..0000000000000 --- a/docs/machine-learning/how-to-guides/train-model-textual-ml-net.md +++ /dev/null @@ -1,88 +0,0 @@ ---- -title: Apply feature engineering for model training on textual data -description: Learn how to apply feature engineering for model training on textual data with ML.NET -ms.date: 03/05/2019 -ms.custom: mvc,how-to -#Customer intent: As a developer, I want to apply feature engineering for my model training on textual data with ML.NET so that I can use my model in the ML.NET processing pipeline. ---- - -# Apply feature engineering for machine learning model training on textual data with ML.NET - -> [!NOTE] -> This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, visit [the ML.NET introduction](https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet). - -This how-to and related sample are currently using **ML.NET version 0.10**. For more information, see the release notes at the [dotnet/machinelearning GitHub repo](https://github.com/dotnet/machinelearning/tree/master/docs/release-notes). - -You need to convert any non float data to `float` data types since all ML.NET `learners` expect features as a `float vector`. - -To learn on textual data, you need to extract text features. ML.NET has some basic text feature extraction mechanisms: - -- `Text normalization` (removing punctuation, diacritics, switching to lowercase etc.) -- `Separator-based tokenization`. -- `Stopword` removal. -- `Ngram` and `skip-gram` extraction. -- `TF-IDF` rescaling. -- `Bag of words` conversion. - -The following example demonstrates ML.NET text feature extraction mechanisms using the -[Wikipedia detox dataset](https://github.com/dotnet/machinelearning/blob/master/test/data/wikipedia-detox-250-line-data.tsv): - - -```console -Sentiment SentimentText -1 Stop trolling, zapatancas, calling me a liar merely demonstartes that you arer Zapatancas. You may choose to chase every legitimate editor from this site and ignore me but I am an editor with a record that isnt 99% trolling and therefore my wishes are not to be completely ignored by a sockpuppet like yourself. The consensus is overwhelmingly against you and your trolling lover Zapatancas, -1 ::::: Why are you threatening me? I'm not being disruptive, its you who is being disruptive. -0 " *::Your POV and propaganda pushing is dully noted. However listing interesting facts in a netral and unacusitory tone is not POV. You seem to be confusing Censorship with POV monitoring. I see nothing POV expressed in the listing of intersting facts. If you want to contribute more facts or edit wording of the cited fact to make them sound more netral then go ahead. No need to CENSOR interesting factual information. " -0 ::::::::This is a gross exaggeration. Nobody is setting a kangaroo court. There was a simple addition concerning the airline. It is the only one disputed here. -``` - - -```csharp -// Define the reader: specify the data columns and where to find them in the text file. -var reader = mlContext.Data.CreateTextLoader(new[] - { - new TextLoader.Column("IsToxic", DataKind.BL, 0), - new TextLoader.Column("Message", DataKind.TX, 1), - }, - hasHeader: true -); - -// Read the data. -var data = reader.Read(dataPath); - -// Inspect the message texts that are read from the file. -var messageTexts = data.GetColumn(mlContext, "Message").Take(20).ToArray(); - -// Apply various kinds of text operations supported by ML.NET. -var pipeline = - // One-stop shop to run the full text featurization. - mlContext.Transforms.Text.FeaturizeText("TextFeatures", "Message") - - // Normalize the message for later transforms - .Append(mlContext.Transforms.Text.NormalizeText("NormalizedMessage", "Message")) - - // NLP pipeline 1: bag of words. - .Append(new WordBagEstimator(mlContext, "BagOfWords", "NormalizedMessage")) - - // NLP pipeline 2: bag of bigrams, using hashes instead of dictionary indices. - .Append(new WordHashBagEstimator(mlContext, "BagOfBigrams","NormalizedMessage", - ngramLength: 2, allLengths: false)) - - // NLP pipeline 3: bag of tri-character sequences with TF-IDF weighting. - .Append(mlContext.Transforms.Text.TokenizeCharacters("MessageChars", "Message")) - .Append(new NgramExtractingEstimator(mlContext, "BagOfTrichar", "MessageChars", - ngramLength: 3, weighting: NgramExtractingEstimator.WeightingCriteria.TfIdf)) - - // NLP pipeline 4: word embeddings. - .Append(mlContext.Transforms.Text.TokenizeWords("TokenizedMessage", "NormalizedMessage")) - .Append(mlContext.Transforms.Text.ExtractWordEmbeddings("Embeddings", "TokenizedMessage", - WordEmbeddingsExtractingTransformer.PretrainedModelKind.GloVeTwitter25D)); - -// Let's train our pipeline, and then apply it to the same data. -// Note that even on a small dataset of 70KB the pipeline above can take up to a minute to completely train. -var transformedData = pipeline.Fit(data).Transform(data); - -// Inspect some columns of the resulting dataset. -var embeddings = transformedData.GetColumn(mlContext, "Embeddings").Take(10).ToArray(); -var unigrams = transformedData.GetColumn(mlContext, "BagOfWords").Take(10).ToArray(); -``` diff --git a/docs/machine-learning/index.yml b/docs/machine-learning/index.yml index 01bd52edc3a21..c1c9e197190d0 100644 --- a/docs/machine-learning/index.yml +++ b/docs/machine-learning/index.yml @@ -19,7 +19,7 @@ sections: - title: Get Started items: - type: paragraph - text: 'If you are new to machine learning, get an overview from What is machine learning? To understand how an ML.NET application is built, read How does ML.NET work? Or get started by adding the Microsoft.ML nuget package to your application.' + text: 'If you are new to machine learning, get an overview from What is machine learning? To understand how an ML.NET application is built, read How does ML.NET work? Or get started by adding the Microsoft.ML nuget package to your application.' - title: Step-by-Step Tutorials items: - type: paragraph diff --git a/docs/machine-learning/tutorials/github-issue-classification.md b/docs/machine-learning/tutorials/github-issue-classification.md index 55a717cddc008..916f79484cc11 100644 --- a/docs/machine-learning/tutorials/github-issue-classification.md +++ b/docs/machine-learning/tutorials/github-issue-classification.md @@ -1,19 +1,17 @@ --- title: Classify GitHub issues - multiclass classification description: Discover how to use ML.NET in a multiclass classification scenario to classify GitHub issues to assign them to a given area. -ms.date: 03/12/2019 +ms.date: 04/29/2019 ms.topic: tutorial ms.custom: mvc #Customer intent: As a developer, I want to use ML.NET to apply a multiclass classification learning algorithm so that I can understand how to classify GitHGub issues to assign them to a given area. --- # Tutorial: Use ML.NET in a multiclass classification scenario to classify GitHub issues -This sample tutorial illustrates using ML.NET to create a GitHub issue classifier via a .NET Core console application using C# in Visual Studio 2017. +This sample tutorial illustrates using ML.NET to create a GitHub issue classifier to train a model that classifies and predicts the Area label for a GitHub issue via a .NET Core console application using C# in Visual Studio 2019. In this tutorial, you learn how to: > [!div class="checklist"] -> * Understand the problem -> * Select the appropriate machine learning algorithm > * Prepare your data > * Transform the data > * Train the model @@ -24,11 +22,8 @@ In this tutorial, you learn how to: > [!NOTE] > This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, visit [the ML.NET introduction](https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet). -This tutorial and related sample are currently using **ML.NET version 0.11**. For more information, see the release notes at the [dotnet/machinelearning github repo](https://github.com/dotnet/machinelearning/tree/master/docs/release-notes). +This tutorial and related sample are currently using **ML.NET version 1.0.0-preview**. For more information, see the release notes at the [dotnet/machinelearning GitHub repo](https://github.com/dotnet/machinelearning/tree/master/docs/release-notes). -## GitHub issue sample overview - -The sample is a console app that uses ML.NET to train a model that classifies and predicts the Area label for a GitHub issue. It also evaluates the model with a second dataset for quality analysis. The issue datasets are from the dotnet/corefx GitHub repo. You can find the source code for this tutorial at the [dotnet/samples](https://github.com/dotnet/samples/tree/master/machine-learning/tutorials/GitHubIssueClassification) repository. @@ -39,74 +34,6 @@ You can find the source code for this tutorial at the [dotnet/samples](https://g * The [Github issues tab separated file (issues_train.tsv)](https://raw.githubusercontent.com/dotnet/samples/master/machine-learning/tutorials/GitHubIssueClassification/Data/issues_train.tsv). * The [Github issues test tab separated file (issues_test.tsv)](https://raw.githubusercontent.com/dotnet/samples/master/machine-learning/tutorials/GitHubIssueClassification/Data/issues_test.tsv). -## Machine learning workflow - -This tutorial follows a machine learning workflow that enables the process to move in an orderly fashion. - -The workflow phases are as follows: - -1. **Understand the problem** -2. **Prepare your data** - * **Load the data** - * **Extract features (Transform your data)** -3. **Build and train** - * **Train the model** - * **Evaluate the model** -4. **Deploy Model** - * **Use the Model to predict** - -### Understand the problem - -You first need to understand the problem, so you can break it down to parts that can support building and training the model. Breaking down the problem allows you to predict and evaluate the results. - -The problem for this tutorial is to understand what area incoming GitHub issues belong to in order to label them correctly for prioritization and scheduling. - -You can break down the problem to the following parts: - -* the issue title text -* the issue description text -* an area value for the model training data -* a predicted area value that you can evaluate and then use operationally - -You then need to **determine** the area, which helps you with the machine learning task selection. - -## Select the appropriate machine learning algorithm - -With this problem, you know the following facts: - -Training data: - -GitHub issues can be labeled in several areas (**Area**) as in the following examples: - -* area-System.Numerics -* area-System.Xml -* area-Infrastructure -* area-System.Linq -* area-System.IO - -Predict the **Area** of a new GitHub Issue such as in the following examples: - -* Contract.Assert vs Debug.Assert -* Make fields readonly in System.Xml - -The classification machine learning algorithm is best suited for this scenario. - -### About the classification learning algorithm - -Classification is a machine learning algorithm that uses data to **determine** the category, type, or class of an item or row of data. For example, you can use classification to: - -* Identify sentiment as positive or negative. -* Classify email as spam, junk, or good. -* Determine whether a patient's lab sample is cancerous. -* Categorize customers by their propensity to respond to a sales campaign. - -Classification learning algorithm use cases are frequently one of the following types: - -* Binary: either A or B. -* Multiclass: multiple categories that can be predicted by using a single model. - -For this type of problem, use a Multiclass classification learning algorithm, since your issue category prediction can be one of multiple categories (multiclass) rather than just two (binary). - ## Create a console application ### Create a project @@ -123,7 +50,7 @@ For this type of problem, use a Multiclass classification learning algorithm, si 4. Install the **Microsoft.ML NuGet Package**: - In Solution Explorer, right-click on your project and select **Manage NuGet Packages**. Choose "nuget.org" as the Package source, select the Browse tab, search for **Microsoft.ML**, select that package in the list, and select the **Install** button. Select the **OK** button on the **Preview Changes** dialog and then select the **I Accept** button on the **License Acceptance** dialog if you agree with the license terms for the packages listed. + In Solution Explorer, right-click on your project and select **Manage NuGet Packages**. Choose "nuget.org" as the Package source, select the Browse tab, search for **Microsoft.ML**, select the **v 1.0.0-preview** package in the list, and select the **Install** button. Select the **OK** button on the **Preview Changes** dialog and then select the **I Accept** button on the **License Acceptance** dialog if you agree with the license terms for the packages listed. ### Prepare your data @@ -137,15 +64,14 @@ Add the following additional `using` statements to the top of the *Program.cs* f [!code-csharp[AddUsings](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#AddUsings)] -Create three global fields to hold the paths to the recently downloaded files, and global variables for the `MLContext`,`DataView`, `PredictionEngine`, and `TextLoader`: +Create three global fields to hold the paths to the recently downloaded files, and global variables for the `MLContext`,`DataView`, and `PredictionEngine`: * `_trainDataPath` has the path to the dataset used to train the model. * `_testDataPath` has the path to the dataset used to evaluate the model. * `_modelPath` has the path where the trained model is saved. * `_mlContext` is the that provides processing context. -* `_trainingDataView` is the used to process the training dataset. +* `_trainingDataView` is the used to process the training dataset. * `_predEngine` is the used for single predictions. -* `_reader` is the used to load and transform the datasets. Add the following code to the line right above the `Main` method to specify those paths and the other variables: @@ -165,16 +91,20 @@ Remove the existing class definition and add the following code, which has two c [!code-csharp[DeclareGlobalVariables](~/samples/machine-learning/tutorials/GitHubIssueClassification/GitHubIssueData.cs#DeclareTypes)] +The `label` is the column you want to predict. The identified `Features` are the inputs you give the model to predict the Label. + +Use the [LoadColumnAttribute](xref:Microsoft.ML.Data.LoadColumnAttribute) to specify the indices of the source columns in the data set. + `GitHubIssue` is the input dataset class and has the following fields: -* `ID` contains a value for the GitHub issue ID -* `Area` contains a value for the `Area` label -* `Title` contains the GitHub issue title -* `Description` contains the GitHub issue description +* the first column `ID` (GitHub Issue ID) +* the second column `Area` (the prediction for training) +* the third column `Title` (GitHub issue title) is the first `feature` used for predicting the `Area` +* the fourth column `Description` is the second `feature` used for predicting the `Area` -`IssuePrediction` is the class used for prediction after the model has been trained. It has a single `string` (`Area`) and a `PredictedLabel` `ColumnName` attribute. The `Label` is used to create and train the model, and it's also used with a second dataset to evaluate the model. The `PredictedLabel` is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used. +`IssuePrediction` is the class used for prediction after the model has been trained. It has a single `string` (`Area`) and a `PredictedLabel` `ColumnName` attribute. The `PredictedLabel` is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used. -When building a model with ML.NET, you start by creating an . `MLContext` is comparable conceptually to using `DbContext` in Entity Framework. The environment provides a context for your ML job that can be used for exception tracking and logging. +All ML.NET operations start in the [MLContext](xref:Microsoft.ML.MLContext) class. Initializing `mlContext` creates a new ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to `DBContext` in `Entity Framework`. ### Initialize variables in Main @@ -184,28 +114,14 @@ Initialize the `_mlContext` global variable with a new instance of `MLContext` ## Load the data -Next, initialize the `_trainingDataView` global variable and load the data with the `_trainDataPath` parameter. - - As the input and output of [`Transforms`](../resources/glossary.md#transformer), a `DataView` is the fundamental data pipeline type, comparable to `IEnumerable` for `LINQ`. - -In ML.NET, data is similar to a `SQL view`. It is lazily evaluated, schematized, and heterogenous. The object is the first part of the pipeline, and loads the data. For this tutorial, it loads a dataset with issue titles, descriptions, and corresponding area GitHub label. The `DataView` is used to create and train the model. - -Since your previously created `GitHubIssue` data model type matches the dataset schema, you can combine the initialization, mapping, and dataset loading into one line of code. - -Load the data using the `MLContext.Data.LoadFromTextFile` wrapper for the [LoadFromTextFile method](xref:Microsoft.ML.TextLoaderSaverCatalog.LoadFromTextFile%60%601%28Microsoft.ML.DataOperationsCatalog,System.String,System.Char,System.Boolean,System.Boolean,System.Boolean,System.Boolean%29). It returns a - which infers the dataset schema from the `GitHubIssue` data model type and uses the dataset header. - -You defined the data schema previously when you created the `GitHubIssue` class. For your schema: - -* the first column `ID` (GitHub Issue ID) -* the second column `Area` (the prediction for training) -* the third column `Title` (GitHub issue title) is the first [feature](../resources/glossary.md##feature) used for predicting the `Area` -* the fourth column `Description` is the second feature used for predicting the `Area` +ML.NET uses the [IDataView class](xref:Microsoft.ML.IDataView) as a flexible, efficient way of describing numeric or text tabular data. `IDataView` can load either text files or in real time (for example, SQL database or log files). To initialize and load the `_trainingDataView` global variable in order to use it for the pipeline, add the following code after the `mlContext` initialization: [!code-csharp[LoadTrainData](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#LoadTrainData)] +The [LoadFromTextFile()](xref:Microsoft.ML.TextLoaderSaverCatalog.LoadFromTextFile%60%601%28Microsoft.ML.DataOperationsCatalog,System.String,System.Char,System.Boolean,System.Boolean,System.Boolean,System.Boolean%29) defines the data schema and reads in the file. It takes in the data path variables and returns an `IDataView`. + Add the following as the next line of code in the `Main` method: [!code-csharp[CallProcessData](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CallProcessData)] @@ -226,24 +142,15 @@ public static IEstimator ProcessData() ## Extract Features and transform the data -Pre-processing and cleaning data are important tasks that occur before a dataset is used effectively for machine learning. Raw data is often noisy and unreliable, and may be missing values. Using data without these modeling tasks can produce misleading results. - -ML.NET's transform pipelines compose a custom `transforms`set that is applied to your data before training or testing. The transforms' primary purpose is data [featurization](../resources/glossary.md#feature-engineering). Machine learning algorithms understand [featurized](../resources/glossary.md#feature) data, so the next step is to transform our textual data into a format that our ML algorithms recognize. That format is a [numeric vector](../resources/glossary.md#numerical-feature-vector). - -In the next steps, we refer to the columns by the names defined in the `GitHubIssue` class. - -When the model is trained and evaluated, by default, the values in the **Label** column are considered as correct values to be predicted. As we want to predict the Area GitHub label for a `GitHubIssue`, copy the `Area` column into the **Label** column. To do that, use the `MLContext.Transforms.Conversion.MapValueToKey`, which is a wrapper for the transformation class. The `MapValueToKey` returns an that will effectively be a pipeline. Name this `pipeline` as you will then append the trainer to the `EstimatorChain`. Add the next line of code: +As you want to predict the Area GitHub label for a `GitHubIssue`, use the [MapValueToKey()](xref:Microsoft.ML.ConversionsExtensionsCatalog.MapValueToKey%2A) method to transform the `Area` column into a numeric key type `Label` column (a format accepted by classification algorithms) and add it as a new dataset column: [!code-csharp[MapValueToKey](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#MapValueToKey)] - Featurizing assigns different numeric key values to the different values in each of the columns and is used by the machine learning algorithm. Next, call `mlContext.Transforms.Text.FeaturizeText` which featurizes the text (`Title` and `Description`) columns into a numeric vector for each called `TitleFeaturized` and `DescriptionFeaturized`. Append the featurization for both columns to the pipeline with the following code: +Next, call `mlContext.Transforms.Text.FeaturizeText` which transforms the text (`Title` and `Description`) columns into a numeric vector for each called `TitleFeaturized` and `DescriptionFeaturized`. Append the featurization for both columns to the pipeline with the following code: [!code-csharp[FeaturizeText](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#FeaturizeText)] ->[!WARNING] -> ML.NET Version 0.10 has changed the order of the Transform parameters. This will not error out until you build. Use the parameter names for Transforms as illustrated in the previous code snippet. - -The last step in data preparation combines all of the feature columns into the **Features** column using the `Concatenate` transformation class. By default, a learning algorithm processes only features from the **Features** column. Append this transformation to the pipeline with the following code: +The last step in data preparation combines all of the feature columns into the **Features** column using the [Concatenate()](xref:Microsoft.ML.TransformExtensionsCatalog.Concatenate%2A) method. By default, a learning algorithm processes only features from the **Features** column. Append this transformation to the pipeline with the following code: [!code-csharp[Concatenate](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#Concatenate)] @@ -271,7 +178,6 @@ The `BuildAndTrainModel` method executes the following tasks: * Creates the training algorithm class. * Trains the model. * Predicts area based on training data. -* Saves the model to a `.zip` file. * Returns the model. Create the `BuildAndTrainModel` method, just after the `Main` method, using the following code: @@ -283,25 +189,30 @@ public static IEstimator BuildAndTrainModel(IDataView trainingData } ``` -Notice that two parameters are passed into the BuildAndTrainModel method; an `IDataView` for the training dataset (`trainingDataView`), and a for the processing pipeline created in ProcessData (`pipeline`). +### About the classification task - Add the following code as the first line of the `BuildAndTrainModel` method: +Classification is a machine learning task that uses data to **determine** the category, type, or class of an item or row of data and is frequently one of the following types: -### Choose a learning algorithm +* Binary: either A or B. +* Multiclass: multiple categories that can be predicted by using a single model. -To add the learning algorithm, call the `mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent` wrapper method which returns a object. The `SdcaMultiClassTrainer` is appended to the `pipeline` and accepts the featurized `Title` and `Description` (`Features`) and the `Label` input parameters to learn from the historic data. You also need to map the label to the value to return to its original readable state. Do both of those actions with the following code: +For this type of problem, use a Multiclass classification learning algorithm, since your issue category prediction can be one of multiple categories (multiclass) rather than just two (binary). + +Append the machine learning algorithm to the data transformation definitions by adding the following as the first line of code in `BuildAndTrainModel()`: [!code-csharp[AddTrainer](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#AddTrainer)] -### Train the model +The [SdcaMaximumEntropy](xref:Microsoft.ML.Trainers.SdcaMaximumEntropyMulticlassTrainer) is your multiclass classification training algorithm. This is appended to the `pipeline` and accepts the featurized `Title` and `Description` (`Features`) and the `Label` input parameters to learn from the historic data. -You train the model, , based on the dataset that has been loaded and transformed. Once the estimator has been defined, you train your model using the while providing the already loaded training data. This method returns a model to use for predictions. `trainingPipeline.Fit()` trains the pipeline and returns a `Transformer` based on the `DataView` passed in. The experiment is not executed until the `.Fit()` method runs. +### Train the model -Add the following code to the `BuildAndTrainModel` method: +Fit the model to the `splitTrainSet` data and return the trained model by adding the following as the next line of code in the `BuildAndTrainModel()` method: [!code-csharp[TrainModel](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#TrainModel)] -While the `model` is a `transformer` that operates on many rows of data, a need for predictions on individual examples is a common production scenario. The is a wrapper that is returned from the `CreatePredictionEngine` method. Let's add the following code to create the `PredictionEngine` as the next line in the `BuildAndTrainModel` Method: +The `Fit()`method trains your model by transforming the dataset and applying the training. + +The [PredictionEngine](xref:Microsoft.ML.PredictionEngine%602) is a convenience API, which allows you to pass in and then perform a prediction on a single instance of data. Add this as the next line in the `BuildAndTrainModel()` method: [!code-csharp[CreatePredictionEngine1](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CreatePredictionEngine1)] @@ -311,7 +222,7 @@ Add a GitHub issue to test the trained model's prediction in the `Predict` metho [!code-csharp[CreateTestIssue1](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CreateTestIssue1)] -You can use that to predict the `Area` label of a single instance of the issue data. To get a prediction, use on the data. The input data is a string and the model includes the featurization. Your pipeline is in sync during training and prediction. You didn’t have to write preprocessing/featurization code specifically for predictions, and the same API takes care of both batch and one-time predictions. +Use the [Predict()](xref:Microsoft.ML.PredictionEngine%602.Predict%2A) function makes a prediction on a single row of data: [!code-csharp[Predict](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#Predict)] @@ -349,13 +260,13 @@ Add a call to the new method from the `Main` method, right under the `BuildAndTr [!code-csharp[CallEvaluate](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CallEvaluate)] -As you did previously with the training dataset, you can combine the initialization, mapping, and test dataset loading into one line of code. You can evaluate the model using this dataset as a quality check. Add the following code to the `Evaluate` method: +As you did previously with the training dataset, load the test dataset by adding the following code to the `Evaluate` method: [!code-csharp[LoadTestDataset](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#LoadTestDataset)] -The `MulticlassClassificationContext.Evaluate` is a wrapper for the method that computes the quality metrics for the model using the specified dataset. It returns a object that contains the overall metrics computed by multiclass classification evaluators. +The [Evaluate()](xref:Microsoft.ML.MulticlassClassificationCatalog.Evaluate%2A) method computes the quality metrics for the model using the specified dataset. It returns a object that contains the overall metrics computed by multiclass classification evaluators. To display the metrics to determine the quality of the model, you need to get them first. -Notice the use of the `Transform` method of the machine learning `_trainedModel` global variable (a transformer) to input the features and return predictions. Add the following code to the `Evaluate` method as the next line: +Notice the use of the [Transform()](xref:Microsoft.ML.ITransformer.Transform%2A) method of the machine learning `_trainedModel` global variable (an [ITransformer](xref:Microsoft.ML.ITransformer)) to input the features and return predictions. Add the following code to the `Evaluate` method as the next line: [!code-csharp[Evaluate](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#Evaluate)] @@ -375,38 +286,7 @@ Use the following code to display the metrics, share the results, and then act o [!code-csharp[DisplayMetrics](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#DisplayMetrics)] -### Save the trained and evaluated model - -At this point, you have a model of type that can be integrated into any of your existing or new .NET applications. To save your trained model to a .zip file, add the following code to call the `SaveModelAsFile` method as the next line in `BuildAndTrainModel`: - -[!code-csharp[CallSaveModel](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CallSaveModel)] - -## Save the model as a .zip file - -Create the `SaveModelAsFile` method, just after the `Evaluate` method, using the following code: - -```csharp -private static void SaveModelAsFile(MLContext mlContext, ITransformer model) -{ - -} -``` - -The `SaveModelAsFile` method executes the following tasks: - -* Saves the model as a .zip file. - -Next, create a method to save the model so that it can be reused and consumed in other applications. The `ITransformer` has a method that takes in the `_modelPath` global field, and a . To save the model as a zip file, you'll create the `FileStream` immediately before calling the `SaveTo` method. Add the following code to the `SaveModelAsFile` method as the next line: - -[!code-csharp[SaveModel](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#SaveModel)] - -You could also display where the file was written by writing a console message with the `_modelPath`, using the following code: - -```csharp -Console.WriteLine("The model is saved to {0}", _modelPath); -``` - -## Deploy and Predict with a loaded model +## Deploy and Predict with a model Add a call to the new method from the `Main` method, right under the `Evaluate` method call, using the following code: @@ -428,17 +308,15 @@ The `PredictIssue` method executes the following tasks: * Combines test data and predictions for reporting. * Displays the predicted results. -First, load the model that you saved previously with the following code: - -[!code-csharp[LoadModel](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#LoadModel)] - Add a GitHub issue to test the trained model's prediction in the `Predict` method by creating an instance of `GitHubIssue`: [!code-csharp[AddTestIssue](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#AddTestIssue)] +As you did previously, create a `PredictionEngine` instance with the following code: + [!code-csharp[CreatePredictionEngine](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#CreatePredictionEngine)] -Now that you have a model, you can use that to predict the Area GitHub label of a single instance of the GitHub issue data. To get a prediction, use on the data. The input data is a string and the model includes the featurization. Your pipeline is in sync during training and prediction. You didn’t have to write preprocessing/featurization code specifically for predictions, and the same API takes care of both batch and one-time predictions. Add the following code to the `PredictIssue` method for the predictions: +Use the `PredictionEngine` to predict the Area GitHub label by adding the following code to the `PredictIssue` method for the prediction: [!code-csharp[PredictIssue](~/samples/machine-learning/tutorials/GitHubIssueClassification/Program.cs#PredictIssue)] @@ -454,14 +332,13 @@ Your results should be similar to the following. As the pipeline processes, it d ```console =============== Single Prediction just-trained-model - Result: area-System.Net =============== -The model is saved to C:\Users\johalex\dotnet-samples\samples\machine-learning\tutorials\GitHubIssueClassification\bin\Debug\netcoreapp2.0\..\..\..\Models\model.zip ************************************************************************************************************* * Metrics for Multi-class Classification model - Test Data *------------------------------------------------------------------------------------------------------------ -* MicroAccuracy: 0.74 -* MacroAccuracy: 0.687 -* LogLoss: .932 -* LogLossReduction: 63.852 +* MicroAccuracy: 0.741 +* MacroAccuracy: 0.67 +* LogLoss: .916 +* LogLossReduction: .645 ************************************************************************************************************* =============== Single Prediction - Result: area-System.Data =============== ``` @@ -472,8 +349,6 @@ Congratulations! You've now successfully built a machine learning model for clas In this tutorial, you learned how to: > [!div class="checklist"] -> * Understand the problem -> * Select the appropriate machine learning algorithm > * Prepare your data > * Transform the data > * Train the model diff --git a/docs/standard/native-interop/customize-struct-marshaling.md b/docs/standard/native-interop/customize-struct-marshaling.md index d00cda03c49d0..bcd08a0e2ecfd 100644 --- a/docs/standard/native-interop/customize-struct-marshaling.md +++ b/docs/standard/native-interop/customize-struct-marshaling.md @@ -21,7 +21,7 @@ Sometimes the default marshaling rules for structures aren't exactly what you ne **✔️ DO** only use `LayoutKind.Explicit` in marshaling when your native struct is also has an explicit layout, such as a union. -**❌ AVOID** using `LayoutKind.Explicit` when marshaling structures on non-Windows platforms. The .NET Core runtime doesn't support passing explicit structures by value to native functions on Intel or AMD 64-bit non-Windows systems. However, the runtime supports passing explicit structures by reference on all platforms. +**❌ AVOID** using `LayoutKind.Explicit` when marshaling structures on non-Windows platforms if you need to target runtimes before .NET Core 3.0. The .NET Core runtime before 3.0 Preview 4 doesn't support passing explicit structures by value to native functions on Intel or AMD 64-bit non-Windows systems. However, the runtime supports passing explicit structures by reference on all platforms. ## Customizing boolean field marshaling