Make grid search of parameter space more efficient

The ML.Net library suffers from a lack of decoupling between data preparation and model training, required to do an efficient grid search over training parameters.

That is, ideally the API should be structured in such a way that it is possible to do the following:

1. Prepare the data set once, so that it can be **re-used multiple times**.  As much as possible, any pre-training calculations should be done up front (or perhaps cached to be re-used).  For large data sets, the overhead of repeating this step each time is significant, taking as long or longer than the training itself.
2. For algorithms with multiple training iterations, it should be straightforward to **retain the intermediate trained models** at each iteration (or at a specified set of iterations).  This way, it is then easy to compute metrics for the intermediate models on training and validation data sets, and ultimately select one of the intermediate models for use in production without having to re-run the training.

For example, consider training a LightGBM model.  This is the training method in `LightGbmTrainerBase.cs`:
```csharp
        public void Train(RoleMappedData data)
        {
            Dataset dtrain;
            CategoricalMetaData catMetaData;
            using (var ch = Host.Start("Loading data for LightGBM"))
            {
                using (var pch = Host.StartProgressChannel("Loading data for LightGBM"))
                    dtrain = LoadTrainingData(ch, data, out catMetaData);
                ch.Done();
            }
            using (var ch = Host.Start("Training with LightGBM"))
            {
                using (var pch = Host.StartProgressChannel("Training with LightGBM"))
                    TrainCore(ch, pch, dtrain, catMetaData);
                ch.Done();
            }
            dtrain.Dispose();
            DisposeParallelTraining();
        }
```

In order to address point 1) above, the `dtrain` object returned by `LoadTrainingData` should be available to be re-used.  This would require that the configuration parameters for data preparation are specified separately to those for training, instead of all thrown in together into the `LightGbmArguments` type.

Now, in regards to point 2) above, note that the `TrainCore` method calls `WrappedLightGBMTraining.Train`, which has the following structure:
```csharp
        public static Booster Train(IChannel ch, IProgressChannel pch,
            Dictionary<string, object> parameters, Dataset dtrain, Dataset dvalid = null, int numIteration = 100,
            bool verboseEval = true, int earlyStoppingRound = 0)
        {
            // create Booster.
            Booster bst = new Booster(parameters, dtrain, dvalid);

            for (int iter = 0; iter < numIteration; ++iter)
            {
                // training logic
            }
            return bst;
        }
```
In order to get the intermediate models, this method should return `Booster []` instead of just the final `Booster` (or perhaps instead in this case, the `Booster` object should support extraction of a prediction model which only contains the first `N` trees of the ensemble).

Perhaps there is already the facility to do this in ML.Net, but I'm unable to find anything from my reading of the source or any of the examples.

I think 99.9% of all machine learning research requires doing a parameter grid search at some stage, and hence this is essential functionality that should be as efficient as possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make grid search of parameter space more efficient #512

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make grid search of parameter space more efficient #512

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions