-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
The ML.Net library suffers from a lack of decoupling between data preparation and model training, required to do an efficient grid search over training parameters.
That is, ideally the API should be structured in such a way that it is possible to do the following:
- Prepare the data set once, so that it can be re-used multiple times. As much as possible, any pre-training calculations should be done up front (or perhaps cached to be re-used). For large data sets, the overhead of repeating this step each time is significant, taking as long or longer than the training itself.
- For algorithms with multiple training iterations, it should be straightforward to retain the intermediate trained models at each iteration (or at a specified set of iterations). This way, it is then easy to compute metrics for the intermediate models on training and validation data sets, and ultimately select one of the intermediate models for use in production without having to re-run the training.
For example, consider training a LightGBM model. This is the training method in LightGbmTrainerBase.cs:
public void Train(RoleMappedData data)
{
Dataset dtrain;
CategoricalMetaData catMetaData;
using (var ch = Host.Start("Loading data for LightGBM"))
{
using (var pch = Host.StartProgressChannel("Loading data for LightGBM"))
dtrain = LoadTrainingData(ch, data, out catMetaData);
ch.Done();
}
using (var ch = Host.Start("Training with LightGBM"))
{
using (var pch = Host.StartProgressChannel("Training with LightGBM"))
TrainCore(ch, pch, dtrain, catMetaData);
ch.Done();
}
dtrain.Dispose();
DisposeParallelTraining();
}In order to address point 1) above, the dtrain object returned by LoadTrainingData should be available to be re-used. This would require that the configuration parameters for data preparation are specified separately to those for training, instead of all thrown in together into the LightGbmArguments type.
Now, in regards to point 2) above, note that the TrainCore method calls WrappedLightGBMTraining.Train, which has the following structure:
public static Booster Train(IChannel ch, IProgressChannel pch,
Dictionary<string, object> parameters, Dataset dtrain, Dataset dvalid = null, int numIteration = 100,
bool verboseEval = true, int earlyStoppingRound = 0)
{
// create Booster.
Booster bst = new Booster(parameters, dtrain, dvalid);
for (int iter = 0; iter < numIteration; ++iter)
{
// training logic
}
return bst;
}In order to get the intermediate models, this method should return Booster [] instead of just the final Booster (or perhaps instead in this case, the Booster object should support extraction of a prediction model which only contains the first N trees of the ensemble).
Perhaps there is already the facility to do this in ML.Net, but I'm unable to find anything from my reading of the source or any of the examples.
I think 99.9% of all machine learning research requires doing a parameter grid search at some stage, and hence this is essential functionality that should be as efficient as possible.