-
Notifications
You must be signed in to change notification settings - Fork 0
Quickstart Python
This guide assumes that lemkit has been set up for python.
Lemkit can be built for both training and prediction in python. There are command-line applications to do both training and prediction, and this can also be done using library functions.
##Library Functions
###Training
To import all available training methods use from lemkit.train import *. Currently lemkit python only supports training of multinomial logistic models by interfacing with the Scikit-Learn package.
Basic usage of the logistic module is demonstrated below. logistic.train() takes a file path as an argument and returns a LinearModel object. The format of the input file must cohere with the format described in the Input Format page.
from lemkit.train import *
model = logistic.train("path_to_lemkit/data/iris.train.txt")
A LinearModel object can be persisted to an output file by calling its writeBinary or writeJson functions.
model.writeBinary("iris.model.bin")
model.writeJson("iris.model.json")
logistic.train additionally supports feature hashing and many other training options. Below a logistic model is trained using L1 regularization and feature hashing with a maximum of 10,000 features.
model = logistic.train("path_to_lemkit/data/iris.train.txt", hash_trick=True, hashmod=10000, regularization="L1")
Predicting requires the creation of LinearModel object. A LinearModel object can be created from reading a pre-trained model file, as well as by training from raw input data.
import lemkit
model = lemkit.model_tools.readBinaryModel("path_to_lemkit/data/model_files/iris.vw.model.bin")
predictions = model.predict("path_to_lemkit/data/iris/iris.test.txt")
predictions[:5]
[['1', 'Iris-setosa', 'Iris-setosa', 2.8037471], ['2', 'Iris-versicolor', 'Iris-versicolor', -0.09192749999999994], ['3', 'Iris-versicolor', 'Iris-versicolor', 0.6667894000000005], ['4', 'Iris-setosa', 'Iris-setosa', 3.3045669999999996], ['5', 'Iris-setosa', 'Iris-setosa', 3.9565813]]
model.predict() returns a list where each element of the list contains [index, Gold_Label, Predicted_Label, Score]. If model.predict() is run on a file without gold labels the 2nd entry of the list will be blank.
Command line applications lktrain and lkpredict allow for simple command line usage of model training and prediction.
The following usage of lktrain trains a logistic model with L1 regularization, feature hashing with a max of 10,000 features, and outputs the trained model in a binary format to iris.model.bin
$LEMKIT/python/bin/lktrain -t $LEMKIT/data/iris/iris.train.txt --model-type logistic --reg L1 --hash 10000 --mf binary -o iris.model.bin
A full listing of possible arguments
Argument | Meaning
----------------------|------------------------------------------------------------
-t | --train | training file
-o | --outfile | file trained model will be written to
-m | --model-type | type of model to train (logistic)
-r | --reg | regularization method used (L1 or L2)
-f | --model-format | write format of model (json or binary)
--hash | integer value specifying mod size of hash trick (optional)
-s | --sparse | write model weights in a sparse format (True or False)
$LEMKIT/python/bin/lkpredict --predict $LEMKIT/data/iris/iris.test.txt
A full listing of possible arguments
Argument | Meaning
-----------------------------|------------------------------------------------------------
-f | --mf | --model-format | Model format (json or binary, default binary)
-p | --predict | File containing data instances to predict
-m | --model | Trained model file
-a | --show-accuracy | Output accuracy at end
-c | --show-correct | Output column indicating correct or wrong