Skip to content
Grant edited this page Jan 5, 2015 · 7 revisions

This guide assumes that lemkit has been set up for python.

Lemkit can be built for both training and prediction in python. There are command-line applications to do both training and prediction, and this can also be done using library functions.

##Library Functions

###Training

To import all available training methods use from lemkit.train import *. Currently lemkit python only supports training of multinomial logistic models by interfacing with the Scikit-Learn package.

Basic usage of the logistic module is demonstrated below. logistic.train() takes a file path as an argument and returns a LinearModel object. The format of the input file must cohere with the format described in the Input Format page.

from lemkit.train import *
model = logistic.train("path_to_lemkit/data/iris.train.txt")

A LinearModel object can be persisted to an output file by calling its writeBinary or writeJson functions.

model.writeBinary("iris.model.bin")
model.writeJson("iris.model.json")

logistic.train additionally supports feature hashing and many other training options. Below a logistic model is trained using L1 regularization and feature hashing with a maximum of 10,000 features.

model = logistic.train("path_to_lemkit/data/iris.train.txt", hash_trick=True, hashmod=10000, regularization="L1")

Predicting

Predicting requires the creation of LinearModel object. A LinearModel object can be created from reading a pre-trained model file, as well as by training from raw input data.

import lemkit
model = lemkit.model_tools.readBinaryModel("path_to_lemkit/data/model_files/iris.vw.model.bin")
predictions = model.predict("path_to_lemkit/data/iris/iris.test.txt")
predictions[:5]
[['1', 'Iris-setosa', 'Iris-setosa', 2.8037471], ['2', 'Iris-versicolor', 'Iris-versicolor', -0.09192749999999994], ['3', 'Iris-versicolor', 'Iris-versicolor', 0.6667894000000005], ['4', 'Iris-setosa', 'Iris-setosa', 3.3045669999999996], ['5', 'Iris-setosa', 'Iris-setosa', 3.9565813]]

model.predict() returns a list where each element of the list contains [index, Gold_Label, Predicted_Label, Score]. If model.predict() is run on a file without gold labels the 2nd entry of the list will be blank.

Command-line applications

Command line applications lktrain and lkpredict allow for simple command line usage of model training and prediction.

lktrain

The following usage of lktrain trains a logistic model with L1 regularization, feature hashing with a max of 10,000 features, and outputs the trained model in a binary format to iris.model.bin

$LEMKIT/python/bin/lktrain -t $LEMKIT/data/iris/iris.train.txt --model-type logistic --reg L1 --hash 10000 --mf binary -o iris.model.bin

A full listing of possible arguments

  Argument        |                         Meaning                           

----------------------|------------------------------------------------------------ -t | --train | training file
-o | --outfile | file trained model will be written to
-m | --model-type | type of model to train (logistic)
-r | --reg | regularization method used (L1 or L2)
-f | --model-format | write format of model (json or binary)
--hash | integer value specifying mod size of hash trick (optional) -s | --sparse | write model weights in a sparse format (True or False)

lkpredict

$LEMKIT/python/bin/lkpredict --predict $LEMKIT/data/iris/iris.test.txt 

A full listing of possible arguments

  Argument               |                         Meaning                           

-----------------------------|------------------------------------------------------------ -f | --mf | --model-format | Model format (json or binary, default binary)
-p | --predict | File containing data instances to predict
-m | --model | Trained model file
-a | --show-accuracy | Output accuracy at end
-c | --show-correct | Output column indicating correct or wrong

Clone this wiki locally