Skip to content
benwing edited this page Jan 5, 2015 · 3 revisions

Go User Guide

Library functions

The library functions in Go are modeled after those in Scala.

Model objects for prediction are held in the LinearClassifier struct, which encapsulates a more general IndexingClassifier struct. The Predict() function computes a prediction for a set of features and has two return values, the label (a string) and the score. The label returned is the one with the highest score. You can also fetch the entire set of scores using Scores() (in the form of an array of scores) or using ScoresMap() (in the form of a map from labels to scores). In the former case, the scores are returned in the same order as the labels returned by the Labels() function of the encapsulated IndexingClassifier. LinearClassifier objects can be read from binary or JSON-formatted files using (respectively) ReadBinaryModel() and ReadJSONModel().

Both labels and features are externally represented as strings, but internally converted to integer indices. Indexing of labels to integers is always exact, using a hash table. Indexing of features to integers may be exact (using ExactFeatureMap) or using feature hashing (using HashedFeatureMap). Both types of indexes are held in a ClassifierIndexer struct, contained inside of IndexingClassifier.

Rest of this section needs updating

Features as passed to the functions of Classifier are of type FeatureSet[String], which is currently a type alias for Seq[FeatureObservation[String]]. Each FeatureObservation encapsulates a single feature and its value.

There is also a class Example that encapsulates a complete data instance, i.e. a set of features, a label, and an optional importance weight. Example is type-parameterized on the feature and label types. Currently the code uses only Example[String, String] (the externally-visible view of an Example, with features and labels represented as strings) and Example[Int, Int] (the internal representation of an Example, with features and labels indexed to integers).

Example and FeatureObservation are case classes, and can be used to directly create data instances, e.g.:

   val feats = Seq("foo" -> 2.0, "bar" -> 3.0).map {
      case (feat, value) => FeatureObservation(feat, value)
   }
   val instance = Example(feats, "positive")

You can also read in a set of instances from a data file (in the format described in Input Format) using ClassifierSource:

    val instances = ClassifierSource.readDataFile("data.predict")

The value returned by this function is an iterator; you may want to convert it to a sequence using toSeq.

The following code, also found in the Quick Start, shows how to read in a model and a set of instances and do prediction:

  import com.peoplepattern.classify._
  import com.peoplepattern.classify.data._

  val classifier = LinearClassifier.readBinaryModel("model.binary")
  val predictData = ClassifierSource.readDataFile("data.predict").toSeq
  val predictions = predictData.map(i => classifier(i.features))
  for ((prediction, inst) <- predictions zip predictData) {
    println(s"Predicted label: ${prediction}, correct label: ${inst.label}")
  }

Command-line applications

lkpredict

lkpredict is used to do prediction on linear classifier models trained using Scala or Python. It takes arguments as follows:

Argument Meaning
`--model-format -f`
`--predict -p`
`--model -m`
`--show-accuracy -a`
`--show-correct -c`

The arguments --predict (or -p) and --model (or -m) are required.

A basic invocation of lkpredict might be:

lkpredict -m vw.iris.exact.model.bin -p iris.data.test.txt

The file passed to --predict is as described in Input Format.

The file passed to --model should be a binary-format or JSON-format model file as created using Scala or Python lktrain.

The predictions are sent to stdout, normally formatted as follows:

1 Iris-setosa Iris-setosa
2 Iris-versicolor Iris-versicolor
...
26 Iris-versicolor Iris-virginica
...

Each line consists of a line number, then the correct label, then the predicted label.

If --show-correct (or -c) is used, a second column is added indicating whether the prediction was correct or wrong. If --show-accuracy (or -a) is used, a line at the end is output showing overall accuracy. For example, executing the following:

lkpredict --model vw.iris.exact.model.bin \
  --predict iris.data.test.txt --show-accuracy --show-correct

Might produce output as follows:

1 CORRECT Iris-setosa Iris-setosa
2 CORRECT Iris-versicolor Iris-versicolor
...
26 WRONG Iris-versicolor Iris-virginica
...
30 CORRECT Iris-setosa Iris-setosa
Accuracy: 93.33%

Clone this wiki locally