API documentation

pymurtree is implemented as a thin Python wrapper around the main C++ MurTree application. The main functionality of MurTree is exposed in pymurtree via the OptimalDecisionTreeClassifier class. Utility functions to load training datasets and export the tree in text and dot formats are also included in the python package.

OptimalDecisionTreeClassifier class

OptimalDecisionTreeClassifier.py

Methods

constructor: initialize the parameters of the model
fit: fit a decision tree classifier to the given training dataset
predict: predict the labels for a set of features
score: return the accuracy on the given test data and labels
depth: return the depth of the tree
num_nodes: return the number of nodes of the tree
export_text: export decision tree in text format
export_dot: export decision tree in DOT format

`constructor`

Construct the object and initialize the model parameters.

Parameters

time: int, default = 600
Maximum runtime given in seconds
max_depth: int, default = 3 Maximum allowed depth of the tree, where the depth is defined as the largest number of decision/feature nodes from the root to any leaf.
Depth greater than four is usually time consuming."
max_num_nodes: int, default = 7
Maximum number of decision/feature nodes allowed. Note that a tree with k feature nodes has k+1 leaf nodes.
sparse_coefficient: float, default = 0.0
Assigns the penalty for using decision/feature nodes.
Large sparse coefficients will result in smaller trees.
verbose: bool, default = True
Determines if the solver should print logging information to the standard output.
all_trees: bool, default = True
Instructs the algorithm to compute trees using all allowed combinations of max-depth and max-num-nodes. Used to stress-test the algorithm.
incremental_frequency: bool, default = True
Activate incremental frequency computation, which takes into account previously computed trees when recomputing the frequency.
In our experiments proved to be effective on all datasets.
similarity_lower_bound: bool, default = True
Activate similarity-based lower bounding. Disabling this option may be better for some benchmarks, but on most of our tested datasets keeping this on was beneficial.
node_selection: int, default = 0
Node selection strategy used to decide on whether the algorithm should examine the left or right child node first.
0 = dynamic
1 = post-order
feature_ordering: int, default = 0
Feature ordering strategy used to determine the order in which features will be inspected in each node.
0 = in-order
1 = random
2 = gini
random_seed: int, default = 3
Random seed used only if the feature-ordering is set to random.
A seed of -1 assings the seed based on the current time.
cache_type: int, default = 0
Cache type used to store computed subtrees. Needs to be determined experimentally. 0 = dataset
1 = branch
2 = closure
'Dataset' is more powerful than 'branch' but may required more computational time.
'Closure' is experimental and typically slower than other options.
duplicate_factor: int, default = 1
Duplicates the instances the given amount of times. Used for stress-testing the algorithm, not a practical parameter.

Returns

None

`fit`

Build a decision tree classifier from the training set (x, y).

Parameters

x: numpyp.ndarray
The training features.
y: numpyp.ndarray
The class labels for the training features.
time: int, default = 600
Maximum runtime given in seconds
time: int, default = 600
Maximum runtime given in seconds
max_depth: int, default = 3
Maximum allowed depth of the tree, where the depth is defined as the largest number of decision/feature nodes from the root to any leaf.
Depth greater than four is usually time consuming."
max_num_nodes: int, default = 7
Maximum number of decision/feature nodes allowed. Note that a tree with k feature nodes has k+1 leaf nodes.
sparse_coefficient: float, default = 0.0
Assigns the penalty for using decision/feature nodes.
Large sparse coefficients will result in smaller trees.
verbose: bool, default = True
Determines if the solver should print logging information to the standard output.
all_trees: bool, default = True
Instructs the algorithm to compute trees using all allowed combinations of max-depth and max-num-nodes. Used to stress-test the algorithm.
incremental_frequency: bool, default = True
Activate incremental frequency computation, which takes into account previously computed trees when recomputing the frequency.
In our experiments proved to be effective on all datasets.
similarity_lower_bound: bool, default = True
Activate similarity-based lower bounding. Disabling this option may be better for some benchmarks, but on most of our tested datasets keeping this on was beneficial.
node_selection: int, default = 0
Node selection strategy used to decide on whether the algorithm should examine the left or right child node first.
0 = dynamic
1 = post-order
feature_ordering: int, default = 0
Feature ordering strategy used to determine the order in which features will be inspected in each node.
0 = in-order
1 = random
2 = gini
random_seed: int, default = 3
Random seed used only if the feature-ordering is set to random.
A seed of -1 assings the seed based on the current time.
cache_type: int, default = 0
Cache type used to store computed subtrees. Needs to be determined experimentally. 0 = dataset
1 = branch
2 = closure
'Dataset' is more powerful than 'branch' but may required more computational time.
'Closure' is experimental and typically slower than other options.
duplicate_factor: int, default = 1
Duplicates the instances the given amount of times. Used for stress-testing the algorithm, not a practical parameter.

Returns

None

Example

>>> import pymurtree
>>> import numpy

# Create training data
>>> x = numpy.array([[0, 1, 0, 1], [1, 0, 0, 1], [1, 1, 0, 0]])
>>> y = numpy.array([5, 5, 4])

# Build tree classifier
>>> model = pymurtree.OptimalDecisionTreeClassifier()
>>> model.fit(x, y)

`predict`

Predict the labels for a set of features

Parameters

x: numpy.ndarray
1D or 2D array with the set of features. Each row corresponds to an instance, and each column corresponds to a feature.

Returns

numpy.ndarray
1D array with the labels of the input features. The i-th element holds the label of the i-th instance in x.

Example

>>> import pymurtree
>>> import numpy

# Create training data
>>> x = numpy.array([[0, 1, 0, 1], [1, 0, 0, 1], [1, 1, 0, 0]])
>>> y = numpy.array([5, 5, 4])

# Build tree classifier
>>> model = pymurtree.OptimalDecisionTreeClassifier()
>>> model.fit(x, y)

# Predict labels for a new set of features
>>> ft = numpy.array([[1, 1, 0, 1], [0, 0, 1, 1], [1, 0, 1, 0]])
>>> labels = model.predict(ft)
>>> print(labels)
>>> [5 5 4]

`score`

Return the misclassification score of the tree.

Parameters

None

Returns

int
Misclassification score of the tree.

Example

>>> import pymurtree
>>> clf = pymurtree.OptimalDecisionTreeClassifier()
...
>>> clf.score()
>>> 112

`depth`

Return the depth of the decision tree.

Parameters

None

Returns

int
The maximum depth of the tree.

Example

>>> import pymurtree
>>> clf = pymurtree.OptimalDecisionTreeClassifier()
...
>>> clf.depth()
>>> 4

`num_nodes`

Return the number of nodes in the tree.

Parameters

None

Returns

int
Number of nodes in the tree.

Example

>>> import pymurtree
>>> clf = pymurtree.OptimalDecisionTreeClassifier()
...
>>> clf.num_nodes()
>>> 15

`export_text`

Create a text representation of all the rules in the decision tree. Text is written to out_file if given, otherwise it is displayed on screen (standard output).

Parameters

out_file: str, optional
Name of the output file. If None, the result is displayed in the standard output.
feature_names: numpy.ndarray, optional 1D Numpy array that represents the names of the features.
class_names: Dict[int, str], optional Dictionary with int keys and str values that represent the class names.

Returns

None

Example

import pymurtree
import numpy

# Create training data
x = numpy.array([[1, 0, 0, 1], [1, 0, 1, 1], [0, 0, 1, 0]]) # features
y = numpy.array([5, 5, 4]) # labels

# Build tree classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x, y)

# Export tree in text format
ftnames = numpy.array(['has legs', 'has hair', 'can swim', 'can fly']) # feature names
clnames = {
    0 : 'snake',
    1 : 'cat',
    2 : 'crocodile',
    3 : 'horse',
    4 : 'fish',
    5 : 'bird'
} # class names
clf.export_text(feature_names=ftnames, class_names=clnames)
#|---has legs is true
#|   |---bird
#|---has legs is false
#|   |---fish

# Save tree in tree.txt
clf.export_text("tree.txt", ftnames, clnames)

`export_dot`

Export the decision tree in DOT format. Tree is written to out_file if given, otherwise it is displayed on screen (standard output). This function generates a GraphViz representation of the decision tree. Graphical renderings can be generated using the Graphviz library:

dot -Tpdf tree.dot -o tree.pdf    (PDF format)
dot -Tpng tree.dot -o tree.png    (PNG format)

Parameters

out_file: str, optional
Name of the output file. If None, the result is displayed in the standard output.
feature_names: numpy.ndarray, optional 1D Numpy array that represents the names of the features.
class_names: Dict[int, str], optional Dictionary with int keys and str values that represent the class names.

Returns

None

Example

import pymurtree
import numpy

x = numpy.array([[1, 0, 0, 1], [1, 0, 1, 1], [0, 0, 1, 0]]) # features
y = numpy.array([5, 5, 4]) # labels

# Build tree classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x, y)

# Export tree in DOT format
ftnames = numpy.array(['has legs', 'has hair', 'can swim', 'can fly']) # feature names
clnames = {
    0 : 'snake',
    1 : 'cat',
    2 : 'crocodile',
    3 : 'horse',
    4 : 'fish',
    5 : 'bird'
} # class names
clf.export_dot(feature_names=ftnames, class_names=clnames)
# digraph Tree {
# node [shape=box, style="filled, rounded", fontname="helvetica", fontsize="8"] ;
# edge [fontname="helvetica", fontsize="6"] ;
# 0 [label=<has legs>, color="#8CB77F", fillcolor="#8CB77F"] ;
# 1 [label=<fish>, color="#B77F8C" fillcolor="#B77F8C"] ;
# 0 -> 1 [label=" 0 "] ;
# 2 [label=<bird>, color="#B77F8C" fillcolor="#B77F8C"] ;
# 0 -> 2 [label=" 1 "] ;
# }

# Save tree in tree.dot
clf.export_dot("tree.dot", ftnames, clnames)

# On the command line:

# Render tree in png format using Graphviz
dot -Tpng tree.dot -o tree.png
open tree.png

tree

Utility functions

readdata.py

Utility functions

load_murtree_dataset_to_pandas_dataframes: read features and labels from file into a pandas.DataFrame
load_murtree_dataset_to_numpy_arrays: read features and labels from file into a Numpy.ndarray

Both functions read datasets from a plain text file with the following format:

The first column contains the target variable (y), and the remaining columns contain the features (x).
All values should be space-separated integers.
All features must be 0 or 1 and classes should be equal or greater than zero
No headers or footers allowed.

These utility functions can be used to easily load the datasets in the murtree-data repository.

`load_murtree_dataset_to_pandas_dataframes`

Read a dataset of features and labels from a plain text file into a pandas.DataFrame (features) and a pandas.Series (labels)

Parameters

path: str
Path to the dataset.

Returns

tuple:
x (pandas.DataFrame): all columns from dataset except the first
y (pandas.Series): first column of dataset

Example

```bash git clone https://github.com/MurTree/murtree-data.git ```

import pymurtree
from pymurtree.readdata import *

# Load training dataset
(x, y) = load_murtree_dataset_to_pandas_dataframes("murtree-data/DL/anneal.txt")

# Build classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x.to_numpy(), y.to_numpy())

`load_murtree_dataset_to_numpy_arrays`

Read a dataset of features and labels from a plain text file into a 2D (features) and 1D (labels) numpy.ndarrays

Parameters

path: str
Path to the dataset.

Returns

tuple:
x (numpy.ndarray): 2D array with all columns except the first
y (numpy.ndarray): 1D array with the first column

Example

```bash git clone https://github.com/MurTree/murtree-data.git ```

import pymurtree
from pymurtree.readdata import *

# Load training dataset
(x, y) = load_murtree_dataset_to_numpy_arrays("murtree-data/DL/anneal.txt")

# Build classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x, y)

Documentation

API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

API documentation

OptimalDecisionTreeClassifier class

`constructor`

`fit`

`predict`

`score`

`depth`

`num_nodes`

`export_text`

`export_dot`

Utility functions

`load_murtree_dataset_to_pandas_dataframes`

`load_murtree_dataset_to_numpy_arrays`

Documentation

Uh oh!

Clone this wiki locally