Skip to content

API documentation

Yasel Quintero edited this page Nov 20, 2023 · 11 revisions

pymurtree is implemented as a thin Python wrapper around the main C++ MurTree application. The main functionality of MurTree is exposed in pymurtree via the OptimalDecisionTreeClassifier class. Utility functions to load training datasets and export the tree in text and dot formats are also included in the python package.

OptimalDecisionTreeClassifier class

OptimalDecisionTreeClassifier.py

Methods

  • constructor: initialize the parameters of the model
  • fit: fit a decision tree classifier to the given training dataset
  • predict: predict the labels for a set of features
  • score: return the accuracy on the given test data and labels
  • depth: return the depth of the tree
  • num_nodes: return the number of nodes of the tree
  • export_text: export decision tree in text format
  • export_dot: export decision tree in DOT format

constructor

Construct the object and initialize the model parameters.

Parameters
  • time: int, default = 600
    Maximum runtime given in seconds

  • max_depth: int, default = 3 Maximum allowed depth of the tree, where the depth is defined as the largest number of decision/feature nodes from the root to any leaf.
    Depth greater than four is usually time consuming."

  • max_num_nodes: int, default = 7
    Maximum number of decision/feature nodes allowed. Note that a tree with k feature nodes has k+1 leaf nodes.

  • sparse_coefficient: float, default = 0.0
    Assigns the penalty for using decision/feature nodes.
    Large sparse coefficients will result in smaller trees.

  • verbose: bool, default = True
    Determines if the solver should print logging information to the standard output.

  • all_trees: bool, default = True
    Instructs the algorithm to compute trees using all allowed combinations of max-depth and max-num-nodes. Used to stress-test the algorithm.

  • incremental_frequency: bool, default = True
    Activate incremental frequency computation, which takes into account previously computed trees when recomputing the frequency.
    In our experiments proved to be effective on all datasets.

  • similarity_lower_bound: bool, default = True
    Activate similarity-based lower bounding. Disabling this option may be better for some benchmarks, but on most of our tested datasets keeping this on was beneficial.

  • node_selection: int, default = 0
    Node selection strategy used to decide on whether the algorithm should examine the left or right child node first.
    0 = dynamic
    1 = post-order

  • feature_ordering: int, default = 0
    Feature ordering strategy used to determine the order in which features will be inspected in each node.
    0 = in-order
    1 = random
    2 = gini

  • random_seed: int, default = 3
    Random seed used only if the feature-ordering is set to random.
    A seed of -1 assings the seed based on the current time.

  • cache_type: int, default = 0
    Cache type used to store computed subtrees. Needs to be determined experimentally. 0 = dataset
    1 = branch
    2 = closure
    'Dataset' is more powerful than 'branch' but may required more computational time.
    'Closure' is experimental and typically slower than other options.

  • duplicate_factor: int, default = 1
    Duplicates the instances the given amount of times. Used for stress-testing the algorithm, not a practical parameter.

Returns
  • None

fit

Build a decision tree classifier from the training set (x, y).

Parameters
  • x: numpyp.ndarray
    The training features.

  • y: numpyp.ndarray
    The class labels for the training features.

  • time: int, default = 600
    Maximum runtime given in seconds

  • time: int, default = 600
    Maximum runtime given in seconds

  • max_depth: int, default = 3
    Maximum allowed depth of the tree, where the depth is defined as the largest number of decision/feature nodes from the root to any leaf.
    Depth greater than four is usually time consuming."

  • max_num_nodes: int, default = 7
    Maximum number of decision/feature nodes allowed. Note that a tree with k feature nodes has k+1 leaf nodes.

  • sparse_coefficient: float, default = 0.0
    Assigns the penalty for using decision/feature nodes.
    Large sparse coefficients will result in smaller trees.

  • verbose: bool, default = True
    Determines if the solver should print logging information to the standard output.

  • all_trees: bool, default = True
    Instructs the algorithm to compute trees using all allowed combinations of max-depth and max-num-nodes. Used to stress-test the algorithm.

  • incremental_frequency: bool, default = True
    Activate incremental frequency computation, which takes into account previously computed trees when recomputing the frequency.
    In our experiments proved to be effective on all datasets.

  • similarity_lower_bound: bool, default = True
    Activate similarity-based lower bounding. Disabling this option may be better for some benchmarks, but on most of our tested datasets keeping this on was beneficial.

  • node_selection: int, default = 0
    Node selection strategy used to decide on whether the algorithm should examine the left or right child node first.
    0 = dynamic
    1 = post-order

  • feature_ordering: int, default = 0
    Feature ordering strategy used to determine the order in which features will be inspected in each node.
    0 = in-order
    1 = random
    2 = gini

  • random_seed: int, default = 3
    Random seed used only if the feature-ordering is set to random.
    A seed of -1 assings the seed based on the current time.

  • cache_type: int, default = 0
    Cache type used to store computed subtrees. Needs to be determined experimentally. 0 = dataset
    1 = branch
    2 = closure
    'Dataset' is more powerful than 'branch' but may required more computational time.
    'Closure' is experimental and typically slower than other options.

  • duplicate_factor: int, default = 1
    Duplicates the instances the given amount of times. Used for stress-testing the algorithm, not a practical parameter.

Returns
  • None
Example
>>> import pymurtree
>>> import numpy

# Create training data
>>> x = numpy.array([[0, 1, 0, 1], [1, 0, 0, 1], [1, 1, 0, 0]])
>>> y = numpy.array([5, 5, 4])

# Build tree classifier
>>> model = pymurtree.OptimalDecisionTreeClassifier()
>>> model.fit(x, y)

predict

Predict the labels for a set of features

Parameters
  • x: numpy.ndarray
    1D or 2D array with the set of features. Each row corresponds to an instance, and each column corresponds to a feature.
Returns
  • numpy.ndarray
    1D array with the labels of the input features. The i-th element holds the label of the i-th instance in x.
Example
>>> import pymurtree
>>> import numpy

# Create training data
>>> x = numpy.array([[0, 1, 0, 1], [1, 0, 0, 1], [1, 1, 0, 0]])
>>> y = numpy.array([5, 5, 4])

# Build tree classifier
>>> model = pymurtree.OptimalDecisionTreeClassifier()
>>> model.fit(x, y)

# Predict labels for a new set of features
>>> ft = numpy.array([[1, 1, 0, 1], [0, 0, 1, 1], [1, 0, 1, 0]])
>>> labels = model.predict(ft)
>>> print(labels)
>>> [5 5 4]

score

Return the misclassification score of the tree.

Parameters
  • None
Returns
  • int
    Misclassification score of the tree.
Example
>>> import pymurtree
>>> clf = pymurtree.OptimalDecisionTreeClassifier()
...
>>> clf.score()
>>> 112

depth

Return the depth of the decision tree.

Parameters
  • None
Returns
  • int
    The maximum depth of the tree.
Example
>>> import pymurtree
>>> clf = pymurtree.OptimalDecisionTreeClassifier()
...
>>> clf.depth()
>>> 4

num_nodes

Return the number of nodes in the tree.

Parameters
  • None
Returns
  • int
    Number of nodes in the tree.
Example
>>> import pymurtree
>>> clf = pymurtree.OptimalDecisionTreeClassifier()
...
>>> clf.num_nodes()
>>> 15

export_text

Create a text representation of all the rules in the decision tree. Text is written to out_file if given, otherwise it is displayed on screen (standard output).

Parameters
  • out_file: str, optional
    Name of the output file. If None, the result is displayed in the standard output.
  • feature_names: numpy.ndarray, optional 1D Numpy array that represents the names of the features.
  • class_names: Dict[int, str], optional Dictionary with int keys and str values that represent the class names.
Returns
  • None
Example
import pymurtree
import numpy

# Create training data
x = numpy.array([[1, 0, 0, 1], [1, 0, 1, 1], [0, 0, 1, 0]]) # features
y = numpy.array([5, 5, 4]) # labels

# Build tree classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x, y)

# Export tree in text format
ftnames = numpy.array(['has legs', 'has hair', 'can swim', 'can fly']) # feature names
clnames = {
    0 : 'snake',
    1 : 'cat',
    2 : 'crocodile',
    3 : 'horse',
    4 : 'fish',
    5 : 'bird'
} # class names
clf.export_text(feature_names=ftnames, class_names=clnames)
#|---has legs is true
#|   |---bird
#|---has legs is false
#|   |---fish

# Save tree in tree.txt
clf.export_text("tree.txt", ftnames, clnames)

export_dot

Export the decision tree in DOT format. Tree is written to out_file if given, otherwise it is displayed on screen (standard output). This function generates a GraphViz representation of the decision tree. Graphical renderings can be generated using the Graphviz library:

dot -Tpdf tree.dot -o tree.pdf    (PDF format)
dot -Tpng tree.dot -o tree.png    (PNG format)
Parameters
  • out_file: str, optional
    Name of the output file. If None, the result is displayed in the standard output.
  • feature_names: numpy.ndarray, optional 1D Numpy array that represents the names of the features.
  • class_names: Dict[int, str], optional Dictionary with int keys and str values that represent the class names.
Returns
  • None
Example
import pymurtree
import numpy

x = numpy.array([[1, 0, 0, 1], [1, 0, 1, 1], [0, 0, 1, 0]]) # features
y = numpy.array([5, 5, 4]) # labels

# Build tree classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x, y)

# Export tree in DOT format
ftnames = numpy.array(['has legs', 'has hair', 'can swim', 'can fly']) # feature names
clnames = {
    0 : 'snake',
    1 : 'cat',
    2 : 'crocodile',
    3 : 'horse',
    4 : 'fish',
    5 : 'bird'
} # class names
clf.export_dot(feature_names=ftnames, class_names=clnames)
# digraph Tree {
# node [shape=box, style="filled, rounded", fontname="helvetica", fontsize="8"] ;
# edge [fontname="helvetica", fontsize="6"] ;
# 0 [label=<has legs>, color="#8CB77F", fillcolor="#8CB77F"] ;
# 1 [label=<fish>, color="#B77F8C" fillcolor="#B77F8C"] ;
# 0 -> 1 [label=" 0 "] ;
# 2 [label=<bird>, color="#B77F8C" fillcolor="#B77F8C"] ;
# 0 -> 2 [label=" 1 "] ;
# }

# Save tree in tree.dot
clf.export_dot("tree.dot", ftnames, clnames)
# On the command line:

# Render tree in png format using Graphviz
dot -Tpng tree.dot -o tree.png
open tree.png

tree

Utility functions

readdata.py

Utility functions

  • load_murtree_dataset_to_pandas_dataframes: read features and labels from file into a pandas.DataFrame
  • load_murtree_dataset_to_numpy_arrays: read features and labels from file into a Numpy.ndarray

Both functions read datasets from a plain text file with the following format:

  • The first column contains the target variable (y), and the remaining columns contain the features (x).
  • All values should be space-separated integers.
  • All features must be 0 or 1 and classes should be equal or greater than zero
  • No headers or footers allowed.

These utility functions can be used to easily load the datasets in the murtree-data repository.

load_murtree_dataset_to_pandas_dataframes

Read a dataset of features and labels from a plain text file into a pandas.DataFrame (features) and a pandas.Series (labels)

Parameters
  • path: str
    Path to the dataset.
Returns
  • tuple:
    x (pandas.DataFrame): all columns from dataset except the first
    y (pandas.Series): first column of dataset
Example ```bash git clone https://github.com/MurTree/murtree-data.git ```
import pymurtree
from pymurtree.readdata import *

# Load training dataset
(x, y) = load_murtree_dataset_to_pandas_dataframes("murtree-data/DL/anneal.txt")

# Build classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x.to_numpy(), y.to_numpy())

load_murtree_dataset_to_numpy_arrays

Read a dataset of features and labels from a plain text file into a 2D (features) and 1D (labels) numpy.ndarrays

Parameters
  • path: str
    Path to the dataset.
Returns
  • tuple:
    x (numpy.ndarray): 2D array with all columns except the first
    y (numpy.ndarray): 1D array with the first column
Example ```bash git clone https://github.com/MurTree/murtree-data.git ```
import pymurtree
from pymurtree.readdata import *

# Load training dataset
(x, y) = load_murtree_dataset_to_numpy_arrays("murtree-data/DL/anneal.txt")

# Build classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x, y)