LLM Boost: LLMs Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

This repository is built on top of the Tablet code-base. The original README found here https://github.com/dylan-slack/Tablet contains useful information for installation and LLM inference.

Installation

Please follow the instructions in the Installation section of the original README (https://github.com/dylan-slack/Tablet) to create a conda environment and install Tablet. Next install addional dependencies to your environment ,

pip install -r requirements.txt

LLM Inference

Use the following to infer and store LLM scores which uses hugggingface to download the specified LLM. See the data folder for the prepared datasets for language model inference. You can use the instructions in the original Tablet README to create new datasets in this format. The repo currently works and is tested with the following families of LLMs: Flan-T5, Llama-3, Qwen-2.5

python llm_infer.py --seed 0 --k_shot 3 --name Adult --model google/flan-t5-base

This code will save new train and test csv files with LLM scores for each row to ./data/dataset_model_n-shot_seed (In this case ./data/Adult_flan-t5-base_3-shot_0) full list of dataset inference commands are given in launch_llm_infer.sh

LLM-Boost Experiments

After Generating LLM Scores use the following to run our boosting experiments. This will run standard XGBoost (With Hyperparamater Tuning) and Additional LLM-Boost Hyperparameter tuning.

python xgb_llm.py --seed 0 --data_path ./data/Adult_flan-t5-base_3-shot_0 --train_size 20 --cv_folds 40

A sample list of training commands for all datasets are given in launch_xgb_llm.sh

PFN-Boost Experiments

For our PFN-Boost experiments use the following.

python xgb_tabpfn.py --seed 0 --data_path ./data/Adult/prototypes-naturallanguage-performance-0 --train_size 20 --cv_folds 40

A sample list of training commands for all datasets are given in launch_xgb_tabpfn.sh

LLM+LGBM Experiments

Simillar to the above if using the following,,

python lgbm_llm.py --seed 0 --data_path ./data/Adult_flan-t5-base_3-shot_0 --train_size 20 --cv_folds 40

A sample list of training commands for all datasets are given in launch_lgbm_llm.sh

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LGBMscale		LGBMscale
Tablet		Tablet
XGBscale		XGBscale
data		data
figures		figures
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
autogluon.py		autogluon.py
create_external_datasets.py		create_external_datasets.py
create_task.py		create_task.py
launch_autogluon.sh		launch_autogluon.sh
launch_create.sh		launch_create.sh
launch_lgbm_llm.sh		launch_lgbm_llm.sh
launch_llm_infer.sh		launch_llm_infer.sh
launch_xgb_llm.sh		launch_xgb_llm.sh
launch_xgb_tabpfn.sh		launch_xgb_tabpfn.sh
launch_xgb_tabpfn_caafe.sh		launch_xgb_tabpfn_caafe.sh
lgbm_llm.py		lgbm_llm.py
llm_boost_utils.py		llm_boost_utils.py
llm_infer.py		llm_infer.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tabllm_utils.py		tabllm_utils.py
xgb_llm.py		xgb_llm.py
xgb_tabpfn.py		xgb_tabpfn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Boost: LLMs Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

Installation

LLM Inference

LLM-Boost Experiments

PFN-Boost Experiments

LLM+LGBM Experiments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Boost: LLMs Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

Installation

LLM Inference

LLM-Boost Experiments

PFN-Boost Experiments

LLM+LGBM Experiments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages