This repository is dedicated to experiments with LLMs on transactional datasets: Rosbank, Age, Gender for paper LLM4ES: Learning User Embeddings from Event Sequences via Large Language Models
When making use of the provided codes in this repository for your own work please ensure you reference the publication
@inproceedings{10.1145/3746252.3760828,
author = {Shestov, Aleksei and Zoloev, Omar and Makarenko, Maksim and Orlov, Mikhail and Fadeev, Egor and Kireev, Ivan and Savchenko, Andrey},
title = {LLM4ES: Learning User Embeddings from Event Sequences via Large Language Models},
year = {2025},
isbn = {9798400720406},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3746252.3760828},
doi = {10.1145/3746252.3760828},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
pages = {5238–5242},
numpages = {5},
keywords = {event sequences, llms, user embeddings},
location = {Seoul, Republic of Korea},
series = {CIKM '25}
}
This paper presents LLM4ES, a novel framework that exploits large pre-trained language models (LLMs) to derive user embeddings from event sequences. Event sequences are transformed into a textual representation, which is subsequently used to fine-tune an LLM through next-token prediction to generate high-quality embeddings. We introduce a text enrichment technique that enhances LLM adaptation to event sequence data, improving representation quality for low-variability domains. Experimental results demonstrate that LLM4ES achieves state-of-the-art performance in user classification tasks in financial and other domains, outperforming existing embedding methods. The resulting user embeddings can be incorporated into a wide range of applications, from user segmentation in finance to patient outcome prediction in healthcare
In this repo:
source/- source codellm4trx/- HF-style multi-gpu LLM trainingaugmentation.py- code for launching vllm to generate augmentationspretrain.py- for next-token-prediction LLM traininginference.py- multi-gpu LLM inferenceconvert_to_text.py- converter to base format, saving as jsonl (for later conversion to streaming)converters.py- converters to different formatssrc/dataset.py- everything related to dataset processing: text conversion, DataLoader creation, etc.dataset_hf.py- dataset version for HF-style training and augmentationsutils.py- utilities for loading models, embeddings, counting model parameters, etc.
llm-foundry/- fastest multi-gpu LLM trainingptls-experiments/- data & downstream embeddings validation
scripts/- scripts for running experiments and augmentationsconvert_to_text.sh- converts arrays of transactions into text format, then into MosaicML Streaming formattrain.sh.sh- multi-gpu LLM training for next-token predictionmodel_convertation.sh- convert model from MosaicML Composer format into HF Transformers formatinference.sh- multi-gpu inferencerun.sh- full pipeline with given seedrun_multi_seed.sh- full pipeline with multi-seed
source/llm4trxrun.sh- run the entire pipeline based on HF Transformers
The three main configs (one per dataset) are located in:
source/llm-foundry/scripts/train/yamls/pretrain
Configs (HF versions, currently used for augmentations) are located in:
source/llm4trx/config
In llm-foundry argparse is used by default, which is not very convenient.
I rewrote part of their code to make it possible to use Hydra and configs more easily.
I also added to their ConcatTokensDataset the ability to truncate by max_length instead of just concat_tokens (because the library is used for pretraining LLMs from scratch, where max_length truncation makes sense).
This is the only difference between the original library and my fork used in this repo.
Some differences between two training variants:
| Parameter | transformers | llm-foundry |
|---|---|---|
| augmentations | vllm | vllm |
| dataset | .csv converted into HF dataset | .jsonl converted into Streaming dataset |
| FSDP | - | + |
| inference | multi gpu accelerate | multi gpu accelerate |
| model | Hugging Face AutoModel | MosaicML Composer |
| speed (on Rosbank dataset) | 2.5h | 1.3h |
| ease of adding details | highly flexible | hard to add new features without rewriting |
| initial experiments | + | - |
image: https://hub.docker.com/orgs/mosaicml/repositories.
git clone https://github.com/mosaicml/llm-foundry.git
cd llm-foundry
pip install -e ".[gpu]"git clone https://github.com/tsebaka/LLM4Trx-research.git
cd LLM4Trx-research
cd source
# prep ptls-experiments
cd ptls-experiments
python3 -m venv ptls-venv
source ptls-venv/bin/activate
pip install pytorch-lifestream
cd ..
# prep llm-foundry
cd llm-foundry
python3 -m venv llmfoundry-venv
source llmfoundry-venv/bin/activate
pip install cmake packaging torch
pip install -e ".[gpu]"
pip install deepspeed=0.15.4
cd ..WORK_DIR=$HOME/zoloev-city/exp_name
CONFIG_DIR=$WORK_DIR/source/llm-foundry/scripts/train/yamls/pretrain
source $WORK_DIR/source/llm-foundry/llmfoundry-venv/bin/activate
export WANDB_API_KEY=""
export WANDB_PROJECT="llm4trx"
export WANDB_DIR=$WORK_DIR/checkpoints
CONFIG=config_name
echo "========== starting... $CONFIG =========="
echo "========== convert to text... =========="
python $WORK_DIR/source/llm4trx/convert_to_text.py \
--config-dir $CONFIG_DIR \
--config-name $CONFIG \
variables.work_dir=$WORK_DIR
echo "========== convert to streaming... =========="
python $WORK_DIR/source/llm-foundry/scripts/data_prep/convert_dataset_json.py \
--config-dir $CONFIG_DIR \
--config-name $CONFIG \
variables.work_dir=$WORK_DIR
echo "========== training llm foundry... =========="
composer $WORK_DIR/source/llm-foundry/scripts/train/train.py \
$CONFIG_DIR/$CONFIG \
variables.work_dir=$WORK_DIR
echo "========== convert model to hf... =========="
python $WORK_DIR/source/llm-foundry/scripts/inference/convert_composer_to_hf.py \
--config-dir $CONFIG_DIR \
--config-name $CONFIG \
variables.work_dir=$WORK_DIR
echo "========== inference... =========="
accelerate launch $WORK_DIR/source/llm4trx/inference.py \
--config-dir $CONFIG_DIR \
--config-name $CONFIG \
variables.work_dir=$WORK_DIR
echo "========== completed! =========="eval "$(conda shell.bash hook)"
export CUDA_VISIBLE_DEVICES=0,1,2,3
export WANDB_PROJECT="llm4trx"
exp_name=config_name
log_dir=.../${exp_name}
checkpoint="checkpoint-id"
config_name="${exp_name}.yaml"
export WANDB_DIR=$log_dir
source llmfoundry-venv/bin/activate # для hf-style придётся доставить transformers нужной версии
# llm text augmentations
python -m dataset_preparing \
--config-dir config \
--config-name ${config_name} \
++exp_name=${exp_name} \
++log_dir=${log_dir} \
++dataset.presave=false
# ntp train
accelerate launch sft_train.py \
--config-dir config \
--config-name ${config_name} \
++exp_name=${exp_name} \
++log_dir=${log_dir} \
++dataset.presave=false
# inference
checkpoint="checkpoint-X"
accelerate launch inference.py \
--config-dir config \
--config-name ${config_name} \
++exp_name=${exp_name} \
++log_dir=${log_dir} \
++checkpoint=${checkpoint} \
++dataset.presave=false
# downstream validation
conda activate kaggle_kernel
cd .../ptls-experiments/scenario_rosbank
rm -r embeddings_validation.work/
pipenv run python -m embeddings_validation \
--config-dir conf \
--config-name embeddings_validation_baselines_unsupervised \
+workers=10 \
+total_cpu_count=20 \
++report_file=".../checkpoints-logs/${exp_name}/experiment_name.txt"
conda deactivate- 8x NVIDIA A100 GPUs (80 GB HBM2e per GPU)
- 1 TB of DDR4 RAM



