Parallel hyperoptimization with MongoDB by Cmurilochem · Pull Request #1921 · NNPDF/nnpdf

Cmurilochem · 2024-01-26T15:59:50Z

Aim

This PR aims to implement parallel hyperoptimizations using MongoDB datasets and mongo workers. This will enable us to calculate several trials simultaneously.

Strategy

Similarly to FileTrials, the main idea is to implement a MongoFileTrials class that inherits from MongoTrials. This new MongoFileTrials class will then be the one we will instantiate before calling hyperopt fmin

Tasks

Implement MongoFileTrials
Parse MongoDB option to n3fit command and HyperScanner
Adapt hyper_scan_wrapper to allow for parallel evaluation of fmin trials
Add MondoDB and pymongo as dependencies
Add unit/integration test
Quantify performance improvement
Run test on snellius
Add documentation
Add restarting options

Usage

Local Machine (for simple tests only)

First, make sure that you have MongoDB installed either via conda (not sure if available in the latest conda version) or apt-get/brew. Also pymongo is necessary but this can be easily installed via pip (it has already been added as dependency).

In the latest version of the code in this PR, n3fit is adapted to run automatically (by internal subprocessing) both mongod (that generates MongoDB databases) and hyperopt-mongo-worker (that launches mongo workers).

To run parallel hyperopts with n3fit, do:

n3fit hyper-quickcard.yml 1 -r N_replicas -o dir_output_name --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N

where N defines the number of mongo workers you want to launch in parallel. Indeed, N will define the number of trials we are calculating simultaneously. If you want to restart jobs, make sure you have dir_output_name in your current path and do:

n3fit hyper-quickcard.yml 1 -r N_replicas -o dir_output_name --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N --restart

Snellius

Here is a complete slurm script showing how we would run a hyperopt experiment in parallel in snellius (including restarts if needed):

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --partition gpu
#SBATCH --gpus-per-node=4
#SBATCH --time 24:00:00
#SBATCH --output=logs/parallel_slurm-%j.out


# Print job info
echo "Job started on $(hostname) at $(date)"


# conda env
ENVNAME=py_nnpdf-master-gpu

# calc details
RUNCARD="hyper-quickcard.yml"
REPLICAS=2
TRIALS=30
DIR_OUTPUT_NAME="test_hyperopt"
RESTART=false

# number of mongo workers to lauch
N_MONGOWORKERS=4


# activate conda environment
source ~/.bashrc
anaconda
conda activate $ENVNAME


# set up cudnn to run on the gpu
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
echo "CUDNN path: $CUDNN_PATH"
export LD_LIBRARY_PATH="$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH"
echo "LD_LIBRARY_PATH: $LD_LIBRARY_PATH"

# Verify GPU usage
ngpus=$(python3 -c "import tensorflow as tf; print(len(tf.config.list_physical_devices('GPU')))")
ngpus_list=$(python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))")

echo "List of physical devices '$ngpus_list'"

if [ ${ngpus} -eq 0 ]; then
    echo "GPUs not being used!"
else
    echo "Using GPUs!"
    echo "Num GPUs Available: ${ngpus}"
fi


# Run n3fit

echo "Changing directory to $TMPDIR"
cp "runcards/$RUNCARD" $TMPDIR
if [ ${RESTART} == "true" ]; then
    cp -r $DIR_OUTPUT_NAME $TMPDIR
fi
cd $TMPDIR


echo "Running n3fit..."

if [ ${RESTART} == "true" ]; then

    echo "Restarting job...."
    echo "n3fit '$TMPDIR/$RUNCARD' 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS --restart"

    n3fit "$TMPDIR/$RUNCARD" 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS --restart

else

    echo "n3fit '$TMPDIR/$RUNCARD' 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS"

    n3fit "$TMPDIR/$RUNCARD" 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS

fi


echo "Copying outputs to $SLURM_SUBMIT_DIR ..."
cp -r "$TMPDIR/$DIR_OUTPUT_NAME" $SLURM_SUBMIT_DIR


echo "Returning to $SLURM_SUBMIT_DIR ..."
cd $SLURM_SUBMIT_DIR


echo "Job completed at $(date)"

This would be run by doing:

sbatch minimal_parallel_hyperopt.slurm --exclusive

Here, each mongo worker selected (4) sees and run in one separate GPU:

as implemented here. In this run, we are then calculating 4 trials in parallel.

We could also set up our experiment to run 2 mongo workers in each gpu (8 trials in parallel), e.g., by using N_MONGOWORKERS=8 in the script above. In this case, we would observe:

Performance assessment

Local Machine

I have just made a very quick test in my local pc to assess the possible performance improvement with parallel hyperopts. I used the hyper-quickcard.yml card from n3fit/tests/regression (with minor modifications) and run it for 10 trials and 2 replicas varying the number of simultaneously launched mongo workers. The results are summarised in the figure below:

The results look encouraging a priori.

Snellius

For the snellius tests, I have employed the slurm script above as model and a more complete runcard.txt. I ran 10 trials with 2 replicas with varying numbers of mongo workers. The final results (after several fine tunings in the code) are plotted in the figure below:

It shows the variations of the total clock run time of each job as a function of the number of launched mongo workers. The idea here is that each mongo worker is responsible for one trial in hyperopt, so the more mongo workers we launch the more trials we calculate simultaneously.

I also tested the possibility that we launch more than 1 mongo worker per gpu; see right (light grey) part of the figure. This is actually where we observe the best performance and improvement. So, as seen, a job with 8 mongo workers (2 per gpu) is nearly ~8x faster than a serial hyperopt.

scarlehoff · 2024-02-08T15:06:29Z

Hi @Cmurilochem do we absolutely need mongodb for this? And if so, is there no pip package (or, at worst, conda-forge package?)

Using the defaults channel introduces licensing problems.

(If there's no other solution so be it, but we can't add it to the conda recipe)

Cmurilochem · 2024-02-08T15:36:23Z

Hi @Cmurilochem do we absolutely need mongodb for this? And if so, is there no pip package (or, at worst, conda-forge package?)

Using the defaults channel introduces licensing problems.

(If there's no other solution so be it, but we can't add it to the conda recipe)

Hi @scarlehoff. Thanks for your help. I created a test for parallel hyperopt and wanted to make it run. The only way is to use mongodb though. Yes, I see. There is no pip for it unfortunately, only conda. It runs nicely in the Test python installation but never in the "Tests".

If you so suggest (mainly to avoid add one more depency apart from lhapdf and pandoc), I could skip the test for now and remove mongodb from the conda install.

scarlehoff · 2024-02-08T15:39:27Z

The problem of the dependency is separated (if needed, we can add it).

But wouldn't it be possible to use it fromconda-forge? https://anaconda.org/conda-forge/mongodb
That way it can be added to the conda recipe.

Edit: otherwise we simply don't add it to the conda-recipe and if one wants to run with mongodb they will have to procure that by themselves. It's not a big problem, I just hoped the conda-forge version worked, but I see it is failing...

Cmurilochem · 2024-02-08T16:09:10Z

The problem of the dependency is separated (if needed, we can add it).

But wouldn't it be possible to use it fromconda-forge? https://anaconda.org/conda-forge/mongodb That way it can be added to the conda recipe.

Edit: otherwise we simply don't add it to the conda-recipe and if one wants to run with mongodb they will have to procure that by themselves. It's not a big problem, I just hoped the conda-forge version worked, but I see it is failing...

Yes...my test with conda install -c conda-forge mongodb --yes failed indeed. I will try to add it anyway and see what happens. Let's see...

Edit: It worked surprisingly.....

APJansen · 2024-02-12T13:23:32Z

The tests have shown that there may exist an immense communication overhead, with workers acting in a kind of chaotic way. This is a point that I will investigate in more details.

Why do you say it's due to communication? That should be very minimal. Seems to me it's memory usage, more than one just doesn't fit on one GPU. In the screenshot you posted, you see that the memory usage is close to 100%. Or... is that this tensorflow thing, where it just reserves all the memory it can?
Even if it is the memory though, I expect that to improve with coming PRs, we can discuss that later.

I ran 10 trials with 2 replicas with varying numbers of mongo workers, using one gpu each. The results are plotted in the figure below:

I'm still confused by this plot. First of all, the parallel/sequential is always referring to parallelization in trials right? Not in replicas?
Is it true that using mongoDB with only one worker is 1.6 times slower per trial than using the old method, for a test job that takes an hour? That seems very strange?
Then the scaling to 2 and 3 GPUs looks good but it seems to plateau there, any idea why? Communication is only once per trial right, so that shouldn't be it. Are they waiting for each other to finish or something?

scarlehoff · 2024-02-12T13:25:59Z

Or... is that this tensorflow thing, where it just reserves all the memory it can?

Tensorflow allocates all memory for itself in the GPU. You need to do this to control it https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

APJansen · 2024-02-13T08:01:05Z

Thanks Juan, I remember we did this before.
Edit: just noticed you already have it Carlos :)

I checked quickly the effect on performance, with 100 replicas and the production runcard, it made it 1% slower, which may just be random variation.

I don't know if it will solve the problems here, but maybe it makes sense to just always use this @scarlehoff? I can make a small separate PR for it if you agree.

Cmurilochem · 2024-02-13T08:18:12Z

Thanks Juan, I remember we did this before. Edit: just noticed you already have it Carlos :)

I checked quickly the effect on performance, with 100 replicas and the production runcard, it made it 1% slower, which may just be random variation.

I don't know if it will solve the problems here, but maybe it makes sense to just always use this @scarlehoff? I can make a small separate PR for it if you agree.

Hi @APJansen and @scarlehoff. Thanks for your help. @Cmurilochem I had a spelling mistake while setting TF_FORCE_GPU_ALLOW_GROWTH. Instead of my_env["TF_FORCE_GPU_ALLOW_GROWTH"] = "True", this should be my_env["TF_FORCE_GPU_ALLOW_GROWTH"] = "true". Because of this the environment variable was not set and I was having memory fragmentation problems. It seems to be fine for me now. I am able to run 2 workers is the same CPU.

Just to illustrate the an idea: the output of the nvidia-smi gives me:

for the case of two workers running in two different gpus:

two workers running on the same gpu:

APJansen · 2024-02-13T08:26:47Z

Ah great :) Looks promising, and it's actually still roughly twice as fast as one worker per GPU?

I saw when I was testing for #1936 that GPU usage is often around 90% at 100 replicas, but still (with the changes there) I was able to run 500 replicas as well, with still better than linear scaling in the number of replicas. So that's not a limit somehow. So perhaps we'll be able to run with even more than 2 per GPU.
But I would say for this PR, just make sure it's configurable and working with 2, and we'll test the limits later.

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

…starts

…65*' dirs

…ropt-mongo-worker

… indicated by Aron

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

Cmurilochem marked this pull request as draft January 26, 2024 16:04

Cmurilochem self-assigned this Jan 26, 2024

Cmurilochem added n3fit Issues and PRs related to n3fit escience enhancement New feature or request labels Jan 26, 2024

Cmurilochem force-pushed the mongodb_hyperopt branch 3 times, most recently from f22720e to 1300766 Compare January 30, 2024 15:02

Cmurilochem requested review from APJansen, RoyStegeman, goord and scarlehoff January 30, 2024 16:07

Cmurilochem force-pushed the mongodb_hyperopt branch 5 times, most recently from 62027b8 to 8b07ebd Compare February 6, 2024 14:46

Cmurilochem force-pushed the mongodb_hyperopt branch from a80f2d4 to 040d631 Compare February 7, 2024 20:07

Cmurilochem force-pushed the mongodb_hyperopt branch 3 times, most recently from 7dbd3c3 to 727b378 Compare February 9, 2024 16:41

Cmurilochem and others added 27 commits March 8, 2024 07:44

Fix in MongoFileTrials logging

dd05559

Updated docs: replace 'initiate' by 'instantiate'

c8e6461

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

Update doc/sphinx/source/n3fit/hyperopt.rst

0db5de9

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

Updated docs: replace database name to 'hyperopt-db'

63c244d

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

Update doc/sphinx/source/n3fit/hyperopt.rst

9576bdb

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

Changed default database name to 'hyperopt-db'

1d673ec

Removed unused 'set_tf_visible_device' function

6c50237

Added validation to hyperopt-related arguments

fdf3458

Moved 'get_physical_gpus' to keras_backend 'internal_state.py'

e901ea7

Added initial MongoFileTrials methods to allow for restarts

6256aff

Added 'MongodRunner' class to automate mongod launch and allow for re…

c72cbad

…starts

Added new test

1e5a0d4

Added directoryperdb option to mongod to eliminate the need for the '…

13a98ea

…65*' dirs

Added tarfile package to handle compression and extraction of tar files

1d21569

Update and fix test

eb11340

Set explicitly path to restart and additional keyword options to hype…

5e8e48c

…ropt-mongo-worker

Fix in hyper_scan_wrapper

f5c64f7

Update docs

7861c3e

Updated 'max-consecutive-failures' and 'reserve-timeout' arguments as…

8cb95cc

… indicated by Aron

Update hyperopt.rst

2796939

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

Update hyperopt.rst

594f76b

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

Update hyper_scan.py

797eb98

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

Add runcard to mongodb database name

543d2d5

Fix restart test

95e4d19

Fix test database name and path

e122cb5

Update docs with database name changes

1fbbddb

Remove unused import

3b97751

Cmurilochem force-pushed the mongodb_hyperopt branch from f5e7f9a to 3b97751 Compare March 8, 2024 07:13

Cmurilochem merged commit 2ec03d5 into master Mar 8, 2024

Cmurilochem deleted the mongodb_hyperopt branch March 8, 2024 07:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel hyperoptimization with MongoDB#1921

Parallel hyperoptimization with MongoDB#1921
Cmurilochem merged 35 commits into
masterfrom
mongodb_hyperopt

Cmurilochem commented Jan 26, 2024 •

edited

Loading

Uh oh!

scarlehoff commented Feb 8, 2024

Uh oh!

Cmurilochem commented Feb 8, 2024

Uh oh!

scarlehoff commented Feb 8, 2024 •

edited

Loading

Uh oh!

Cmurilochem commented Feb 8, 2024 •

edited

Loading

Uh oh!

APJansen commented Feb 12, 2024

Uh oh!

scarlehoff commented Feb 12, 2024

Uh oh!

APJansen commented Feb 13, 2024 •

edited

Loading

Uh oh!

Cmurilochem commented Feb 13, 2024 •

edited

Loading

Uh oh!

APJansen commented Feb 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Cmurilochem commented Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Aim

Strategy

Tasks

Usage

Local Machine (for simple tests only)

Snellius

Performance assessment

Local Machine

Snellius

Uh oh!

scarlehoff commented Feb 8, 2024

Uh oh!

Cmurilochem commented Feb 8, 2024

Uh oh!

scarlehoff commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cmurilochem commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

APJansen commented Feb 12, 2024

Uh oh!

scarlehoff commented Feb 12, 2024

Uh oh!

APJansen commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cmurilochem commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

APJansen commented Feb 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Cmurilochem commented Jan 26, 2024 •

edited

Loading

scarlehoff commented Feb 8, 2024 •

edited

Loading

Cmurilochem commented Feb 8, 2024 •

edited

Loading

APJansen commented Feb 13, 2024 •

edited

Loading

Cmurilochem commented Feb 13, 2024 •

edited

Loading