Skip to content

Parallel hyperoptimization with MongoDB#1921

Merged
Cmurilochem merged 35 commits into
masterfrom
mongodb_hyperopt
Mar 8, 2024
Merged

Parallel hyperoptimization with MongoDB#1921
Cmurilochem merged 35 commits into
masterfrom
mongodb_hyperopt

Conversation

@Cmurilochem
Copy link
Copy Markdown
Collaborator

@Cmurilochem Cmurilochem commented Jan 26, 2024

Aim

This PR aims to implement parallel hyperoptimizations using MongoDB datasets and mongo workers. This will enable us to calculate several trials simultaneously.

Strategy

Similarly to FileTrials, the main idea is to implement a MongoFileTrials class that inherits from MongoTrials. This new MongoFileTrials class will then be the one we will instantiate before calling hyperopt fmin

Tasks

  • Implement MongoFileTrials
  • Parse MongoDB option to n3fit command and HyperScanner
  • Adapt hyper_scan_wrapper to allow for parallel evaluation of fmin trials
  • Add MondoDB and pymongo as dependencies
  • Add unit/integration test
  • Quantify performance improvement
  • Run test on snellius
  • Add documentation
  • Add restarting options

Usage

Local Machine (for simple tests only)

First, make sure that you have MongoDB installed either via conda (not sure if available in the latest conda version) or apt-get/brew. Also pymongo is necessary but this can be easily installed via pip (it has already been added as dependency).

In the latest version of the code in this PR, n3fit is adapted to run automatically (by internal subprocessing) both mongod (that generates MongoDB databases) and hyperopt-mongo-worker (that launches mongo workers).

To run parallel hyperopts with n3fit, do:

n3fit hyper-quickcard.yml 1 -r N_replicas -o dir_output_name --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N

where N defines the number of mongo workers you want to launch in parallel. Indeed, N will define the number of trials we are calculating simultaneously. If you want to restart jobs, make sure you have dir_output_name in your current path and do:

n3fit hyper-quickcard.yml 1 -r N_replicas -o dir_output_name --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N --restart

Snellius

Here is a complete slurm script showing how we would run a hyperopt experiment in parallel in snellius (including restarts if needed):

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --partition gpu
#SBATCH --gpus-per-node=4
#SBATCH --time 24:00:00
#SBATCH --output=logs/parallel_slurm-%j.out


# Print job info
echo "Job started on $(hostname) at $(date)"


# conda env
ENVNAME=py_nnpdf-master-gpu

# calc details
RUNCARD="hyper-quickcard.yml"
REPLICAS=2
TRIALS=30
DIR_OUTPUT_NAME="test_hyperopt"
RESTART=false

# number of mongo workers to lauch
N_MONGOWORKERS=4


# activate conda environment
source ~/.bashrc
anaconda
conda activate $ENVNAME


# set up cudnn to run on the gpu
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
echo "CUDNN path: $CUDNN_PATH"
export LD_LIBRARY_PATH="$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH"
echo "LD_LIBRARY_PATH: $LD_LIBRARY_PATH"

# Verify GPU usage
ngpus=$(python3 -c "import tensorflow as tf; print(len(tf.config.list_physical_devices('GPU')))")
ngpus_list=$(python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))")

echo "List of physical devices '$ngpus_list'"

if [ ${ngpus} -eq 0 ]; then
    echo "GPUs not being used!"
else
    echo "Using GPUs!"
    echo "Num GPUs Available: ${ngpus}"
fi


# Run n3fit

echo "Changing directory to $TMPDIR"
cp "runcards/$RUNCARD" $TMPDIR
if [ ${RESTART} == "true" ]; then
    cp -r $DIR_OUTPUT_NAME $TMPDIR
fi
cd $TMPDIR


echo "Running n3fit..."

if [ ${RESTART} == "true" ]; then

    echo "Restarting job...."
    echo "n3fit '$TMPDIR/$RUNCARD' 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS --restart"

    n3fit "$TMPDIR/$RUNCARD" 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS --restart

else

    echo "n3fit '$TMPDIR/$RUNCARD' 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS"

    n3fit "$TMPDIR/$RUNCARD" 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS

fi


echo "Copying outputs to $SLURM_SUBMIT_DIR ..."
cp -r "$TMPDIR/$DIR_OUTPUT_NAME" $SLURM_SUBMIT_DIR


echo "Returning to $SLURM_SUBMIT_DIR ..."
cd $SLURM_SUBMIT_DIR


echo "Job completed at $(date)"

This would be run by doing:

sbatch minimal_parallel_hyperopt.slurm --exclusive

Here, each mongo worker selected (4) sees and run in one separate GPU:

top nvidia_smi

as implemented here. In this run, we are then calculating 4 trials in parallel.

We could also set up our experiment to run 2 mongo workers in each gpu (8 trials in parallel), e.g., by using N_MONGOWORKERS=8 in the script above. In this case, we would observe:

8_wks_top 8_wks_smi

Performance assessment

Local Machine

I have just made a very quick test in my local pc to assess the possible performance improvement with parallel hyperopts. I used the hyper-quickcard.yml card from n3fit/tests/regression (with minor modifications) and run it for 10 trials and 2 replicas varying the number of simultaneously launched mongo workers. The results are summarised in the figure below:

parallel_performance

The results look encouraging a priori.

Snellius

For the snellius tests, I have employed the slurm script above as model and a more complete runcard.txt. I ran 10 trials with 2 replicas with varying numbers of mongo workers. The final results (after several fine tunings in the code) are plotted in the figure below:

snellius_parallel_performance

It shows the variations of the total clock run time of each job as a function of the number of launched mongo workers. The idea here is that each mongo worker is responsible for one trial in hyperopt, so the more mongo workers we launch the more trials we calculate simultaneously.

I also tested the possibility that we launch more than 1 mongo worker per gpu; see right (light grey) part of the figure. This is actually where we observe the best performance and improvement. So, as seen, a job with 8 mongo workers (2 per gpu) is nearly ~8x faster than a serial hyperopt.

@Cmurilochem Cmurilochem marked this pull request as draft January 26, 2024 16:04
@Cmurilochem Cmurilochem self-assigned this Jan 26, 2024
@Cmurilochem Cmurilochem added n3fit Issues and PRs related to n3fit escience enhancement New feature or request labels Jan 26, 2024
@Cmurilochem Cmurilochem force-pushed the mongodb_hyperopt branch 3 times, most recently from f22720e to 1300766 Compare January 30, 2024 15:02
@Cmurilochem Cmurilochem force-pushed the mongodb_hyperopt branch 5 times, most recently from 62027b8 to 8b07ebd Compare February 6, 2024 14:46
@scarlehoff
Copy link
Copy Markdown
Member

Hi @Cmurilochem do we absolutely need mongodb for this? And if so, is there no pip package (or, at worst, conda-forge package?)

Using the defaults channel introduces licensing problems.

(If there's no other solution so be it, but we can't add it to the conda recipe)

@Cmurilochem
Copy link
Copy Markdown
Collaborator Author

Hi @Cmurilochem do we absolutely need mongodb for this? And if so, is there no pip package (or, at worst, conda-forge package?)

Using the defaults channel introduces licensing problems.

(If there's no other solution so be it, but we can't add it to the conda recipe)

Hi @scarlehoff. Thanks for your help. I created a test for parallel hyperopt and wanted to make it run. The only way is to use mongodb though. Yes, I see. There is no pip for it unfortunately, only conda. It runs nicely in the Test python installation but never in the "Tests".

If you so suggest (mainly to avoid add one more depency apart from lhapdf and pandoc), I could skip the test for now and remove mongodb from the conda install.

@scarlehoff
Copy link
Copy Markdown
Member

scarlehoff commented Feb 8, 2024

The problem of the dependency is separated (if needed, we can add it).

But wouldn't it be possible to use it fromconda-forge? https://anaconda.org/conda-forge/mongodb
That way it can be added to the conda recipe.

Edit: otherwise we simply don't add it to the conda-recipe and if one wants to run with mongodb they will have to procure that by themselves. It's not a big problem, I just hoped the conda-forge version worked, but I see it is failing...

@Cmurilochem
Copy link
Copy Markdown
Collaborator Author

Cmurilochem commented Feb 8, 2024

The problem of the dependency is separated (if needed, we can add it).

But wouldn't it be possible to use it fromconda-forge? https://anaconda.org/conda-forge/mongodb That way it can be added to the conda recipe.

Edit: otherwise we simply don't add it to the conda-recipe and if one wants to run with mongodb they will have to procure that by themselves. It's not a big problem, I just hoped the conda-forge version worked, but I see it is failing...

Yes...my test with conda install -c conda-forge mongodb --yes failed indeed. I will try to add it anyway and see what happens. Let's see...

Edit: It worked surprisingly.....

@Cmurilochem Cmurilochem force-pushed the mongodb_hyperopt branch 3 times, most recently from 7dbd3c3 to 727b378 Compare February 9, 2024 16:41
@APJansen
Copy link
Copy Markdown
Collaborator

The tests have shown that there may exist an immense communication overhead, with workers acting in a kind of chaotic way. This is a point that I will investigate in more details.

Why do you say it's due to communication? That should be very minimal. Seems to me it's memory usage, more than one just doesn't fit on one GPU. In the screenshot you posted, you see that the memory usage is close to 100%. Or... is that this tensorflow thing, where it just reserves all the memory it can?
Even if it is the memory though, I expect that to improve with coming PRs, we can discuss that later.

I ran 10 trials with 2 replicas with varying numbers of mongo workers, using one gpu each. The results are plotted in the figure below:

I'm still confused by this plot. First of all, the parallel/sequential is always referring to parallelization in trials right? Not in replicas?
Is it true that using mongoDB with only one worker is 1.6 times slower per trial than using the old method, for a test job that takes an hour? That seems very strange?
Then the scaling to 2 and 3 GPUs looks good but it seems to plateau there, any idea why? Communication is only once per trial right, so that shouldn't be it. Are they waiting for each other to finish or something?

@scarlehoff
Copy link
Copy Markdown
Member

Or... is that this tensorflow thing, where it just reserves all the memory it can?

Tensorflow allocates all memory for itself in the GPU. You need to do this to control it https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

@APJansen
Copy link
Copy Markdown
Collaborator

APJansen commented Feb 13, 2024

Thanks Juan, I remember we did this before.
Edit: just noticed you already have it Carlos :)

I checked quickly the effect on performance, with 100 replicas and the production runcard, it made it 1% slower, which may just be random variation.

I don't know if it will solve the problems here, but maybe it makes sense to just always use this @scarlehoff? I can make a small separate PR for it if you agree.

@Cmurilochem
Copy link
Copy Markdown
Collaborator Author

Cmurilochem commented Feb 13, 2024

Thanks Juan, I remember we did this before. Edit: just noticed you already have it Carlos :)

I checked quickly the effect on performance, with 100 replicas and the production runcard, it made it 1% slower, which may just be random variation.

I don't know if it will solve the problems here, but maybe it makes sense to just always use this @scarlehoff? I can make a small separate PR for it if you agree.

Hi @APJansen and @scarlehoff. Thanks for your help. @Cmurilochem I had a spelling mistake while setting TF_FORCE_GPU_ALLOW_GROWTH. Instead of my_env["TF_FORCE_GPU_ALLOW_GROWTH"] = "True", this should be my_env["TF_FORCE_GPU_ALLOW_GROWTH"] = "true". Because of this the environment variable was not set and I was having memory fragmentation problems. It seems to be fine for me now. I am able to run 2 workers is the same CPU.

Just to illustrate the an idea: the output of the nvidia-smi gives me:

  • for the case of two workers running in two different gpus:
1_gpus_with_mongo_each
  • two workers running on the same gpu:
1_gpu_2_mongo

@APJansen
Copy link
Copy Markdown
Collaborator

Ah great :) Looks promising, and it's actually still roughly twice as fast as one worker per GPU?

I saw when I was testing for #1936 that GPU usage is often around 90% at 100 replicas, but still (with the changes there) I was able to run 500 replicas as well, with still better than linear scaling in the number of replicas. So that's not a limit somehow. So perhaps we'll be able to run with even more than 2 per GPU.
But I would say for this PR, just make sure it's configurable and working with 2, and we'll test the limits later.

Cmurilochem and others added 27 commits March 8, 2024 07:44
Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>
Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>
Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>
Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>
Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>
Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>
Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>
@Cmurilochem Cmurilochem merged commit 2ec03d5 into master Mar 8, 2024
@Cmurilochem Cmurilochem deleted the mongodb_hyperopt branch March 8, 2024 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request escience n3fit Issues and PRs related to n3fit performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants