Add support for running compute worker with other container engines #763

cjh1 · 2023-01-30T18:35:21Z

A brief description of the purpose of the changes contained in this PR.

This PR adds a new configuration environment variable (CONTAINER_ENGINE_EXECUTABLE) to allow the compute worker to be run with other container technology such as podman. It also provides a rootless podman Containerfile.

Issues this PR resolves

Allow computer worker to run with podman

Checklist

Didayolo · 2023-01-30T19:31:01Z

Thank you for the pull request. Be sure that we will review it soon.

cjh1 · 2023-01-31T16:19:36Z

This is the podman invokation used to run the worker:

 podman run -it --security-opt label=disable --device /dev/fuse --user worker codabench_compute_worker_podman

This is in preparation for generalizing container engine support to allow the use of podman.

Add support for configuring the container engine through an environment variable ( CONTAINER_ENGINE_EXECUTABLE ).

Docker will create them, but other container engines like podman may not.

This Containerfile allows rootless Podman in Podman (PINP).

cjh1 · 2023-01-31T17:34:26Z

I can confirm that I have tested with the docker worker, to ensure that these changes don't break running with docker.

dtuantran · 2023-02-01T17:18:38Z

We get this kind of error when processing a submission :

[2023-02-01 16:45:00,934: INFO/ForkPoolWorker-1] Running program = podman run --rm --name=cd633d2a-b385-4a7c-a9d8-cccdd8cd6d97 --security-opt=no-new-privileges -v /codabench/storage/tmph0ibs9ca/ingestion_program:/app/program -v /codabench/storage/tmph0ibs9ca/output:/app/output -w /app/program -e PYTHONUNBUFFERED=1 -v /codabench/storage/tmph0ibs9ca/program:/app/ingested_program -v /codabench/storage/tmph0ibs9ca/input_data:/app/input_data codalab/codalab-legacy:py37 python /app/program/ingestion.py /app/input_data /app/output /app/program /app/ingested_program
[2023-02-01 16:45:00,939: INFO/ForkPoolWorker-1] Connecting to wss://codabench-test.lri.fr/submission_input/5/496/8c8fcb5e-c150-48e2-9d9a-e2667492fac1/
[2023-02-01 16:45:01,282: WARNING/ForkPoolWorker-1] WS: b"python: can't open file '/app/program/ingestion.py': [Errno 2] No such file or directory\n"
...
[2023-02-01 16:45:01,474: ERROR/ForkPoolWorker-1] Task compute_worker_run[5e5b8040-7372-44b9-98df-0294a14a9fef] raised unexpected: FileNotFoundError(2, 'No such file or directory')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/worker/compute_worker/compute_worker.py", line 95, in run_wrapper
    run.push_output()
  File "/home/worker/compute_worker/compute_worker.py", line 779, in push_output
    with open(metadata_path, 'w') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/codabench/tmph0ibs9ca/output/metadata'

I think because it missed a mapping volume /codabench/storage:/codabench as mentioned here.

After adding the mapping volume option -v /codabench/storage:/codabench, I got another error Permission denied

[2023-02-01 16:18:05,644: ERROR/ForkPoolWorker-1] Task compute_worker_run[c7dc9a33-ebc0-479a-9389-175064c9766f] raised unexpected: PermissionError(13, 'Permission denied')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/worker/compute_worker/compute_worker.py", line 88, in run_wrapper
    run = Run(run_args)
  File "/home/worker/compute_worker/compute_worker.py", line 181, in __init__
    self.root_dir = tempfile.mkdtemp(dir=BASE_DIR)
  File "/usr/lib64/python3.8/tempfile.py", line 358, in mkdtemp
    _os.mkdir(file, 0o700)
PermissionError: [Errno 13] Permission denied: '/codabench/tmpheazjjin'

Somehow the volume /codabench inside the container is mapped to root user.
We also need to think about the user and volume mapping for the child container which is executed by Podman container.
This is a competition for testing : https://codabench-test.lri.fr/competitions/30/ and the sample submission is in attached file.
sample_code_submission.zip

cjh1 · 2023-02-01T17:31:43Z

Can you please try the following invocation ( from the above comment ), we don't need to bind mount from the host, we just use the folder from inside the worker container. The security options and device are important:

podman run -it --security-opt label=disable --device /dev/fuse --user worker codabench_compute_worker_podman

dtuantran · 2023-02-02T14:18:37Z

Can you please try the following invocation ( from the above comment ), we don't need to bind mount from the host, we just use the folder from inside the worker container. The security options and device are important:
podman run -it --security-opt label=disable --device /dev/fuse --user worker codabench_compute_worker_podman

I tried exactly that command. However, the child container needs the access to temporary files generated in original container which lauched by your command. In our case, we map the volume /codabench/storage from the host VM to /codabench inside the original container. After that, the generated command

podman run --rm --name=cd633d2a-b385-4a7c-a9d8-cccdd8cd6d97 --security-opt=no-new-privileges -v /codabench/storage/tmph0ibs9ca/ingestion_program:/app/program

maps the temporary files /codabench/storage/tmp... generated by compute_worker container (the original) from the host VM to /app/... inside the child container.
If you confirm that your test working, maybe we missed some configuration for Podman ?

cjh1 · 2023-02-02T14:26:13Z

@dtuantran I can confirm that my tests work, any chance we could jump on a quick call to see what is going on at your end? Just to confirm, you aren't bind mounting any volumes into the compute_worker contain?

cjh1 · 2023-02-02T14:34:59Z

@dtuantran I think I see the issue with your set up, please try setting HOST_DIRECTORY=/codabench

dtuantran · 2023-02-06T13:50:17Z

I've added the Containerfile for building the image. The gpu container can detect and use the GPU. However, I got error :

$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I've to add the option --privileged in the podman run command and in the file compute_worker.py

codabench/compute_worker/compute_worker.py

Line 520 in 43e01d4

And then, it works.

cjh1 · 2023-02-06T14:09:38Z

@dtuantran Thank for trying this. I would prefer not to use --privileged, I think it should be possible without this option. I will take a look at this.

dtuantran · 2023-02-06T14:11:16Z

I've added the Containerfile for building the image. The gpu container can detect and use the GPU. However, I got error :
$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I've to add the option --privileged in the podman run command and in the file compute_worker.py

codabench/compute_worker/compute_worker.py

Line 520 in 43e01d4

And then, it works.

I think this option creates perhaps a security flaw. That why I didn't commit into PR. @cjh1 : Do you know if there is another solution ?

cjh1 · 2023-02-06T14:13:12Z

Here is a write up for rootless podman and nvidia, we should try this approach: https://github.com/henrymai/podman_wsl2_cuda_rootless

dtuantran · 2023-02-06T14:25:40Z

Here is a write up for rootless podman and nvidia, we should try this approach: https://github.com/henrymai/podman_wsl2_cuda_rootless

I already did theses configurations on the VM host before testing gpu compute worker. However, It's strange that when testing with cuda image, it works and when testing with our image : https://hub.docker.com/r/codalab/codabench_worker_podman_gpu . It only works with --privileged option.

cjh1 · 2023-02-06T14:43:46Z

Which cuda image works for you? Your running the cuda image inside the compute container?

cjh1 · 2023-02-06T14:45:59Z

Containerfile.compute_worker_podman_gpu

+    -l info \
+    -Q compute-worker \
+    -n compute-worker@%n \
+    --concurrency=1


We shouldn't duplicate this code. We should just base the gpu version on the other compute worker image and just make the necessary changes.

I built the gpu version base on your Containerfile in order to validate gpu case. You can remove it if it isn't necessary.

dtuantran · 2023-02-06T15:02:33Z

Which cuda image works for you? Your running the cuda image inside the compute container?

I use this one nvidia/cuda:11.6.2-base-ubuntu20.04 as metioned here : https://github.com/codalab/codabench/wiki/Compute-worker-installation-with-Podman#for-gpu-case

dtuantran · 2023-02-06T18:01:07Z

I've added the Containerfile for building the image. The gpu container can detect and use the GPU. However, I got error :
$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I've to add the option --privileged in the podman run command and in the file compute_worker.py

codabench/compute_worker/compute_worker.py

Line 520 in 43e01d4

And then, it works.

No need to add --privileged in to compute_worker.py, for my test needing only in the first podman run command.

cjh1 · 2023-02-08T14:06:41Z

Containerfile.compute_worker_podman_gpu

+
+# Include deps
+RUN curl -s -L https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo | sudo tee /etc/yum.repos.d/cuda.repo && \
+    curl -s -L https://nvidia.github.io/nvidia-docker/rhel9.0/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo && \


@dtuantran Not sure how this built for you, sudo is not setup in the container, anyway these steps run as root so sudo is not needed. I will try to fix it up.

cjh1 added 4 commits January 31, 2023 16:42

Remove container engine specfic directory name

111f31f

This is in preparation for generalizing container engine support to allow the use of podman.

Allow container engine to be configured

fd3f38b

Add support for configuring the container engine through an environment variable ( CONTAINER_ENGINE_EXECUTABLE ).

Ensure host directories exist

36e3327

Docker will create them, but other container engines like podman may not.

Add Containerfile for rootless podman container

63709e5

This Containerfile allows rootless Podman in Podman (PINP).

cjh1 force-pushed the podman branch from 1f6cd8b to 63709e5 Compare January 31, 2023 16:43

Adding Containerfile for building podman gpu compute_worker image

43e01d4

cjh1 commented Feb 6, 2023

View reviewed changes

Adding nvidia runtime config

fd7d38f

cjh1 commented Feb 8, 2023

View reviewed changes

Didayolo changed the base branch from develop to podman February 15, 2023 14:02

Didayolo merged commit da51014 into codalab:podman Feb 15, 2023

Didayolo mentioned this pull request Feb 15, 2023

Add support for running compute worker with other container engines (Podman) #772

Merged

Add support for running compute worker with other container engines #763

Add support for running compute worker with other container engines #763

Uh oh!

Conversation

cjh1 commented Jan 30, 2023 • edited by dtuantran Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A brief description of the purpose of the changes contained in this PR.

Issues this PR resolves

Checklist

Uh oh!

Didayolo commented Jan 30, 2023

Uh oh!

cjh1 commented Jan 31, 2023

Uh oh!

cjh1 commented Jan 31, 2023

Uh oh!

dtuantran commented Feb 1, 2023

Uh oh!

cjh1 commented Feb 1, 2023

Uh oh!

dtuantran commented Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjh1 commented Feb 2, 2023

Uh oh!

cjh1 commented Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dtuantran commented Feb 6, 2023

Uh oh!

cjh1 commented Feb 6, 2023

Uh oh!

dtuantran commented Feb 6, 2023

Uh oh!

cjh1 commented Feb 6, 2023

Uh oh!

dtuantran commented Feb 6, 2023

Uh oh!

cjh1 commented Feb 6, 2023

Uh oh!

cjh1 Feb 6, 2023

Choose a reason for hiding this comment

Uh oh!

dtuantran Feb 6, 2023

Choose a reason for hiding this comment

Uh oh!

dtuantran commented Feb 6, 2023

Uh oh!

dtuantran commented Feb 6, 2023

Uh oh!

cjh1 Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjh1 commented Jan 30, 2023 •

edited by dtuantran

Loading

dtuantran commented Feb 2, 2023 •

edited

Loading

cjh1 commented Feb 2, 2023 •

edited

Loading