Skip to content

Conversation

@cjh1
Copy link
Contributor

@cjh1 cjh1 commented Jan 30, 2023

A brief description of the purpose of the changes contained in this PR.

This PR adds a new configuration environment variable (CONTAINER_ENGINE_EXECUTABLE) to allow the compute worker to be run with other container technology such as podman. It also provides a rootless podman Containerfile.

Issues this PR resolves

Allow computer worker to run with podman

Checklist

  • Code review by me
  • Hand tested by me
  • I'm proud of my work
  • Code review by reviewer
  • Hand tested by reviewer
  • Ready to merge

@Didayolo
Copy link
Member

Thank you for the pull request. Be sure that we will review it soon.

@cjh1
Copy link
Contributor Author

cjh1 commented Jan 31, 2023

This is the podman invokation used to run the worker:

 podman run -it --security-opt label=disable --device /dev/fuse --user worker codabench_compute_worker_podman

cjh1 added 4 commits January 31, 2023 16:42
This is in preparation for generalizing container engine support
to allow the use of podman.
Add support for configuring the container engine through an
environment variable ( CONTAINER_ENGINE_EXECUTABLE ).
Docker will create them, but other container engines like podman
may not.
This Containerfile allows rootless Podman in Podman (PINP).
@cjh1
Copy link
Contributor Author

cjh1 commented Jan 31, 2023

I can confirm that I have tested with the docker worker, to ensure that these changes don't break running with docker.

@dtuantran
Copy link
Contributor

We get this kind of error when processing a submission :

[2023-02-01 16:45:00,934: INFO/ForkPoolWorker-1] Running program = podman run --rm --name=cd633d2a-b385-4a7c-a9d8-cccdd8cd6d97 --security-opt=no-new-privileges -v /codabench/storage/tmph0ibs9ca/ingestion_program:/app/program -v /codabench/storage/tmph0ibs9ca/output:/app/output -w /app/program -e PYTHONUNBUFFERED=1 -v /codabench/storage/tmph0ibs9ca/program:/app/ingested_program -v /codabench/storage/tmph0ibs9ca/input_data:/app/input_data codalab/codalab-legacy:py37 python /app/program/ingestion.py /app/input_data /app/output /app/program /app/ingested_program
[2023-02-01 16:45:00,939: INFO/ForkPoolWorker-1] Connecting to wss://codabench-test.lri.fr/submission_input/5/496/8c8fcb5e-c150-48e2-9d9a-e2667492fac1/
[2023-02-01 16:45:01,282: WARNING/ForkPoolWorker-1] WS: b"python: can't open file '/app/program/ingestion.py': [Errno 2] No such file or directory\n"
...
[2023-02-01 16:45:01,474: ERROR/ForkPoolWorker-1] Task compute_worker_run[5e5b8040-7372-44b9-98df-0294a14a9fef] raised unexpected: FileNotFoundError(2, 'No such file or directory')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/worker/compute_worker/compute_worker.py", line 95, in run_wrapper
    run.push_output()
  File "/home/worker/compute_worker/compute_worker.py", line 779, in push_output
    with open(metadata_path, 'w') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/codabench/tmph0ibs9ca/output/metadata'

I think because it missed a mapping volume /codabench/storage:/codabench as mentioned here.

After adding the mapping volume option -v /codabench/storage:/codabench, I got another error Permission denied

[2023-02-01 16:18:05,644: ERROR/ForkPoolWorker-1] Task compute_worker_run[c7dc9a33-ebc0-479a-9389-175064c9766f] raised unexpected: PermissionError(13, 'Permission denied')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/worker/compute_worker/compute_worker.py", line 88, in run_wrapper
    run = Run(run_args)
  File "/home/worker/compute_worker/compute_worker.py", line 181, in __init__
    self.root_dir = tempfile.mkdtemp(dir=BASE_DIR)
  File "/usr/lib64/python3.8/tempfile.py", line 358, in mkdtemp
    _os.mkdir(file, 0o700)
PermissionError: [Errno 13] Permission denied: '/codabench/tmpheazjjin'

Somehow the volume /codabench inside the container is mapped to root user.
We also need to think about the user and volume mapping for the child container which is executed by Podman container.
This is a competition for testing : https://codabench-test.lri.fr/competitions/30/ and the sample submission is in attached file.
sample_code_submission.zip

@cjh1
Copy link
Contributor Author

cjh1 commented Feb 1, 2023

Can you please try the following invocation ( from the above comment ), we don't need to bind mount from the host, we just use the folder from inside the worker container. The security options and device are important:

podman run -it --security-opt label=disable --device /dev/fuse --user worker codabench_compute_worker_podman

@dtuantran
Copy link
Contributor

dtuantran commented Feb 2, 2023

Can you please try the following invocation ( from the above comment ), we don't need to bind mount from the host, we just use the folder from inside the worker container. The security options and device are important:

podman run -it --security-opt label=disable --device /dev/fuse --user worker codabench_compute_worker_podman

I tried exactly that command. However, the child container needs the access to temporary files generated in original container which lauched by your command. In our case, we map the volume /codabench/storage from the host VM to /codabench inside the original container. After that, the generated command

podman run --rm --name=cd633d2a-b385-4a7c-a9d8-cccdd8cd6d97 --security-opt=no-new-privileges -v /codabench/storage/tmph0ibs9ca/ingestion_program:/app/program

maps the temporary files /codabench/storage/tmp... generated by compute_worker container (the original) from the host VM to /app/... inside the child container.
If you confirm that your test working, maybe we missed some configuration for Podman ?

@cjh1
Copy link
Contributor Author

cjh1 commented Feb 2, 2023

@dtuantran I can confirm that my tests work, any chance we could jump on a quick call to see what is going on at your end? Just to confirm, you aren't bind mounting any volumes into the compute_worker contain?

@cjh1
Copy link
Contributor Author

cjh1 commented Feb 2, 2023

@dtuantran I think I see the issue with your set up, please try setting HOST_DIRECTORY=/codabench

@dtuantran
Copy link
Contributor

I've added the Containerfile for building the image. The gpu container can detect and use the GPU. However, I got error :

$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I've to add the option --privileged in the podman run command and in the file compute_worker.py


And then, it works.

@cjh1
Copy link
Contributor Author

cjh1 commented Feb 6, 2023

@dtuantran Thank for trying this. I would prefer not to use --privileged, I think it should be possible without this option. I will take a look at this.

@dtuantran
Copy link
Contributor

I've added the Containerfile for building the image. The gpu container can detect and use the GPU. However, I got error :

$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I've to add the option --privileged in the podman run command and in the file compute_worker.py

And then, it works.

I think this option creates perhaps a security flaw. That why I didn't commit into PR. @cjh1 : Do you know if there is another solution ?

@cjh1
Copy link
Contributor Author

cjh1 commented Feb 6, 2023

Here is a write up for rootless podman and nvidia, we should try this approach: https://github.com/henrymai/podman_wsl2_cuda_rootless

@dtuantran
Copy link
Contributor

Here is a write up for rootless podman and nvidia, we should try this approach: https://github.com/henrymai/podman_wsl2_cuda_rootless

I already did theses configurations on the VM host before testing gpu compute worker. However, It's strange that when testing with cuda image, it works and when testing with our image : https://hub.docker.com/r/codalab/codabench_worker_podman_gpu . It only works with --privileged option.

@cjh1
Copy link
Contributor Author

cjh1 commented Feb 6, 2023

Which cuda image works for you? Your running the cuda image inside the compute container?

-l info \
-Q compute-worker \
-n compute-worker@%n \
--concurrency=1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't duplicate this code. We should just base the gpu version on the other compute worker image and just make the necessary changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I built the gpu version base on your Containerfile in order to validate gpu case. You can remove it if it isn't necessary.

@dtuantran
Copy link
Contributor

Which cuda image works for you? Your running the cuda image inside the compute container?

I use this one nvidia/cuda:11.6.2-base-ubuntu20.04 as metioned here : https://github.com/codalab/codabench/wiki/Compute-worker-installation-with-Podman#for-gpu-case

@dtuantran
Copy link
Contributor

I've added the Containerfile for building the image. The gpu container can detect and use the GPU. However, I got error :

$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I've to add the option --privileged in the podman run command and in the file compute_worker.py

And then, it works.

No need to add --privileged in to compute_worker.py, for my test needing only in the first podman run command.


# Include deps
RUN curl -s -L https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo | sudo tee /etc/yum.repos.d/cuda.repo && \
curl -s -L https://nvidia.github.io/nvidia-docker/rhel9.0/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo && \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtuantran Not sure how this built for you, sudo is not setup in the container, anyway these steps run as root so sudo is not needed. I will try to fix it up.

@Didayolo Didayolo changed the base branch from develop to podman February 15, 2023 14:02
@Didayolo Didayolo merged commit da51014 into codalab:podman Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants