Skip to content

Conversation

@bbearce
Copy link
Collaborator

@bbearce bbearce commented Apr 18, 2024

Reviewers

@Didayolo, @ihsaan-ullah

Summary

Meant to transition us to poetry from pip dependency management. I only upped the python version to 3.9 and not even for all Dockerfiles. I ran into issues using python3.10. There are too many changes in a variety of packages. I think the only real way to do this is one package at a time and slowly over time bump up the python. For now I really just switched us to poetry from pip but kept the same packages versions. I think I changed one called ipdb and upgraded a bit because python 3.9 required it.

Dockerfiles updated:

  • Dockerfile
docker build --no-cache -f Dockerfile -t codabench-django:latest ./
  • Dockerfile.compute_worker
docker build --no-cache -f Dockerfile.compute_worker -t codabench-compute_worker:latest ./
  • Dockerfile.compute_worker_gpu
docker build --no-cache -f Dockerfile.compute_worker_gpu -t codabench-compute_worker_gpu:latest ./
  • Dockerfile.flower
docker build --no-cache -f Dockerfile.flower -t codabench-flower:latest ./

PS: Do we use Dockerfile.celery at all?

Issues this PR resolves

Checklist

  • Code review by me
  • Hand tested by me
  • I'm proud of my work
  • Code review by reviewer
  • Hand tested by reviewer
  • CircleCi tests are passing
  • Ready to merge

@Didayolo Didayolo changed the title Issue 1413 Update Python using Poetry (Issue #1413) Apr 19, 2024
@Didayolo
Copy link
Member

Nice progress. This is a fundamental change that needs to be tested deeply before merging.

@ihsaan-ullah
Copy link
Collaborator

Nice work done @bbearce. I see that there is poetry for compute worker. Will this affect the setting up of a compute worker?
https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup

@Didayolo
Copy link
Member

Also, what is the file compute_worker/poetry.lock ?

@Didayolo
Copy link
Member

Didayolo commented Jun 11, 2024

  • This currently runs on https://codabench-test.lri.fr/

  • We need also to test compute workers

  • Building the GPU compute worker Dockerfile did NOT work

  • We are supposed to remove the requirements.txt files in this PR

@bbearce
Copy link
Collaborator Author

bbearce commented Jun 17, 2024

Was working on this over the weekend. A couple questions.

  • I had to delete basically every docker image to build the gpu one. Could we request maybe 50GB total for this VM (157.136.249.152)?
  • The compute worker and gpu worker both start from the same python3.9 and I believe that is for consistency. However installing cuda drivers and nvidia-toolkit is harder that way. I have a current Dockerfile I'm working on and it builds to about ~8GB but it starts from image ubuntu:20.04. Can we start from ubuntu or even better yet, nvidia-cuda images? Adding python after the fact is easy and I could even integrate pyenv, though don't want to go too crazy.
    • The obvious con is that it wouldn't be the same base image as the compute worker, unless we switched that to come from ubuntu:20.04 as well (just skip cuda installation).

What do you think?

@ihsaan-ullah
Copy link
Collaborator

If you want to use a cuda image, you can check this one which we are using in another project

https://github.com/FAIR-Universe/HEP-Challenge/blob/master/docker/Dockerfile

@bbearce
Copy link
Collaborator Author

bbearce commented Jun 17, 2024

Exactly. The more I think about it, this is great but how much should the gpu worker and the cpu worker match (Dockerfile wise)? My take is that you'd want them to be as similar as possible and I'm worried using an nvidia image for the cpu worker doesn't make sense so then these would come from different bases. Does anyone else think having both workers start from ubuntu is too low level?

PS: Totally down to do an nvidia image for gpu worker and python or ubuntu for cpu worker if folks don't think the different base matters much. Would greatly simplify gpu setup, effectively allowing us to skip cuda install altogether.

@Didayolo
Copy link
Member

Totally down to do an nvidia image for gpu worker and python or ubuntu for cpu worker if folks don't think the different base matters much. Would greatly simplify gpu setup, effectively allowing us to skip cuda install altogether.

Yeah I guess we can go for this

@Didayolo
Copy link
Member

Didayolo commented Jun 19, 2024

We updated the compute worker docker images:

  • tag test for testing CPU
  • tag gpu for GPU version

The CPU one is currently running submissions on the test server without problems.

We are in the process of testing the GPU version. Bug:

Traceback (most recent call last):
  File "/usr/bin/celery", line 5, in <module>
    from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
  File "/usr/bin/celery", line 5, in <module>
    from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
  File "/usr/bin/celery", line 5, in <module>
    from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
  File "/usr/bin/celery", line 5, in <module>
    from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'
Traceback (most recent call last):
  File "/usr/bin/celery", line 5, in <module>
    from celery.__main__ import main
ModuleNotFoundError: No module named 'celery'

@Didayolo
Copy link
Member

Didayolo commented Jun 19, 2024

Testing the GPU compute worker, I get this error on a submission of the "GPU test" bundle:

[2024-06-19 18:02:02,324: INFO/MainProcess] compute-worker@88d49e766880 ready.
[2024-06-19 18:02:02,325: INFO/MainProcess] Received task: compute_worker_run[ef4800b6-9ff8-4626-8702-260f99d2c8b5]  
[2024-06-19 18:02:02,426: INFO/ForkPoolWorker-1] Received run arguments: {'user_pk': 7, 'submissions_api_url': 'https://www.codabench.org/api', 'secret': 'fd3e705e-4a4f-4777-9507-fede640bda15', 'docker_image': 'codalab/codalab-legacy:gpu', 'execution_time_limit': 600, 'id': 71224, 'is_scoring': False, 'prediction_result': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/prediction_result/2024-06-19-1718819984/6d99e685bc54/prediction_result.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=34B6XI%2FOee1gMw1L8p88gx5UNw8%3D&content-type=application%2Fzip&Expires=1718906385', 'ingestion_only_during_scoring': False, 'program_data': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/dataset/2024-06-19-1718819979/0b3a8502929c/submission.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=Q4%2FLO4Va4yIINYzRn2JSCcz%2BgBQ%3D&Expires=1718906385', 'prediction_stdout': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/87a4537c79e3/prediction_stdout.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=BHhGAaPiK%2BBz0VgIyuBkzyy3hOM%3D&content-type=application%2Fzip&Expires=1718906385', 'prediction_stderr': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/ea6d07365e4b/prediction_stderr.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=5G2icoWHEBZgMLXJrbSxAAUKsGA%3D&content-type=application%2Fzip&Expires=1718906385', 'prediction_ingestion_stdout': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/a890fb3b7207/prediction_ingestion_stdout.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=FSOuv3o9UGq4hjRnpJTa8jKe8Nk%3D&content-type=application%2Fzip&Expires=1718906385', 'prediction_ingestion_stderr': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-06-19-1718819985/e030e979fdb9/prediction_ingestion_stderr.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=oFdZi6FjiY1BckMc%2BoXct6l%2BwpY%3D&content-type=application%2Fzip&Expires=1718906385'}
[2024-06-19 18:02:02,427: INFO/ForkPoolWorker-1] Updating submission @ https://www.codabench.org/api/submissions/71224/ with data = {'status': 'Preparing', 'status_details': None, 'secret': 'fd3e705e-4a4f-4777-9507-fede640bda15'}
[2024-06-19 18:02:02,702: INFO/ForkPoolWorker-1] Submission updated successfully!
[2024-06-19 18:02:02,702: INFO/ForkPoolWorker-1] Checking if cache directory needs to be pruned...
[2024-06-19 18:02:02,703: INFO/ForkPoolWorker-1] Cache directory does not need to be pruned!
[2024-06-19 18:02:02,703: INFO/ForkPoolWorker-1] Getting bundle https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/dataset/2024-06-19-1718819979/0b3a8502929c/submission.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=Q4%2FLO4Va4yIINYzRn2JSCcz%2BgBQ%3D&Expires=1718906385 to unpack @ program
[2024-06-19 18:02:02,777: INFO/ForkPoolWorker-1] Beginning MD5 checksum of submission: /codabench/tmp42i5mcin/bundles/tmpctn6s1_z
[2024-06-19 18:02:02,777: INFO/ForkPoolWorker-1] Checksum result: 573f142dfb0c45b19c063131db595bb5
[2024-06-19 18:02:02,777: INFO/ForkPoolWorker-1] Updating submission @ https://www.codabench.org/api/submissions/71224/ with data = {'md5': '573f142dfb0c45b19c063131db595bb5', 'secret': 'fd3e705e-4a4f-4777-9507-fede640bda15'}
[2024-06-19 18:02:02,870: INFO/ForkPoolWorker-1] Submission updated successfully!
[2024-06-19 18:02:02,870: INFO/ForkPoolWorker-1] Running pull for image: codalab/codalab-legacy:gpu
[2024-06-19 18:02:02,873: INFO/ForkPoolWorker-1] Destroying submission temp dir: /codabench/tmp42i5mcin
[2024-06-19 18:02:02,876: ERROR/ForkPoolWorker-1] Task compute_worker_run[ef4800b6-9ff8-4626-8702-260f99d2c8b5] raised unexpected: FileNotFoundError(2, 'No such file or directory')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/compute_worker.py", line 115, in run_wrapper
    run.prepare()
  File "/compute_worker.py", line 803, in prepare
    self._get_container_image(self.container_image)
  File "/compute_worker.py", line 367, in _get_container_image
    container_engine_pull = check_output(cmd)
  File "/usr/local/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/local/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/local/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'nvidia-docker'

This comes from the new Dockerfile, with the older docker image the error does not appear. Indeed, we are using the new nvidia toolkit. We should update the code to stop using nvidia-docker.

@Didayolo
Copy link
Member

Now the CPU version is broken...

[2024-06-19 19:26:28,933: INFO/ForkPoolWorker-1] Connecting to wss://codabench-test.lri.fr/submission_input/6/797/d3673c8d-e345-4f34-aa44-73e01e63c530/
[2024-06-19 19:26:31,802: WARNING/ForkPoolWorker-1] WS: b'docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].\n'
[2024-06-19 19:26:31,802: INFO/ForkPoolWorker-1] Process exited with 125
[2024-06-19 19:26:31,802: INFO/ForkPoolWorker-1] Disconnecting from websocket wss://codabench-test.lri.fr/submission_input/6/797/d3673c8d-e345-4f34-aa44-73e01e63c530/
[2024-06-19 19:26:33,936: INFO/ForkPoolWorker-1] [exited with 125]
[2024-06-19 19:26:33,936: INFO/ForkPoolWorker-1] [stderr]
b'docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].\n'
[2024-06-19 19:26:33,936: INFO/ForkPoolWorker-1] Putting raw data b'docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].\n' in https://minio-test.lri.fr/codabench-private/submission_details/2024-06-19-1718825186/183b292a2ba8/scoring_stderr.txt?AWSAccessKeyId=AKIAIOSFODNN7EXAMPLE&Signature=08Orj1ptjBxO7cguAlMCA83YwGs%3D&content-type=application%2Fzip&Expires=1718911586

The code in compute_worker.py:

        engine_cmd = [
            CONTAINER_ENGINE_EXECUTABLE,
            'run',
            # Remove it after run
            '--rm',
            f'--name={self.ingestion_container_name if kind == "ingestion" else self.program_container_name}',

            # Don't allow subprocesses to raise privileges
            '--security-opt=no-new-privileges',

            # GPU or not
            '--gpus', 
            'all' if os.environ.get("USE_GPU") else '0',

            # Set the volumes
            '-v', f'{self._get_host_path(program_dir)}:/app/program',
            '-v', f'{self._get_host_path(self.output_dir)}:/app/output',
            '-v', f'{self.data_dir}:/app/data:ro',

            # Start in the right directory
            '-w', '/app/program',

            # Don't buffer python output, so we don't lose any
            '-e', 'PYTHONUNBUFFERED=1',
        ]

@Didayolo
Copy link
Member

Fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants