Skip to content

Conversation

@Didayolo
Copy link
Member

@Didayolo Didayolo commented Jun 20, 2025

Test

Deployed at https://codabench-test.lri.fr/

Manual intervention

git pull
  1. Rebuild containers (Django to 3.0 #1730)
docker compose build
docker compose up -d
  1. Run migrations and collect static
docker compose exec django ./manage.py makemigrations
docker compose exec django ./manage.py migrate
docker compose exec django ./manage.py collectstatic --noinput
  1. Upgrade compute workers

For every compute worker associated to the instance (default queue) or to a custom queue, you need to update the worker:

<ssh into worker>
docker compose down
docker compose up -d --pull always

Changes

Related to compute worker

Django 3.0 upgrade

ihsaan-ullah and others added 9 commits May 5, 2025 17:22
…ion_output

Submission files duplication fixed
…6GB to 1GB; Made Dockerfile.compute_worker_gpu base image the build CPU image, adding only the necessary things to build it faster
* fix to run result sbmission(with copy to predictions dir)

* raise error when signing up with an email with *

* revert compute worker changes
@Didayolo Didayolo added In Progress Release PR develop --> master labels Jun 20, 2025
ihsaan-ullah and others added 11 commits June 20, 2025 15:06
* fix to run result sbmission(with copy to predictions dir)

* removed filter based on

* reverted compute worker changes
* filters added in public competitions page

* reward added to filters, style fixed, tests updated, empty response handled

* fixed small variable name
* fix to run result sbmission(with copy to predictions dir)

* task added to delete non activated users who have not activated their accounts for more than 3 days of creation

* additional checks added before deletion, compute worker changes reverted
[code documentation] - public competitions api documentation added
Put back copy of submission files into prediction output in compute worker
curious-broccoli and others added 7 commits July 14, 2025 19:43
clamp length of competition search results description
Clamp length of competition search results description
* Removed num_entries becasue the count was slowing down the platform

* num entries completely removed

* test updated

---------

Co-authored-by: didayolo <adrien.pavao@gmail.com>
Markdown rendering problems fixed
Option added to download all participants
…om submission on soft detetion (#1911)

* improved organization delete error

* remove orgnaization from submission on soft-delete
ihsaan-ullah and others added 4 commits July 15, 2025 13:08
* datasets query clarified, total file size added for submissions, individual file sizes added for submissions, front end modified to show file sizes in details

* flake fixes

* updated names for better readability
* compute worker allow to unzip files with parent dir

* ingestion scoring and input data handled differently

---------

Co-authored-by: Ihsan Ullah <ihsan2131@gmail.com>
Update mc command from "config host add" to "alias set"
bbearce and others added 6 commits July 17, 2025 16:43
* consumer async and template static loading changes

* flake problems

* comp participant needs creating

* ORM based issues recified.

* flake message removal

* CONCERN task

* Add remove button for cancelled submissions (#1808)

* Add remove button for cancelled submissions

* Allow remove of cancelled submissions

* more waits added

* flake concerns

* Update compute_worker.py

* Triggering tests with blank line deletion

* flake

* circleci resource_class: medium+

* circleci resource_class: large

* circleci resource_class: xlarge

* Add permissions check for bulk download

* flake8 fix

* Add hide_score_output option (#1838)

* Add hide_score_output option

* Update test

* Add the options for v1 bundles

* Make more generic tests (v1, v2)

* code removed that was copying submission files to predictions dir

* hail mary

* flake

* config

* version update workflow removed

* Add hide_prediction_output feature

* Calendar lock fixed, additional check added for start and end date

* Simplify code

* Version bump

* Removed time and updated date to today

* Caddy image update

* fix Caddyfile indentation

* django to 3.2.0 - but still has websocket errors for test_submissions during tests

* poetry.lock

* removing submissions to pass circleci

* Do not allow signup with email with `*` (#1882)

* fix to run result sbmission(with copy to predictions dir)

* raise error when signing up with an email with *

* revert compute worker changes

* User model filters - remove `deleted` (#1887)

* fix to run result sbmission(with copy to predictions dir)

* removed filter based on

* reverted compute worker changes

* consumer async and template static loading changes

* flake problems

* comp participant needs creating

* ORM based issues recified.

* flake message removal

* CONCERN task

* more waits added

* flake concerns

* Triggering tests with blank line deletion

* flake

* circleci resource_class: medium+

* circleci resource_class: large

* circleci resource_class: xlarge

* hail mary

* flake

* config

* django to 3.2.0 - but still has websocket errors for test_submissions during tests

* poetry.lock

* removing submissions to pass circleci

* integrate dev branch commit: 2883349

* config.yml for circleci

* config.yml for circleci

* spelling mistake

* timing issues

* flake

* timing issues

* timing issues

* timing issues

* timing issues

* logger.info -> logger.debug changes

* test just selenium submissions

* time adjustment

* separating out submissions to see if one is particulary troublesome

* time adjustment

* time adjustment

* docker images

* submissions in batch

* all

* final clean up

---------

Co-authored-by: Adrien Pavão <adrien.pavao@gmail.com>
Co-authored-by: Ihsan Ullah <ihsan2131@gmail.com>
Co-authored-by: Obada Haddad <obada.haddad@lisn.fr>
Co-authored-by: Moritz <moritz.mueller2@tu-dresden.de>
fix missing username in email template, improve message
@Didayolo
Copy link
Member Author

Didayolo commented Jul 25, 2025

On my local setup, submission are blocked in "Preparing" status and I get the following error logs:

codabench-compute_worker-1  | [2025-07-25 10:29:32,814: INFO/ForkPoolWorker-1] Received run arguments: {'user_pk': 1, 'submissions_api_url': 'http://django:8000/api', 'secret': 'dadaf918-5fe1-4467-ba23-9ddb07f5c515', 'docker_image': 'codalab/codalab-legacy:py39', 'execution_time_limit': 600, 'id': 9, 'is_scoring': False, 'prediction_result': 'http://docker.for.mac.localhost:9000/private/prediction_result/2025-07-25-1753439372/1297e25db90b/prediction_result.zip?AWSAccessKeyId=testkey&Signature=31F8tY2YHMWtyB9IkBixz6YmdUI%3D&content-type=application%2Fzip&Expires=1753871372', 'ingestion_program': 'http://docker.for.mac.localhost:9000/private/dataset/2025-07-24-1753371071/2f827182f1e7/ingestion_program.zip?AWSAccessKeyId=testkey&Signature=%2FmeWY4fxyJJCZ0RFBQvv729Fvqs%3D&Expires=1753871372', 'input_data': 'http://docker.for.mac.localhost:9000/private/dataset/2025-07-24-1753371071/2bb057ba491b/input_data.zip?AWSAccessKeyId=testkey&Signature=StTpIITCobxNWzlutImW%2BYbcmJw%3D&Expires=1753871372', 'ingestion_only_during_scoring': False, 'program_data': 'http://docker.for.mac.localhost:9000/private/dataset/2025-07-25-1753439366/bae0dc77281b/sample_code_submission.zip?AWSAccessKeyId=testkey&Signature=%2BT3KCgzlM9iHxlupE%2BcVotNYAq0%3D&Expires=1753871372', 'prediction_stdout': 'http://docker.for.mac.localhost:9000/private/submission_details/2025-07-25-1753439372/8c2d555f9a7c/prediction_stdout.txt?AWSAccessKeyId=testkey&Signature=d3YNrQhWn3PIfbK6d7NixGVJIgU%3D&content-type=application%2Fzip&Expires=1753871372', 'prediction_stderr': 'http://docker.for.mac.localhost:9000/private/submission_details/2025-07-25-1753439372/e0b27522a783/prediction_stderr.txt?AWSAccessKeyId=testkey&Signature=BAClDvRwgymYxezxdVZEkwApt%2Bc%3D&content-type=application%2Fzip&Expires=1753871372', 'prediction_ingestion_stdout': 'http://docker.for.mac.localhost:9000/private/submission_details/2025-07-25-1753439372/935dac0a0f59/prediction_ingestion_stdout.txt?AWSAccessKeyId=testkey&Signature=7nI96MjM46Co0alGnL5KJNkmWoo%3D&content-type=application%2Fzip&Expires=1753871372', 'prediction_ingestion_stderr': 'http://docker.for.mac.localhost:9000/private/submission_details/2025-07-25-1753439372/721a7aca830f/prediction_ingestion_stderr.txt?AWSAccessKeyId=testkey&Signature=8%2ByAhVFFE3nXdi4Lnq1lYe10UMs%3D&content-type=application%2Fzip&Expires=1753871372'}
codabench-compute_worker-1  | [2025-07-25 10:29:32,829: INFO/ForkPoolWorker-1] Updating submission @ http://django:8000/api/submissions/9/ with data = {'status': 'Preparing', 'status_details': None, 'secret': 'dadaf918-5fe1-4467-ba23-9ddb07f5c515'}
codabench-compute_worker-1  | [2025-07-25 10:29:33,261: INFO/ForkPoolWorker-1] Submission updated successfully!
codabench-compute_worker-1  | [2025-07-25 10:29:33,266: INFO/ForkPoolWorker-1] Checking if cache directory needs to be pruned...
codabench-compute_worker-1  | [2025-07-25 10:29:33,268: INFO/ForkPoolWorker-1] Cache directory does not need to be pruned!
codabench-compute_worker-1  | [2025-07-25 10:29:33,268: INFO/ForkPoolWorker-1] Getting bundle http://docker.for.mac.localhost:9000/private/dataset/2025-07-25-1753439366/bae0dc77281b/sample_code_submission.zip?AWSAccessKeyId=testkey&Signature=%2BT3KCgzlM9iHxlupE%2BcVotNYAq0%3D&Expires=1753871372 to unpack @ program
codabench-compute_worker-1  | [2025-07-25 10:29:33,280: INFO/ForkPoolWorker-1] CODALAB_IGNORE_CLEANUP_STEP mode enabled, ignoring clean up of: /codabench/tmpscql5bqa
codabench-compute_worker-1  | [2025-07-25 10:29:33,285: ERROR/ForkPoolWorker-1] Task compute_worker_run[2a7cdcbd-12fd-4435-91c6-172cd1500ebe] raised unexpected: URLError(ConnectionRefusedError(111, 'Connection refused'))
codabench-compute_worker-1  | Traceback (most recent call last):
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/urllib/request.py", line 1346, in do_open
codabench-compute_worker-1  |     h.request(req.get_method(), req.selector, req.data, headers,
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/http/client.py", line 1285, in request
codabench-compute_worker-1  |     self._send_request(method, url, body, headers, encode_chunked)
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/http/client.py", line 1331, in _send_request
codabench-compute_worker-1  |     self.endheaders(body, encode_chunked=encode_chunked)
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/http/client.py", line 1280, in endheaders
codabench-compute_worker-1  |     self._send_output(message_body, encode_chunked=encode_chunked)
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/http/client.py", line 1040, in _send_output
codabench-compute_worker-1  |     self.send(msg)
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/http/client.py", line 980, in send
codabench-compute_worker-1  |     self.connect()
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/http/client.py", line 946, in connect
codabench-compute_worker-1  |     self.sock = self._create_connection(
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/socket.py", line 856, in create_connection
codabench-compute_worker-1  |     raise err
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/socket.py", line 844, in create_connection
codabench-compute_worker-1  |     sock.connect(sa)
codabench-compute_worker-1  | ConnectionRefusedError: [Errno 111] Connection refused
codabench-compute_worker-1  | 
codabench-compute_worker-1  | During handling of the above exception, another exception occurred:
codabench-compute_worker-1  | 
codabench-compute_worker-1  | Traceback (most recent call last):
codabench-compute_worker-1  |   File "/usr/lib/python3.9/site-packages/celery/app/trace.py", line 385, in trace_task
codabench-compute_worker-1  |     R = retval = fun(*args, **kwargs)
codabench-compute_worker-1  |   File "/usr/lib/python3.9/site-packages/celery/app/trace.py", line 650, in __protected_call__
codabench-compute_worker-1  |     return self.run(*args, **kwargs)
codabench-compute_worker-1  |   File "/app/compute_worker.py", line 117, in run_wrapper
codabench-compute_worker-1  |     run.prepare()
codabench-compute_worker-1  |   File "/app/compute_worker.py", line 856, in prepare
codabench-compute_worker-1  |     zip_file = self._get_bundle(url, path, cache=cache_this_bundle)
codabench-compute_worker-1  |   File "/app/compute_worker.py", line 462, in _get_bundle
codabench-compute_worker-1  |     urlretrieve(url, bundle_file)
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/urllib/request.py", line 239, in urlretrieve
codabench-compute_worker-1  |     with contextlib.closing(urlopen(url, data)) as fp:
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/urllib/request.py", line 214, in urlopen
codabench-compute_worker-1  |     return opener.open(url, data, timeout)
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/urllib/request.py", line 517, in open
codabench-compute_worker-1  |     response = self._open(req, data)
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/urllib/request.py", line 534, in _open
codabench-compute_worker-1  |     result = self._call_chain(self.handle_open, protocol, protocol +
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/urllib/request.py", line 494, in _call_chain
codabench-compute_worker-1  |     result = func(*args)
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/urllib/request.py", line 1375, in http_open
codabench-compute_worker-1  |     return self.do_open(http.client.HTTPConnection, req)
codabench-compute_worker-1  |   File "/usr/lib64/python3.9/urllib/request.py", line 1349, in do_open
codabench-compute_worker-1  |     raise URLError(err)
codabench-compute_worker-1  | urllib.error.URLError: <urlopen error [Errno 111] Connection refused>

@bbearce Any clue?

Didayolo added 2 commits July 25, 2025 12:38
Fix missing username in email template, improve message
@ckravit
Copy link

ckravit commented Jul 26, 2025

@Didayolo - If it's at all helpful (LLM assisted):

Root Cause Analysis

The primary issue is that docker.for.mac.localhost is a special DNS name that only resolves from the host machine, not from within Docker containers. When the compute worker (running in a container) tries to access Minio using URLs containing docker.for.mac.localhost, the connection fails.

Known Workarounds in the Codebase

The codebase already contains explicit workarounds for this issue:
Example Submission Script: URLs are modified to replace docker.for.mac.localhost with localhost for local testing. example_submission.py:88

Get Submission Details Script: Similar URL replacement is performed for local development. get_submission_details.py:143, get_submission_details.py:156

Debugging Steps

  1. Check Environment Configuration: Ensure AWS_S3_ENDPOINT_URL is set to http://minio:9000/ in your .env file, not http://docker.for.mac.localhost:9000/

  2. Verify Network Connectivity: The compute worker should access Minio via the internal Docker network using the service name minio. docker-compose.yml:216-237

  3. URL Generation: The system generates presigned URLs through the make_url_sassy function, which uses the configured AWS_S3_ENDPOINT_URL. data.py:46-80

  4. Compute Worker Environment: Verify that the compute worker container inherits the correct environment variables. docker-compose.yml:230

Possible Solutions

  1. Update Environment Variables: Ensure your .env file uses http://minio:9000/ instead of http://docker.for.mac.localhost:9000/
  2. Network Isolation: Both the Django application and compute worker should use the internal Docker network names rather than host-accessible URLs
  3. URL Replacement: If URLs with docker.for.mac.localhost are still being generated, implement similar replacement logic as shown in the example scripts

The key insight is that containers should communicate with each other using Docker service names (like minio) rather than host-accessible addresses like docker.for.mac.localhost.

Notes

The connection issues arise when these URLs use host-accessible hostnames instead of Docker internal service names, causing network resolution failures within the containerized environment.

Didayolo and others added 2 commits July 31, 2025 17:24
…ring

Revert "Compute worker - Allow to unzip files with parent directory"
Version bump 1.20.0 - 7 august 2025
@Didayolo Didayolo merged commit 9127d57 into master Aug 7, 2025
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Release PR develop --> master

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants