Skip to content

Allow IT Dockerfile to exit after all setup attempts fail#17592

Merged
kgyrtkirk merged 21 commits intoapache:masterfrom
GWphua:allow-IT-dockerfile-to-exit
Jan 15, 2025
Merged

Allow IT Dockerfile to exit after all setup attempts fail#17592
kgyrtkirk merged 21 commits intoapache:masterfrom
GWphua:allow-IT-dockerfile-to-exit

Conversation

@GWphua
Copy link
Copy Markdown
Contributor

@GWphua GWphua commented Dec 20, 2024

Description

Presently the Integration Test Dockerfile continues execution even after 3 failures of base-setup.sh, and do not exit immediately. This PR aims to patch that up, by forcing Docker to exit after 3 failures.

This PR has:

  • been self-reviewed.

@GWphua GWphua changed the title Allow it dockerfile to exit Allow IT Dockerfile to exit after all setup attempts fail Dec 20, 2024
@GWphua
Copy link
Copy Markdown
Contributor Author

GWphua commented Dec 20, 2024

PTAL @Akshat-Jain @cryptoe

Copy link
Copy Markdown
Contributor

@Akshat-Jain Akshat-Jain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread integration-tests/docker/Dockerfile Outdated
rm -f /root/base-setup.sh
rm -f /root/base-setup.sh; \
if [ "$i" -eq "$SETUP_RETRIES" ]; then \
exit 1; \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets put a message here mentioning gave up on retries.

Copy link
Copy Markdown
Contributor

@cryptoe cryptoe Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message like
Tried [$i] times to contact [$url]. Unable to do so. Try a different url or check connectivity

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a fix-of-the-fix already #17543

I think this retry logic is completely misplaced; it seems to me that the original problem was that wget times out?
I think running all steps of the script will retry it even if other issues happen ...

I wonder why not specify some retry conditions for wget instead ?
https://unix.stackexchange.com/questions/227665/wget-abort-retrying-after-failure-or-timeout

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kgyrtkirk, really appreciate the tip. I will look into experimenting with wget retry conditions and update if it helps with our current issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kgyrtkirk @cryptoe @Akshat-Jain, I did some research into the wget commands, PTAL at my findings:

Error Logging

Turns out that our current wget command is called with the -q option, hence silencing all wget logs. Removing the -q option temporarily (I added it back after debugging, lmk if its better to simply not silence the logs) caused these logs to surface:

  1. wget success
#8 15.32 --2025-01-13 07:50:55--  https://downloads.apache.org/zookeeper/zookeeper-3.8.4/apache-zookeeper-3.8.4-bin.tar.gz
#8 15.33 Resolving downloads.apache.org (downloads.apache.org)... 88.99.208.237, 135.181.214.104, 2a01:4f9:3a:2c57::2, ...
#8 15.34 Connecting to downloads.apache.org (downloads.apache.org)|88.99.208.237|:443... connected.
#8 15.76 HTTP request sent, awaiting response... 200 OK
  1. wget failure
#8 14.55 --2025-01-13 07:42:57--  https://downloads.apache.org/zookeeper/zookeeper-3.8.4/apache-zookeeper-3.8.4-bin.tar.gz
#8 14.56 Resolving downloads.apache.org (downloads.apache.org)... 135.181.214.104, 88.99.208.237, 2a01:4f8:10a:39da::2, ...
#8 14.56 Connecting to downloads.apache.org (downloads.apache.org)|135.181.214.104|:443... failed: Connection timed out.
#8 149.2 Connecting to downloads.apache.org (downloads.apache.org)|88.99.208.237|:443... failed: Connection timed out.
#8 284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f8:10a:39da::2|:443... failed: Network is unreachable.
#8 284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f9:3a:2c57::2|:443... failed: Network is unreachable.
#8 ERROR: process "/bin/sh -c APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh" did not complete successfully: exit code: 4
------
 > [druidbase 3/3] RUN APACHE_ARCHIVE_MIRROR_HOST=https://downloads.apache.org /root/base-setup.sh && rm -f /root/base-setup.sh:
14.04 Setting up python (2.7.16-1) ...
14.05 Setting up python-pkg-resources (40.8.0-1) ...
14.16 Setting up python-meld3 (1.0.2-2) ...
14.23 Setting up supervisor (3.3.5-1) ...
14.42 invoke-rc.d: could not determine current runlevel
14.43 invoke-rc.d: policy-rc.d denied execution of start.
14.52 Processing triggers for libc-bin (2.28-10+deb10u1) ...
failed: Connection timed out.
284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f8:10a:39da::2|:443... failed: Network is unreachable.
284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f9:3a:2c57::2|:443... failed: Network is unreachable.
------
Dockerfile:28
--------------------
  26 |     ARG ZK_VERSION
  27 |     ARG APACHE_ARCHIVE_MIRROR_HOST=https://downloads.apache.org
  28 | >>> RUN APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh
  29 |     
  30 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh" did not complete successfully: exit code: 4

Findings

  1. According to the wget manual, wget will conduct up to 20 retries by default. (configurable with --retries=n)
  2. Between each retry, wget employs a linear backoff 1,2,... up to 10 seconds.
  3. The retry condition does not apply to fatal errors like "connection refused" or "not found".
  4. From the failure case, we get Network is unreachable errors when calling on IPv6 addresses.
  5. These errors may cause wget to not trigger the default retry mechanism.

Actions

  1. Restrict wget to only call IPv4 addresses helps to both focus on the right addresses to call, and allow for retries.
  2. Add --continue to prevent wget retries to download from scratch.
  3. Revert the 3 retries for the entire base-setup.sh script. Chances are that if our wget fails a total of 20 times, the host is probably overloaded with download requests. A total of 60 wget may exacerbate the situation.
  4. Added a message to indicate wget succeeded / failed.

Comment thread integration-tests/docker/base-setup.sh Outdated
local dest=$1
local host=$2

if ! wget --inet4-only --quiet --continue --output-document="$dest" "$host"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the details/changes; looks better!

I feel like the current changes will be volatile to the same issues you've seen before you've started fixing this...
you could simulate connection refused/etc error-s with iptables commands like:

iptables -A OUTPUT -d 135.181.214.104 -j DROP
iptables -A OUTPUT -d 88.99.208.237 -j DROP

or using a tool like saboteur

  • I think you could use --retry-connrefused and/or other options to force connection retries?
  • restricting to ipv4 could possibly hit back later....I don't think it should be restricted to that
  • instead of a custom message; wouldn't it be enough to use -nv ?
  • nit: why did you added --continue? could it be already there?

Copy link
Copy Markdown
Contributor Author

@GWphua GWphua Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kgyrtkirk, thanks for the tip in simulating network problems, I will try to simulate the problem with your tip. Also, good points made about using --retry-connrefused instead of restricting ipv4, and -nv instead of --quiet.

About the nit:
--continue is added as an attempt to target the timeout problem. I read that it will help wget to resume downloading the file, so I added it in to reduce the chances that wget is working, but the download speed is too slow to satisfy the timeout.

Copy link
Copy Markdown
Contributor Author

@GWphua GWphua Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes to the wget options, and we can now see that using --retry-connrefused will allow us to wget with [20 retries] (https://github.com/apache/druid/actions/runs/12764706329/job/35581015481?pr=17592) instead of not retrying prior to this PR.

However, I found that using -nv may confuse people. PTAL at the wget output for -nv:

#8 14.22 Processing triggers for libc-bin (2.28-10+deb10u1) ...
#8 148.7 failed: Connection timed out.
#8 283.9 failed: Connection timed out.
#8 283.9 failed: Network is unreachable.
#8 283.9 failed: Network is unreachable.
#8 419.0 failed: Connection timed out.
#8 554.2 failed: Connection timed out.
#8 554.2 failed: Network is unreachable.
#8 554.2 failed: Network is unreachable.

For example, in this case, it is not obvious that:

  1. We are using wget to download zookeeper.
  2. The IP addresses are not shown, this may give users an impression that wget is alternating between Connection timed out and Network is unreachable for the same IP host.

As such, I have changed the log settings to the default (--verbose).

Copy link
Copy Markdown
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks much better - I guess now it would fail correctly as well if all 20 retries are exhausted

@kgyrtkirk kgyrtkirk merged commit 863e91d into apache:master Jan 15, 2025
@GWphua GWphua deleted the allow-IT-dockerfile-to-exit branch January 20, 2025 01:40
317brian pushed a commit to 317brian/druid that referenced this pull request Jan 28, 2025
@kgyrtkirk kgyrtkirk added this to the 33.0.0 milestone Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants