Allow IT Dockerfile to exit after all setup attempts fail#17592
Allow IT Dockerfile to exit after all setup attempts fail#17592kgyrtkirk merged 21 commits intoapache:masterfrom
Conversation
|
PTAL @Akshat-Jain @cryptoe |
Akshat-Jain
left a comment
There was a problem hiding this comment.
LGTM, was able to confirm the new behaviour at https://github.com/apache/druid/actions/runs/12429974963/job/34707262605?pr=17592
| rm -f /root/base-setup.sh | ||
| rm -f /root/base-setup.sh; \ | ||
| if [ "$i" -eq "$SETUP_RETRIES" ]; then \ | ||
| exit 1; \ |
There was a problem hiding this comment.
Lets put a message here mentioning gave up on retries.
There was a problem hiding this comment.
Message like
Tried [$i] times to contact [$url]. Unable to do so. Try a different url or check connectivity
There was a problem hiding this comment.
this seems like a fix-of-the-fix already #17543
I think this retry logic is completely misplaced; it seems to me that the original problem was that wget times out?
I think running all steps of the script will retry it even if other issues happen ...
I wonder why not specify some retry conditions for wget instead ?
https://unix.stackexchange.com/questions/227665/wget-abort-retrying-after-failure-or-timeout
There was a problem hiding this comment.
Hi @kgyrtkirk, really appreciate the tip. I will look into experimenting with wget retry conditions and update if it helps with our current issue.
There was a problem hiding this comment.
Hi @kgyrtkirk @cryptoe @Akshat-Jain, I did some research into the wget commands, PTAL at my findings:
Error Logging
Turns out that our current wget command is called with the -q option, hence silencing all wget logs. Removing the -q option temporarily (I added it back after debugging, lmk if its better to simply not silence the logs) caused these logs to surface:
- wget success
#8 15.32 --2025-01-13 07:50:55-- https://downloads.apache.org/zookeeper/zookeeper-3.8.4/apache-zookeeper-3.8.4-bin.tar.gz
#8 15.33 Resolving downloads.apache.org (downloads.apache.org)... 88.99.208.237, 135.181.214.104, 2a01:4f9:3a:2c57::2, ...
#8 15.34 Connecting to downloads.apache.org (downloads.apache.org)|88.99.208.237|:443... connected.
#8 15.76 HTTP request sent, awaiting response... 200 OK
- wget failure
#8 14.55 --2025-01-13 07:42:57-- https://downloads.apache.org/zookeeper/zookeeper-3.8.4/apache-zookeeper-3.8.4-bin.tar.gz
#8 14.56 Resolving downloads.apache.org (downloads.apache.org)... 135.181.214.104, 88.99.208.237, 2a01:4f8:10a:39da::2, ...
#8 14.56 Connecting to downloads.apache.org (downloads.apache.org)|135.181.214.104|:443... failed: Connection timed out.
#8 149.2 Connecting to downloads.apache.org (downloads.apache.org)|88.99.208.237|:443... failed: Connection timed out.
#8 284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f8:10a:39da::2|:443... failed: Network is unreachable.
#8 284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f9:3a:2c57::2|:443... failed: Network is unreachable.
#8 ERROR: process "/bin/sh -c APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh" did not complete successfully: exit code: 4
------
> [druidbase 3/3] RUN APACHE_ARCHIVE_MIRROR_HOST=https://downloads.apache.org /root/base-setup.sh && rm -f /root/base-setup.sh:
14.04 Setting up python (2.7.16-1) ...
14.05 Setting up python-pkg-resources (40.8.0-1) ...
14.16 Setting up python-meld3 (1.0.2-2) ...
14.23 Setting up supervisor (3.3.5-1) ...
14.42 invoke-rc.d: could not determine current runlevel
14.43 invoke-rc.d: policy-rc.d denied execution of start.
14.52 Processing triggers for libc-bin (2.28-10+deb10u1) ...
failed: Connection timed out.
284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f8:10a:39da::2|:443... failed: Network is unreachable.
284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f9:3a:2c57::2|:443... failed: Network is unreachable.
------
Dockerfile:28
--------------------
26 | ARG ZK_VERSION
27 | ARG APACHE_ARCHIVE_MIRROR_HOST=https://downloads.apache.org
28 | >>> RUN APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh
29 |
30 |
--------------------
ERROR: failed to solve: process "/bin/sh -c APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh" did not complete successfully: exit code: 4
Findings
- According to the wget manual, wget will conduct up to 20 retries by default. (configurable with
--retries=n) - Between each retry, wget employs a linear backoff 1,2,... up to 10 seconds.
- The retry condition does not apply to fatal errors like "connection refused" or "not found".
- From the failure case, we get Network is unreachable errors when calling on IPv6 addresses.
- These errors may cause wget to not trigger the default retry mechanism.
Actions
- Restrict
wgetto only call IPv4 addresses helps to both focus on the right addresses to call, and allow for retries. - Add
--continueto preventwgetretries to download from scratch. - Revert the 3 retries for the entire
base-setup.shscript. Chances are that if ourwgetfails a total of 20 times, the host is probably overloaded with download requests. A total of 60wgetmay exacerbate the situation. - Added a message to indicate wget succeeded / failed.
| local dest=$1 | ||
| local host=$2 | ||
|
|
||
| if ! wget --inet4-only --quiet --continue --output-document="$dest" "$host" |
There was a problem hiding this comment.
thank you for the details/changes; looks better!
I feel like the current changes will be volatile to the same issues you've seen before you've started fixing this...
you could simulate connection refused/etc error-s with iptables commands like:
iptables -A OUTPUT -d 135.181.214.104 -j DROP
iptables -A OUTPUT -d 88.99.208.237 -j DROP
or using a tool like saboteur
- I think you could use
--retry-connrefusedand/or other options to force connection retries? - restricting to ipv4 could possibly hit back later....I don't think it should be restricted to that
- instead of a custom message; wouldn't it be enough to use
-nv? - nit: why did you added
--continue? could it be already there?
There was a problem hiding this comment.
Hi @kgyrtkirk, thanks for the tip in simulating network problems, I will try to simulate the problem with your tip. Also, good points made about using --retry-connrefused instead of restricting ipv4, and -nv instead of --quiet.
About the nit:
--continue is added as an attempt to target the timeout problem. I read that it will help wget to resume downloading the file, so I added it in to reduce the chances that wget is working, but the download speed is too slow to satisfy the timeout.
There was a problem hiding this comment.
I have made the changes to the wget options, and we can now see that using --retry-connrefused will allow us to wget with [20 retries] (https://github.com/apache/druid/actions/runs/12764706329/job/35581015481?pr=17592) instead of not retrying prior to this PR.
However, I found that using -nv may confuse people. PTAL at the wget output for -nv:
#8 14.22 Processing triggers for libc-bin (2.28-10+deb10u1) ...
#8 148.7 failed: Connection timed out.
#8 283.9 failed: Connection timed out.
#8 283.9 failed: Network is unreachable.
#8 283.9 failed: Network is unreachable.
#8 419.0 failed: Connection timed out.
#8 554.2 failed: Connection timed out.
#8 554.2 failed: Network is unreachable.
#8 554.2 failed: Network is unreachable.
For example, in this case, it is not obvious that:
- We are using
wgetto download zookeeper. - The IP addresses are not shown, this may give users an impression that
wgetis alternating betweenConnection timed outandNetwork is unreachablefor the same IP host.
As such, I have changed the log settings to the default (--verbose).
kgyrtkirk
left a comment
There was a problem hiding this comment.
looks much better - I guess now it would fail correctly as well if all 20 retries are exhausted
Description
Presently the Integration Test Dockerfile continues execution even after 3 failures of
base-setup.sh, and do not exit immediately. This PR aims to patch that up, by forcing Docker to exit after 3 failures.This PR has: