Skip to content

[D] Use a fallback bootstrap script provider if dlang.org is offline#910

Merged
BanzaiMan merged 3 commits intotravis-ci:masterfrom
wilzbach:increase-timeouts-for-d
Jan 11, 2017
Merged

[D] Use a fallback bootstrap script provider if dlang.org is offline#910
BanzaiMan merged 3 commits intotravis-ci:masterfrom
wilzbach:increase-timeouts-for-d

Conversation

@wilzbach
Copy link
Copy Markdown
Contributor

@wilzbach wilzbach commented Dec 8, 2016

We sometimes see timeouts in D installation that are due to connection issues to dlang.org, which is queried to receive the newest installation script, e.g. https://travis-ci.org/dlang/dmd/jobs/177747612, https://travis-ci.org/dlang/dmd/jobs/177747323, https://travis-ci.org/dlang/dmd/jobs/169949568

This PR increases the timeout from ~45s to 9.68 min (2^9-1 +10 * 7) / 60 (similar to the 10 minutes default timeout for stalled builds), which hopefully might help in some cases.

However we should think about mitigating the connection issues in the first place, by e.g. using a CDN or a more stable location for the installation script.

@ Travis CI maintainers: do you maybe have same experience or advise on this problem?

CC @MartinNowak @klickverbot

@andralex
Copy link
Copy Markdown

andralex commented Dec 8, 2016

Thank you for the thorough investigation.

andralex
andralex approved these changes Dec 8, 2016
@andralex
Copy link
Copy Markdown

andralex commented Dec 8, 2016

@wilzbach who has the rights to approve this?

@BanzaiMan
Copy link
Copy Markdown
Contributor

BanzaiMan commented Dec 8, 2016

Maybe I'm reading this wrong, but the current max timeout is not 45s, is it? It is running the curl command with timeout of 5 seconds while sleeping longer and longer. The step overall would fail after some time (I guess 45 seconds, if I am to believe your calculation), but the timeout is not that.

I don't know the real cause of these problems, but I don't know if this change is going to add much benefit.

  1. Is a longer timeout going to help? If the remote end is not responding in 5 seconds, can we expect to have a better chance with 10 seconds?
  2. Is sleeping in between retries help at all? If the request failed 10 seconds ago, do we expect to have a better chance now?

The real fix should come from dlang.org. Or, store a copy of D lang install script locally, and fall back to it instead.

@wilzbach wilzbach force-pushed the increase-timeouts-for-d branch from 2f5ec6f to 8e02792 Compare December 9, 2016 17:07
@wilzbach wilzbach changed the title Increase timeouts for D installation script [D] Use GitHub as fallback bootstrap script provider if dlang.org is offline Dec 9, 2016
@wilzbach
Copy link
Copy Markdown
Contributor Author

wilzbach commented Dec 9, 2016

Maybe I'm reading this wrong, but the current max timeout is not 45s, is it? It is running the curl command with timeout of 5 seconds while sleeping longer and longer. The step overall would fail after some time (I guess 45 seconds, if I am to believe your calculation), but the timeout is not that.

Sorry I was referring to the overall time.

I don't know the real cause of these problems, but I don't know if this change is going to add much benefit.

To be honest me neither - it was the simplest thing that come to my mind.

Is a longer timeout going to help? If the remote end is not responding in 5 seconds, can we expect to have a better chance with 10 seconds?
Is sleeping in between retries help at all? If the request failed 10 seconds ago, do we expect to have a better chance now?

Yeah I absolutely get your point, my hope was that within in a longer sleeping loop or timeout period dlang.org might be reachable again.

The real fix should come from dlang.org.

Unfortunately that's not within my reach ;/

Or, store a copy of D lang install script locally, and fall back to it instead.

That's a good idea! The script is hosted at GitHub and as their service tends to be reliable, why not use that directly as fallback?
-> I changed the PR subsequently to do so

@MartinNowak
Copy link
Copy Markdown
Contributor

Nope! We're not going to use some random master development version of a script for all Travis-CI D users. That opens the door for potential outages and is a giant security risk.
We could create a separate installer_scripts repo and serve the script via github releases.

For now I'd propose we use https://nightlies.dlang.org/install.sh as fallback if https://dlang.org/install.sh fails.

Is a longer timeout going to help? If the remote end is not responding in 5 seconds, can we expect to have a better chance with 10 seconds?

No it wouldn't really help.

Is sleeping in between retries help at all? If the request failed 10 seconds ago, do we expect to have a better chance now?

Yes, it does help for certain transient network issues, and exponential backoff is a common retry pattern for transient faults.
Though we should try the other mirror before sleeping and retrying.

@MartinNowak
Copy link
Copy Markdown
Contributor

Did that as a PR for your PR @wilzbach ;), wilzbach#1.

add fallback mirror for more reliable downloads
@wilzbach
Copy link
Copy Markdown
Contributor Author

Nope! We're not going to use some random master development version of a script for all Travis-CI D users. That opens the door for potential outages and is a giant security risk.

We could have linked to the stable branch as well. I don't understand why this should be a security risk as only a handful of people have push rights to the installer repo and it's only for the CI.

If you are worried about security, we could consider signing the install.sh - though I am not sure whether it's worth the cost/increased work.

We could create a separate installer_scripts repo and serve the script via github releases.

That would be quite nice. GitHub has an awesome reliability and free traffic :)

Though we should try the other mirror before sleeping and retrying.

Thanks a lot!
@BanzaiMan should I squash the changes or how is your policy here?

@MartinNowak
Copy link
Copy Markdown
Contributor

If you are worried about security, we could consider signing the install.sh - though I am not sure whether it's worth the cost/increased work.

I do already sign all releases of the installer script.

@MartinNowak
Copy link
Copy Markdown
Contributor

We could have linked to the stable branch as well.

Releases of the installer script are independent of compiler releases, so that won't work.

I don't understand why this should be a security risk as only a handful of people have push rights to the installer repo and it's only for the CI.

It's quite high profile access tokens that might be visible in testing environments, so we should have a solid understanding of how the script could be manipulated. Relying on a github workflow to not slip any security issue opens a can of worms.

@BanzaiMan
Copy link
Copy Markdown
Contributor

There is no need for squashing, unless you want to. The commit history looks fine to me.

I'll deploy to https://staging.travis-ci.org tomorrow (2016-12-13) at 14:00 UTC.

BanzaiMan added a commit that referenced this pull request Dec 13, 2016
[D] Use GitHub as fallback bootstrap script provider if dlang.org is offline
@BanzaiMan
Copy link
Copy Markdown
Contributor

deployed to .org staging

@wilzbach
Copy link
Copy Markdown
Contributor Author

deployed to .org staging

Thanks a lot - it seems to work quite well :)

https://staging.travis-ci.org/wilzbach/dlang-tour

@wilzbach wilzbach changed the title [D] Use GitHub as fallback bootstrap script provider if dlang.org is offline [D] Use a fallback bootstrap script provider if dlang.org is offline Dec 29, 2016
@wilzbach
Copy link
Copy Markdown
Contributor Author

deployed to .org staging
Thanks a lot - it seems to work quite well :)

Anything else we can do to help to move this forward?

@BanzaiMan BanzaiMan merged commit c90045d into travis-ci:master Jan 11, 2017
@wilzbach wilzbach deleted the increase-timeouts-for-d branch February 24, 2017 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants