steps: add internal for GCP and Azure #9553

abhinavdahiya · 2020-06-09T01:20:25Z

Currently the CI pod that runs the installer/openshift-test/oc(gathering) needs access to the internet for various things.

the cluster that was created, kube api and the openshift-ingress
the cloud APIs
DNS resolution of both types of endpoints mentioned above.

The first one is about getting the CI pod access to internal network ips, the second one is not necessary for internal clusters specifically but for certain testing like AWS GovCloud emulation testing, we need to make sure we are originating all the request from inside the network itself to use PrivateLink service endpoints and the last one is important because the internal cluster does not publish the dns names to Internet, the dns resolution of api.clusterdomain for example is only possible from inside the cluster and also affects the usage of PrivateLink service endpoints.

So I started with 3 idea and tried to implement those using what we have.

So started looking into complete transparent VPNs like [sshuttle| https://github.com/sshuttle/sshuttle] or OpenVPN
Using these should allow all the traffic to be force through the VPN into the internal network and also provide dns resolution.
But when trying to use them in container esp on an OpenShift clsuter,

Both of them require sudo/root priviledges + NET_CAP like capabilities in the pod. With openshift this is very difficult to achieve as far as read.
Also these require modyging network devices and I could not find a way to even test this on OpenShift clsuter.

Then I started looking into, ssh+rsync
For a given step we take the commands.sh and use rsync to copy that plus any binaries we need to the bastian-host's temporary workspace. Plus we also sync all the shared_dir, artifacts_dir and any other environment variables to ensure we can run the step properly.
This is approach i have taken in commit but there are a lot of downsides to this approach,

Requires a lot of wrapping and needs an alternate steps with jump handling for each step today. https://issues.redhat.com/browse/DPTP-1259 can help reduce one type of dupliation but there is still which binaries to copy, what env variables should we tranfer and what if there are other files that need to be copied.
The shared_dir, artifacts_dir syncing is a little brittle, the impl should be good enough but there is alittle more to be desired in terms of retries etc.
The test require 3 cpus, 4 gb of ram and that means we can't shared bastians and we might have to bring one up per CI job

The last idea i have is using SOCKS5 proxy + pod dnspolicy forced to dns resolver of the internal network.

Since most of our binaries are Go binaries. Go support HTTP_PROXY="socks5://<>" and this would allow us to access the internal IPs over the proxy as long as we wrap the commands.sh with a local ssh proxy.

The next requirement is to force the DNS to that of the internal network. K8s pods allow defining dnsPolicy and nameservers for the pods.

dnsPolicy: None
dnsConfig:
  nameservers:
  - <the chosen one, probably needs to be dynamic per the steps>

This should technicaly allow the pod to use bastian dns resolvers already setup to allow dns resolution of internal services of our pre-existing networks. Now this requires that ci-operator can read these values from like SHARED_DIR and then setup the dns for the pod running the step accordingly.
There is one down side, no containers in that pod will have access to the CI cluster's services so current implementation of shared-secrets-copier might have to be updated.

depends on #9640
long term improvements can be tracked in https://issues.redhat.com/browse/DPTP-1260

abhinavdahiya · 2020-06-09T05:20:53Z

/retest

openshift-ci-robot · 2020-06-18T18:09:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/config/openshift/installer/OWNERS~~ [abhinavdahiya]
~~ci-operator/jobs/openshift/installer/OWNERS~~ [abhinavdahiya]
~~ci-operator/step-registry/gather/OWNERS~~ [abhinavdahiya]
~~ci-operator/step-registry/ipi/OWNERS~~ [abhinavdahiya]
~~ci-operator/step-registry/openshift/OWNERS~~ [abhinavdahiya]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jstuever

I'm not a huge fan of forking all of these -commands.sh. When we did this with templates, we ended up with huge variants between the templates. I would be more comfortable if we could somehow wrap this into the original -commands.sh to avoid drift.

I have concern about how the cat calls handle variables, double so if CI is replacing the variables prior to running the scripts (like it did in templates).

jstuever · 2020-06-18T20:58:34Z

ci-operator/step-registry/ipi/conf/azure/jump/ipi-conf-azure-jump-commands.sh

Let's make sure we have a tech debt card to handle moving these.

Maybe have this card also look for other modifications to install-config.yaml that could benefit from yq.

jstuever · 2020-06-18T21:05:31Z

ci-operator/step-registry/gather/extra/jump/gather-extra-jump-commands.sh

Are the variables in this block being translated during the cat call prior to being written to disk? If so, how does that affect them running on the jump host? The same thing applies to ther other -commands.sh in this PR.

Are the variables in this block being translated during the cat call prior to being written to disk?
https://stackoverflow.com/a/22698106
'EOF' makes it so that NO expanding will happen.

Perfect. I wasn't aware of the functionality change when adding the quotes.

jstuever · 2020-06-18T21:09:39Z

ci-operator/step-registry/gather/extra/jump/gather-extra-jump-commands.sh

gate if [[ ! -s ${SHARED_DIR}/jump-host.txt" ]]

Also in the other files.

Any reason why we would do this? I don't want to skip silently.

This will prevent ssh from failing when ${REMOTE} is an empty string and throwing a possibly misleading error message. Better to gate here with a clear error of why.

added with eec6cc0

abhinavdahiya · 2020-06-18T23:39:30Z

I'm not a huge fan of forking all of these -commands.sh. When we did this with templates, we ended up with huge variants between the templates. I would be more comfortable if we could somehow wrap this into the original -commands.sh to avoid drift.

See the PR description for the reasoning why this is being done this way.

Requires a lot of wrapping and needs an alternate steps with jump handling for each step today. https://issues.redhat.com/browse/DPTP-1259 can help reduce one type of dupliation but there is still which binaries to copy, what env variables should we tranfer and what if there are other files that need to be copied.

1259 would allow us to symlink the script to this step and use an input and then we don't need the local copy.

updating the original install just makes that step too complicated.

jstuever · 2020-06-19T18:10:37Z

I'm not a huge fan of forking all of these -commands.sh. When we did this with templates, we ended up with huge variants between the templates. I would be more comfortable if we could somehow wrap this into the original -commands.sh to avoid drift.

See the PR description for the reasoning why this is being done this way.

Requires a lot of wrapping and needs an alternate steps with jump handling for each step today. https://issues.redhat.com/browse/DPTP-1259 can help reduce one type of dupliation but there is still which binaries to copy, what env variables should we tranfer and what if there are other files that need to be copied.

1259 would allow us to symlink the script to this step and use an input and then we don't need the local copy.

updating the original install just makes that step too complicated.

I didn't intend to block on this, I was voicing my concern to help prioritize the work to fix it.

Currently the CI pod that runs the installer/openshift-test/oc(gathering) needs access to the internet for various things. - the cluster that was created, kube api and the openshift-ingress - the cloud APIs - DNS resolution of both types of endpoints mentioned above. The first one is about getting the CI pod access to internal network ips, the second one is not necessary for internal clusters specifically but for certain testing like AWS GovCloud emulation testing, we need to make sure we are originating all the request from inside the network itself to use PrivateLink service endpoints and the last one is important because the internal cluster does not publish the dns names to Internet, the dns resolution of api.clusterdomain for example is only possible from inside the cluster and also affects the usage of PrivateLink service endpoints. So I started with 3 idea and tried to implement those using what we have. 1. So started looking into complete transparent VPNs like [sshuttle| https://github.com/sshuttle/sshuttle] or OpenVPN Using these should allow all the traffic to be force through the VPN into the internal network and also provide dns resolution. But when trying to use them in container esp on an OpenShift clsuter, - Both of them require sudo/root priviledges + NET_CAP like capabilities in the pod. With openshift this is very difficult to achieve as far as read. - Also these require modyging network devices and I could not find a way to even test this on OpenShift clsuter. 2. Then I started looking into, ssh+rsync For a given step we take the commands.sh and use rsync to copy that plus any binaries we need to the bastian-host's temporary workspace. Plus we also sync all the shared_dir, artifacts_dir and any other environment variables to ensure we can run the step properly. This is approach i have taken in commit but there are a lot of downsides to this approach, - Requires a lot of wrapping and needs an alternate steps with jump handling for each step today. https://issues.redhat.com/browse/DPTP-1259 can help reduce one type of dupliation but there is still which binaries to copy, what env variables should we tranfer and what if there are other files that need to be copied. - The shared_dir, artifacts_dir syncing is a little brittle, the impl should be good enough but there is alittle more to be desired in terms of retries etc. - The test require 3 cpus, 4 gb of ram and that means we can't shared bastians and we might have to bring one up per CI job 3. The last idea i have is using SOCKS5 proxy + pod dnspolicy forced to dns resolver of the internal network. Since most of our binaries are Go binaries. Go support HTTP_PROXY="socks5://<>" and this would allow us to access the internal IPs over the proxy as long as we wrap the commands.sh with a local ssh proxy. The next requirement is to force the DNS to that of the internal network. K8s pods allow defining dnsPolicy and nameservers for the pods. ```yaml dnsPolicy: None dnsConfig: nameservers: - <the chosen one, probably needs to be dynamic per the steps> ``` This should technicaly allow the pod to use bastian dns resolvers already setup to allow dns resolution of internal services of our pre-existing networks. Now this requires that ci-operator can read these values from like SHARED_DIR and then setup the dns for the pod running the step accordingly. There is one down side, no containers in that pod will have access to the CI cluster's services so current implementation of shared-secrets-copier might have to be updated.

openshift-ci-robot · 2020-10-09T00:38:08Z

@abhinavdahiya: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/rehearse/open-cluster-management/registration-operator/master/e2e	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/cri-o/cri-o/release-1.13/e2e-aws	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-network-operator/master/e2e-aws-sdn-multi	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-network-operator/master/e2e-vsphere	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/installer/master/e2e-gcp	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/installer/master/e2e-gcp-upi	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/installer/master/e2e-azure	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-etcd-operator/master/e2e-aws	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/installer/master/e2e-vsphere	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-network-operator/master/e2e-windows-hybrid-network	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/opendatahub-io/odh-manifests/master/odh-manifests-e2e	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-network-operator/master/e2e-aws-sdn-single	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-network-operator/master/e2e-ovn-step-registry	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-network-operator/master/e2e-ovn-hybrid-step-registry	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift-knative/serverless-operator/master/4.3-upgrade-tests-aws-ocp-43	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/openshift-tests-private/master/e2e-aws	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift-knative/serverless-operator/master/4.3-e2e-aws-ocp-43	f5432c987cd75c45daf73cc86ae01b1fb8b5c894	link	`/test pj-rehearse`
ci/rehearse/openshift/installer/release-4.5/e2e-azure-internal	2b464fab5fdaada1539208f9d62a73c1990c958e	link	`/test pj-rehearse`
ci/rehearse/openshift/installer/master/e2e-gcp-shared-vpc	ff736c276fb1a4b8d1182e7b3a5e02a62c510053	link	`/test pj-rehearse`
ci/prow/step-registry-metadata	`eec6cc0`	link	`/test step-registry-metadata`
ci/prow/ci-testgrid-allow-list	`eec6cc0`	link	`/test ci-testgrid-allow-list`
ci/prow/yamllint	`eec6cc0`	link	`/test yamllint`
ci/prow/boskos-config	`eec6cc0`	link	`/test boskos-config`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-merge-robot · 2020-11-10T11:53:58Z

@abhinavdahiya: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/release-config	`eec6cc0`	link	`/test release-config`
ci/prow/boskos-config-generation	`eec6cc0`	link	`/test boskos-config-generation`
ci/prow/secret-generator-config-valid	`eec6cc0`	link	`/test secret-generator-config-valid`
ci/prow/deprecate-templates	`eec6cc0`	link	`/test deprecate-templates`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2021-01-14T01:14:00Z

@abhinavdahiya: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/ci-secret-generator-config	`eec6cc0`	link	`/test ci-secret-generator-config`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-04-14T12:59:48Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-ci · 2021-04-14T12:59:55Z

@abhinavdahiya: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mtnbikenc · 2021-04-21T20:58:38Z

/uncc

openshift-bot · 2021-05-21T22:48:29Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-06-21T01:35:01Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-06-21T01:35:10Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2020

openshift-ci-robot requested review from mtnbikenc and patrickdillon June 9, 2020 01:20

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2020

abhinavdahiya force-pushed the internal-clusters branch 4 times, most recently from 09d03f3 to 11c9cf5 Compare June 9, 2020 03:10

abhinavdahiya force-pushed the internal-clusters branch 21 times, most recently from f02dc1f to 296b2f0 Compare June 10, 2020 23:14

wking mentioned this pull request Jun 18, 2020

step-registry: setup the commands to allow running on bastion #9640

Closed

abhinavdahiya force-pushed the internal-clusters branch from 3ffb142 to e76d7f7 Compare June 18, 2020 18:16

jstuever suggested changes Jun 18, 2020

View reviewed changes

openshift-ci-robot assigned jstuever Jun 18, 2020

jstuever removed their assignment Jun 18, 2020

abhinavdahiya added 6 commits June 22, 2020 14:24

config/openshift/installer: add azure internal ci

d020ff1

config/openshift/installer: add gcp internal ci

b09a136

steps: add internal configuration for Azure, GCP using jump-host

aada033

installer: makes azure, gcp internal jobs optional

9efddd7

steps: exit if jump-host.txt is empty

eec6cc0

abhinavdahiya force-pushed the internal-clusters branch from e76d7f7 to eec6cc0 Compare June 22, 2020 21:30

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2021

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2021

openshift-ci-robot removed the request for review from mtnbikenc April 21, 2021 20:58

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2021

openshift-ci bot closed this Jun 21, 2021

steps: add internal for GCP and Azure #9553

steps: add internal for GCP and Azure #9553

Uh oh!

Conversation

abhinavdahiya commented Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhinavdahiya commented Jun 9, 2020

Uh oh!

openshift-ci-robot commented Jun 18, 2020

Uh oh!

jstuever left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya commented Jun 18, 2020

Uh oh!

jstuever commented Jun 19, 2020

Uh oh!

openshift-ci-robot commented Oct 9, 2020

Uh oh!

openshift-merge-robot commented Nov 10, 2020

Uh oh!

openshift-ci bot commented Jan 14, 2021

Uh oh!

openshift-bot commented Apr 14, 2021

Uh oh!

openshift-ci bot commented Apr 14, 2021

Uh oh!

mtnbikenc commented Apr 21, 2021

Uh oh!

openshift-bot commented May 21, 2021

Uh oh!

openshift-bot commented Jun 21, 2021

Uh oh!

openshift-ci bot commented Jun 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

abhinavdahiya commented Jun 9, 2020 •

edited

Loading