Skip to content

Conversation

@abhinavdahiya
Copy link
Contributor

@abhinavdahiya abhinavdahiya commented Jun 9, 2020

Currently the CI pod that runs the installer/openshift-test/oc(gathering) needs access to the internet for various things.

  • the cluster that was created, kube api and the openshift-ingress
  • the cloud APIs
  • DNS resolution of both types of endpoints mentioned above.

The first one is about getting the CI pod access to internal network ips, the second one is not necessary for internal clusters specifically but for certain testing like AWS GovCloud emulation testing, we need to make sure we are originating all the request from inside the network itself to use PrivateLink service endpoints and the last one is important because the internal cluster does not publish the dns names to Internet, the dns resolution of api.clusterdomain for example is only possible from inside the cluster and also affects the usage of PrivateLink service endpoints.

So I started with 3 idea and tried to implement those using what we have.

  1. So started looking into complete transparent VPNs like [sshuttle| https://github.com/sshuttle/sshuttle] or OpenVPN
    Using these should allow all the traffic to be force through the VPN into the internal network and also provide dns resolution.
    But when trying to use them in container esp on an OpenShift clsuter,
  • Both of them require sudo/root priviledges + NET_CAP like capabilities in the pod. With openshift this is very difficult to achieve as far as read.
  • Also these require modyging network devices and I could not find a way to even test this on OpenShift clsuter.
  1. Then I started looking into, ssh+rsync
    For a given step we take the commands.sh and use rsync to copy that plus any binaries we need to the bastian-host's temporary workspace. Plus we also sync all the shared_dir, artifacts_dir and any other environment variables to ensure we can run the step properly.
    This is approach i have taken in commit but there are a lot of downsides to this approach,
  • Requires a lot of wrapping and needs an alternate steps with jump handling for each step today. https://issues.redhat.com/browse/DPTP-1259 can help reduce one type of dupliation but there is still which binaries to copy, what env variables should we tranfer and what if there are other files that need to be copied.
  • The shared_dir, artifacts_dir syncing is a little brittle, the impl should be good enough but there is alittle more to be desired in terms of retries etc.
  • The test require 3 cpus, 4 gb of ram and that means we can't shared bastians and we might have to bring one up per CI job
  1. The last idea i have is using SOCKS5 proxy + pod dnspolicy forced to dns resolver of the internal network.

Since most of our binaries are Go binaries. Go support HTTP_PROXY="socks5://<>" and this would allow us to access the internal IPs over the proxy as long as we wrap the commands.sh with a local ssh proxy.

The next requirement is to force the DNS to that of the internal network. K8s pods allow defining dnsPolicy and nameservers for the pods.

dnsPolicy: None
dnsConfig:
  nameservers:
  - <the chosen one, probably needs to be dynamic per the steps>

This should technicaly allow the pod to use bastian dns resolvers already setup to allow dns resolution of internal services of our pre-existing networks. Now this requires that ci-operator can read these values from like SHARED_DIR and then setup the dns for the pod running the step accordingly.
There is one down side, no containers in that pod will have access to the CI cluster's services so current implementation of shared-secrets-copier might have to be updated.

depends on #9640
long term improvements can be tracked in https://issues.redhat.com/browse/DPTP-1260

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2020
@abhinavdahiya abhinavdahiya force-pushed the internal-clusters branch 4 times, most recently from 09d03f3 to 11c9cf5 Compare June 9, 2020 03:10
@abhinavdahiya
Copy link
Contributor Author

/retest

@abhinavdahiya abhinavdahiya force-pushed the internal-clusters branch 21 times, most recently from f02dc1f to 296b2f0 Compare June 10, 2020 23:14
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@jstuever jstuever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of forking all of these -commands.sh. When we did this with templates, we ended up with huge variants between the templates. I would be more comfortable if we could somehow wrap this into the original -commands.sh to avoid drift.

I have concern about how the cat calls handle variables, double so if CI is replacing the variables prior to running the scripts (like it did in templates).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure we have a tech debt card to handle moving these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have this card also look for other modifications to install-config.yaml that could benefit from yq.

Comment on lines +14 to +151
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the variables in this block being translated during the cat call prior to being written to disk? If so, how does that affect them running on the jump host? The same thing applies to ther other -commands.sh in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the variables in this block being translated during the cat call prior to being written to disk?
https://stackoverflow.com/a/22698106
'EOF' makes it so that NO expanding will happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. I wasn't aware of the functionality change when adding the quotes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gate if [[ ! -s ${SHARED_DIR}/jump-host.txt" ]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in the other files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we would do this? I don't want to skip silently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will prevent ssh from failing when ${REMOTE} is an empty string and throwing a possibly misleading error message. Better to gate here with a clear error of why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added with eec6cc0

@jstuever jstuever removed their assignment Jun 18, 2020
@abhinavdahiya
Copy link
Contributor Author

I'm not a huge fan of forking all of these -commands.sh. When we did this with templates, we ended up with huge variants between the templates. I would be more comfortable if we could somehow wrap this into the original -commands.sh to avoid drift.

See the PR description for the reasoning why this is being done this way.

Requires a lot of wrapping and needs an alternate steps with jump handling for each step today. https://issues.redhat.com/browse/DPTP-1259 can help reduce one type of dupliation but there is still which binaries to copy, what env variables should we tranfer and what if there are other files that need to be copied.

1259 would allow us to symlink the script to this step and use an input and then we don't need the local copy.

updating the original install just makes that step too complicated.

@jstuever
Copy link
Contributor

I'm not a huge fan of forking all of these -commands.sh. When we did this with templates, we ended up with huge variants between the templates. I would be more comfortable if we could somehow wrap this into the original -commands.sh to avoid drift.

See the PR description for the reasoning why this is being done this way.

Requires a lot of wrapping and needs an alternate steps with jump handling for each step today. https://issues.redhat.com/browse/DPTP-1259 can help reduce one type of dupliation but there is still which binaries to copy, what env variables should we tranfer and what if there are other files that need to be copied.

1259 would allow us to symlink the script to this step and use an input and then we don't need the local copy.

updating the original install just makes that step too complicated.

I didn't intend to block on this, I was voicing my concern to help prioritize the work to fix it.

Currently the CI pod that runs the installer/openshift-test/oc(gathering) needs access to the internet for various things.

-    the cluster that was created, kube api and the openshift-ingress
-    the cloud APIs
-    DNS resolution of both types of endpoints mentioned above.

The first one is about getting the CI pod access to internal network ips, the second one is not necessary for internal clusters specifically but for certain testing like AWS GovCloud emulation testing, we need to make sure we are originating all the request from inside the network itself to use PrivateLink service endpoints and the last one is important because the internal cluster does not publish the dns names to Internet, the dns resolution of api.clusterdomain for example is only possible from inside the cluster and also affects the usage of PrivateLink service endpoints.

So I started with 3 idea and tried to implement those using what we have.

1. So started looking into complete transparent VPNs like [sshuttle| https://github.com/sshuttle/sshuttle] or OpenVPN
Using these should allow all the traffic to be force through the VPN into the internal network and also provide dns resolution.
But when trying to use them in container esp on an OpenShift clsuter,

-    Both of them require sudo/root priviledges + NET_CAP like capabilities in the pod. With openshift this is very difficult to achieve as far as read.
-    Also these require modyging network devices and I could not find a way to even test this on OpenShift clsuter.

2. Then I started looking into, ssh+rsync
For a given step we take the commands.sh and use rsync to copy that plus any binaries we need to the bastian-host's temporary workspace. Plus we also sync all the shared_dir, artifacts_dir and any other environment variables to ensure we can run the step properly.
This is approach i have taken in commit but there are a lot of downsides to this approach,

-    Requires a lot of wrapping and needs an alternate steps with jump handling for each step today. https://issues.redhat.com/browse/DPTP-1259 can help reduce one type of dupliation but there is still which binaries to copy, what env variables should we tranfer and what if there are other files that need to be copied.
-    The shared_dir, artifacts_dir syncing is a little brittle, the impl should be good enough but there is alittle more to be desired in terms of retries etc.
-    The test require 3 cpus, 4 gb of ram and that means we can't shared bastians and we might have to bring one up per CI job

3. The last idea i have is using SOCKS5 proxy + pod dnspolicy forced to dns resolver of the internal network.

Since most of our binaries are Go binaries. Go support HTTP_PROXY="socks5://<>" and this would allow us to access the internal IPs over the proxy as long as we wrap the commands.sh with a local ssh proxy.

The next requirement is to force the DNS to that of the internal network. K8s pods allow defining dnsPolicy and nameservers for the pods.
```yaml
dnsPolicy: None
dnsConfig:
  nameservers:
  - <the chosen one, probably needs to be dynamic per the steps>
```
This should technicaly allow the pod to use bastian dns resolvers already setup to allow dns resolution of internal services of our pre-existing networks. Now this requires that ci-operator can read these values from like SHARED_DIR and then setup the dns for the pod running the step accordingly.
There is one down side, no containers in that pod will have access to the CI cluster's services so current implementation of shared-secrets-copier might have to be updated.
@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/open-cluster-management/registration-operator/master/e2e f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/cri-o/cri-o/release-1.13/e2e-aws f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/master/e2e-aws-sdn-multi f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/master/e2e-vsphere f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-gcp f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-gcp-upi f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-azure f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/master/e2e-aws f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-vsphere f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/master/e2e-windows-hybrid-network f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/opendatahub-io/odh-manifests/master/odh-manifests-e2e f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/master/e2e-aws-sdn-single f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/master/e2e-ovn-step-registry f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/master/e2e-ovn-hybrid-step-registry f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift-knative/serverless-operator/master/4.3-upgrade-tests-aws-ocp-43 f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/openshift-tests-private/master/e2e-aws f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift-knative/serverless-operator/master/4.3-e2e-aws-ocp-43 f5432c987cd75c45daf73cc86ae01b1fb8b5c894 link /test pj-rehearse
ci/rehearse/openshift/installer/release-4.5/e2e-azure-internal 2b464fab5fdaada1539208f9d62a73c1990c958e link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-gcp-shared-vpc ff736c276fb1a4b8d1182e7b3a5e02a62c510053 link /test pj-rehearse
ci/prow/step-registry-metadata eec6cc0 link /test step-registry-metadata
ci/prow/ci-testgrid-allow-list eec6cc0 link /test ci-testgrid-allow-list
ci/prow/yamllint eec6cc0 link /test yamllint
ci/prow/boskos-config eec6cc0 link /test boskos-config

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

@abhinavdahiya: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/release-config eec6cc0 link /test release-config
ci/prow/boskos-config-generation eec6cc0 link /test boskos-config-generation
ci/prow/secret-generator-config-valid eec6cc0 link /test secret-generator-config-valid
ci/prow/deprecate-templates eec6cc0 link /test deprecate-templates

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 14, 2021

@abhinavdahiya: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/ci-secret-generator-config eec6cc0 link /test ci-secret-generator-config

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2021
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 14, 2021

@abhinavdahiya: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mtnbikenc
Copy link
Member

/uncc

@openshift-ci-robot openshift-ci-robot removed the request for review from mtnbikenc April 21, 2021 20:58
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2021
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 21, 2021

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants