Add template execution and secret creation to the operator #16

smarterclayton · 2018-05-02T21:40:09Z

Templates are passed to the command, parsed, and then used as arbitrary
execution units in the CI graph. The operator requires that each
template contain at least one restart=Never pod and will set up
dependencies on earlier stages based on the parameters passed as input.
The following parameters are automatic:

JOB_NAME: the name of the job from the Job spec
JOB_NAME_HASH: a short hash of the job name
JOB_NAME_SAFE: the job name converted to a Kubernetes safe name
NAMESPACE: the target namespace
IMAGE_FORMAT: requires the release stage be defined
RPM_REPO: a URL pointing to the hosted RPM repo
IMAGE_*: depends on the release stage and places the wildcard value
into IMAGE_FORMAT as ${component}

Environment variables can underfill the job.

The pods in each template are considered successful when they reach
phase=Succeeded and considered failed if any container or init container
has a non-zero exit code. The template name is used as the unique key.

Also support --secret-dir which converts a directory into a secret with the same name as the directory.

smarterclayton · 2018-05-04T03:13:42Z

Added a test step command. Fixed a bug where multiple identical Requires links caused multiple executions of the step. Moved from --build-config JSON to taking CONFIG_SPEC as environment.

smarterclayton · 2018-05-04T03:13:49Z

@stevekuznetsov

smarterclayton · 2018-05-04T04:58:26Z

Added --target to restrict the graph to only steps that are dependencies of the source.

Templates are passed to the command, parsed, and then used as arbitrary execution units in the CI graph. The operator requires that each template contain at least one restart=Never pod and will set up dependencies on earlier stages based on the parameters passed as input. The following parameters are automatic: * JOB_NAME: the name of the job from the Job spec * JOB_NAME_HASH: a short hash of the job name * JOB_NAME_SAFE: the job name converted to a Kubernetes safe name * NAMESPACE: the target namespace * IMAGE_FORMAT: requires the release stage be defined * RPM_REPO: a URL pointing to the hosted RPM repo * IMAGE_*: depends on the release stage and places the wildcard value into IMAGE_FORMAT as ${component} Environment variables can underfill the job. The pods in each template are considered successful when they reach phase=Succeeded and considered failed if any container or init container has a non-zero exit code. The template name is used as the unique key.

--secret-dir may be specified multiple times and creates a secret with the contents of each directory. The secret name is the name of the directory.

Add an example of collecting JUnit results

Not just the release step.

Only dependencies of the named targets are returned. Steps return names that correspond to their images in general.

If the stage is completed, we delete the pod/templateinstance and wait for it to complete. If it's running, we wait for it to complete. Assumption is that on cancellation our pods get marked deleted anyway.

This guarantees a Prow cancellation stops pending work.

stevekuznetsov · 2018-05-04T21:33:57Z

pkg/steps/source.go

 func printBuildLogs(buildClient BuildClient, name string) {
 	if s, err := buildClient.Logs(name, &buildapi.BuildLogOptions{
-		NoWait:     true,
-		Timestamps: true,


Timestamps were confusing and ragged. Wait was an accident.

stevekuznetsov · 2018-05-04T21:36:21Z

pkg/steps/output_image_tag.go

+		return nil, nil
+	}
+	return api.ParameterMap{
+		fmt.Sprintf("IMAGE_%s", strings.ToUpper(strings.Replace(s.config.To.As, "-", "_", -1))): func() (string, error) {


What is this and why is it not going to be a pain to depend on?

This avoids complex replacement logic (which is impossible) based on format inside of the template. You depend on IMAGE_ANSIBLE and you get IMAGE_FORMAT replaced by ansible.

stevekuznetsov · 2018-05-04T21:38:10Z

pkg/steps/release_images.go

+func (s *releaseImagesTagStep) Provides() (api.ParameterMap, api.StepLink) {
+	return api.ParameterMap{
+		"IMAGE_FORMAT": func() (string, error) {
+			registry := "REGISTRY"


When is this a valid output? Why not var registry string?

I need to return an error actually, thanks.

stevekuznetsov · 2018-05-04T21:38:29Z

pkg/steps/release_images.go

+				}
+			}
+			var format string
+			if len(s.config.Name) > 0 {


It's not clear to me what this implies

This was previously added, the godoc should make it clear.

stevekuznetsov · 2018-05-04T21:39:22Z

pkg/steps/template.go

+				if err != nil {
+					return err
+				}
+				s.template.Parameters[i].Value = strings.Replace(format, "${component}", component, -1)


We should make a constant for "${component}"

stevekuznetsov · 2018-05-04T22:07:59Z

cmd/ci-operator/main.go


 func (o *options) Complete() error {
-	if err := json.Unmarshal([]byte(o.rawBuildConfig), &o.buildConfig); err != nil {
+	configSpec := os.Getenv("CONFIG_SPEC")


os.LookupEnv?

stevekuznetsov · 2018-05-04T22:09:02Z

pkg/api/types.go

+	From PipelineImageStreamTagReference `json:"from"`
+	// Commands are the shell commands to run in
+	// the repository root to execute tests.
+	Commands string `json:"commands"`


Why is it on us to do word splitting?

Isn't it? We take this single string and need to feed it into a []string{} for the Pod, so isn't every entry in that array treated as a single arg?

We’re feeding a scriptlet to bash as arg[2]. The same as the build command.

stevekuznetsov · 2018-05-04T22:10:13Z

pkg/steps/test.go

+				{
+					Name:    "test",
+					Image:   fmt.Sprintf("%s:%s", PipelineImageStream, s.config.From),
+					Command: []string{"/bin/bash", "-c", "#!/bin/bash\nset -euo pipefail\n" + s.config.Commands},


I am weary of this

You mean having to deal with shell? We already require this for most other things.

stevekuznetsov · 2018-05-04T22:11:48Z

cmd/ci-operator/main.go


 func bindOptions() *options {
 	opt := &options{}
+	flag.Var(&opt.targets, "target", "A set of names in the config to target. Only steps that are required for these targets will be run.")


What's the use-case for more than one leaf here?

Build two images in the same repo.

When you are building images, just run the build job to completion. When you need a leaf, you are running a single test. If the test requies multiple parents, make the tree relationship as necessary? Using --target for non-test runs seems wrong.

stevekuznetsov · 2018-05-04T22:15:25Z

pkg/steps/pipeline_image_cache.go

 	return nil, nil
 }

+func (s *pipelineImageCacheStep) Name() string { return string(s.config.To) }


Why do we want any steps other than leaves to be targeted?

Build just a specific image.

smarterclayton · 2018-05-04T23:41:06Z

Addressed or commented, changes pushed.

smarterclayton · 2018-05-05T00:47:50Z

Added a -h and provided help.

The hash used for the namespace name needs to include the external dependencies we pull for images as well as the build configuration and source code in order to ensure we have reproducible builds. Add a new Step method `Inputs(...) api.InputDefinition` which returns an opaque list of inputs that can be combined into the final hash. After loading the steps we resolve all inputs and calculate the hash for the inputs, then use that in the input name. Because namespace is an input to the steps but can only be determined after we resolve inputs defined in other namespaces, we must switch steps to lazily include their namespace (so we pass cluster scoped clients to steps instead of namespace scoped steps). This makes the ci-operator hermetic with respect to its inputs.

smarterclayton · 2018-05-05T19:25:59Z

Made the graph include the images we will tag in for prereqs part of the input hash, which required two changes:

a pass over all steps prior to building the graph to calculate all their inputs (the steps cache the resolution)
the namespace that steps create into must be lazily resolved, so switch all the clients passed into steps from *Interface to *Getter.

I debated whether to change the signature of Run and decided against it because it keeps the run action independent of the actual graph being executed. Steps are already lazily evaluating state from the source of truth (the cluster) so the Run() command just asks the JobSpec what namespace it should run from. We don't attach things to context generally.

After this commit, builds are hermetic which is a nice win (two different users with the same source, build definition, and base images will get the same output). This was required to make retries work:

test job A runs the first time while a particular dependency (base image) is broken
test job A fails because the base image is broken
user retries test job A as job B, but if job B won't rebuild the artifacts we'll remain broken

After this change since the base image is changed, the input hash changes, which means a new namespace is used and new artifacts are created.

smarterclayton · 2018-05-07T19:51:17Z

@stevekuznetsov I get lonely without your comments... :(

stevekuznetsov · 2018-05-08T15:56:19Z

/test pull-ci-operator-unit

stevekuznetsov · 2018-05-08T17:50:33Z

/test pull-ci-operator-unit

stevekuznetsov

Incoming changes LGTM but I am still not clear on the Creates() API and why we ever want multiple leaves for --target

smarterclayton · 2018-05-10T21:01:07Z

I’ll make target singular. I can unify creates and provides in a follow up?

smarterclayton · 2018-05-11T04:38:25Z

Disabled multiple targets. Will come back to creates and provides.

Because while two seconds may be sufficient teardown time for unit tests and such, it's too short for e2e teardown (where cluster logs should be collected, cluster resources need to be reaped, and collected assets need to be uploaded). The two-second timeout is from c2ac3ab (When interrupted, mark any in progress pods/templates as deleted, 2018-05-04, openshift#16). And actually, I'm not sure how the 2-second default worked even for unit tests and such, because Kubernetes pods have a 30-second grace period by default [1]. A recent aborted e2e job that did not run teardown to completion or collect any cluster assets is [2,3]. A recent successful e2e job showing current teardown timing is [4,5]. From [4]: 2019/02/14 18:19:54 Container setup in pod e2e-aws completed successfully 2019/02/14 18:46:08 Container test in pod e2e-aws completed successfully 2019/02/14 18:51:38 Container teardown in pod e2e-aws completed successfully So 5.5 minutes to teardown. And from [5]: time="2019-02-14T18:47:40Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"1341ccc2-66ac-46bb-ae15-0295d4a126ba\"}" ... time="2019-02-14T18:51:38Z" level=debug msg="Purging asset \"Cluster\" from disk" about 4 minutes of that is resource reaping (with the previous 1.5 minutes being log collection). [1]: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods [2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/build-log.txt [3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/artifacts/ [4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/build-log.txt [5]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/artifacts/e2e-aws/installer/.openshift_install.log

Because while two seconds may be sufficient teardown time for unit tests and such, it's too short for e2e teardown (where cluster logs should be collected, cluster resources need to be reaped, and collected assets need to be uploaded). The two-second timeout is from c2ac3ab (When interrupted, mark any in progress pods/templates as deleted, 2018-05-04, openshift#16). And actually, I'm not sure how the 2-second default worked even for unit tests and such, because Kubernetes pods have a 30-second grace period by default [1]. A recent aborted e2e job that did not run teardown to completion or collect any cluster assets is [2,3]. A recent successful e2e job showing current teardown timing is [4,5]. From [4]: 2019/02/14 18:19:54 Container setup in pod e2e-aws completed successfully 2019/02/14 18:46:08 Container test in pod e2e-aws completed successfully 2019/02/14 18:51:38 Container teardown in pod e2e-aws completed successfully So 5.5 minutes to teardown. And from [5]: time="2019-02-14T18:47:40Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"1341ccc2-66ac-46bb-ae15-0295d4a126ba\"}" ... time="2019-02-14T18:51:38Z" level=debug msg="Purging asset \"Cluster\" from disk" So about 4 minutes of that is resource reaping (with the previous 1.5 minutes being log collection). [1]: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods [2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/build-log.txt [3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/artifacts/ [4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/build-log.txt [5]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/artifacts/e2e-aws/installer/.openshift_install.log

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 2, 2018

smarterclayton force-pushed the tempalte branch 3 times, most recently from 010d38a to 95aaaf0 Compare May 3, 2018 04:43

smarterclayton changed the title ~~WIP - A template execution step in the graph~~ Add template execution and secret creation to the operator May 3, 2018

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 3, 2018

smarterclayton force-pushed the tempalte branch from d4b3155 to d5fde2d Compare May 3, 2018 05:26

smarterclayton requested a review from stevekuznetsov May 3, 2018 05:41

Remove timestamps from build

5537ff8

smarterclayton force-pushed the tempalte branch from d5fde2d to 7a2f4d7 Compare May 3, 2018 23:35

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 4, 2018

smarterclayton force-pushed the tempalte branch from 8bac510 to d86b801 Compare May 4, 2018 01:33

smarterclayton added 9 commits May 4, 2018 14:12

Allow secrets to be created in the target namespace off disk

8c9138c

--secret-dir may be specified multiple times and creates a secret with the contents of each directory. The secret name is the name of the directory.

Only add a node once to a parent's children

40893a1

Simplify output for readability

0545e97

Move ci-operator config to enviroment variable CONFIG_SPEC

57dd470

Add a test step command that runs a simple pod

e13d9ea

Add an example of collecting JUnit results

Components that depend on IMAGE_FORMAT should wait for all images

de3bc91

Not just the release step.

Add --target which restricts the set of steps run

2d7c83c

Only dependencies of the named targets are returned. Steps return names that correspond to their images in general.

Print logs incrementally as containers fail

f900c98

smarterclayton force-pushed the tempalte branch from 2d2f9b1 to bc09044 Compare May 4, 2018 18:12

smarterclayton added 2 commits May 4, 2018 15:38

Make test and template executions re-entrant

2fac40b

If the stage is completed, we delete the pod/templateinstance and wait for it to complete. If it's running, we wait for it to complete. Assumption is that on cancellation our pods get marked deleted anyway.

When interrupted, mark any in progress pods/templates as deleted

c2ac3ab

This guarantees a Prow cancellation stops pending work.

smarterclayton force-pushed the tempalte branch from bc09044 to c2ac3ab Compare May 4, 2018 19:38

stevekuznetsov reviewed May 4, 2018

View reviewed changes

Review comments 1

47af259

Move templates back to parameters

2e65a35

Add help documentation for the command

036e774

smarterclayton force-pushed the tempalte branch from 679b926 to 036e774 Compare May 5, 2018 00:48

smarterclayton added 2 commits May 5, 2018 15:14

Add verbose help output and slightly more description

83be524

stevekuznetsov reviewed May 9, 2018

View reviewed changes

Make targets singular for now

f7579c2

smarterclayton merged commit d2ea6bd into openshift:master May 11, 2018

wking mentioned this pull request Feb 15, 2019

cmd/ci-operator: Add --grace-period #264

Open

Add template execution and secret creation to the operator #16

Add template execution and secret creation to the operator #16

Uh oh!

Conversation

smarterclayton commented May 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarterclayton commented May 4, 2018

Uh oh!

smarterclayton commented May 4, 2018

Uh oh!

smarterclayton commented May 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented May 4, 2018

Uh oh!

smarterclayton commented May 5, 2018

Uh oh!

smarterclayton commented May 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarterclayton commented May 7, 2018

Uh oh!

stevekuznetsov commented May 8, 2018

Uh oh!

stevekuznetsov commented May 8, 2018

Uh oh!

stevekuznetsov left a comment

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented May 10, 2018

Uh oh!

smarterclayton commented May 11, 2018

Uh oh!

Reviewers

smarterclayton commented May 2, 2018 •

edited

Loading

smarterclayton commented May 5, 2018 •

edited

Loading