Skip to content
This repository was archived by the owner on Jun 14, 2019. It is now read-only.

Conversation

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented May 2, 2018

Templates are passed to the command, parsed, and then used as arbitrary
execution units in the CI graph. The operator requires that each
template contain at least one restart=Never pod and will set up
dependencies on earlier stages based on the parameters passed as input.
The following parameters are automatic:

  • JOB_NAME: the name of the job from the Job spec
  • JOB_NAME_HASH: a short hash of the job name
  • JOB_NAME_SAFE: the job name converted to a Kubernetes safe name
  • NAMESPACE: the target namespace
  • IMAGE_FORMAT: requires the release stage be defined
  • RPM_REPO: a URL pointing to the hosted RPM repo
  • IMAGE_*: depends on the release stage and places the wildcard value
    into IMAGE_FORMAT as ${component}

Environment variables can underfill the job.

The pods in each template are considered successful when they reach
phase=Succeeded and considered failed if any container or init container
has a non-zero exit code. The template name is used as the unique key.

Also support --secret-dir which converts a directory into a secret with the same name as the directory.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 2, 2018
@smarterclayton smarterclayton force-pushed the tempalte branch 3 times, most recently from 010d38a to 95aaaf0 Compare May 3, 2018 04:43
@smarterclayton smarterclayton changed the title WIP - A template execution step in the graph Add template execution and secret creation to the operator May 3, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 3, 2018
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 4, 2018
@smarterclayton
Copy link
Contributor Author

Added a test step command. Fixed a bug where multiple identical Requires links caused multiple executions of the step. Moved from --build-config JSON to taking CONFIG_SPEC as environment.

@smarterclayton
Copy link
Contributor Author

@stevekuznetsov

@smarterclayton
Copy link
Contributor Author

Added --target to restrict the graph to only steps that are dependencies of the source.

Templates are passed to the command, parsed, and then used as arbitrary
execution units in the CI graph. The operator requires that each
template contain at least one restart=Never pod and will set up
dependencies on earlier stages based on the parameters passed as input.
The following parameters are automatic:

* JOB_NAME: the name of the job from the Job spec
* JOB_NAME_HASH: a short hash of the job name
* JOB_NAME_SAFE: the job name converted to a Kubernetes safe name
* NAMESPACE: the target namespace
* IMAGE_FORMAT: requires the release stage be defined
* RPM_REPO: a URL pointing to the hosted RPM repo
* IMAGE_*: depends on the release stage and places the wildcard value
    into IMAGE_FORMAT as ${component}

Environment variables can underfill the job.

The pods in each template are considered successful when they reach
phase=Succeeded and considered failed if any container or init container
has a non-zero exit code. The template name is used as the unique key.
--secret-dir may be specified multiple times and creates a secret with
the contents of each directory. The secret name is the name of the
directory.
Add an example of collecting JUnit results
Only dependencies of the named targets are returned. Steps return names
that correspond to their images in general.
If the stage is completed, we delete the pod/templateinstance and wait
for it to complete. If it's running, we wait for it to complete.

Assumption is that on cancellation our pods get marked deleted anyway.
This guarantees a Prow cancellation stops pending work.
func printBuildLogs(buildClient BuildClient, name string) {
if s, err := buildClient.Logs(name, &buildapi.BuildLogOptions{
NoWait: true,
Timestamps: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamps were confusing and ragged. Wait was an accident.

return nil, nil
}
return api.ParameterMap{
fmt.Sprintf("IMAGE_%s", strings.ToUpper(strings.Replace(s.config.To.As, "-", "_", -1))): func() (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this and why is it not going to be a pain to depend on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids complex replacement logic (which is impossible) based on format inside of the template. You depend on IMAGE_ANSIBLE and you get IMAGE_FORMAT replaced by ansible.

func (s *releaseImagesTagStep) Provides() (api.ParameterMap, api.StepLink) {
return api.ParameterMap{
"IMAGE_FORMAT": func() (string, error) {
registry := "REGISTRY"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this a valid output? Why not var registry string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to return an error actually, thanks.

}
}
var format string
if len(s.config.Name) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what this implies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously added, the godoc should make it clear.

if err != nil {
return err
}
s.template.Parameters[i].Value = strings.Replace(format, "${component}", component, -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make a constant for "${component}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.


func (o *options) Complete() error {
if err := json.Unmarshal([]byte(o.rawBuildConfig), &o.buildConfig); err != nil {
configSpec := os.Getenv("CONFIG_SPEC")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.LookupEnv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.

From PipelineImageStreamTagReference `json:"from"`
// Commands are the shell commands to run in
// the repository root to execute tests.
Commands string `json:"commands"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it on us to do word splitting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it? We take this single string and need to feed it into a []string{} for the Pod, so isn't every entry in that array treated as a single arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’re feeding a scriptlet to bash as arg[2]. The same as the build command.

{
Name: "test",
Image: fmt.Sprintf("%s:%s", PipelineImageStream, s.config.From),
Command: []string{"/bin/bash", "-c", "#!/bin/bash\nset -euo pipefail\n" + s.config.Commands},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am weary of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean having to deal with shell? We already require this for most other things.


func bindOptions() *options {
opt := &options{}
flag.Var(&opt.targets, "target", "A set of names in the config to target. Only steps that are required for these targets will be run.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use-case for more than one leaf here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build two images in the same repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you are building images, just run the build job to completion. When you need a leaf, you are running a single test. If the test requies multiple parents, make the tree relationship as necessary? Using --target for non-test runs seems wrong.

return nil, nil
}

func (s *pipelineImageCacheStep) Name() string { return string(s.config.To) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want any steps other than leaves to be targeted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build just a specific image.

@smarterclayton
Copy link
Contributor Author

Addressed or commented, changes pushed.

@smarterclayton
Copy link
Contributor Author

Added a -h and provided help.

The hash used for the namespace name needs to include the external
dependencies we pull for images as well as the build configuration and
source code in order to ensure we have reproducible builds.

Add a new Step method `Inputs(...) api.InputDefinition` which returns an
opaque list of inputs that can be combined into the final hash. After
loading the steps we resolve all inputs and calculate the hash for the
inputs, then use that in the input name.

Because namespace is an input to the steps but can only be determined
after we resolve inputs defined in other namespaces, we must switch
steps to lazily include their namespace (so we pass cluster scoped
clients to steps instead of namespace scoped steps).

This makes the ci-operator hermetic with respect to its inputs.
@smarterclayton
Copy link
Contributor Author

smarterclayton commented May 5, 2018

Made the graph include the images we will tag in for prereqs part of the input hash, which required two changes:

  1. a pass over all steps prior to building the graph to calculate all their inputs (the steps cache the resolution)
  2. the namespace that steps create into must be lazily resolved, so switch all the clients passed into steps from *Interface to *Getter.

I debated whether to change the signature of Run and decided against it because it keeps the run action independent of the actual graph being executed. Steps are already lazily evaluating state from the source of truth (the cluster) so the Run() command just asks the JobSpec what namespace it should run from. We don't attach things to context generally.

After this commit, builds are hermetic which is a nice win (two different users with the same source, build definition, and base images will get the same output). This was required to make retries work:

  1. test job A runs the first time while a particular dependency (base image) is broken
  2. test job A fails because the base image is broken
  3. user retries test job A as job B, but if job B won't rebuild the artifacts we'll remain broken

After this change since the base image is changed, the input hash changes, which means a new namespace is used and new artifacts are created.

@smarterclayton
Copy link
Contributor Author

@stevekuznetsov I get lonely without your comments... :(

@stevekuznetsov
Copy link
Contributor

/test pull-ci-operator-unit

1 similar comment
@stevekuznetsov
Copy link
Contributor

/test pull-ci-operator-unit

Copy link
Contributor

@stevekuznetsov stevekuznetsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incoming changes LGTM but I am still not clear on the Creates() API and why we ever want multiple leaves for --target

@smarterclayton
Copy link
Contributor Author

I’ll make target singular. I can unify creates and provides in a follow up?

@smarterclayton
Copy link
Contributor Author

Disabled multiple targets. Will come back to creates and provides.

@smarterclayton smarterclayton merged commit d2ea6bd into openshift:master May 11, 2018
wking added a commit to wking/ci-operator that referenced this pull request Feb 15, 2019
Because while two seconds may be sufficient teardown time for unit
tests and such, it's too short for e2e teardown (where cluster logs
should be collected, cluster resources need to be reaped, and
collected assets need to be uploaded).  The two-second timeout is from
c2ac3ab (When interrupted, mark any in progress pods/templates as
deleted, 2018-05-04, openshift#16).  And actually, I'm not sure how the
2-second default worked even for unit tests and such, because
Kubernetes pods have a 30-second grace period by default [1].

A recent aborted e2e job that did not run teardown to completion or
collect any cluster assets is [2,3].  A recent successful e2e job
showing current teardown timing is [4,5].  From [4]:

  2019/02/14 18:19:54 Container setup in pod e2e-aws completed successfully
  2019/02/14 18:46:08 Container test in pod e2e-aws completed successfully
  2019/02/14 18:51:38 Container teardown in pod e2e-aws completed successfully

So 5.5 minutes to teardown.  And from [5]:

  time="2019-02-14T18:47:40Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"1341ccc2-66ac-46bb-ae15-0295d4a126ba\"}"
  ...
  time="2019-02-14T18:51:38Z" level=debug msg="Purging asset \"Cluster\" from disk"

about 4 minutes of that is resource reaping (with the previous 1.5
minutes being log collection).

[1]: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
[2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/build-log.txt
[3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/artifacts/
[4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/build-log.txt
[5]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/artifacts/e2e-aws/installer/.openshift_install.log
wking added a commit to wking/ci-operator that referenced this pull request Feb 15, 2019
Because while two seconds may be sufficient teardown time for unit
tests and such, it's too short for e2e teardown (where cluster logs
should be collected, cluster resources need to be reaped, and
collected assets need to be uploaded).  The two-second timeout is from
c2ac3ab (When interrupted, mark any in progress pods/templates as
deleted, 2018-05-04, openshift#16).  And actually, I'm not sure how the
2-second default worked even for unit tests and such, because
Kubernetes pods have a 30-second grace period by default [1].

A recent aborted e2e job that did not run teardown to completion or
collect any cluster assets is [2,3].  A recent successful e2e job
showing current teardown timing is [4,5].  From [4]:

  2019/02/14 18:19:54 Container setup in pod e2e-aws completed successfully
  2019/02/14 18:46:08 Container test in pod e2e-aws completed successfully
  2019/02/14 18:51:38 Container teardown in pod e2e-aws completed successfully

So 5.5 minutes to teardown.  And from [5]:

  time="2019-02-14T18:47:40Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"1341ccc2-66ac-46bb-ae15-0295d4a126ba\"}"
  ...
  time="2019-02-14T18:51:38Z" level=debug msg="Purging asset \"Cluster\" from disk"

So about 4 minutes of that is resource reaping (with the previous 1.5
minutes being log collection).

[1]: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
[2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/build-log.txt
[3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/artifacts/
[4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/build-log.txt
[5]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/artifacts/e2e-aws/installer/.openshift_install.log
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants