Skip to content

Conversation

@ari-becker
Copy link

This proposal outlines the addition of a manual step type.

RFC written as requested in concourse/concourse#3265.

Signed-off-by: Ari Becker <ari@coralogix.com>
@ari-becker ari-becker changed the title Initial proposal for manual steps Add manual steps Aug 17, 2019
@evanchaoli
Copy link
Contributor

Good to see this RFC. I also received a number of user asks for manual steps, because many users have Jenkins experience, and Jenkins has such functionality. I may consider to help implement this feature if I get some bandwidth.

build should proceed normally. If a user should select the "reject" option,
then the manual step should be marked as failed, and the build should respond
accordingly, as if any other step in the build had failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not only a decision (accept/reject), the manual step may also support to take some user inputs, which is a feature Jenkins supports, and many of our users asked for that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evanchaoli In my very subjective opinion, user inputs at build-time are an anti-patttern, as the values should instead be placed in a Git repository or some other resource, which is then provided as an input to the job. Having worked in places where user inputs were abused, copying and pasting values into a web form gets very old, very fast.


It is occasionally desirable for the authority to approve `manual` steps to be
restricted to a limited subset of users. An RBAC role, `supervisor`, should be
added to Concourse teams. A user with the `supervisor` role has the power to
Copy link
Contributor

@evanchaoli evanchaoli Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing a new role, I think Concourse should support team level user defined roles.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evanchaoli if so, that sounds to me like a matter for a separate RFC.

It is occasionally desirable for the authority to approve `manual` steps to be
restricted to a limited subset of users. An RBAC role, `supervisor`, should be
added to Concourse teams. A user with the `supervisor` role has the power to
approve manual steps. Team admins, by virtue of having the power to alter

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the biggest challenge is that team admins could be removing the manual step, invalidating it's usefulness. Any thoughts on that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it's reasonable to consider team admins a threat under a security model which depends upon manual steps to enforce policy. A malicious team admin, instead of altering the pipeline to remove the manual step, could do damage by altering the pipeline to add a malicious resource, and more damage would be done, with a lower probability of detection, by doing so. In my opinion, organizations which consider team admins a threat should have separate pipelines, ideally on separate Concourse instances (soon, the same Concourse instance but on different teams, with the introduction of team credential management), such that a different team has the opportunity to review the output from the pipeline which the malicious team admin is administrating.

I see the purpose of a manual step less as a measure to protect against malicious internal actors and more as a measure to review automatically generated output for which automated testing is difficult, or not worth the engineering cost, to implement.

@hrishin
Copy link

hrishin commented May 19, 2020

Wonders if this is progressing further? 🙇 🤔

Copy link
Member

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for putting this together, and double thanks for your patience! 🙏

Just to be upfront, and I believe I'm preaching to the choir here based on some of your review comments, I think Concourse's strength is in being a continuous thing-doer, and we should avoid turning it into a general GUI driver for executing workloads interactively. It's more designed around "give me a declarative config which describes your total workflow and I'll connect the dots" - while making a conscious effort to not have Concourse be a mission-critical part of the workflow.

So I'm very thankful that this proposal is pretty limited in scope and doesn't tread into the "arbitrary user-provided input" territory. 🙂

That being said, I think it would be more Concoursey to have an external record of who approved what and when, from the standpoint that Concourse should never be the source of truth - it's the dot connector, not the dot. I understand that this approach is painful and awkward to implement in practice right now, but I would like to unpack the specifics there just to make sure we're at our wit's end and we really just need to add a manual step after all.

For example, @eedwards-sk brought up the idea of using Slack messages to convey approval. It's awfully convenient that this is what they naturally wanted to be able to do while also being more Concoursey in that it allows for an external record of approvals.

This might even be something we can support through prototypes (#37) if we figure out what the interface looks like.

I think the downside of this approach is it requires you to have some sort of external mechanism for approval, but there's at least some evidence that that is a common case, and this might be one of those things where Concourse forces you to "do the right thing" - which may feel heavy on day 1, but may pay off on day 1000 when you really need to figure out who OK'd the build on August 24, 2017 and you've since migrated your pipelines to a different Concourse instance altogether.

There are a few other benefits to it - externalizing the approval also lets "who can approve" be configured elsewhere, rather than requiring RBAC in Concourse. It also lets whoever's doing the approval stay in an appropriate context - such as replying to an e-mail or a Slack thread - rather than diving into a Concourse pipeline and clicking a button, which may feel awkward for folks who's only role in a pipeline is to approve things. I'm just theorycrafting here, though.

I would imagine viewing a build that's waiting for approval and then having to go over to Slack would be annoying. Maybe that could be addressed by still allowing approval through the Concourse UI, but then recording that action through some other event so that it can still be logged in the external service.

So, I guess what I'm asking is this: do you value having a history of who approved what and when, and does your workflow currently involve a service that could be used for such a thing? Following that, would that workflow feel natural or more like a constant burden?

Thanks again! And sorry for the wall of text. 😅

each manual step must define a timeout step modifier. Manual steps are special
in this requirement (compared to other steps) in order to ensure that forgotten
jobs with open manual approvals, including their task containers and volumes,
are garbage-collected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really think it's necessary to require this. In principle any build could run for an unbounded amount of time from a hanging task/get/put step, and this has come up before in concourse/concourse#2341 - so perhaps it's best to frame this as an overall step max execution time if such a thing is really needed, and not specifically call it out in this RFC.

I think it makes more sense to just let users configure the timeout: step modifier when necessary. Weekends and vacations shouldn't have to mean a failed/timed-out build in the pipeline - as long as the build doesn't cause some sort of resource leak by virtue of running for that long I think it's OK for it to just wait.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're getting at here.

The design presumption I'm working from is, it's better to be explicit about requiring the user to set expectations, than to allow the user an easy way to shoot himself in the foot.

Let's take GitHub PRs as an example - it's pretty easy to end up with PRs staying open for indefinite periods of time, because priorities, and people switch their attention away and forget about stuff that was opened. The cost for a PR staying open for months on-end is basically zero (maybe some lost effort whose value won't be realized since it's not merged in), so it's fine not to add complexity to people opening PRs by asking them to define, when they open the PR, when to self-close the PR (even asking that is kind of hostile, from a design perspective).

But the cost for a manual step staying open is a resource leak, one that is sure to make Concourse operators tear their hair out: for one, because debugging silent resource leaks is annoying and difficult to pin down, and when the operators find that their users are at fault (because of laziness?), they're going to feel like there's nothing they can do to prevent it from recurring. It would make the system look broken.

Sure, the user can provide a timeout: step-modifier, but if it's not required by the system, then it'll pass fly validate-pipeline, and I don't think that's a (forgive the euphemism) healthy design. I think a better compromise is to require the user to provide timeout:, but allow the user to specify timeout: 0 (for unlimited) if the user really wants it (really this should be true of all steps, but I digress).

Allowing operators to set overall step max execution time is a good idea, but tricky - one instance may have both long-running jobs and short jobs. But probably a good idea in practice.

@vito vito self-assigned this May 20, 2020
@ari-becker
Copy link
Author

Re. @vito :

Just to be upfront, and I believe I'm preaching to the choir here based on some of your review comments, I think Concourse's strength is in being a continuous thing-doer, and we should avoid turning it into a general GUI driver for executing workloads interactively. It's more designed around "give me a declarative config which describes your total workflow and I'll connect the dots" - while making a conscious effort to not have Concourse be a mission-critical part of the workflow.

100%.

That being said, I think it would be more Concoursey to have an external record of who approved what and when, from the standpoint that Concourse should never be the source of truth - it's the dot connector, not the dot. I understand that this approach is painful and awkward to implement in practice right now, but I would like to unpack the specifics there just to make sure we're at our wit's end and we really just need to add a manual step after all.

I think the downside of this approach is it requires you to have some sort of external mechanism for approval, but there's at least some evidence that that is a common case, and this might be one of those things where Concourse forces you to "do the right thing" - which may feel heavy on day 1, but may pay off on day 1000 when you really need to figure out who OK'd the build on August 24, 2017 and you've since migrated your pipelines to a different Concourse instance altogether.

There are a few other benefits to it - externalizing the approval also lets "who can approve" be configured elsewhere, rather than requiring RBAC in Concourse. It also lets whoever's doing the approval stay in an appropriate context - such as replying to an e-mail or a Slack thread - rather than diving into a Concourse pipeline and clicking a button, which may feel awkward for folks who's only role in a pipeline is to approve things. I'm just theorycrafting here, though.

I would imagine viewing a build that's waiting for approval and then having to go over to Slack would be annoying. Maybe that could be addressed by still allowing approval through the Concourse UI, but then recording that action through some other event so that it can still be logged in the external service.

Permit me to tell a short story: once upon a time, I was a contractor, and I was doing some work for a bank. The bank had software that kept track of approvals from management for workflow like deploying to production. It was off-the-shelf from some BigEnterpriseSoftwareVendorCo and I honestly can't remember the name of it, but I do remember that everyone hated that software. That software was both bureaucratically impossible to make changes to (as it was an integral part of the bank's compliance workflow), technically impossible to make changes to (old, no open documentation, no documented API, etc.), and scapegoated the frustrations of people who work in big bank environments (needing a zillion approvals to do simple tasks). I'm sure, as any other piece of corporate IT, that software represented its own operational complexity - web servers, a load balancer, a database, monitoring, SSO login, etc.

For the sake of argument, let's say that a Concourse user who wishes to configure a pipeline wants such an external approval system, Foo to control manual approvals for resuming a pipeline. Well, that's easily modeled in Concourse today, with two resource types: a FooRequestApproval resource, which with put/out creates a request for approval in the external approval system, and a resource type FooRequireApproval, which uses the check, get/in mechanisms to poll the external system for approvals and only provide a new Concourse resource version when the necessary approval has been issued.

As an added bonus, this system does even more than what you suggested - as a dedicated system, it can provide its own rules engine, its own notifications, its own audit trail, etc. It will necessarily be much more powerful than anything Concourse could provide.

So my question is this: do you know of any systems like this that could be pointed to for people who need to implement this pattern? Something that's relatively easy for small teams to set up and maintain for the "simple" use-case (i.e. reviewing automatically generated output), but powerful enough for big enterprise teams as their needs grow and compliance becomes a genuine business concern? Even if it doesn't have Concourse resources yet (hopefully it wouldn't be too difficult to create a third-party resource, as long as the API is well-documented)? Maybe available as a SaaS so that we don't have to run the infrastructure ourselves?

So, I guess what I'm asking is this: do you value having a history of who approved what and when, and does your workflow currently involve a service that could be used for such a thing? Following that, would that workflow feel natural or more like a constant burden?

We don't even keep a durable build history. We set up our Concourse instance such that, if our Concourse instance blew up, we run a script and it sets everything back up for us, soup to nuts - deploying Concourse on Kubernetes, creating teams, setting up auth, setting up credential managers (including populating the credential managers with secrets), running a bootstrap pipeline that sets up all of the other pipelines in the system. Nobody really cares about build history because all of our pipelines are set up with idempotent-only operations in mind following a GitOps model - if something is screwed up, just re-run, and it either fixes the issue or reproduces it for you in a fresh build log. So we run PostgreSQL in a Kubernetes pod on commodity nodes, with no backups. We take the risk of being down for a few hours to give everything time to run, but that's fine with us for now.... actually, it's been fine for us for two years and counting. #yolo

We couldn't care less of keeping track of approvals and the idea of having to set up an entirely separate system to keep durable track of approvals sounds like an anti-pattern to me, considering the rest of how our infrastructure is set up.

Regarding auditing approvals, I recall touching upon the potential for Concourse to add better support for audit sinks, at least here: #41 (comment) but I'm sure the question has come up elsewhere. If Concourse had better support for external audit sinks in general (also sending more information than what's available in current audit logs) then such support would probably naturally support the kind of auditing needs that you foresee needing in a manual approval.

Thanks again! And sorry for the wall of text. sweat_smile

I seem to have repaid you in kind 😅 Thanks for taking the time to review the RFC!

@ari-becker
Copy link
Author

@vito the point about the value of an external state store for approvals made me think about another point -

Why does Concourse have a Postgres database to begin with? To keep track of pipeline configurations, build coordination, and build logs, more or less?

Allow me to draw a hypothetical:

  • All of Concourse's pipelines, team configurations, etc. are all defined in some configuration file. Concourse starts, pointing to this configuration file, and from it assembles all of the pipelines, teams, etc. Changes can be made by changing the configuration file and sending SIGHUP. For highly available setups, the file can instead be put in an object bucket somewhere.
  • Build coordination is stored in etcd / Raft-protocol distributed consensus
  • Instead of storing build logs in a Postgres database, instead allow pipelines to declare a "build log sink" which sends the build logs somewhere. This allows the user to choose to send the logs anywhere from a Postgres database (what's currently done), to an object store (after the build is completed, it's basically just an object), to an Elasticsearch cluster (better for indexing and searching logs), to /dev/null (user doesn't care about log persistence).

If you think this has merit, I can write it up in a separate RFC... this one isn't the place to debate it. The point I wish to raise with this hypothetical is, as long as Concourse maintains state internally at all, whether the scope of that state should be expanded to support a given feature (say, manual steps) becomes debatable, as it essentially represents an attempt to get a "bang for the buck" simple implementation, in the way that storing build logs in Postgres is "good enough" for most use-cases. If a design goal of Concourse is to reduce the amount of state which Concourse itself is responsible for, then perhaps we should have more interesting RFCs on the reduction in the scope of state - like build log sinks.

@vito
Copy link
Member

vito commented May 25, 2020

@ari-becker PostgreSQL is a bit of a compromise for making Concourse easy to use out-of-the-box. There's all sorts of cool tech we could utilize but I don't want to scare folks away who just want an open-source CI solution that's easy to deploy and use for side projects. Some folks even complain about the Postgres dependency, since Jenkins doesn't need such a thing. 😆

To me the most risky state we store in Postgres is pinned / disabled versions and paused jobs, but I don't know where else we could really put that information. 🤷 They're what I'd classify as "emergent state" in that pipeline operators sometimes need to be able to quickly pull the trigger on them, but it would be a lot closer to our ideals if we had some way to tie them to a source of truth.

I'd be very interested in hashing out an RFC about your first point. I think it would eliminate a lot of the need for a public HTTP API, since a lot of the demand seems to be centered around configuring Concourse. The more we can make Concourse self-configuring, the closer we can get to pipelines (or more generally, build plans) as the only API to worry about. I talked about that in concourse/concourse#3208 which I see you were also active in.

Re: build log sinks, it's probably worth checking out @aoldershaw's RFC: #53 - and associated PR: concourse/concourse#5651

@ari-becker
Copy link
Author

@ari-becker PostgreSQL is a bit of a compromise for making Concourse easy to use out-of-the-box. There's all sorts of cool tech we could utilize but I don't want to scare folks away who just want an open-source CI solution that's easy to deploy and use for side projects. Some folks even complain about the Postgres dependency, since Jenkins doesn't need such a thing. laughing

For what it's worth, running a static binary by itself is strictly less complex than running a static binary and a database. I'm not saying that Jenkins is better, it's much worse, but I would think that finding a way to run Concourse without the database would only make it easier to use out-of-the-box.

To me the most risky state we store in Postgres is pinned / disabled versions and paused jobs, but I don't know where else we could really put that information. shrug They're what I'd classify as "emergent state" in that pipeline operators sometimes need to be able to quickly pull the trigger on them, but it would be a lot closer to our ideals if we had some way to tie them to a source of truth.

I would look to Kubernetes and kubectl for inspiration here. It's a really bad anti-pattern for client code to be able to touch the database directly, and I don't think anybody would suggest that users should write SQL directly to pause/unpause pipelines or enable/disable/pin versions. But what Kubernetes did with manifests is to essentially expose most of the relevant bits of the etcd database to the user directly (working through the API server mostly as a secure proxy), therefore making it more explicit to the user what was happening, and therefore making it clear that the "quick-trigger-pull" way of doing things was not the only way of doing them.

What I'm trying to refer to by that is, let's say that I want to scale a Kubernetes deployment. On the one hand, I can execute kubectl scale deployment/example --replicas=5. This is the "quick-trigger-pull" way of scaling a deployment, and it's similar to quickly executing fly pause-pipeline etc. However, that's not the only way to scale a Kubernetes deployment. I can also edit the manifest itself:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example
spec:
  replicas: 5
  # most fields omitted

and indeed, the only thing that actually happens when I call kubectl scale deployment/example --replicas=5 is an editing of the underlying manifest.

I point this out because, if pipeline manifests looked something like the following:

paused: false
resources:
- name: example
  versions:
    pinned: { hash: foo }
    disabled:
    - hash: bar
jobs:
- name: baz
  paused: true

then I think that you'd find that the problem more-or-less solves itself in the same way as Kubernetes manifests.

I'd be very interested in hashing out an RFC about your first point. I think it would eliminate a lot of the need for a public HTTP API, since a lot of the demand seems to be centered around configuring Concourse. The more we can make Concourse self-configuring, the closer we can get to pipelines (or more generally, build plans) as the only API to worry about. I talked about that in concourse/concourse#3208 which I see you were also active in.

Cool, I'll start writing something up... might take me some time though.

Re: build log sinks, it's probably worth checking out @aoldershaw's RFC: #53 - and associated PR: concourse/concourse#5651

I went ahead and added some comments on the RFC :)

@ari-becker
Copy link
Author

I'd be very interested in hashing out an RFC about your first point. I think it would eliminate a lot of the need for a public HTTP API, since a lot of the demand seems to be centered around configuring Concourse. The more we can make Concourse self-configuring, the closer we can get to pipelines (or more generally, build plans) as the only API to worry about. I talked about that in concourse/concourse#3208 which I see you were also active in.

For the sake of bookkeeping on this post (as I've reached out to you separately and I know you already know about it), the RFC is up here: #59

@ari-becker
Copy link
Author

I'm cleaning up my old open PRs. This hasn't gone anywhere, so I'll close it.
I would be happy to re-open if someone thinks I should.

@ari-becker ari-becker closed this Jul 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants