-
Notifications
You must be signed in to change notification settings - Fork 37
Add manual steps #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add manual steps #36
Conversation
Signed-off-by: Ari Becker <ari@coralogix.com>
71b9c7a to
002f0a1
Compare
|
Good to see this RFC. I also received a number of user asks for manual steps, because many users have Jenkins experience, and Jenkins has such functionality. I may consider to help implement this feature if I get some bandwidth. |
| build should proceed normally. If a user should select the "reject" option, | ||
| then the manual step should be marked as failed, and the build should respond | ||
| accordingly, as if any other step in the build had failed. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not only a decision (accept/reject), the manual step may also support to take some user inputs, which is a feature Jenkins supports, and many of our users asked for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evanchaoli In my very subjective opinion, user inputs at build-time are an anti-patttern, as the values should instead be placed in a Git repository or some other resource, which is then provided as an input to the job. Having worked in places where user inputs were abused, copying and pasting values into a web form gets very old, very fast.
|
|
||
| It is occasionally desirable for the authority to approve `manual` steps to be | ||
| restricted to a limited subset of users. An RBAC role, `supervisor`, should be | ||
| added to Concourse teams. A user with the `supervisor` role has the power to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing a new role, I think Concourse should support team level user defined roles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evanchaoli if so, that sounds to me like a matter for a separate RFC.
| It is occasionally desirable for the authority to approve `manual` steps to be | ||
| restricted to a limited subset of users. An RBAC role, `supervisor`, should be | ||
| added to Concourse teams. A user with the `supervisor` role has the power to | ||
| approve manual steps. Team admins, by virtue of having the power to alter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the biggest challenge is that team admins could be removing the manual step, invalidating it's usefulness. Any thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it's reasonable to consider team admins a threat under a security model which depends upon manual steps to enforce policy. A malicious team admin, instead of altering the pipeline to remove the manual step, could do damage by altering the pipeline to add a malicious resource, and more damage would be done, with a lower probability of detection, by doing so. In my opinion, organizations which consider team admins a threat should have separate pipelines, ideally on separate Concourse instances (soon, the same Concourse instance but on different teams, with the introduction of team credential management), such that a different team has the opportunity to review the output from the pipeline which the malicious team admin is administrating.
I see the purpose of a manual step less as a measure to protect against malicious internal actors and more as a measure to review automatically generated output for which automated testing is difficult, or not worth the engineering cost, to implement.
|
Wonders if this is progressing further? 🙇 🤔 |
vito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for putting this together, and double thanks for your patience! 🙏
Just to be upfront, and I believe I'm preaching to the choir here based on some of your review comments, I think Concourse's strength is in being a continuous thing-doer, and we should avoid turning it into a general GUI driver for executing workloads interactively. It's more designed around "give me a declarative config which describes your total workflow and I'll connect the dots" - while making a conscious effort to not have Concourse be a mission-critical part of the workflow.
So I'm very thankful that this proposal is pretty limited in scope and doesn't tread into the "arbitrary user-provided input" territory. 🙂
That being said, I think it would be more Concoursey to have an external record of who approved what and when, from the standpoint that Concourse should never be the source of truth - it's the dot connector, not the dot. I understand that this approach is painful and awkward to implement in practice right now, but I would like to unpack the specifics there just to make sure we're at our wit's end and we really just need to add a manual step after all.
For example, @eedwards-sk brought up the idea of using Slack messages to convey approval. It's awfully convenient that this is what they naturally wanted to be able to do while also being more Concoursey in that it allows for an external record of approvals.
This might even be something we can support through prototypes (#37) if we figure out what the interface looks like.
I think the downside of this approach is it requires you to have some sort of external mechanism for approval, but there's at least some evidence that that is a common case, and this might be one of those things where Concourse forces you to "do the right thing" - which may feel heavy on day 1, but may pay off on day 1000 when you really need to figure out who OK'd the build on August 24, 2017 and you've since migrated your pipelines to a different Concourse instance altogether.
There are a few other benefits to it - externalizing the approval also lets "who can approve" be configured elsewhere, rather than requiring RBAC in Concourse. It also lets whoever's doing the approval stay in an appropriate context - such as replying to an e-mail or a Slack thread - rather than diving into a Concourse pipeline and clicking a button, which may feel awkward for folks who's only role in a pipeline is to approve things. I'm just theorycrafting here, though.
I would imagine viewing a build that's waiting for approval and then having to go over to Slack would be annoying. Maybe that could be addressed by still allowing approval through the Concourse UI, but then recording that action through some other event so that it can still be logged in the external service.
So, I guess what I'm asking is this: do you value having a history of who approved what and when, and does your workflow currently involve a service that could be used for such a thing? Following that, would that workflow feel natural or more like a constant burden?
Thanks again! And sorry for the wall of text. 😅
| each manual step must define a timeout step modifier. Manual steps are special | ||
| in this requirement (compared to other steps) in order to ensure that forgotten | ||
| jobs with open manual approvals, including their task containers and volumes, | ||
| are garbage-collected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really think it's necessary to require this. In principle any build could run for an unbounded amount of time from a hanging task/get/put step, and this has come up before in concourse/concourse#2341 - so perhaps it's best to frame this as an overall step max execution time if such a thing is really needed, and not specifically call it out in this RFC.
I think it makes more sense to just let users configure the timeout: step modifier when necessary. Weekends and vacations shouldn't have to mean a failed/timed-out build in the pipeline - as long as the build doesn't cause some sort of resource leak by virtue of running for that long I think it's OK for it to just wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're getting at here.
The design presumption I'm working from is, it's better to be explicit about requiring the user to set expectations, than to allow the user an easy way to shoot himself in the foot.
Let's take GitHub PRs as an example - it's pretty easy to end up with PRs staying open for indefinite periods of time, because priorities, and people switch their attention away and forget about stuff that was opened. The cost for a PR staying open for months on-end is basically zero (maybe some lost effort whose value won't be realized since it's not merged in), so it's fine not to add complexity to people opening PRs by asking them to define, when they open the PR, when to self-close the PR (even asking that is kind of hostile, from a design perspective).
But the cost for a manual step staying open is a resource leak, one that is sure to make Concourse operators tear their hair out: for one, because debugging silent resource leaks is annoying and difficult to pin down, and when the operators find that their users are at fault (because of laziness?), they're going to feel like there's nothing they can do to prevent it from recurring. It would make the system look broken.
Sure, the user can provide a timeout: step-modifier, but if it's not required by the system, then it'll pass fly validate-pipeline, and I don't think that's a (forgive the euphemism) healthy design. I think a better compromise is to require the user to provide timeout:, but allow the user to specify timeout: 0 (for unlimited) if the user really wants it (really this should be true of all steps, but I digress).
Allowing operators to set overall step max execution time is a good idea, but tricky - one instance may have both long-running jobs and short jobs. But probably a good idea in practice.
|
Re. @vito :
100%.
Permit me to tell a short story: once upon a time, I was a contractor, and I was doing some work for a bank. The bank had software that kept track of approvals from management for workflow like deploying to production. It was off-the-shelf from some BigEnterpriseSoftwareVendorCo and I honestly can't remember the name of it, but I do remember that everyone hated that software. That software was both bureaucratically impossible to make changes to (as it was an integral part of the bank's compliance workflow), technically impossible to make changes to (old, no open documentation, no documented API, etc.), and scapegoated the frustrations of people who work in big bank environments (needing a zillion approvals to do simple tasks). I'm sure, as any other piece of corporate IT, that software represented its own operational complexity - web servers, a load balancer, a database, monitoring, SSO login, etc. For the sake of argument, let's say that a Concourse user who wishes to configure a pipeline wants such an external approval system, As an added bonus, this system does even more than what you suggested - as a dedicated system, it can provide its own rules engine, its own notifications, its own audit trail, etc. It will necessarily be much more powerful than anything Concourse could provide. So my question is this: do you know of any systems like this that could be pointed to for people who need to implement this pattern? Something that's relatively easy for small teams to set up and maintain for the "simple" use-case (i.e. reviewing automatically generated output), but powerful enough for big enterprise teams as their needs grow and compliance becomes a genuine business concern? Even if it doesn't have Concourse resources yet (hopefully it wouldn't be too difficult to create a third-party resource, as long as the API is well-documented)? Maybe available as a SaaS so that we don't have to run the infrastructure ourselves?
We don't even keep a durable build history. We set up our Concourse instance such that, if our Concourse instance blew up, we run a script and it sets everything back up for us, soup to nuts - deploying Concourse on Kubernetes, creating teams, setting up auth, setting up credential managers (including populating the credential managers with secrets), running a bootstrap pipeline that sets up all of the other pipelines in the system. Nobody really cares about build history because all of our pipelines are set up with idempotent-only operations in mind following a GitOps model - if something is screwed up, just re-run, and it either fixes the issue or reproduces it for you in a fresh build log. So we run PostgreSQL in a Kubernetes pod on commodity nodes, with no backups. We take the risk of being down for a few hours to give everything time to run, but that's fine with us for now.... actually, it's been fine for us for two years and counting. #yolo We couldn't care less of keeping track of approvals and the idea of having to set up an entirely separate system to keep durable track of approvals sounds like an anti-pattern to me, considering the rest of how our infrastructure is set up. Regarding auditing approvals, I recall touching upon the potential for Concourse to add better support for audit sinks, at least here: #41 (comment) but I'm sure the question has come up elsewhere. If Concourse had better support for external audit sinks in general (also sending more information than what's available in current audit logs) then such support would probably naturally support the kind of auditing needs that you foresee needing in a manual approval.
I seem to have repaid you in kind 😅 Thanks for taking the time to review the RFC! |
|
@vito the point about the value of an external state store for approvals made me think about another point - Why does Concourse have a Postgres database to begin with? To keep track of pipeline configurations, build coordination, and build logs, more or less? Allow me to draw a hypothetical:
If you think this has merit, I can write it up in a separate RFC... this one isn't the place to debate it. The point I wish to raise with this hypothetical is, as long as Concourse maintains state internally at all, whether the scope of that state should be expanded to support a given feature (say, manual steps) becomes debatable, as it essentially represents an attempt to get a "bang for the buck" simple implementation, in the way that storing build logs in Postgres is "good enough" for most use-cases. If a design goal of Concourse is to reduce the amount of state which Concourse itself is responsible for, then perhaps we should have more interesting RFCs on the reduction in the scope of state - like build log sinks. |
|
@ari-becker PostgreSQL is a bit of a compromise for making Concourse easy to use out-of-the-box. There's all sorts of cool tech we could utilize but I don't want to scare folks away who just want an open-source CI solution that's easy to deploy and use for side projects. Some folks even complain about the Postgres dependency, since Jenkins doesn't need such a thing. 😆 To me the most risky state we store in Postgres is pinned / disabled versions and paused jobs, but I don't know where else we could really put that information. 🤷 They're what I'd classify as "emergent state" in that pipeline operators sometimes need to be able to quickly pull the trigger on them, but it would be a lot closer to our ideals if we had some way to tie them to a source of truth. I'd be very interested in hashing out an RFC about your first point. I think it would eliminate a lot of the need for a public HTTP API, since a lot of the demand seems to be centered around configuring Concourse. The more we can make Concourse self-configuring, the closer we can get to pipelines (or more generally, build plans) as the only API to worry about. I talked about that in concourse/concourse#3208 which I see you were also active in. Re: build log sinks, it's probably worth checking out @aoldershaw's RFC: #53 - and associated PR: concourse/concourse#5651 |
For what it's worth, running a static binary by itself is strictly less complex than running a static binary and a database. I'm not saying that Jenkins is better, it's much worse, but I would think that finding a way to run Concourse without the database would only make it easier to use out-of-the-box.
I would look to Kubernetes and What I'm trying to refer to by that is, let's say that I want to scale a Kubernetes deployment. On the one hand, I can execute apiVersion: apps/v1
kind: Deployment
metadata:
name: example
spec:
replicas: 5
# most fields omittedand indeed, the only thing that actually happens when I call I point this out because, if pipeline manifests looked something like the following: paused: false
resources:
- name: example
versions:
pinned: { hash: foo }
disabled:
- hash: bar
jobs:
- name: baz
paused: truethen I think that you'd find that the problem more-or-less solves itself in the same way as Kubernetes manifests.
Cool, I'll start writing something up... might take me some time though.
I went ahead and added some comments on the RFC :) |
For the sake of bookkeeping on this post (as I've reached out to you separately and I know you already know about it), the RFC is up here: #59 |
|
I'm cleaning up my old open PRs. This hasn't gone anywhere, so I'll close it. |
This proposal outlines the addition of a
manualstep type.RFC written as requested in concourse/concourse#3265.