Conversation
9ab66a4 to
1cc8f9c
Compare
docs/proposals/job_controller.md
Outdated
There was a problem hiding this comment.
I think this use case list is missing some of the insights from this comment here:
kubernetes/kubernetes#1624 (comment)
Can you augment to capture the worker coordination need or not? How does a pod spec template vary across controller instances so workers only work on an assigned portion of work?
There was a problem hiding this comment.
Frankly said I didn't even think about any of those use cases, somehow I missed it. I'll try to catch you on IRC to talk it through.
1cc8f9c to
63576e4
Compare
docs/proposals/job_controller.md
Outdated
There was a problem hiding this comment.
Why do these have to be part of the same object design? A scheduled job creates multiple jobs at different times. A single job has no link to other jobs (a scheduled job may want to be exclusive if a job is already running, but jobs can provide no such guarantee).
Want to see a justification for this being a single object vs Job and ScheduledJob
There was a problem hiding this comment.
Primarily because upstream has already indicated they believe this is two objects
There was a problem hiding this comment.
I'll be separating this into two objects, after talking to @derekwaynecarr yesterday, which was my original design some time ago, but foolishly I was convinced to merge that into one.
|
Is the controller going to produce events about jobs? If so, the proposal should state what events / when. |
63576e4 to
31555e7
Compare
|
I've reworked my proposal splitting the job controller and scheduled job controller into two separate docs. Appreciate reviews. |
docs/proposals/job.md
Outdated
There was a problem hiding this comment.
isn't this same as Replicas ?
There was a problem hiding this comment.
Yes it is, are you suggesting renaming?
There was a problem hiding this comment.
Yes, to keep it consistent if it refers to the same
There was a problem hiding this comment.
The upstream PR to add types (kubernetes/kubernetes#7380) calls this Completions. I don't think it makes sense to call it Replicas unless you truly mean n identical pods.
There was a problem hiding this comment.
I don't like Completions either as it refers to actual status and not desire imho.
@ncdc it will always be n identical pods, unless I'm missing some use case, in which case please provide them.
There was a problem hiding this comment.
I'm not a huge fan of Completions either, FWIW, but I at least wanted to point out what the upstream PR has.
Upstream talked about 3 different possible use cases (of which only 2 really make sense):
- n identical pod specs, all performing the exact same work (doesn't really make sense / can't think of a reason why you'd have n > 1)
- n similar pod specs, with variation in something (env vars, hostname, data in a mounted volume, etc.) to indicate what portion of the overall work the pod should take (i.e. sharding)
- n identical pod specs, where each realized pod queries some sort of centralized work dispatcher service to request a portion of work
There was a problem hiding this comment.
I'm aware of that (they call those type 1, aka static; and type 2, aka dynamic) and I've spent some time trying to figure out some smart solution to address both. From what I followed the thread starting from here, @davidopp and @bgrant0607 where leaning towards starting with the dynamic type and be able to address the static one later on. To address that, I was thinking about adding additional field to assign workload on per pod basis; either way you don't care about which pod gets assigned which part of the actual job, as long as the overall progress is achieved. Although that workload will have to be part of the job execution then, to know what to assign to a new pod if one of the original ones dies.
There was a problem hiding this comment.
Do not call it podCount. The name must accurately reflect the purpose: number of completions, min simultaneous pods, max simultaneous pods, ...
There was a problem hiding this comment.
TaskCount would be more descriptive in the case that it is intended to represent the desired number of completed tasks.
There was a problem hiding this comment.
At this point - maybe let's move this proposal to Kube and record brian's
comments. I think the initial cleanup of the proposal was enough but this
should be in the wider audience (now that folks have time to review
upstream).
On Wed, Jul 22, 2015 at 1:44 PM, Brian Grant notifications@github.com
wrote:
In docs/proposals/job.md
#3693 (comment):
- TypeMeta
- ListMeta
- Items []Job
+}
+`
+
JobSpecstructure is defined to contain all the information how the actual job execution
+will look like.
+
+`go
+// JobSpec describes how the job execution will look like.
+type JobSpec struct {
+
- // PodCount specifies the desired number of pods the job should be run with.
- PodCount int
TaskCount would be more descriptive in the case that it is intended to
represent the desired number of completed tasks.—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/3693/files#r35241384.
Clayton Coleman | Lead Engineer, OpenShift
docs/proposals/job.md
Outdated
There was a problem hiding this comment.
We're not sure if this will work and/or if it's worth the added complexity. For example, while builds and deployments both follow similar conventions for moving their objects from new to pending to running and so on, there is logic that each controller implements alongside the state transitions that is specific to each resource. /cc @smarterclayton @pmorie @derekwaynecarr
There was a problem hiding this comment.
I wasn't sure about that, I remember a discussion in this topic quite some time ago, that's why I put it down here.
31555e7 to
afb4315
Compare
|
cc @timothysc |
|
General comment, why not post as a proposal upstream on this one, b/c this is clearly going to happen upstream... |
docs/proposals/job.md
Outdated
There was a problem hiding this comment.
Run-after dependencies are a tricky area:
- need a convenient mechanism for creating references to other jobs
- need to ensure completion status is not lost until it is consumed by dependencies
- need to kill dependent jobs in the case of failure, with clear reason attribution
- eventually need to support cross-cluster dependencies
I recommend deferring this feature to later.
There was a problem hiding this comment.
Makes sense - you can implement dependencies in a single pod if necessary.
On Wed, Jul 22, 2015 at 1:35 PM, Brian Grant notifications@github.com
wrote:
In docs/proposals/job.md
#3693 (comment):+## Abstract
+A proposal for implementing a new controller - Job controller - which will be responsible
+for managing pod(s) that require to run-once to a completion, in contrast to what
+ReplicationController currently offers.
+
+Several existing issues and PRs were already created regarding that particular subject:
+* Job Controller #1624
+* New Job resource #7380
+
+
+## Use Cases
+1. Be able to start of one or several pods tracked as a single entity.
+1. Be able to implement heavy-computation model, eg. MapReduce.
+1. Be able to get the job status.
+1. Be able to limit the execution time for a job.
+1. Be able to create a chain of jobs dependent one on another.Run-after dependencies are a tricky area:
- need a convenient mechanism for creating references to other jobs
- need to ensure completion status is not lost until it is consumed by
dependencies- need to kill dependent jobs in the case of failure, with clear
reason attribution- eventually need to support cross-cluster dependencies
I recommend deferring this feature to later.
—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/3693/files#r35240307.
Clayton Coleman | Lead Engineer, OpenShift
There was a problem hiding this comment.
I'll create a section with future directions at the end of the document, so that it won't be forgotten.
|
The implementation is being prototyped and is part of our short term The proposal should move upstream soon. On Jul 21, 2015, at 4:49 PM, Timothy St. Clair notifications@github.com General comment, why not post as a proposal upstream on this one, b/c this — |
|
We will carry a prototype in Openshift regardless of whether it goes upstream in our short term roadmap. The goal is an upstream impl though. |
|
2 more general comments
|
|
Where is the cron controller? Was that a recent pull? The use reqt we're focused on is specifically cron in the short term, On Jul 21, 2015, at 5:12 PM, Timothy St. Clair notifications@github.com 2 more general comments
— |
|
Agree we are not interested in heavy batch. On Jul 21, 2015, at 5:12 PM, Timothy St. Clair notifications@github.com 2 more general comments
— |
|
@bgrant0607 would know where the controller is... The google folks did a demo a week or two ago showing the per-node daemon controller and the cron-like controller. |
|
I think you may be referring to this: |
|
Ok - does not seem similar to me in intent, although has similarities. On Jul 21, 2015, at 11:35 PM, David Oppenheimer notifications@github.com I think you may be referring to this: — |
|
As @smarterclayton pointed out earlier we're mostly interested in the cron controller; and we see that cron controller to be a good candidate to sit on top of a simple batch controller. Once we refine this here (at least the batch controller, I'm not sure if upstream will be interested in cron as well) it'll be pushed upstream |
|
Both controllers would be things that would go upstream. On Jul 22, 2015, at 8:29 AM, Maciej Szulik notifications@github.com wrote: As @smarterclayton https://github.com/smarterclayton pointed out earlier — |
afb4315 to
5337123
Compare
|
I'm closing this PR here, in favor for upstream proposal kubernetes/kubernetes#11746 |
This is the initial proposal for Job controller, that was initially started in #2000. In the meantime I'll start working on the implementation to help figure out some details.
@nwendling @danmcp @smarterclayton @derekwaynecarr ptal
Once we got this merged/agreed in here I'll move the proposal upstream.