Job controller proposal by soltysh · Pull Request #3693 · openshift/origin

soltysh · 2015-07-14T14:21:17Z

This is the initial proposal for Job controller, that was initially started in #2000. In the meantime I'll start working on the implementation to help figure out some details.
@nwendling @danmcp @smarterclayton @derekwaynecarr ptal
Once we got this merged/agreed in here I'll move the proposal upstream.

derekwaynecarr · 2015-07-15T13:57:28Z

docs/proposals/job_controller.md

I think this use case list is missing some of the insights from this comment here:

kubernetes/kubernetes#1624 (comment)

Can you augment to capture the worker coordination need or not? How does a pod spec template vary across controller instances so workers only work on an assigned portion of work?

Frankly said I didn't even think about any of those use cases, somehow I missed it. I'll try to catch you on IRC to talk it through.

smarterclayton · 2015-07-16T22:20:09Z

docs/proposals/job_controller.md

Why do these have to be part of the same object design? A scheduled job creates multiple jobs at different times. A single job has no link to other jobs (a scheduled job may want to be exclusive if a job is already running, but jobs can provide no such guarantee).

Want to see a justification for this being a single object vs Job and ScheduledJob

Primarily because upstream has already indicated they believe this is two objects

I'll be separating this into two objects, after talking to @derekwaynecarr yesterday, which was my original design some time ago, but foolishly I was convinced to merge that into one.

pmorie · 2015-07-16T23:21:59Z

Is the controller going to produce events about jobs? If so, the proposal should state what events / when.

soltysh · 2015-07-17T11:59:15Z

I've reworked my proposal splitting the job controller and scheduled job controller into two separate docs. Appreciate reviews.

mfojtik · 2015-07-20T08:14:17Z

docs/proposals/job.md

isn't this same as Replicas ?

Yes it is, are you suggesting renaming?

Yes, to keep it consistent if it refers to the same

The upstream PR to add types (kubernetes/kubernetes#7380) calls this Completions. I don't think it makes sense to call it Replicas unless you truly mean n identical pods.

I don't like Completions either as it refers to actual status and not desire imho.
@ncdc it will always be n identical pods, unless I'm missing some use case, in which case please provide them.

I'm not a huge fan of Completions either, FWIW, but I at least wanted to point out what the upstream PR has.

Upstream talked about 3 different possible use cases (of which only 2 really make sense):

n identical pod specs, all performing the exact same work (doesn't really make sense / can't think of a reason why you'd have n > 1)

n similar pod specs, with variation in something (env vars, hostname, data in a mounted volume, etc.) to indicate what portion of the overall work the pod should take (i.e. sharding)

n identical pod specs, where each realized pod queries some sort of centralized work dispatcher service to request a portion of work

I'm aware of that (they call those type 1, aka static; and type 2, aka dynamic) and I've spent some time trying to figure out some smart solution to address both. From what I followed the thread starting from here, @davidopp and @bgrant0607 where leaning towards starting with the dynamic type and be able to address the static one later on. To address that, I was thinking about adding additional field to assign workload on per pod basis; either way you don't care about which pod gets assigned which part of the actual job, as long as the overall progress is achieved. Although that workload will have to be part of the job execution then, to know what to assign to a new pod if one of the original ones dies.

Do not call it podCount. The name must accurately reflect the purpose: number of completions, min simultaneous pods, max simultaneous pods, ...

TaskCount would be more descriptive in the case that it is intended to represent the desired number of completed tasks.

At this point - maybe let's move this proposal to Kube and record brian's
comments. I think the initial cleanup of the proposal was enough but this
should be in the wider audience (now that folks have time to review
upstream).

On Wed, Jul 22, 2015 at 1:44 PM, Brian Grant notifications@github.com
wrote:

In docs/proposals/job.md
#3693 (comment):

TypeMeta

ListMeta

Items []Job
+}
+`

+JobSpec structure is defined to contain all the information how the actual job execution
+will look like.
+
+`go
+// JobSpec describes how the job execution will look like.
+type JobSpec struct {
+

// PodCount specifies the desired number of pods the job should be run with.

PodCount int

TaskCount would be more descriptive in the case that it is intended to
represent the desired number of completed tasks.

—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/3693/files#r35241384.

Clayton Coleman | Lead Engineer, OpenShift

ncdc · 2015-07-20T15:38:43Z

docs/proposals/job.md

We're not sure if this will work and/or if it's worth the added complexity. For example, while builds and deployments both follow similar conventions for moving their objects from new to pending to running and so on, there is logic that each controller implements alongside the state transitions that is specific to each resource. /cc @smarterclayton @pmorie @derekwaynecarr

I wasn't sure about that, I remember a discussion in this topic quite some time ago, that's why I put it down here.

soltysh · 2015-07-21T12:55:01Z

I've addressed concerns from @mfojtik and @ncdc, still waiting for more.

bgrant0607 · 2015-07-21T19:06:59Z

cc @timothysc

timothysc · 2015-07-21T20:49:14Z

General comment, why not post as a proposal upstream on this one, b/c this is clearly going to happen upstream...

timothysc · 2015-07-21T20:54:20Z

docs/proposals/job.md

Just call it a DAG / Workflow.

Run-after dependencies are a tricky area:

need a convenient mechanism for creating references to other jobs

need to ensure completion status is not lost until it is consumed by dependencies

need to kill dependent jobs in the case of failure, with clear reason attribution

eventually need to support cross-cluster dependencies

I recommend deferring this feature to later.

Makes sense - you can implement dependencies in a single pod if necessary.

On Wed, Jul 22, 2015 at 1:35 PM, Brian Grant notifications@github.com
wrote:

In docs/proposals/job.md
#3693 (comment):

+## Abstract
+A proposal for implementing a new controller - Job controller - which will be responsible
+for managing pod(s) that require to run-once to a completion, in contrast to what
+ReplicationController currently offers.
+
+Several existing issues and PRs were already created regarding that particular subject:
+* Job Controller #1624
+* New Job resource #7380
+
+
+## Use Cases
+1. Be able to start of one or several pods tracked as a single entity.
+1. Be able to implement heavy-computation model, eg. MapReduce.
+1. Be able to get the job status.
+1. Be able to limit the execution time for a job.
+1. Be able to create a chain of jobs dependent one on another.

Run-after dependencies are a tricky area:

need a convenient mechanism for creating references to other jobs

need to ensure completion status is not lost until it is consumed by
dependencies

need to kill dependent jobs in the case of failure, with clear
reason attribution

eventually need to support cross-cluster dependencies

I recommend deferring this feature to later.

—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/3693/files#r35240307.

Clayton Coleman | Lead Engineer, OpenShift

I'll create a section with future directions at the end of the document, so that it won't be forgotten.

smarterclayton · 2015-07-21T20:55:24Z

The implementation is being prototyped and is part of our short term
deliverable. It's easier to prototype in OpenShift right now because
master is closed upstream.

The proposal should move upstream soon.

On Jul 21, 2015, at 4:49 PM, Timothy St. Clair notifications@github.com
wrote:

General comment, why not post as a proposal upstream on this one, b/c this
is clearly going to happen upstream...

—
Reply to this email directly or view it on GitHub
#3693 (comment).

smarterclayton · 2015-07-21T20:58:29Z

We will carry a prototype in Openshift regardless of whether it goes upstream in our short term roadmap. The goal is an upstream impl though.

timothysc · 2015-07-21T21:12:33Z

2 more general comments

there already exists the durinal/chron controller upstream.
I would stick with simple batch only, because it makes far more sense to bridge in existing systems then it does to try and tread into their turf.

smarterclayton · 2015-07-21T21:16:40Z

Where is the cron controller? Was that a recent pull?

The use reqt we're focused on is specifically cron in the short term,
expanded to small batch.

On Jul 21, 2015, at 5:12 PM, Timothy St. Clair notifications@github.com
wrote:

2 more general comments

there already exists the durinal/chron controller upstream.
I would stick with simple batch only, because it makes far more sense
to bridge in existing systems then it does to try and tread into their
turf.

—
Reply to this email directly or view it on GitHub
#3693 (comment).

smarterclayton · 2015-07-21T21:17:14Z

Agree we are not interested in heavy batch.

On Jul 21, 2015, at 5:12 PM, Timothy St. Clair notifications@github.com
wrote:

2 more general comments

there already exists the durinal/chron controller upstream.
I would stick with simple batch only, because it makes far more sense
to bridge in existing systems then it does to try and tread into their
turf.

—
Reply to this email directly or view it on GitHub
#3693 (comment).

timothysc · 2015-07-21T21:39:32Z

@bgrant0607 would know where the controller is... The google folks did a demo a week or two ago showing the per-node daemon controller and the cron-like controller.

davidopp · 2015-07-22T03:35:38Z

I think you may be referring to this:
kubernetes/kubernetes#10881

smarterclayton · 2015-07-22T03:47:57Z

Ok - does not seem similar to me in intent, although has similarities.
Seems more in the scaler class than the job class?

On Jul 21, 2015, at 11:35 PM, David Oppenheimer notifications@github.com
wrote:

I think you may be referring to this:
kubernetes/kubernetes#10881
kubernetes/kubernetes#10881

—
Reply to this email directly or view it on GitHub
#3693 (comment).

soltysh · 2015-07-22T12:29:13Z

As @smarterclayton pointed out earlier we're mostly interested in the cron controller; and we see that cron controller to be a good candidate to sit on top of a simple batch controller. Once we refine this here (at least the batch controller, I'm not sure if upstream will be interested in cron as well) it'll be pushed upstream
@timothysc I went through kubernetes/kubernetes#10881 and it's hardly what we expect from the cron controller.
Anyway thanks for the comments, I'm always open for reviews :)

smarterclayton · 2015-07-22T12:35:47Z

Both controllers would be things that would go upstream.

On Jul 22, 2015, at 8:29 AM, Maciej Szulik notifications@github.com wrote:

As @smarterclayton https://github.com/smarterclayton pointed out earlier
we're mostly interested in the cron controller; and we see that cron
controller to be a good candidate to sit on top of a simple batch
controller. Once we refine this here (at least the batch controller, I'm
not sure if upstream will be interested in cron as well) it'll be pushed
upstream
@timothysc https://github.com/timothysc I went through
kubernetes/kubernetes#10881
kubernetes/kubernetes#10881 and it's
hardly what we expect from the cron controller.
Anyway thanks for the comments, I'm always open for reviews :)

—
Reply to this email directly or view it on GitHub
#3693 (comment).

soltysh · 2015-07-23T12:08:00Z

I'm closing this PR here, in favor for upstream proposal kubernetes/kubernetes#11746

job controller proposal document added

3fcae20

soltysh force-pushed the job_controller_proposal branch from 9ab66a4 to 1cc8f9c Compare July 15, 2015 12:07

derekwaynecarr reviewed Jul 15, 2015
View reviewed changes

soltysh force-pushed the job_controller_proposal branch from 1cc8f9c to 63576e4 Compare July 16, 2015 14:02

smarterclayton reviewed Jul 16, 2015
View reviewed changes

soltysh force-pushed the job_controller_proposal branch from 63576e4 to 31555e7 Compare July 17, 2015 11:57

mfojtik reviewed Jul 20, 2015
View reviewed changes

ncdc reviewed Jul 20, 2015
View reviewed changes

soltysh force-pushed the job_controller_proposal branch from 31555e7 to afb4315 Compare July 21, 2015 12:54

soltysh mentioned this pull request Jul 21, 2015

Job Controller kubernetes/kubernetes#1624

Closed

timothysc reviewed Jul 21, 2015
View reviewed changes

Job controller proposal

5337123

soltysh force-pushed the job_controller_proposal branch from afb4315 to 5337123 Compare July 23, 2015 11:21

soltysh mentioned this pull request Jul 23, 2015

Job controller proposal kubernetes/kubernetes#11746

Merged

soltysh closed this Jul 23, 2015

soltysh deleted the job_controller_proposal branch September 12, 2016 09:17

Conversation

soltysh commented Jul 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmorie commented Jul 16, 2015

Uh oh!

soltysh commented Jul 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soltysh commented Jul 21, 2015

Uh oh!

bgrant0607 commented Jul 21, 2015

Uh oh!

timothysc commented Jul 21, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Jul 21, 2015

Uh oh!

smarterclayton commented Jul 21, 2015

Uh oh!

timothysc commented Jul 21, 2015

Uh oh!

smarterclayton commented Jul 21, 2015

Uh oh!

smarterclayton commented Jul 21, 2015

Uh oh!

timothysc commented Jul 21, 2015

Uh oh!

davidopp commented Jul 22, 2015

Uh oh!

smarterclayton commented Jul 22, 2015

Uh oh!

soltysh commented Jul 22, 2015

Uh oh!

smarterclayton commented Jul 22, 2015

Uh oh!

soltysh commented Jul 23, 2015

Uh oh!

Reviewers

Assignees

Labels