Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions docs/proposals/job.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Job Controller

## Abstract
A proposal for implementing a new controller - Job controller - which will be responsible
for managing pod(s) that require to run-once to a completion, in contrast to what
ReplicationController currently offers.

Several existing issues and PRs were already created regarding that particular subject:
* Job Controller [#1624](https://github.com/GoogleCloudPlatform/kubernetes/issues/1624)
* New Job resource [#7380](https://github.com/GoogleCloudPlatform/kubernetes/pull/7380)


## Use Cases
1. Be able to start of one or several pods tracked as a single entity.
1. Be able to implement basic batch oriented tasks.
1. Be able to get the job status.
1. Be able to limit the execution time for a job.
1. Be able to specify the number of instances performing a task.
1. Be able to specify triggers on job’s success/failure.


## Motivation
Jobs are needed for executing multi-pod computation to completion; a good example
here would be the ability to implement a MapReduce or Hadoop style workload.
Additionally this new controller should take over pod management logic we currently
have in certain OpenShift controllers, namely build controller.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not sure if this will work and/or if it's worth the added complexity. For example, while builds and deployments both follow similar conventions for moving their objects from new to pending to running and so on, there is logic that each controller implements alongside the state transitions that is specific to each resource. /cc @smarterclayton @pmorie @derekwaynecarr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure about that, I remember a discussion in this topic quite some time ago, that's why I put it down here.



## Implementation
Job controller is similar to replication controller in that they manage pods.
This implies they will follow the same controller framework that replication
controllers already defined. The biggest difference between `Job` and
`ReplicationController` objects is the purpose; `ReplicationController`
ensures that a specified number of Pods are running at any one time, whereas
`Job` is responsible for keeping the desired number of Pods to a completion of
a task. This will be represented by the `RestartPolicy` which is required to
always take value of `RestartPolicyNever` or `RestartOnFailure`.


The new `Job` object will have the following content:

```go
// Job represents the configuration of a single job.
type Job struct {
TypeMeta
ObjectMeta

// Spec is a structure defining the expected behavior of a job.
Spec JobSpec

// Status is a structure describing current status of a job.
Status JobStatus
}

// JobList is a collection of jobs.
type JobList struct {
TypeMeta
ListMeta

Items []Job
}
```

`JobSpec` structure is defined to contain all the information how the actual job execution
will look like.

```go
// JobSpec describes how the job execution will look like.
type JobSpec struct {

// TaskCount specifies the desired number of pods the job should be run with.
TaskCount int

// Optional duration in seconds relative to the StartTime that the job may be active
// before the system actively tries to terminate it; value must be positive integer
ActiveDeadlineSeconds *int64

// Selector is a label query over pods that should match the pod count.
Selector map[string]string

// Spec is the object that describes the pod that will be created when
// executing a job.
Spec PodSpec
}
```

`JobStatus` structure is defined to contain informations about pods executing
specified job. The structure holds information about pods currently executing
the job.

```go
// JobStatus represents the current state of a Job.
type JobStatus struct {
// Executions holds a detailed information about each of the pods running a job.
Executions []JobExec

// Completions is the number of pods successfully completed their job.
Completions int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? I mean you can already just select all Pods that succeeded and see their status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but this is a shortcut to do that.


}

// JobExec represents the current state of a single execution of a Job.
type JobExec struct {
// CreationTime represents time when the job execution was created
CreationTime util.Time

// StartTime represents time when the job execution was started
StartTime util.Time

// CompletionTime represents time when the job execution was completed
CompletionTime util.Time

// Phase represents the point in the job execution lifecycle.
Phase JobExecPhase

// Tag is added in labels of pod(s) created for this job execution. It allows
// job object to safely group/track all pods started for one given job execution.
Tag util.UID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there 1 Tag per job or multiple?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One per job execution, which results in multiple per job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would you have multiple executions for a job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing this similar to what we have with deployments or builds, a Job represents definition (intention), job execution is actual run, thus the ability to have more than one. Where each is executed on demand, similarly to builds.

}

// JobExecPhase represents job execution phase at given point in time.
type JobExecPhase string

// These are valid JobExec phases.
const (
// JobExecPending means the pod has been accepted by the system but one or more
// pods has not been started.
JobExecPending JobExecPhase = "Pending"
// JobExecRunning means that all pods have been started.
JobExecRunning JobExecPhase = "Running"
// JobExecComplete means that all pods have terminated with an exit code of 0.
JobExecComplete JobExecPhase = "Complete"
// JobExecFailed means that all pods have terminated and at least one of
// them terminated with non-zero exit code.
JobExecFailed JobExecPhase = "Failed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I cancel the job? (iow. my job is taking too long or is stuck...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can and it'll fall under "Failed" category in such situation. That's similar to pods, if you kill one, there's no canceled state there.

)
```

## Events
Job controller will be emitting the following events:
* JobStart
* JobFinish

## Future evolution
Below are the possible future extensions to the Job controller:
* Be able to create a chain of jobs dependent one on another.

## Discussion points:
* triggers
* replacing build controller (others?)
*
113 changes: 113 additions & 0 deletions docs/proposals/scheduledjob.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# ScheduledJob Controller

## Abstract
A proposal for implementing a new controller - ScheduledJob controller - which
will be responsible for managing time based jobs, namely:
* once at a specified point in time,
* repeatedly at a specified point in time.

There is already an upstream discussion regarding that particular subject:
* Distributed CRON jobs [#2156](https://github.com/GoogleCloudPlatform/kubernetes/issues/2156)

There are also similar solutions available already:
* [Mesos Chronos](https://github.com/mesos/chronos)
* [Quartz](http://quartz-scheduler.org/)


## Use Cases
1. Be able to schedule a job execution at a given point in time.
1. Be able to create a repetitive job, eg. database backup, sending emails.


## Motivation
ScheduledJobs are needed for performing all time related actions, namely backups,
report generation and alike. Each of these tasks should be allowed to perform
repeatedly (once a day/month, etc.) or once at a given point in time.


## Implementation
ScheduledJob controller relies heavily on the [Job Controller API](https://github.com/openshift/origin/blob/master/docs/proposals/job.md)
for running actual jobs, on top of which it adds information regarding the date
and time part according to ISO8601 format.

The new `ScheduledJob` object will have the following content:

```go
// ScheduledJob represents the configuration of a single scheduled job.
type ScheduledJob struct {
TypeMeta
ObjectMeta

// Spec is a structure defining the expected behavior of a job, including the schedule.
Spec ScheduledJobSpec

// Status is a structure describing current status of a job.
Status ScheduledJobStatus
}

// ScheduledJobList is a collection of scheduled jobs.
type ScheduledJobList struct {
TypeMeta
ListMeta

Items []ScheduledJob
}
```

`ScheduledJobSpec` structure is defined to contain all the information how the actual
job execution will look like, including the `JobSpec` from [Job Controller API](https://github.com/openshift/origin/blob/master/docs/proposals/job.md)
and the schedule in ISO8601 format.

```go
// ScheduledJobSpec describes how the job execution will look like and when it will actually run.
type ScheduledJobSpec struct {

// Spec is a structure defining the expected behavior of a job.
Spec JobSpec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this by just ObjectReference for the Job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean you'd like to have an option to create a on-demand job and an additional copy with schedule for it? Yeah, that makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely get why you're calling this Spec, but a sample file for this will look a bit weird:

{
  "apiVersion": "v1",
  "type": "ScheduledJob",
  "metadata": {
    "name": "myjob"
  },
  "spec": {
    "spec": {
      "podCount": 3,
      "..."
    },
    "schedule": "* * * * * *"
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I haven't thought of. On the other hand, I can't think of anything more straightforward.


// Schedule contains the schedule in ISO8601 format, eg.
// - 2015-07-21T14:00:00Z - represents date and time in UTC
// - R/2015-07-21T14:00:00Z/P1D - represents endlessly repeating interval (1 day), starting from given date
Schedule string

// SkipOutdated specifies that if AllowConcurrent is false, only the newest job
// will be started (default: true), ignoring executions that missed their schedule.
SkipOutdated bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? I would like to see something like:

  1. If another Job is started and the old Job was started, I want to be able to say to 'cancel' the current job and replace it with a new job.
  2. I want to be able to start concurent jobs

So how about:

StartPolicy with values AllowConcurrent, CancelExisting ?

Also, I don't see a way how to timeout the job. Can I say that this Job should run only for 2 minutes? This might be solvable by CancelExisting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't see a way how to timeout the job. Can I say that this Job should run only for 2 minutes? This might be solvable by CancelExisting.

Timeouts are at a Job level, but of course I didn't put that field there, but it's mentioned in the UCs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? I would like to see something like:

Assume, that for some reason OpenShift was down, or at least the Schedule Job Controller missed a couple of executions, I don't want them to be rerun, just because they are outdated, I know that the next execution will deal with those outdated data still.

So how about:
StartPolicy with values AllowConcurrent, CancelExisting ?

That makes sense and makes the api more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more, I think I'll keep current version, because it allows more flexibility than setting single policy, in which case we'd have to have multiple combination of policies to achieve the same result.


// BlockOnFailure suspends scheduling of next job runs after a failed one.
BlockOnFailure bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed. There should be something like FailureCount that will keep increasing when a job is failing, similar to Container retry count. Once a Job succeed, the counter will be reset.


// AllowConcurrent specified whether concurrent jobs may be started.
AllowConcurrent bool
}
```

`ScheduledJobStatus` structure is defined to contain some information for scheduled
job executions (up to a limited amount). The structure holds objects in three lists:
* "Pending" list is used as a way to stack executions for job preventing concurrent
runs, and also to trigger an execution for a job defined without time based scheduling;
* "Running" list contains all actually running jobs with detailed information;
* "Failed" list contains information about recently failed jobs.

```go
// ScheduledJobStatus represents the current state of a Job.
type ScheduledJobStatus struct {
// PendingExecutions are the job runs pending for execution
PendingExecutions []JobStatus

// CurrentExecutions are the job run currently executing
RunningExecutions []JobStatus

// CompletedExecutions tracks previously scheduled jobs (up to a limited amount)
CompletedExecutions []JobStatus

// FailedExecutions tracks previously failed jobs
FailedExecutions []JobStatus

// ScheduleCount tracks the amount of successful executions for this job
CompletedCount int

// FailedCount tracks the amount of failures for this job
FailedCount int
}
```