Release 1.22 camille rebase by xinyuche · Pull Request #32 · lyft/kubernetes

xinyuche · 2022-06-29T21:00:33Z

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

pick 3efe7de sidecar: container ordered start/shutdown support
pick 4c01153 sidecar: kubelet: don't bother killing pods when non-sidecars are done
pick 6c8c3b4 sidecar: glog -> klog
pick cc1cb6f Allow metrics to be retrieved from CRI with CRI-O
pick 137a81e pkg/kubelet: try restoring the container spec if its nil
pick 8251b5f pkg/kubelet: fix 1.14 compat for container restore error text
pick 069517a pkg/kubelet: fix uint64 overflow when elapsed UsageCoreNanoSeconds exceeds 18446744073
pick 6b95d62 do not consider exited containers when calculating nanocores
pick 1ba9ec7 handle case where cpuacct is reset to 0 in a live container # empty
pick bd36566 remove unnecessary line
drop 428e48e Check next cron schedules in a binary-search fashion
drop 709ab5b wrap table driven tests in t.Run to allow running individual tests (#17)
drop 32a8844 fix missed starting deadline warning never being hit
drop 05e3010 create cronjob controller metrics
drop ac7b61d add Job scheduled start time annotation
drop 5d4d00a make cronjobController sync period configurable via flag
pick 03bd351 disable klog for cadvisor.GetDirFsInfo cache miss
drop 717d679 cronjob: handle invalid/unschedulable dates
drop fe822af legacy-cloud-providers/aws: add gp3 pvc support (#28)

During volume detach, the following might happen in reconciler 1. Pod is deleting 2. remove volume from reportedAsAttached, so node status updater will update volumeAttached list 3. detach failed due to some issue 4. volume is added back in reportedAsAttached 5. reconciler loops again the volume, remove volume from reportedAsAttached 6. detach will not be trigged because exponential back off, detach call will fail with exponential backoff error 7. another pod is added which using the same volume on the same node 8. reconciler loops and it will NOT try to tigger detach anymore At this point, volume is still attached and in actual state, but volumeAttached list in node status does not has this volume anymore, and will block volume mount from kubelet. The fix in first round is to add volume back into the volume list that need to reported as attached at step 6 when detach call failed with error (exponentical backoff). However this might has some performance issue if detach fail for a while. During this time, volume will be keep removing/adding back to node status which will cause a surge of API calls. So we changed to logic to check first whether operation is safe to retry which means no pending operation or it is not in exponentical backoff time period before calling detach. This way we can avoid keep removing/adding volume from node status. Change-Id: I5d4e760c880d72937d34b9d3e904ecad125f802e

…ecessary

…onZero

… fixes Signed-off-by: Carlos Panato <ctadeu@gmail.com>

…ck-of-#105734-upstream-release-1.22 Automated cherry pick of kubernetes#105734: Fix race condition in logging when request times out

…ick-of-#105511-upstream-release-1.22 Automated cherry pick of kubernetes#105511: Free APF seats for watches handled by an aggregated

…leged-storage-client Cherry pick of kubernetes#104551: Run storage hostpath e2e test client pod as privileged

…pick-of-#105755-upstream-release-1.22 Automated cherry pick of kubernetes#105755: Support cgroupv2 in node problem detector test

…ick-of-#105997-release-1.22 Automated cherry pick of kubernetes#105997: Fixing how EndpointSlice Mirroring handles Service selector

…-pick-of-#105673-upstream-release-1.22 Automated cherry pick of kubernetes#105673: support more than 100 disk mounts on Windows

…ick-of-#105946-upstream-release-1.22 Automated cherry pick of kubernetes#105946: Remove nodes with Cluster Autoscaler taint from LB backends.

Update debian, debian-iptables, setcap images to pick up CVEs fixes

… logging (kubernetes#105137) * added keys for structured logging * used KObj Co-authored-by: Shivanshu Raj Shrivastava <shivanshu1333@gmail.com>

Signed-off-by: Carlos Panato <ctadeu@gmail.com>

…g paths

The logic to detect stale endpoints was not assuming the endpoint readiness. We can have stale entries on UDP services for 2 reasons: - an endpoint was receiving traffic and is removed or replaced - a service was receiving traffic but not forwarding it, and starts to forward it. Add an e2e test to cover the regression

Bump kube-openapi against kube-openapi/release-1.22 branch Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

The commit a8b8995 changed the content of the data kubelet writes in the checkpoint. Unfortunately, the checkpoint restore code was not updated, so if we upgrade kubelet from pre-1.20 to 1.20+, the device manager cannot anymore restore its state correctly. The only trace of this misbehaviour is this line in the kubelet logs: ``` W0615 07:31:49.744770 4852 manager.go:244] Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date. Err: json: cannot unmarshal array into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type checkpoint.DevicesPerNUMA ``` If we hit this bug, the device allocation info is indeed NOT up-to-date up until the device plugins register themselves again. This can take up to few minutes, depending on the specific device plugin. While the device manager state is inconsistent: 1. the kubelet will NOT update the device availability to zero, so the scheduler will send pods towards the inconsistent kubelet. 2. at pod admission time, the device manager allocation will not trigger, so pods will be admitted without devices actually being allocated to them. To fix these issues, we add support to the device manager to read pre-1.20 checkpoint data. We retroactively call this format "v1". Signed-off-by: Francesco Romani <fromani@redhat.com>

Other components must know when the Kubelet has released critical resources for terminal pods. Do not set the phase in the apiserver to terminal until all containers are stopped and cannot restart. As a consequence of this change, the Kubelet must explicitly transition a terminal pod to the terminating state in the pod worker which is handled by returning a new isTerminal boolean from syncPod. Finally, if a pod with init containers hasn't been initialized yet, don't default container statuses or not yet attempted init containers to the unknown failure state.

Exploring termination revealed we have race conditions in certain parts of pod initialization and termination. To better catch these issues refactor the existing test so it can be reused, and then test a number of alternate scenarios.

Create an E2E test that creates a job that spawns a pod that should succeed. The job reserves a fixed amount of CPU and has a large number of completions and parallelism. Use to repro github.com/kubernetes/issues/106884 Signed-off-by: David Porter <david@porter.me>

…er And fix test to generate UUID without dash

Signed-off-by: David Porter <david@porter.me>

…pick-of-#108366-upstream-release-1.22 Automated cherry pick of kubernetes#108366 (release-1.22): Delay writing a terminal phase until the pod is terminated

…-secret-manager [release-1.22] Move kubelet secret and configmap manager calls to sync_Pod functions

…-pick-of-#107764-upstream-release-1.22 Automated cherry pick of kubernetes#107764: wrap error from RunCordonOrUncordon

Signed-off-by: Davanum Srinivas <davanum@gmail.com>

…of-#108928-upstream-release-1.22 Automated cherry pick of kubernetes#108928: kube-up: use registry.k8s.io for containerd-related jobs

…k-of-#108455-upstream-release-1.22 Automated cherry pick of kubernetes#108455: Copy request in timeout handler

…erry-pick-of-#104039-upstream-release-1.22 Automated cherry pick of kubernetes#104039 upstream release 1.22

Change-Id: Iacb8530769e7a93e3bc8384cf51d7a8fd9a192e1

…erry-pick-of-#109245-upstream-release-1.22 Automated cherry pick of kubernetes#109245: Fix: abort nominating a pod that was already scheduled to a

This change turns off the ability to completely kill pods when the non-sidecars are done. This is useful for cronjobs, where the non-sidecars finish work and exit, this code previously would clean up the pod and its resources. This feature was pulled in from kubernetes#75099. This is a feature that sounds nice in practice, but its not what we need. It seems to be a bit buggy since the Pod sandbox can potentially be deleted and recreated during the liftime of the Pod. That ain't good.

CRI-O properly implements the CRI interface, and therefore it is capable of returning the container stats if being asked for. There is no reason to keep CRI-O as a special use case that has to be run with the legacy mode making kubelet using cadvisor on each container. This patch removes the hardcoded assumptions that CRI-O has cannot handle to return containers stats through CRI. Fixes kubernetes#73750 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>

We're not guaranteed that the pod passed in has the ContainerSpec we're looking for. With this, we check if the pod has the container spec, and if it doesn't, we try to recover it one more time.

…ceeds 18446744073

benluddy and others added 30 commits October 21, 2021 13:32

Free APF seats for watches handled by an aggregated apiserver.

6765a52

fix: skip instance not found when decoupling vmss from lb

7465f1d

Release commit for Kubernetes v1.22.3

c920368

Release commit for Kubernetes v1.22.4-rc.0

39f5a50

Update CHANGELOG/CHANGELOG-1.22.md for v1.22.3

33b5f0f

Support cgroupv2 in node problem detector test

e565102

Remove nodes with Cluster Autoscaler taint from LB backends.

dee25f4

fix: leave the probe path empty for TCP probes

085556b

fix: remove VMSS and VMSS instances from SLB backend pool only when n…

cb5a772

…ecessary

use original requests in NodeResourcesBalancedAllocation instead of N…

83a25d3

…onZero

Fix race condition in logging when request times out

9e778cb

sched: ensure feature gate is honored when instantiating scheduler

84bd48e

Add unit tests to cover scheduler's setup

4dfd5db

Fixing how EndpointSlice Mirroring handles Service selector transitions

b9236d7

Update debian, debian-iptables, setcap images to pick up CVE-2021-33910…

98ad7ac

… fixes Signed-off-by: Carlos Panato <ctadeu@gmail.com>

Merge pull request kubernetes#106112 from marseel/automated-cherry-pi…

e091d57

…ck-of-#105734-upstream-release-1.22 Automated cherry pick of kubernetes#105734: Fix race condition in logging when request times out

Merge pull request kubernetes#105827 from benluddy/automated-cherry-p…

d2d17d3

…ick-of-#105511-upstream-release-1.22 Automated cherry pick of kubernetes#105511: Free APF seats for watches handled by an aggregated

Merge pull request kubernetes#105786 from Elbehery/cherrypick-unprivi…

6315226

…leged-storage-client Cherry pick of kubernetes#104551: Run storage hostpath e2e test client pod as privileged

Merge pull request kubernetes#105990 from bobbypage/automated-cherry-…

48c87da

…pick-of-#105755-upstream-release-1.22 Automated cherry pick of kubernetes#105755: Support cgroupv2 in node problem detector test

Merge pull request kubernetes#106132 from robscott/automated-cherry-p…

4968d1a

…ick-of-#105997-release-1.22 Automated cherry pick of kubernetes#105997: Fixing how EndpointSlice Mirroring handles Service selector

Merge pull request kubernetes#105692 from andyzhangx/automated-cherry…

dccaa29

…-pick-of-#105673-upstream-release-1.22 Automated cherry pick of kubernetes#105673: support more than 100 disk mounts on Windows

Merge pull request kubernetes#106061 from prameshj/automated-cherry-p…

137731a

…ick-of-#105946-upstream-release-1.22 Automated cherry pick of kubernetes#105946: Remove nodes with Cluster Autoscaler taint from LB backends.

Merge pull request kubernetes#106143 from cpanato/debian-122

f3e8a2e

Update debian, debian-iptables, setcap images to pick up CVEs fixes

Automated cherry pick of kubernetes#105122: added keys for structured…

bd146ab

… logging (kubernetes#105137) * added keys for structured logging * used KObj Co-authored-by: Shivanshu Raj Shrivastava <shivanshu1333@gmail.com>

[go1.16] Update to go1.16.10

3bf2248

Signed-off-by: Carlos Panato <ctadeu@gmail.com>

Use separate pathSpec for local and remote to properly handle cleanin…

24b725f

…g paths

Manual cherry pick of kube-openapi changes for release-1.22

0c73323

Bump kube-openapi against kube-openapi/release-1.22 branch Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

smarterclayton and others added 29 commits March 16, 2022 12:26

test: Add E2E for init container pod deletion

daf2551

Exploring termination revealed we have race conditions in certain parts of pod initialization and termination. To better catch these issues refactor the existing test so it can be reused, and then test a number of alternate scenarios.

Extract containerID from systemd-style cgroupPath in cri_stats_provid…

126cebe

…er And fix test to generate UUID without dash

Move kubelet secret and configmap manager calls to sync_Pod functions

59d5267

Include pod UID in secret/configmap cache key

50b96bb

test: Verify that nodes do not transition to Failed while ready

f839106

Signed-off-by: David Porter <david@porter.me>

Merge pull request kubernetes#108749 from bobbypage/automated-cherry-…

78c4893

…pick-of-#108366-upstream-release-1.22 Automated cherry pick of kubernetes#108366 (release-1.22): Delay writing a terminal phase until the pod is terminated

Merge pull request kubernetes#108754 from gautierdelorme/1.22-kubelet…

dbb1bfe

…-secret-manager [release-1.22] Move kubelet secret and configmap manager calls to sync_Pod functions

Merge pull request kubernetes#108520 from heybronson/automated-cherry…

50d063f

…-pick-of-#107764-upstream-release-1.22 Automated cherry pick of kubernetes#107764: wrap error from RunCordonOrUncordon

kube-up: use registry.k8s.io for containerd-related jobs

8dcebf3

Signed-off-by: Davanum Srinivas <davanum@gmail.com>

Merge pull request kubernetes#108944 from dims/automated-cherry-pick-…

c338162

…of-#108928-upstream-release-1.22 Automated cherry pick of kubernetes#108928: kube-up: use registry.k8s.io for containerd-related jobs

Copy request in timeout handler

fa0e07f

Merge pull request kubernetes#109014 from Argh4k/automated-cherry-pic…

a1fd7a4

…k-of-#108455-upstream-release-1.22 Automated cherry pick of kubernetes#108455: Copy request in timeout handler

Merge pull request kubernetes#108753 from CatherineF-dev/automated-ch…

67a57c4

…erry-pick-of-#104039-upstream-release-1.22 Automated cherry pick of kubernetes#104039 upstream release 1.22

Fix: abort nominating a pod that was already scheduled to a node

b52d7ff

Change-Id: Iacb8530769e7a93e3bc8384cf51d7a8fd9a192e1

Merge pull request kubernetes#109247 from alculquicondor/automated-ch…

acfccfa

…erry-pick-of-#109245-upstream-release-1.22 Automated cherry pick of kubernetes#109245: Fix: abort nominating a pod that was already scheduled to a

Release commit for Kubernetes v1.22.9

6df4433

sidecar: container ordered start/shutdown support

f50a822

sidecar: glog -> klog

cc4516e

pkg/kubelet: try restoring the container spec if its nil

e79dd97

We're not guaranteed that the pod passed in has the ContainerSpec we're looking for. With this, we check if the pod has the container spec, and if it doesn't, we try to recover it one more time.

pkg/kubelet: fix 1.14 compat for container restore error text

7424684

pkg/kubelet: fix uint64 overflow when elapsed UsageCoreNanoSeconds ex…

bdda6f1

…ceeds 18446744073

do not consider exited containers when calculating nanocores

ce9175f

handle case where cpuacct is reset to 0 in a live container

2347789

remove unnecessary line

20cd2a9

disable klog for cadvisor.GetDirFsInfo cache miss

bcce88d

xinyuche closed this Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 1.22 camille rebase#32

Release 1.22 camille rebase#32
xinyuche wants to merge 438 commits intorelease-1.22.9-lyft.1from
release-1.22-camille-rebase

xinyuche commented Jun 29, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

xinyuche commented Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

xinyuche commented Jun 29, 2022 •

edited

Loading