Only Container in POD restarted with OOM, but POD stays (not restartet) #1828

ffppmm · 2023-12-11T09:08:52Z

ffppmm
Dec 11, 2023

Hi,

this is just a question, not really a Bug Report.

We are running "4.13.0-0.okd-2023-09-30-084937". With this Release I have seen that PODs are no longer restartet and relocated to a "better" Node, with "more" resources. Only the container in the pod are restartet and the POD stays on the same worker node.

Is this expected behavior? Can we somehow "activate" the old behavior? Or can you give us a hint to the documentation, please,

regards Philipp

vrutkovs · 2023-12-11T14:28:38Z

vrutkovs
Dec 11, 2023
Maintainer

With this Release I have seen that PODs are no longer restartet and relocated to a "better" Node, with "more" resources

Is it a container being restarted? In that case the pod spec (and selected node) doesn't change, so it won't be moved to any other node, as it has already reserved the resources on this node.

There is no "pod restart", however pod can be recreated (if it was removed or evicted). In that case the scheduler selects new pod for the pod, see https://docs.okd.io/4.13/nodes/scheduling/nodes-scheduler-about.html about the features it takes into account

If you want to "pin" the pod to a particular node, it needs to have nodeSelector set, otherwise the scheduler would assume that recreating this pod on a new node is a valid option.

0 replies

ffppmm · 2023-12-11T14:54:28Z

ffppmm
Dec 11, 2023
Author

Hi,

With this Release I have seen that PODs are no longer restartet and relocated to a "better" Node, with "more" resources

Is it a container being restarted? In that case the pod spec (and selected node) doesn't change, so it won't be moved to any other node, as it has already reserved the resources on this node.

Yes the container gets restarted because of OOM. But the POD stays on the same node and isn't recreated on a different node. I think this behavior changed. Is this true?

There is no "pod restart", however pod can be recreated (if it was removed or evicted). In that case the scheduler selects new pod for the pod, see https://docs.okd.io/4.13/nodes/scheduling/nodes-scheduler-about.html about the features it takes into account

Thanks for the clarifications. I should have put "" around restarted. I meant recreated sry. The Sheduler is just fine, thanks for the docs, if the default is "LowNodeUtilization".

If you want to "pin" the pod to a particular node, it needs to have nodeSelector set, otherwise the scheduler would assume that recreating this pod on a new node is a valid option.

No that is not what I wan't. If a container is restarted (for whatever reason) the whole POD should be recreated.

Regards Philipp

2 replies

vrutkovs Dec 11, 2023
Maintainer

Right, seems what you're hitting is https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced:

When a force deletion is performed, the API server does not wait for confirmation from the kubelet that the Pod has been terminated on the node it was running on. It removes the Pod in the API immediately so a new Pod can be created with the same name. On the node, Pods that are set to terminate immediately will still be given a small grace period before being force killed.

So, when the container is being OOMKilled, the apiserver immediately discards it and a new pod scheduled, going through scheduler loop - and it may pick a new node. I don't think there's been a change about this recently (and if it has, it has happened in k8s core)

ffppmm Dec 12, 2023
Author

Hi,

thanks for your answer, I will have a look. I'm pretty sure that some Versions before the whole POD was recreated after a container OOM, I will dive into kubernetes Doku,

Regards Philipp

ffppmm · 2023-12-13T09:44:26Z

ffppmm
Dec 13, 2023
Author

Hi,

investigating a little further I found out kubelet is normaly evicting a POD if the POD uses to much memory:

 conditions:
    - type: DisruptionTarget
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2023-12-12T15:00:44Z'
      reason: TerminationByKubelet
      message: >-
        The node was low on resource: memory. Threshold quantity: 100Mi,
        available: 74416Ki. Container TESTPOD was using 25208660Ki, request is 14G,
        has larger consumption of memory.

But this doesn't happen to the PODs that are causing us trouble. The OS OOMKiller is killing the Containers not kubelet.

The above behavior is what I expect, because if the OS OOMKiller starts to act, I think its to late.

Is there a way to configure kubelet to be more restrictive (or faster than the OS OOMKiller) so PODs get evicted and not container OOM killed?

So the goal should be kubelet doing the memory "management" and not the OS, any thoughts about this?

Regards Philipp

1 reply

ffppmm Dec 13, 2023
Author

Maybe I should mention we are using "autoSizingReserved: true" on the nodes:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: node-dynamic-resource-tuning-worker
spec:
  autoSizingReserved: true
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""
---
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: node-dynamic-resource-tuning-master
spec:
  autoSizingReserved: true
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/master: ""

The specs from a typically node:

sh-5.2# cat /etc/node-sizing.env
SYSTEM_RESERVED_MEMORY=7Gi
SYSTEM_RESERVED_CPU=0.12
SYSTEM_RESERVED_ES=1Gi
sh-5.2# nproc
20
sh-5.2# free -g
               total        used        free      shared  buff/cache   available
Mem:              86          65           2           0          18          19
Swap:              0           0           0
sh-5.2# df -hP /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda4       160G   74G   86G  47% /

This looks good to me, but maybe someone else can share their settings? Or any thoughts on how this should be better "sized"? IMHO the OS OOMKiller should never get involved and kubelet should do "all" the memory management.

Regards Philipp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only Container in POD restarted with OOM, but POD stays (not restartet) #1828

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Only Container in POD restarted with OOM, but POD stays (not restartet) #1828

ffppmm Dec 11, 2023

Replies: 3 comments · 3 replies

vrutkovs Dec 11, 2023 Maintainer

ffppmm Dec 11, 2023 Author

vrutkovs Dec 11, 2023 Maintainer

ffppmm Dec 12, 2023 Author

ffppmm Dec 13, 2023 Author

ffppmm Dec 13, 2023 Author

ffppmm
Dec 11, 2023

Replies: 3 comments 3 replies

vrutkovs
Dec 11, 2023
Maintainer

ffppmm
Dec 11, 2023
Author

vrutkovs Dec 11, 2023
Maintainer

ffppmm Dec 12, 2023
Author

ffppmm
Dec 13, 2023
Author

ffppmm Dec 13, 2023
Author