Skip to content

Commit f32bcaf

Browse files
committed
Move node shutdown section to cluster-administration
Update doc title Rename file/update link Update api_metadata Update what's next Remove api_metadata, add tootip
1 parent f3145e2 commit f32bcaf

File tree

2 files changed

+275
-254
lines changed

2 files changed

+275
-254
lines changed

content/en/docs/concepts/architecture/nodes.md

Lines changed: 1 addition & 254 deletions
Original file line numberDiff line numberDiff line change
@@ -291,260 +291,6 @@ the kubelet can use topology hints when making resource assignment decisions.
291291
See [Control Topology Management Policies on a Node](/docs/tasks/administer-cluster/topology-manager/)
292292
for more information.
293293

294-
## Graceful node shutdown {#graceful-node-shutdown}
295-
296-
{{< feature-state feature_gate_name="GracefulNodeShutdown" >}}
297-
298-
The kubelet attempts to detect node system shutdown and terminates pods running on the node.
299-
300-
Kubelet ensures that pods follow the normal
301-
[pod termination process](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)
302-
during the node shutdown. During node shutdown, the kubelet does not accept new
303-
Pods (even if those Pods are already bound to the node).
304-
305-
The Graceful node shutdown feature depends on systemd since it takes advantage of
306-
[systemd inhibitor locks](https://www.freedesktop.org/wiki/Software/systemd/inhibit/) to
307-
delay the node shutdown with a given duration.
308-
309-
Graceful node shutdown is controlled with the `GracefulNodeShutdown`
310-
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) which is
311-
enabled by default in 1.21.
312-
313-
Note that by default, both configuration options described below,
314-
`shutdownGracePeriod` and `shutdownGracePeriodCriticalPods` are set to zero,
315-
thus not activating the graceful node shutdown functionality.
316-
To activate the feature, the two kubelet config settings should be configured appropriately and
317-
set to non-zero values.
318-
319-
Once systemd detects or notifies node shutdown, the kubelet sets a `NotReady` condition on
320-
the Node, with the `reason` set to `"node is shutting down"`. The kube-scheduler honors this condition
321-
and does not schedule any Pods onto the affected node; other third-party schedulers are
322-
expected to follow the same logic. This means that new Pods won't be scheduled onto that node
323-
and therefore none will start.
324-
325-
The kubelet **also** rejects Pods during the `PodAdmission` phase if an ongoing
326-
node shutdown has been detected, so that even Pods with a
327-
{{< glossary_tooltip text="toleration" term_id="toleration" >}} for
328-
`node.kubernetes.io/not-ready:NoSchedule` do not start there.
329-
330-
At the same time when kubelet is setting that condition on its Node via the API,
331-
the kubelet also begins terminating any Pods that are running locally.
332-
333-
During a graceful shutdown, kubelet terminates pods in two phases:
334-
335-
1. Terminate regular pods running on the node.
336-
2. Terminate [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical)
337-
running on the node.
338-
339-
Graceful node shutdown feature is configured with two
340-
[`KubeletConfiguration`](/docs/tasks/administer-cluster/kubelet-config-file/) options:
341-
342-
* `shutdownGracePeriod`:
343-
* Specifies the total duration that the node should delay the shutdown by. This is the total
344-
grace period for pod termination for both regular and
345-
[critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical).
346-
* `shutdownGracePeriodCriticalPods`:
347-
* Specifies the duration used to terminate
348-
[critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical)
349-
during a node shutdown. This value should be less than `shutdownGracePeriod`.
350-
351-
{{< note >}}
352-
353-
There are cases when Node termination was cancelled by the system (or perhaps manually
354-
by an administrator). In either of those situations the Node will return to the `Ready` state.
355-
However, Pods which already started the process of termination will not be restored by kubelet
356-
and will need to be re-scheduled.
357-
358-
{{< /note >}}
359-
360-
For example, if `shutdownGracePeriod=30s`, and
361-
`shutdownGracePeriodCriticalPods=10s`, kubelet will delay the node shutdown by
362-
30 seconds. During the shutdown, the first 20 (30-10) seconds would be reserved
363-
for gracefully terminating normal pods, and the last 10 seconds would be
364-
reserved for terminating [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical).
365-
366-
{{< note >}}
367-
When pods were evicted during the graceful node shutdown, they are marked as shutdown.
368-
Running `kubectl get pods` shows the status of the evicted pods as `Terminated`.
369-
And `kubectl describe pod` indicates that the pod was evicted because of node shutdown:
370-
371-
```
372-
Reason: Terminated
373-
Message: Pod was terminated in response to imminent node shutdown.
374-
```
375-
376-
{{< /note >}}
377-
378-
### Pod Priority based graceful node shutdown {#pod-priority-graceful-node-shutdown}
379-
380-
{{< feature-state feature_gate_name="GracefulNodeShutdownBasedOnPodPriority" >}}
381-
382-
To provide more flexibility during graceful node shutdown around the ordering
383-
of pods during shutdown, graceful node shutdown honors the PriorityClass for
384-
Pods, provided that you enabled this feature in your cluster. The feature
385-
allows cluster administers to explicitly define the ordering of pods
386-
during graceful node shutdown based on
387-
[priority classes](/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass).
388-
389-
The [Graceful Node Shutdown](#graceful-node-shutdown) feature, as described
390-
above, shuts down pods in two phases, non-critical pods, followed by critical
391-
pods. If additional flexibility is needed to explicitly define the ordering of
392-
pods during shutdown in a more granular way, pod priority based graceful
393-
shutdown can be used.
394-
395-
When graceful node shutdown honors pod priorities, this makes it possible to do
396-
graceful node shutdown in multiple phases, each phase shutting down a
397-
particular priority class of pods. The kubelet can be configured with the exact
398-
phases and shutdown time per phase.
399-
400-
Assuming the following custom pod
401-
[priority classes](/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass)
402-
in a cluster,
403-
404-
|Pod priority class name|Pod priority class value|
405-
|-------------------------|------------------------|
406-
|`custom-class-a` | 100000 |
407-
|`custom-class-b` | 10000 |
408-
|`custom-class-c` | 1000 |
409-
|`regular/unset` | 0 |
410-
411-
Within the [kubelet configuration](/docs/reference/config-api/kubelet-config.v1beta1/)
412-
the settings for `shutdownGracePeriodByPodPriority` could look like:
413-
414-
|Pod priority class value|Shutdown period|
415-
|------------------------|---------------|
416-
| 100000 |10 seconds |
417-
| 10000 |180 seconds |
418-
| 1000 |120 seconds |
419-
| 0 |60 seconds |
420-
421-
The corresponding kubelet config YAML configuration would be:
422-
423-
```yaml
424-
shutdownGracePeriodByPodPriority:
425-
- priority: 100000
426-
shutdownGracePeriodSeconds: 10
427-
- priority: 10000
428-
shutdownGracePeriodSeconds: 180
429-
- priority: 1000
430-
shutdownGracePeriodSeconds: 120
431-
- priority: 0
432-
shutdownGracePeriodSeconds: 60
433-
```
434-
435-
The above table implies that any pod with `priority` value >= 100000 will get
436-
just 10 seconds to stop, any pod with value >= 10000 and < 100000 will get 180
437-
seconds to stop, any pod with value >= 1000 and < 10000 will get 120 seconds to stop.
438-
Finally, all other pods will get 60 seconds to stop.
439-
440-
One doesn't have to specify values corresponding to all of the classes. For
441-
example, you could instead use these settings:
442-
443-
|Pod priority class value|Shutdown period|
444-
|------------------------|---------------|
445-
| 100000 |300 seconds |
446-
| 1000 |120 seconds |
447-
| 0 |60 seconds |
448-
449-
In the above case, the pods with `custom-class-b` will go into the same bucket
450-
as `custom-class-c` for shutdown.
451-
452-
If there are no pods in a particular range, then the kubelet does not wait
453-
for pods in that priority range. Instead, the kubelet immediately skips to the
454-
next priority class value range.
455-
456-
If this feature is enabled and no configuration is provided, then no ordering
457-
action will be taken.
458-
459-
Using this feature requires enabling the `GracefulNodeShutdownBasedOnPodPriority`
460-
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/),
461-
and setting `ShutdownGracePeriodByPodPriority` in the
462-
[kubelet config](/docs/reference/config-api/kubelet-config.v1beta1/)
463-
to the desired configuration containing the pod priority class values and
464-
their respective shutdown periods.
465-
466-
{{< note >}}
467-
The ability to take Pod priority into account during graceful node shutdown was introduced
468-
as an Alpha feature in Kubernetes v1.23. In Kubernetes {{< skew currentVersion >}}
469-
the feature is Beta and is enabled by default.
470-
{{< /note >}}
471-
472-
Metrics `graceful_shutdown_start_time_seconds` and `graceful_shutdown_end_time_seconds`
473-
are emitted under the kubelet subsystem to monitor node shutdowns.
474-
475-
## Non-graceful node shutdown handling {#non-graceful-node-shutdown}
476-
477-
{{< feature-state feature_gate_name="NodeOutOfServiceVolumeDetach" >}}
478-
479-
A node shutdown action may not be detected by kubelet's Node Shutdown Manager,
480-
either because the command does not trigger the inhibitor locks mechanism used by
481-
kubelet or because of a user error, i.e., the ShutdownGracePeriod and
482-
ShutdownGracePeriodCriticalPods are not configured properly. Please refer to above
483-
section [Graceful Node Shutdown](#graceful-node-shutdown) for more details.
484-
485-
When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods
486-
that are part of a {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}
487-
will be stuck in terminating status on the shutdown node and cannot move to a new running node.
488-
This is because kubelet on the shutdown node is not available to delete the pods so
489-
the StatefulSet cannot create a new pod with the same name. If there are volumes used by the pods,
490-
the VolumeAttachments will not be deleted from the original shutdown node so the volumes
491-
used by these pods cannot be attached to a new running node. As a result, the
492-
application running on the StatefulSet cannot function properly. If the original
493-
shutdown node comes up, the pods will be deleted by kubelet and new pods will be
494-
created on a different running node. If the original shutdown node does not come up,
495-
these pods will be stuck in terminating status on the shutdown node forever.
496-
497-
To mitigate the above situation, a user can manually add the taint `node.kubernetes.io/out-of-service`
498-
with either `NoExecute` or `NoSchedule` effect to a Node marking it out-of-service.
499-
If the `NodeOutOfServiceVolumeDetach`[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
500-
is enabled on {{< glossary_tooltip text="kube-controller-manager" term_id="kube-controller-manager" >}},
501-
and a Node is marked out-of-service with this taint, the pods on the node will be forcefully deleted
502-
if there are no matching tolerations on it and volume detach operations for the pods terminating on
503-
the node will happen immediately. This allows the Pods on the out-of-service node to recover quickly
504-
on a different node.
505-
506-
During a non-graceful shutdown, Pods are terminated in the two phases:
507-
508-
1. Force delete the Pods that do not have matching `out-of-service` tolerations.
509-
2. Immediately perform detach volume operation for such pods.
510-
511-
{{< note >}}
512-
- Before adding the taint `node.kubernetes.io/out-of-service`, it should be verified
513-
that the node is already in shutdown or power off state (not in the middle of restarting).
514-
- The user is required to manually remove the out-of-service taint after the pods are
515-
moved to a new node and the user has checked that the shutdown node has been
516-
recovered since the user was the one who originally added the taint.
517-
{{< /note >}}
518-
519-
### Forced storage detach on timeout {#storage-force-detach-on-timeout}
520-
521-
In any situation where a pod deletion has not succeeded for 6 minutes, kubernetes will
522-
force detach volumes being unmounted if the node is unhealthy at that instant. Any
523-
workload still running on the node that uses a force-detached volume will cause a
524-
violation of the
525-
[CSI specification](https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerunpublishvolume),
526-
which states that `ControllerUnpublishVolume` "**must** be called after all
527-
`NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed".
528-
In such circumstances, volumes on the node in question might encounter data corruption.
529-
530-
The forced storage detach behaviour is optional; users might opt to use the "Non-graceful
531-
node shutdown" feature instead.
532-
533-
Force storage detach on timeout can be disabled by setting the `disable-force-detach-on-timeout`
534-
config field in `kube-controller-manager`. Disabling the force detach on timeout feature means
535-
that a volume that is hosted on a node that is unhealthy for more than 6 minutes will not have
536-
its associated
537-
[VolumeAttachment](/docs/reference/kubernetes-api/config-and-storage-resources/volume-attachment-v1/)
538-
deleted.
539-
540-
After this setting has been applied, unhealthy pods still attached to a volumes must be recovered
541-
via the [Non-Graceful Node Shutdown](#non-graceful-node-shutdown) procedure mentioned above.
542-
543-
{{< note >}}
544-
- Caution must be taken while using the [Non-Graceful Node Shutdown](#non-graceful-node-shutdown) procedure.
545-
- Deviation from the steps documented above can result in data corruption.
546-
{{< /note >}}
547-
548294
## Swap memory management {#swap-memory}
549295

550296
{{< feature-state feature_gate_name="NodeSwap" >}}
@@ -610,6 +356,7 @@ Learn more about the following:
610356
* [API definition for Node](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#node-v1-core).
611357
* [Node](https://git.k8s.io/design-proposals-archive/architecture/architecture.md#the-kubernetes-node)
612358
section of the architecture design document.
359+
* [Graceful/non-graceful node shutdown](/docs/concepts/cluster-administration/node-shutdown/).
613360
* [Cluster autoscaling](/docs/concepts/cluster-administration/cluster-autoscaling/) to
614361
manage the number and size of nodes in your cluster.
615362
* [Taints and Tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/).

0 commit comments

Comments
 (0)