@@ -291,260 +291,6 @@ the kubelet can use topology hints when making resource assignment decisions.
291
291
See [ Control Topology Management Policies on a Node] ( /docs/tasks/administer-cluster/topology-manager/ )
292
292
for more information.
293
293
294
- ## Graceful node shutdown {#graceful-node-shutdown}
295
-
296
- {{< feature-state feature_gate_name="GracefulNodeShutdown" >}}
297
-
298
- The kubelet attempts to detect node system shutdown and terminates pods running on the node.
299
-
300
- Kubelet ensures that pods follow the normal
301
- [ pod termination process] ( /docs/concepts/workloads/pods/pod-lifecycle/#pod-termination )
302
- during the node shutdown. During node shutdown, the kubelet does not accept new
303
- Pods (even if those Pods are already bound to the node).
304
-
305
- The Graceful node shutdown feature depends on systemd since it takes advantage of
306
- [ systemd inhibitor locks] ( https://www.freedesktop.org/wiki/Software/systemd/inhibit/ ) to
307
- delay the node shutdown with a given duration.
308
-
309
- Graceful node shutdown is controlled with the ` GracefulNodeShutdown `
310
- [ feature gate] ( /docs/reference/command-line-tools-reference/feature-gates/ ) which is
311
- enabled by default in 1.21.
312
-
313
- Note that by default, both configuration options described below,
314
- ` shutdownGracePeriod ` and ` shutdownGracePeriodCriticalPods ` are set to zero,
315
- thus not activating the graceful node shutdown functionality.
316
- To activate the feature, the two kubelet config settings should be configured appropriately and
317
- set to non-zero values.
318
-
319
- Once systemd detects or notifies node shutdown, the kubelet sets a ` NotReady ` condition on
320
- the Node, with the ` reason ` set to ` "node is shutting down" ` . The kube-scheduler honors this condition
321
- and does not schedule any Pods onto the affected node; other third-party schedulers are
322
- expected to follow the same logic. This means that new Pods won't be scheduled onto that node
323
- and therefore none will start.
324
-
325
- The kubelet ** also** rejects Pods during the ` PodAdmission ` phase if an ongoing
326
- node shutdown has been detected, so that even Pods with a
327
- {{< glossary_tooltip text="toleration" term_id="toleration" >}} for
328
- ` node.kubernetes.io/not-ready:NoSchedule ` do not start there.
329
-
330
- At the same time when kubelet is setting that condition on its Node via the API,
331
- the kubelet also begins terminating any Pods that are running locally.
332
-
333
- During a graceful shutdown, kubelet terminates pods in two phases:
334
-
335
- 1 . Terminate regular pods running on the node.
336
- 2 . Terminate [ critical pods] ( /docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical )
337
- running on the node.
338
-
339
- Graceful node shutdown feature is configured with two
340
- [ ` KubeletConfiguration ` ] ( /docs/tasks/administer-cluster/kubelet-config-file/ ) options:
341
-
342
- * ` shutdownGracePeriod ` :
343
- * Specifies the total duration that the node should delay the shutdown by. This is the total
344
- grace period for pod termination for both regular and
345
- [ critical pods] ( /docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical ) .
346
- * ` shutdownGracePeriodCriticalPods ` :
347
- * Specifies the duration used to terminate
348
- [ critical pods] ( /docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical )
349
- during a node shutdown. This value should be less than ` shutdownGracePeriod ` .
350
-
351
- {{< note >}}
352
-
353
- There are cases when Node termination was cancelled by the system (or perhaps manually
354
- by an administrator). In either of those situations the Node will return to the ` Ready ` state.
355
- However, Pods which already started the process of termination will not be restored by kubelet
356
- and will need to be re-scheduled.
357
-
358
- {{< /note >}}
359
-
360
- For example, if ` shutdownGracePeriod=30s ` , and
361
- ` shutdownGracePeriodCriticalPods=10s ` , kubelet will delay the node shutdown by
362
- 30 seconds. During the shutdown, the first 20 (30-10) seconds would be reserved
363
- for gracefully terminating normal pods, and the last 10 seconds would be
364
- reserved for terminating [ critical pods] ( /docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical ) .
365
-
366
- {{< note >}}
367
- When pods were evicted during the graceful node shutdown, they are marked as shutdown.
368
- Running ` kubectl get pods ` shows the status of the evicted pods as ` Terminated ` .
369
- And ` kubectl describe pod ` indicates that the pod was evicted because of node shutdown:
370
-
371
- ```
372
- Reason: Terminated
373
- Message: Pod was terminated in response to imminent node shutdown.
374
- ```
375
-
376
- {{< /note >}}
377
-
378
- ### Pod Priority based graceful node shutdown {#pod-priority-graceful-node-shutdown}
379
-
380
- {{< feature-state feature_gate_name="GracefulNodeShutdownBasedOnPodPriority" >}}
381
-
382
- To provide more flexibility during graceful node shutdown around the ordering
383
- of pods during shutdown, graceful node shutdown honors the PriorityClass for
384
- Pods, provided that you enabled this feature in your cluster. The feature
385
- allows cluster administers to explicitly define the ordering of pods
386
- during graceful node shutdown based on
387
- [ priority classes] ( /docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass ) .
388
-
389
- The [ Graceful Node Shutdown] ( #graceful-node-shutdown ) feature, as described
390
- above, shuts down pods in two phases, non-critical pods, followed by critical
391
- pods. If additional flexibility is needed to explicitly define the ordering of
392
- pods during shutdown in a more granular way, pod priority based graceful
393
- shutdown can be used.
394
-
395
- When graceful node shutdown honors pod priorities, this makes it possible to do
396
- graceful node shutdown in multiple phases, each phase shutting down a
397
- particular priority class of pods. The kubelet can be configured with the exact
398
- phases and shutdown time per phase.
399
-
400
- Assuming the following custom pod
401
- [ priority classes] ( /docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass )
402
- in a cluster,
403
-
404
- | Pod priority class name| Pod priority class value|
405
- | -------------------------| ------------------------|
406
- | ` custom-class-a ` | 100000 |
407
- | ` custom-class-b ` | 10000 |
408
- | ` custom-class-c ` | 1000 |
409
- | ` regular/unset ` | 0 |
410
-
411
- Within the [ kubelet configuration] ( /docs/reference/config-api/kubelet-config.v1beta1/ )
412
- the settings for ` shutdownGracePeriodByPodPriority ` could look like:
413
-
414
- | Pod priority class value| Shutdown period|
415
- | ------------------------| ---------------|
416
- | 100000 | 10 seconds |
417
- | 10000 | 180 seconds |
418
- | 1000 | 120 seconds |
419
- | 0 | 60 seconds |
420
-
421
- The corresponding kubelet config YAML configuration would be:
422
-
423
- ``` yaml
424
- shutdownGracePeriodByPodPriority :
425
- - priority : 100000
426
- shutdownGracePeriodSeconds : 10
427
- - priority : 10000
428
- shutdownGracePeriodSeconds : 180
429
- - priority : 1000
430
- shutdownGracePeriodSeconds : 120
431
- - priority : 0
432
- shutdownGracePeriodSeconds : 60
433
- ` ` `
434
-
435
- The above table implies that any pod with ` priority` value >= 100000 will get
436
- just 10 seconds to stop, any pod with value >= 10000 and < 100000 will get 180
437
- seconds to stop, any pod with value >= 1000 and < 10000 will get 120 seconds to stop.
438
- Finally, all other pods will get 60 seconds to stop.
439
-
440
- One doesn't have to specify values corresponding to all of the classes. For
441
- example, you could instead use these settings :
442
-
443
- |Pod priority class value|Shutdown period|
444
- |------------------------|---------------|
445
- | 100000 |300 seconds |
446
- | 1000 |120 seconds |
447
- | 0 |60 seconds |
448
-
449
- In the above case, the pods with `custom-class-b` will go into the same bucket
450
- as `custom-class-c` for shutdown.
451
-
452
- If there are no pods in a particular range, then the kubelet does not wait
453
- for pods in that priority range. Instead, the kubelet immediately skips to the
454
- next priority class value range.
455
-
456
- If this feature is enabled and no configuration is provided, then no ordering
457
- action will be taken.
458
-
459
- Using this feature requires enabling the `GracefulNodeShutdownBasedOnPodPriority`
460
- [feature gate](/docs/reference/command-line-tools-reference/feature-gates/),
461
- and setting `ShutdownGracePeriodByPodPriority` in the
462
- [kubelet config](/docs/reference/config-api/kubelet-config.v1beta1/)
463
- to the desired configuration containing the pod priority class values and
464
- their respective shutdown periods.
465
-
466
- {{< note >}}
467
- The ability to take Pod priority into account during graceful node shutdown was introduced
468
- as an Alpha feature in Kubernetes v1.23. In Kubernetes {{< skew currentVersion >}}
469
- the feature is Beta and is enabled by default.
470
- {{< /note >}}
471
-
472
- Metrics `graceful_shutdown_start_time_seconds` and `graceful_shutdown_end_time_seconds`
473
- are emitted under the kubelet subsystem to monitor node shutdowns.
474
-
475
- # # Non-graceful node shutdown handling {#non-graceful-node-shutdown}
476
-
477
- {{< feature-state feature_gate_name="NodeOutOfServiceVolumeDetach" >}}
478
-
479
- A node shutdown action may not be detected by kubelet's Node Shutdown Manager,
480
- either because the command does not trigger the inhibitor locks mechanism used by
481
- kubelet or because of a user error, i.e., the ShutdownGracePeriod and
482
- ShutdownGracePeriodCriticalPods are not configured properly. Please refer to above
483
- section [Graceful Node Shutdown](#graceful-node-shutdown) for more details.
484
-
485
- When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods
486
- that are part of a {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}
487
- will be stuck in terminating status on the shutdown node and cannot move to a new running node.
488
- This is because kubelet on the shutdown node is not available to delete the pods so
489
- the StatefulSet cannot create a new pod with the same name. If there are volumes used by the pods,
490
- the VolumeAttachments will not be deleted from the original shutdown node so the volumes
491
- used by these pods cannot be attached to a new running node. As a result, the
492
- application running on the StatefulSet cannot function properly. If the original
493
- shutdown node comes up, the pods will be deleted by kubelet and new pods will be
494
- created on a different running node. If the original shutdown node does not come up,
495
- these pods will be stuck in terminating status on the shutdown node forever.
496
-
497
- To mitigate the above situation, a user can manually add the taint `node.kubernetes.io/out-of-service`
498
- with either `NoExecute` or `NoSchedule` effect to a Node marking it out-of-service.
499
- If the `NodeOutOfServiceVolumeDetach`[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
500
- is enabled on {{< glossary_tooltip text="kube-controller-manager" term_id="kube-controller-manager" >}},
501
- and a Node is marked out-of-service with this taint, the pods on the node will be forcefully deleted
502
- if there are no matching tolerations on it and volume detach operations for the pods terminating on
503
- the node will happen immediately. This allows the Pods on the out-of-service node to recover quickly
504
- on a different node.
505
-
506
- During a non-graceful shutdown, Pods are terminated in the two phases :
507
-
508
- 1. Force delete the Pods that do not have matching `out-of-service` tolerations.
509
- 2. Immediately perform detach volume operation for such pods.
510
-
511
- {{< note >}}
512
- - Before adding the taint `node.kubernetes.io/out-of-service`, it should be verified
513
- that the node is already in shutdown or power off state (not in the middle of restarting).
514
- - The user is required to manually remove the out-of-service taint after the pods are
515
- moved to a new node and the user has checked that the shutdown node has been
516
- recovered since the user was the one who originally added the taint.
517
- {{< /note >}}
518
-
519
- # ## Forced storage detach on timeout {#storage-force-detach-on-timeout}
520
-
521
- In any situation where a pod deletion has not succeeded for 6 minutes, kubernetes will
522
- force detach volumes being unmounted if the node is unhealthy at that instant. Any
523
- workload still running on the node that uses a force-detached volume will cause a
524
- violation of the
525
- [CSI specification](https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerunpublishvolume),
526
- which states that `ControllerUnpublishVolume` "**must** be called after all
527
- ` NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed".
528
- In such circumstances, volumes on the node in question might encounter data corruption.
529
-
530
- The forced storage detach behaviour is optional; users might opt to use the "Non-graceful
531
- node shutdown" feature instead.
532
-
533
- Force storage detach on timeout can be disabled by setting the `disable-force-detach-on-timeout`
534
- config field in `kube-controller-manager`. Disabling the force detach on timeout feature means
535
- that a volume that is hosted on a node that is unhealthy for more than 6 minutes will not have
536
- its associated
537
- [VolumeAttachment](/docs/reference/kubernetes-api/config-and-storage-resources/volume-attachment-v1/)
538
- deleted.
539
-
540
- After this setting has been applied, unhealthy pods still attached to a volumes must be recovered
541
- via the [Non-Graceful Node Shutdown](#non-graceful-node-shutdown) procedure mentioned above.
542
-
543
- {{< note >}}
544
- - Caution must be taken while using the [Non-Graceful Node Shutdown](#non-graceful-node-shutdown) procedure.
545
- - Deviation from the steps documented above can result in data corruption.
546
- {{< /note >}}
547
-
548
294
## Swap memory management {#swap-memory}
549
295
550
296
{{< feature-state feature_gate_name="NodeSwap" >}}
@@ -610,6 +356,7 @@ Learn more about the following:
610
356
* [API definition for Node](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#node-v1-core).
611
357
* [Node](https://git.k8s.io/design-proposals-archive/architecture/architecture.md#the-kubernetes-node)
612
358
section of the architecture design document.
359
+ * [Graceful/non-graceful node shutdown](/docs/concepts/cluster-administration/node-shutdown/).
613
360
* [Cluster autoscaling](/docs/concepts/cluster-administration/cluster-autoscaling/) to
614
361
manage the number and size of nodes in your cluster.
615
362
* [Taints and Tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/).
0 commit comments