Kubelet Image GC原理分析

Image GC 是什么?

Image GC 是 kubelet 的镜像清理功能,用于在磁盘空间不足的情况下清除不需要的镜像,释放磁盘空间,保证 Pod 能正常启动运行。

Image GC 如何使用?

Kubelet 默认开启,通过 kubele 启动配置中的 ImageGCPolicy 控制。ImageGCPolicy 有三个设置参数:

  • ImageGCHighThresholdPercent:触发 gc 的阈值,超过该值将会执行 gc,设置为 100 时,gc 不启动。
  • ImageGCLowThresholdPercent:ImageGC 执行空间空间的目标值,gc 触发后,将会将磁盘占用率降至该值以下;
  • ImageMinimumGCAge:最短 GC 年龄(即距离首次被探测到的间隔),小于该阈值时不会被 gc。

源码分析

ImageGC 的初始化与启动

在 kubelet 启动时,ImageGC 的启动在 BirthCry 执行完成之后。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
func (kl *Kubelet) StartGarbageCollection() {
loggedContainerGCFailure := false

// container gc流程,省略
...

// ImageGCHighThresholdPercent设置为100时,关闭image gc
if kl.kubeletConfiguration.ImageGCHighThresholdPercent == 100 {
klog.V(2).Infof("ImageGCHighThresholdPercent is set 100, Disable image GC")
return
}

prevImageGCFailed := false
go wait.Until(func() {
if err := kl.imageManager.GarbageCollect(); err != nil {
if prevImageGCFailed {
klog.Errorf("Image garbage collection failed multiple times in a row: %v", err)
// Only create an event for repeated failures
kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error())
} else {
klog.Errorf("Image garbage collection failed once. Stats initialization may not have completed yet: %v", err)
}
prevImageGCFailed = true
} else {
var vLevel klog.Level = 4
if prevImageGCFailed {
vLevel = 1
prevImageGCFailed = false
}

klog.V(vLevel).Infof("Image garbage collection succeeded")
}
}, ImageGCPeriod, wait.NeverStop)
}

可以看到,ImageGC 由单独的协程执行,默认的执行间隔为五分钟。当 ImageGC 首次执行失败时会打印日志,而重复失败后,会记录一个 ImageGCFailed 的事件。这意味着可以通过配置日志或者告警了解 GC 是否正常运行。
接下来看看 ImageGCManager 的具体实现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
type ImageGCManager interface {
// Applies the garbage collection policy. Errors include being unable to free
// enough space as per the garbage collection policy.
GarbageCollect() error

// Start async garbage collection of images.
Start()

GetImageList() ([]container.Image, error)

// Delete all unused images.
DeleteUnusedImages() error
}

func NewImageGCManager(runtime container.Runtime, statsProvider StatsProvider, recorder record.EventRecorder, nodeRef *v1.ObjectReference, policy ImageGCPolicy, sandboxImage string) (ImageGCManager, error) {
// Validate policy.
if policy.HighThresholdPercent < 0 || policy.HighThresholdPercent > 100 {
return nil, fmt.Errorf("invalid HighThresholdPercent %d, must be in range [0-100]", policy.HighThresholdPercent)
}
if policy.LowThresholdPercent < 0 || policy.LowThresholdPercent > 100 {
return nil, fmt.Errorf("invalid LowThresholdPercent %d, must be in range [0-100]", policy.LowThresholdPercent)
}
if policy.LowThresholdPercent > policy.HighThresholdPercent {
return nil, fmt.Errorf("LowThresholdPercent %d can not be higher than HighThresholdPercent %d", policy.LowThresholdPercent, policy.HighThresholdPercent)
}
im := &realImageGCManager{
runtime: runtime,
policy: policy,
imageRecords: make(map[string]*imageRecord),
statsProvider: statsProvider,
recorder: recorder,
nodeRef: nodeRef,
initialized: false,
sandboxImage: sandboxImage,
}

return im, nil
}

ImageGCManager 的接口非常简单,只有四个方法:

  • GarbageCollect:根据定义的 ImageGCPolicy 执行具体的清理动作;
  • Start:异步地收集镜像信息;
  • GetImageList:获取缓存中的镜像列表;
  • DeleteUnusedImages:删除未使用的镜像。

ImageGCManager 在初始化时会校验 Policy 的参数合法性,然后传递运行时、监控、事件等参数。然后看看 Start 方法的逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func (im *realImageGCManager) Start() {
go wait.Until(func() {
// Initial detection make detected time "unknown" in the past.
var ts time.Time
if im.initialized {
ts = time.Now()
}
_, err := im.detectImages(ts)
if err != nil {
klog.Warningf("[imageGCManager] Failed to monitor images: %v", err)
} else {
im.initialized = true
}
}, 5*time.Minute, wait.NeverStop)

// Start a goroutine periodically updates image cache.
go wait.Until(func() {
images, err := im.runtime.ListImages()
if err != nil {
klog.Warningf("[imageGCManager] Failed to update image list: %v", err)
} else {
im.imageCache.set(images)
}
}, 30*time.Second, wait.NeverStop)

}

ImageGCManager 的 Start 方法会启动两个协程。在第一个协程内,每隔五分钟 Manager 会检查一次镜像。一旦完成一次,Manager 的状态就会被标记为已初始化。另一个协程每隔 30 秒会从容器运行时获取所有的镜像信息,更新到缓存的镜像列表中。

镜像信息的检测和维护

那么,Manager 时如何检测镜像的呢?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
func (im *realImageGCManager) detectImages(detectTime time.Time) (sets.String, error) {
imagesInUse := sets.NewString()

// Always consider the container runtime pod sandbox image in use
imageRef, err := im.runtime.GetImageRef(container.ImageSpec{Image: im.sandboxImage})
if err == nil && imageRef != "" {
imagesInUse.Insert(imageRef)
}

images, err := im.runtime.ListImages()
if err != nil {
return imagesInUse, err
}
pods, err := im.runtime.GetPods(true)
if err != nil {
return imagesInUse, err
}

// Make a set of images in use by containers.
for _, pod := range pods {
for _, container := range pod.Containers {
klog.V(5).Infof("Pod %s/%s, container %s uses image %s(%s)", pod.Namespace, pod.Name, container.Name, container.Image, container.ImageID)
imagesInUse.Insert(container.ImageID)
}
}

// Add new images and record those being used.
now := time.Now()
currentImages := sets.NewString()
im.imageRecordsLock.Lock()
defer im.imageRecordsLock.Unlock()
for _, image := range images {
klog.V(5).Infof("Adding image ID %s to currentImages", image.ID)
currentImages.Insert(image.ID)

// New image, set it as detected now.
if _, ok := im.imageRecords[image.ID]; !ok {
klog.V(5).Infof("Image ID %s is new", image.ID)
im.imageRecords[image.ID] = &imageRecord{
firstDetected: detectTime,
}
}

// Set last used time to now if the image is being used.
if isImageUsed(image.ID, imagesInUse) {
klog.V(5).Infof("Setting Image ID %s lastUsed to %v", image.ID, now)
im.imageRecords[image.ID].lastUsed = now
}

klog.V(5).Infof("Image ID %s has size %d", image.ID, image.Size)
im.imageRecords[image.ID].size = image.Size
}

// Remove old images from our records.
for image := range im.imageRecords {
if !currentImages.Has(image) {
klog.V(5).Infof("Image ID %s is no longer present; removing from imageRecords", image)
delete(im.imageRecords, image)
}
}

return imagesInUse, nil
}

检测镜像的目的是找出正在使用的镜像,防止在 GC 执行的过程中被清理。同时,在此过程中,镜像的清理需要参考一些信息,这些信息也会在检测的过程中更新。
首先,Sandbox 镜像是一定会被判定为正在使用的镜像。接着会将所有 Pod 的所有正在运行中的容器使用的 image 加到正在使用的镜像列表中。注意,即使 Pod 有容器需要该镜像,但是该容器未处于 Running 状态,其对应的镜像也会被清理。
选出正在使用(即不会被清理)的镜像之后,会将容器运行时中获取到的镜像列表信息更新到 Manager 维护的镜像列表记录中。
查询所有新获取的镜像列表信息进行遍历,分为以下几步:

  • 如果是第一次被记录,那么更新该镜像的首次被探测时间为本轮探测的事件
  • 如果被前面一步被判定为“正在使用的镜像”,那么它的最新使用事件会被刷新为当前时间
  • 刷新获取到的镜像的大小

最后,如果某个镜像已经不在容器运行时返回的镜像列表中,就会被移出 Manager 缓存的镜像探测记录。

ImageGC 的具体执行

ImageGCManager 的核心方法就是 GarbageCollect 了,主要步骤如下:首先获取 Image 对应的 Filesystem 占用信息,根据启动的配置计算出用量百分比以及需要释放的空间大小,然后开始释放。如果实际释放的空间小于目标大小,会记录 FreeDiskSpaceFailed 的 Warnning 事件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
func (im *realImageGCManager) GarbageCollect() error {
// Get disk usage on disk holding images.
fsStats, err := im.statsProvider.ImageFsStats()
if err != nil {
return err
}

var capacity, available int64
if fsStats.CapacityBytes != nil {
capacity = int64(*fsStats.CapacityBytes)
}
if fsStats.AvailableBytes != nil {
available = int64(*fsStats.AvailableBytes)
}

if available > capacity {
klog.Warningf("available %d is larger than capacity %d", available, capacity)
available = capacity
}

// Check valid capacity.
if capacity == 0 {
err := goerrors.New("invalid capacity 0 on image filesystem")
im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())
return err
}

// If over the max threshold, free enough to place us at the lower threshold.
usagePercent := 100 - int(available*100/capacity)
if usagePercent >= im.policy.HighThresholdPercent {
amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available
klog.Infof("[imageGCManager]: Disk usage on image filesystem is at %d%% which is over the high threshold (%d%%). Trying to free %d bytes down to the low threshold (%d%%).", usagePercent, im.policy.HighThresholdPercent, amountToFree, im.policy.LowThresholdPercent)
freed, err := im.freeSpace(amountToFree, time.Now())
if err != nil {
return err
}

if freed < amountToFree {
err := fmt.Errorf("failed to garbage collect required amount of images. Wanted to free %d bytes, but freed %d bytes", amountToFree, freed)
im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.FreeDiskSpaceFailed, err.Error())
return err
}
}

return nil
}

算出需要释放空间后是删除的镜像是怎么决定的呢?在开始执行清理时,会执行我们上面介绍的镜像探测过程。在完成镜像探测后,我们的得到的 imagesInUse 包括了 Sanbox 镜像以及 Pod 内正在与运行中容器使用的镜像。接下来,要选出清理的目标镜像,存放清理目标的数据结构叫 evictionInfo,它存放了所有不在 imagesInUse 列表内的镜像记录。接着会将这些镜像记录按照 最后使用时间首次探测时间 进行一次排序,即按照 LRU 规则将最后一次使用时间较早和探测事件较早的镜像排在前面。
排序完之后,会遍历所有这些镜像:如果是镜像最后一次使用事件没有删除触发时间早(即刚刚刷新了最后使用时间),则不会删除。同时,如果该镜像首次被探测到的时间差小于配置的最小 GC 间隔(即刚加入到缓存记录中),也不会删除。否则,就会依序删除这些镜像,删除完之后会从探测记录中删除该镜像同时累加 已经释放的空间值 spaceFreed。如果 spaceFreed 不小于目标释放的空间,则本轮的清理正常结束。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
func (im *realImageGCManager) freeSpace(bytesToFree int64, freeTime time.Time) (int64, error) {
imagesInUse, err := im.detectImages(freeTime)
if err != nil {
return 0, err
}

im.imageRecordsLock.Lock()
defer im.imageRecordsLock.Unlock()

// Get all images in eviction order.
images := make([]evictionInfo, 0, len(im.imageRecords))
for image, record := range im.imageRecords {
if isImageUsed(image, imagesInUse) {
klog.V(5).Infof("Image ID %s is being used", image)
continue
}
images = append(images, evictionInfo{
id: image,
imageRecord: *record,
})
}
sort.Sort(byLastUsedAndDetected(images))

// Delete unused images until we've freed up enough space.
var deletionErrors []error
spaceFreed := int64(0)
for _, image := range images {
klog.V(5).Infof("Evaluating image ID %s for possible garbage collection", image.id)
// Images that are currently in used were given a newer lastUsed.
if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) {
klog.V(5).Infof("Image ID %s has lastUsed=%v which is >= freeTime=%v, not eligible for garbage collection", image.id, image.lastUsed, freeTime)
continue
}

// Avoid garbage collect the image if the image is not old enough.
// In such a case, the image may have just been pulled down, and will be used by a container right away.

if freeTime.Sub(image.firstDetected) < im.policy.MinAge {
klog.V(5).Infof("Image ID %s has age %v which is less than the policy's minAge of %v, not eligible for garbage collection", image.id, freeTime.Sub(image.firstDetected), im.policy.MinAge)
continue
}

// Remove image. Continue despite errors.
klog.Infof("[imageGCManager]: Removing image %q to free %d bytes", image.id, image.size)
err := im.runtime.RemoveImage(container.ImageSpec{Image: image.id})
if err != nil {
deletionErrors = append(deletionErrors, err)
continue
}
delete(im.imageRecords, image.id)
spaceFreed += image.size

if spaceFreed >= bytesToFree {
break
}
}

if len(deletionErrors) > 0 {
return spaceFreed, fmt.Errorf("wanted to free %d bytes, but freed %d bytes space with errors in image deletion: %v", bytesToFree, spaceFreed, errors.NewAggregate(deletionErrors))
}
return spaceFreed, nil
}

磁盘驱逐与 ImageGC

除了上述 GC 逻辑外,实际上还有额外的 ImageGC 触发条件。在运行中,偶尔会遇到 ImageGCHighThresholdPercent 被设置为 100 但还是有镜像被清理的情况。我们反过来看下在上文提到的 ImageGC 的接口,可以看到 DeleteUnusedImages 是个 public 方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
func buildSignalToNodeReclaimFuncs(imageGC ImageGC, containerGC ContainerGC, withImageFs bool) map[evictionapi.Signal]nodeReclaimFuncs {
signalToReclaimFunc := map[evictionapi.Signal]nodeReclaimFuncs{}
// usage of an imagefs is optional
if withImageFs {
// with an imagefs, nodefs pressure should just delete logs
signalToReclaimFunc[evictionapi.SignalNodeFsAvailable] = nodeReclaimFuncs{}
signalToReclaimFunc[evictionapi.SignalNodeFsInodesFree] = nodeReclaimFuncs{}
// with an imagefs, imagefs pressure should delete unused images
signalToReclaimFunc[evictionapi.SignalImageFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
signalToReclaimFunc[evictionapi.SignalImageFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
} else {
// without an imagefs, nodefs pressure should delete logs, and unused images
// since imagefs and nodefs share a common device, they share common reclaim functions
signalToReclaimFunc[evictionapi.SignalNodeFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
signalToReclaimFunc[evictionapi.SignalNodeFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
signalToReclaimFunc[evictionapi.SignalImageFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
signalToReclaimFunc[evictionapi.SignalImageFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
}
return signalToReclaimFunc
}

实际上,在磁盘满导致节点驱逐信号触发时会直接调用容器和镜像的 GC 方法,毕竟节点驱逐的触发是更紧急的。

总结

总的来看,Kubelet 会在节点驱逐信号触发Image 对应的 Filesystem 空间不足的情况下删除冗余的镜像。整个 GC 的要点如下:

  • 清理的触发为到达 HighThresholdPercent 开始清理,一直清理到 LowThresholdPercent 为止。但是需要注意的是通过将 HighThresholdPercent 设置为 100 关闭 GC 的做法对节点驱逐不生效,只能关闭定时清理任务
  • 镜像清理过程中,有三类镜像不会被清除:
    • Sanbox 所需镜像;
    • GC 首次探测和刚被刷新过最后使用时间的镜像;
    • 探测累计时长小于 MinimumGCAge 的镜像。
  • 清理过程会优先清除最久没用到的和最早探测到的镜像。