Kubelet Image GC原理分析

本文适用于想使用 Image GC 或想了解 Image GC,对 ImageGC 有二次开发需求的 K8S 用户。如果只是想使用配置,只需看前两段以及最后的总结即可。

Image GC 是什么?

Image GC 是随着 kubelet 启动的镜像清理功能,用于在磁盘空间不足的情况下清除不需要的镜像,释放磁盘空间,保证 pod 正常启动运行。

Image GC 如何使用?

kubelet 默认开启,通过 kubelet 内的 ImageGCPolicy 控制。ImageGCPolicy 有三个选项:

  • ImageGCHighThresholdPercent:开始 gc 的阈值,超过该值将会触发 gc,设置为 100 时,gc 不启动。
  • ImageGCLowThresholdPercent:不会运行 gc 的下限值,gc 触发后,将会将磁盘占用率降至该值以下;
  • ImageMinimumGCAge:最短 gc 时长,image age 小于该阈值时不会被 gc;

源码分析

在 kubelet 启动流程中,image gc 的执行在 BirthCry 执行完成之后。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
func (kl *Kubelet) StartGarbageCollection() {
loggedContainerGCFailure := false

// container gc流程,省略
...

// ImageGCHighThresholdPercent设置为100时,关闭image gc
if kl.kubeletConfiguration.ImageGCHighThresholdPercent == 100 {
klog.V(2).Infof("ImageGCHighThresholdPercent is set 100, Disable image GC")
return
}

prevImageGCFailed := false
go wait.Until(func() {
if err := kl.imageManager.GarbageCollect(); err != nil {
if prevImageGCFailed {
klog.Errorf("Image garbage collection failed multiple times in a row: %v", err)
// Only create an event for repeated failures
kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error())
} else {
klog.Errorf("Image garbage collection failed once. Stats initialization may not have completed yet: %v", err)
}
prevImageGCFailed = true
} else {
var vLevel klog.Level = 4
if prevImageGCFailed {
vLevel = 1
prevImageGCFailed = false
}

klog.V(vLevel).Infof("Image garbage collection succeeded")
}
}, ImageGCPeriod, wait.NeverStop)
}

可以看到,image gc 由单独的协程执行,默认的 ImageGCPeriod 为五分钟,即每隔五分钟运行一次,不停止。当 image gc 重复失败时,会吐出 ImageGCFailed 的事件,首次失败仅会打印日志。这意味着可以通过配置日志或者告警了解 GC 是否正常运行。接下来看看 imageManager 的具体实现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
type ImageGCManager interface {
// Applies the garbage collection policy. Errors include being unable to free
// enough space as per the garbage collection policy.
GarbageCollect() error

// Start async garbage collection of images.
Start()

GetImageList() ([]container.Image, error)

// Delete all unused images.
DeleteUnusedImages() error
}

func NewImageGCManager(runtime container.Runtime, statsProvider StatsProvider, recorder record.EventRecorder, nodeRef *v1.ObjectReference, policy ImageGCPolicy, sandboxImage string) (ImageGCManager, error) {
// Validate policy.
if policy.HighThresholdPercent < 0 || policy.HighThresholdPercent > 100 {
return nil, fmt.Errorf("invalid HighThresholdPercent %d, must be in range [0-100]", policy.HighThresholdPercent)
}
if policy.LowThresholdPercent < 0 || policy.LowThresholdPercent > 100 {
return nil, fmt.Errorf("invalid LowThresholdPercent %d, must be in range [0-100]", policy.LowThresholdPercent)
}
if policy.LowThresholdPercent > policy.HighThresholdPercent {
return nil, fmt.Errorf("LowThresholdPercent %d can not be higher than HighThresholdPercent %d", policy.LowThresholdPercent, policy.HighThresholdPercent)
}
im := &realImageGCManager{
runtime: runtime,
policy: policy,
imageRecords: make(map[string]*imageRecord),
statsProvider: statsProvider,
recorder: recorder,
nodeRef: nodeRef,
initialized: false,
sandboxImage: sandboxImage,
}

return im, nil
}

ImageGCManager 的接口非常简单,只有四个方法。初始化的时候就是校验 Policy 的参数,然后传递了几个相关的运行时、监控、事件等参数。然后看看启动:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func (im *realImageGCManager) Start() {
go wait.Until(func() {
// Initial detection make detected time "unknown" in the past.
var ts time.Time
if im.initialized {
ts = time.Now()
}
_, err := im.detectImages(ts)
if err != nil {
klog.Warningf("[imageGCManager] Failed to monitor images: %v", err)
} else {
im.initialized = true
}
}, 5*time.Minute, wait.NeverStop)

// Start a goroutine periodically updates image cache.
go wait.Until(func() {
images, err := im.runtime.ListImages()
if err != nil {
klog.Warningf("[imageGCManager] Failed to update image list: %v", err)
} else {
im.imageCache.set(images)
}
}, 30*time.Second, wait.NeverStop)

}

ImageGCManager 的启动会启动两个协程。在第一个协程内,每隔五分钟 Manager 会通过检查镜像来确保初始化完成。另一个协程每隔 30 秒会将本机运行时上所有的镜像更新到缓存 imageCache 中。ImageCache 有什么作用以及镜像是如何检测的,可以放到后面 gc 的主流程里在仔细看。
ImageGCManager 的核心方法就是 GarbageCollect 了,看源码步骤如下:首先获取 Image 对应的 Filesystem 占用信息,计算出用量百分比以及需要释放的空间大小,然后开始释放。如果实际释放的空间小于目标大小,会记录 FreeDiskSpaceFailed 的 Warnning 事件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
func (im *realImageGCManager) GarbageCollect() error {
// Get disk usage on disk holding images.
fsStats, err := im.statsProvider.ImageFsStats()
if err != nil {
return err
}

var capacity, available int64
if fsStats.CapacityBytes != nil {
capacity = int64(*fsStats.CapacityBytes)
}
if fsStats.AvailableBytes != nil {
available = int64(*fsStats.AvailableBytes)
}

if available > capacity {
klog.Warningf("available %d is larger than capacity %d", available, capacity)
available = capacity
}

// Check valid capacity.
if capacity == 0 {
err := goerrors.New("invalid capacity 0 on image filesystem")
im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())
return err
}

// If over the max threshold, free enough to place us at the lower threshold.
usagePercent := 100 - int(available*100/capacity)
if usagePercent >= im.policy.HighThresholdPercent {
amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available
klog.Infof("[imageGCManager]: Disk usage on image filesystem is at %d%% which is over the high threshold (%d%%). Trying to free %d bytes down to the low threshold (%d%%).", usagePercent, im.policy.HighThresholdPercent, amountToFree, im.policy.LowThresholdPercent)
freed, err := im.freeSpace(amountToFree, time.Now())
if err != nil {
return err
}

if freed < amountToFree {
err := fmt.Errorf("failed to garbage collect required amount of images. Wanted to free %d bytes, but freed %d bytes", amountToFree, freed)
im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.FreeDiskSpaceFailed, err.Error())
return err
}
}

return nil
}

在看 ImageGC 内磁盘释放空间的实现,需要了解 Manager 是如何检测镜像的,即 detectImages 方法。首先,sandboxImage 是一定会被判定为正在使用的镜像。接着会将所有 Pod 的所有 Container 使用的 image 加到正在使用的镜像列表中。然后,所有在本机器上已经存在的镜像会批量记录到 Manager 的探测记录中,同时将探测时间置为本轮探测的时间。
Manager 的镜像探测记录中会记录该镜像的第一次被探测到的时间、最后一次使用的时间以及镜像大小。在将某个新 Image 添加到探测记录并记录时间的同时,无论是否是被第一次探测到,如果该镜像在前面的“正在使用的镜像”列表中,那么它的最新使用都会被刷新为当前的时间。最后,如果某个镜像已经不在本机上,就会被移出 Manager 的镜像探测记录。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
func (im *realImageGCManager) detectImages(detectTime time.Time) (sets.String, error) {
imagesInUse := sets.NewString()

// Always consider the container runtime pod sandbox image in use
imageRef, err := im.runtime.GetImageRef(container.ImageSpec{Image: im.sandboxImage})
if err == nil && imageRef != "" {
imagesInUse.Insert(imageRef)
}

images, err := im.runtime.ListImages()
if err != nil {
return imagesInUse, err
}
pods, err := im.runtime.GetPods(true)
if err != nil {
return imagesInUse, err
}

// Make a set of images in use by containers.
for _, pod := range pods {
for _, container := range pod.Containers {
klog.V(5).Infof("Pod %s/%s, container %s uses image %s(%s)", pod.Namespace, pod.Name, container.Name, container.Image, container.ImageID)
imagesInUse.Insert(container.ImageID)
}
}

// Add new images and record those being used.
now := time.Now()
currentImages := sets.NewString()
im.imageRecordsLock.Lock()
defer im.imageRecordsLock.Unlock()
for _, image := range images {
klog.V(5).Infof("Adding image ID %s to currentImages", image.ID)
currentImages.Insert(image.ID)

// New image, set it as detected now.
if _, ok := im.imageRecords[image.ID]; !ok {
klog.V(5).Infof("Image ID %s is new", image.ID)
im.imageRecords[image.ID] = &imageRecord{
firstDetected: detectTime,
}
}

// Set last used time to now if the image is being used.
if isImageUsed(image.ID, imagesInUse) {
klog.V(5).Infof("Setting Image ID %s lastUsed to %v", image.ID, now)
im.imageRecords[image.ID].lastUsed = now
}

klog.V(5).Infof("Image ID %s has size %d", image.ID, image.Size)
im.imageRecords[image.ID].size = image.Size
}

// Remove old images from our records.
for image := range im.imageRecords {
if !currentImages.Has(image) {
klog.V(5).Infof("Image ID %s is no longer present; removing from imageRecords", image)
delete(im.imageRecords, image)
}
}

return imagesInUse, nil
}

刚才说到了镜像每次探测都会有记录,那么探测记录是用来干什么的?很容易理解,是给释放空间提供参考的。在完成镜像探测后,我们的得到的 imagesInUse 包括了 sanbox Image 以及 pod 内各 container 使用的镜像。接下来,要选出清理的目标镜像,这里存放的数据结构叫 evictionInfo,它主要存放了上面说到的每个的镜像探测记录。所有不在 imagesInUse 列表内的镜像都被列入到清理目标中。接着会将这些清理目标按照 最后使用时间首次探测时间 进行一次排序,按照 LRU 规则将最早使用和最早探测的镜像排在前面。
排序完之后,会遍历所有这些镜像:如果是本轮刚探测到的镜像不会删除。同时,该镜像距离第一次被探测到的时间差如果小于配置的最小 GC 间隔,也不会删除。否则,就会依序删除这些镜像,删除完之后会从探测记录中删除该镜像同时累加 已经释放的空间值 spaceFreed。如果 spaceFreed 不小于目标释放的空间,则本轮的清理结束。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
func (im *realImageGCManager) freeSpace(bytesToFree int64, freeTime time.Time) (int64, error) {
imagesInUse, err := im.detectImages(freeTime)
if err != nil {
return 0, err
}

im.imageRecordsLock.Lock()
defer im.imageRecordsLock.Unlock()

// Get all images in eviction order.
images := make([]evictionInfo, 0, len(im.imageRecords))
for image, record := range im.imageRecords {
if isImageUsed(image, imagesInUse) {
klog.V(5).Infof("Image ID %s is being used", image)
continue
}
images = append(images, evictionInfo{
id: image,
imageRecord: *record,
})
}
sort.Sort(byLastUsedAndDetected(images))

// Delete unused images until we've freed up enough space.
var deletionErrors []error
spaceFreed := int64(0)
for _, image := range images {
klog.V(5).Infof("Evaluating image ID %s for possible garbage collection", image.id)
// Images that are currently in used were given a newer lastUsed.
if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) {
klog.V(5).Infof("Image ID %s has lastUsed=%v which is >= freeTime=%v, not eligible for garbage collection", image.id, image.lastUsed, freeTime)
continue
}

// Avoid garbage collect the image if the image is not old enough.
// In such a case, the image may have just been pulled down, and will be used by a container right away.

if freeTime.Sub(image.firstDetected) < im.policy.MinAge {
klog.V(5).Infof("Image ID %s has age %v which is less than the policy's minAge of %v, not eligible for garbage collection", image.id, freeTime.Sub(image.firstDetected), im.policy.MinAge)
continue
}

// Remove image. Continue despite errors.
klog.Infof("[imageGCManager]: Removing image %q to free %d bytes", image.id, image.size)
err := im.runtime.RemoveImage(container.ImageSpec{Image: image.id})
if err != nil {
deletionErrors = append(deletionErrors, err)
continue
}
delete(im.imageRecords, image.id)
spaceFreed += image.size

if spaceFreed >= bytesToFree {
break
}
}

if len(deletionErrors) > 0 {
return spaceFreed, fmt.Errorf("wanted to free %d bytes, but freed %d bytes space with errors in image deletion: %v", bytesToFree, spaceFreed, errors.NewAggregate(deletionErrors))
}
return spaceFreed, nil
}

总结

总的来看,Kubelet 会在 Image 对应的 Filesystem 空间不足的情况下删除冗余的镜像。整个 GC 的要点如下:

  • 清理的触发为到达 HighThresholdPercent 开始清理,一直清理到 LowThresholdPercent 为止。
  • 在此过程中,有三类镜像不会被清除:
    • sanbox 所需镜像;
    • gc 首次探测到的镜像;
    • 探测累计时长小于 MinimumGCAge 的镜像。
  • 清理过程会优先清除最久没用到的和最早探测到的镜像。