Kubelet Image GC原理分析

Image GC是什么?

Image GC是kubelet的镜像清理功能,用于在磁盘空间不足的情况下清除不需要的镜像,释放磁盘空间,保证Pod能正常启动运行。

Image GC如何使用?

Kubelet默认开启,通过kubele启动配置中的ImageGCPolicy控制。ImageGCPolicy有三个设置参数:

  • ImageGCHighThresholdPercent:触发gc的阈值,超过该值将会执行gc,设置为100时,gc不启动。

  • ImageGCLowThresholdPercent:ImageGC执行空间空间的目标值,gc触发后,将会将磁盘占用率降至该值以下;

  • ImageMinimumGCAge:最短GC年龄(即距离首次被探测到的间隔),小于该阈值时不会被gc。

    源码分析

    ImageGC的初始化与启动

    在kubelet启动时,ImageGC的启动在BirthCry执行完成之后。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    func (kl *Kubelet) StartGarbageCollection() {
    loggedContainerGCFailure := false

    // container gc流程,省略
    ...

    // ImageGCHighThresholdPercent设置为100时,关闭image gc
    if kl.kubeletConfiguration.ImageGCHighThresholdPercent == 100 {
    klog.V(2).Infof("ImageGCHighThresholdPercent is set 100, Disable image GC")
    return
    }

    prevImageGCFailed := false
    go wait.Until(func() {
    if err := kl.imageManager.GarbageCollect(); err != nil {
    if prevImageGCFailed {
    klog.Errorf("Image garbage collection failed multiple times in a row: %v", err)
    // Only create an event for repeated failures
    kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error())
    } else {
    klog.Errorf("Image garbage collection failed once. Stats initialization may not have completed yet: %v", err)
    }
    prevImageGCFailed = true
    } else {
    var vLevel klog.Level = 4
    if prevImageGCFailed {
    vLevel = 1
    prevImageGCFailed = false
    }

    klog.V(vLevel).Infof("Image garbage collection succeeded")
    }
    }, ImageGCPeriod, wait.NeverStop)
    }

    可以看到,ImageGC由单独的协程执行,默认的执行间隔为五分钟。当ImageGC首次执行失败时会打印日志,而重复失败后,会记录一个ImageGCFailed的事件。这意味着可以通过配置日志或者告警了解GC是否正常运行。
    接下来看看ImageGCManager的具体实现。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    type ImageGCManager interface {
    // Applies the garbage collection policy. Errors include being unable to free
    // enough space as per the garbage collection policy.
    GarbageCollect() error

    // Start async garbage collection of images.
    Start()

    GetImageList() ([]container.Image, error)

    // Delete all unused images.
    DeleteUnusedImages() error
    }

    func NewImageGCManager(runtime container.Runtime, statsProvider StatsProvider, recorder record.EventRecorder, nodeRef *v1.ObjectReference, policy ImageGCPolicy, sandboxImage string) (ImageGCManager, error) {
    // Validate policy.
    if policy.HighThresholdPercent < 0 || policy.HighThresholdPercent > 100 {
    return nil, fmt.Errorf("invalid HighThresholdPercent %d, must be in range [0-100]", policy.HighThresholdPercent)
    }
    if policy.LowThresholdPercent < 0 || policy.LowThresholdPercent > 100 {
    return nil, fmt.Errorf("invalid LowThresholdPercent %d, must be in range [0-100]", policy.LowThresholdPercent)
    }
    if policy.LowThresholdPercent > policy.HighThresholdPercent {
    return nil, fmt.Errorf("LowThresholdPercent %d can not be higher than HighThresholdPercent %d", policy.LowThresholdPercent, policy.HighThresholdPercent)
    }
    im := &realImageGCManager{
    runtime: runtime,
    policy: policy,
    imageRecords: make(map[string]*imageRecord),
    statsProvider: statsProvider,
    recorder: recorder,
    nodeRef: nodeRef,
    initialized: false,
    sandboxImage: sandboxImage,
    }

    return im, nil
    }

    ImageGCManager的接口非常简单,只有四个方法:

  • GarbageCollect:根据定义的ImageGCPolicy执行具体的清理动作;

  • Start:异步地收集镜像信息;

  • GetImageList:获取缓存中的镜像列表;

  • DeleteUnusedImages:删除未使用的镜像。

ImageGCManager在初始化时会校验Policy的参数合法性,然后传递运行时、监控、事件等参数。然后看看Start方法的逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func (im *realImageGCManager) Start() {
go wait.Until(func() {
// Initial detection make detected time "unknown" in the past.
var ts time.Time
if im.initialized {
ts = time.Now()
}
_, err := im.detectImages(ts)
if err != nil {
klog.Warningf("[imageGCManager] Failed to monitor images: %v", err)
} else {
im.initialized = true
}
}, 5*time.Minute, wait.NeverStop)

// Start a goroutine periodically updates image cache.
go wait.Until(func() {
images, err := im.runtime.ListImages()
if err != nil {
klog.Warningf("[imageGCManager] Failed to update image list: %v", err)
} else {
im.imageCache.set(images)
}
}, 30*time.Second, wait.NeverStop)

}

ImageGCManager的Start方法会启动两个协程。在第一个协程内,每隔五分钟Manager会检查一次镜像。一旦完成一次,Manager的状态就会被标记为已初始化。另一个协程每隔30秒会从容器运行时获取所有的镜像信息,更新到缓存的镜像列表中。

镜像信息的检测和维护

那么,Manager时如何检测镜像的呢?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
func (im *realImageGCManager) detectImages(detectTime time.Time) (sets.String, error) {
imagesInUse := sets.NewString()

// Always consider the container runtime pod sandbox image in use
imageRef, err := im.runtime.GetImageRef(container.ImageSpec{Image: im.sandboxImage})
if err == nil && imageRef != "" {
imagesInUse.Insert(imageRef)
}

images, err := im.runtime.ListImages()
if err != nil {
return imagesInUse, err
}
pods, err := im.runtime.GetPods(true)
if err != nil {
return imagesInUse, err
}

// Make a set of images in use by containers.
for _, pod := range pods {
for _, container := range pod.Containers {
klog.V(5).Infof("Pod %s/%s, container %s uses image %s(%s)", pod.Namespace, pod.Name, container.Name, container.Image, container.ImageID)
imagesInUse.Insert(container.ImageID)
}
}

// Add new images and record those being used.
now := time.Now()
currentImages := sets.NewString()
im.imageRecordsLock.Lock()
defer im.imageRecordsLock.Unlock()
for _, image := range images {
klog.V(5).Infof("Adding image ID %s to currentImages", image.ID)
currentImages.Insert(image.ID)

// New image, set it as detected now.
if _, ok := im.imageRecords[image.ID]; !ok {
klog.V(5).Infof("Image ID %s is new", image.ID)
im.imageRecords[image.ID] = &imageRecord{
firstDetected: detectTime,
}
}

// Set last used time to now if the image is being used.
if isImageUsed(image.ID, imagesInUse) {
klog.V(5).Infof("Setting Image ID %s lastUsed to %v", image.ID, now)
im.imageRecords[image.ID].lastUsed = now
}

klog.V(5).Infof("Image ID %s has size %d", image.ID, image.Size)
im.imageRecords[image.ID].size = image.Size
}

// Remove old images from our records.
for image := range im.imageRecords {
if !currentImages.Has(image) {
klog.V(5).Infof("Image ID %s is no longer present; removing from imageRecords", image)
delete(im.imageRecords, image)
}
}

return imagesInUse, nil
}

检测镜像的目的是找出正在使用的镜像,防止在GC执行的过程中被清理。同时,在此过程中,镜像的清理需要参考一些信息,这些信息也会在检测的过程中更新。
首先,Sandbox镜像是一定会被判定为正在使用的镜像。接着会将所有Pod的所有正在运行中的容器使用的image加到正在使用的镜像列表中。注意,即使Pod有容器需要该镜像,但是该容器未处于Running状态,其对应的镜像也会被清理。
选出正在使用(即不会被清理)的镜像之后,会将容器运行时中获取到的镜像列表信息更新到Manager维护的镜像列表记录中。
查询所有新获取的镜像列表信息进行遍历,分为以下几步:

  • 如果是第一次被记录,那么更新该镜像的首次被探测时间为本轮探测的事件
  • 如果被前面一步被判定为“正在使用的镜像”,那么它的最新使用事件会被刷新为当前时间
  • 刷新获取到的镜像的大小

最后,如果某个镜像已经不在容器运行时返回的镜像列表中,就会被移出Manager缓存的镜像探测记录。

ImageGC的具体执行

ImageGCManager的核心方法就是GarbageCollect了,主要步骤如下:首先获取Image对应的Filesystem占用信息,根据启动的配置计算出用量百分比以及需要释放的空间大小,然后开始释放。如果实际释放的空间小于目标大小,会记录FreeDiskSpaceFailed的Warnning事件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
func (im *realImageGCManager) GarbageCollect() error {
// Get disk usage on disk holding images.
fsStats, err := im.statsProvider.ImageFsStats()
if err != nil {
return err
}

var capacity, available int64
if fsStats.CapacityBytes != nil {
capacity = int64(*fsStats.CapacityBytes)
}
if fsStats.AvailableBytes != nil {
available = int64(*fsStats.AvailableBytes)
}

if available > capacity {
klog.Warningf("available %d is larger than capacity %d", available, capacity)
available = capacity
}

// Check valid capacity.
if capacity == 0 {
err := goerrors.New("invalid capacity 0 on image filesystem")
im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())
return err
}

// If over the max threshold, free enough to place us at the lower threshold.
usagePercent := 100 - int(available*100/capacity)
if usagePercent >= im.policy.HighThresholdPercent {
amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available
klog.Infof("[imageGCManager]: Disk usage on image filesystem is at %d%% which is over the high threshold (%d%%). Trying to free %d bytes down to the low threshold (%d%%).", usagePercent, im.policy.HighThresholdPercent, amountToFree, im.policy.LowThresholdPercent)
freed, err := im.freeSpace(amountToFree, time.Now())
if err != nil {
return err
}

if freed < amountToFree {
err := fmt.Errorf("failed to garbage collect required amount of images. Wanted to free %d bytes, but freed %d bytes", amountToFree, freed)
im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.FreeDiskSpaceFailed, err.Error())
return err
}
}

return nil
}

算出需要释放空间后是删除的镜像是怎么决定的呢?在开始执行清理时,会执行我们上面介绍的镜像探测过程。在完成镜像探测后,我们的得到的imagesInUse包括了Sanbox镜像以及Pod内正在与运行中容器使用的镜像。接下来,要选出清理的目标镜像,存放清理目标的数据结构叫evictionInfo,它存放了所有不在imagesInUse列表内的镜像记录。接着会将这些镜像记录按照 最后使用时间首次探测时间 进行一次排序,即按照LRU规则将最后一次使用时间较早和探测事件较早的镜像排在前面。
排序完之后,会遍历所有这些镜像:如果是镜像最后一次使用事件没有删除触发时间早(即刚刚刷新了最后使用时间),则不会删除。同时,如果该镜像首次被探测到的时间差小于配置的最小GC间隔(即刚加入到缓存记录中),也不会删除。否则,就会依序删除这些镜像,删除完之后会从探测记录中删除该镜像同时累加 已经释放的空间值spaceFreed。如果spaceFreed不小于目标释放的空间,则本轮的清理正常结束。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
func (im *realImageGCManager) freeSpace(bytesToFree int64, freeTime time.Time) (int64, error) {
imagesInUse, err := im.detectImages(freeTime)
if err != nil {
return 0, err
}

im.imageRecordsLock.Lock()
defer im.imageRecordsLock.Unlock()

// Get all images in eviction order.
images := make([]evictionInfo, 0, len(im.imageRecords))
for image, record := range im.imageRecords {
if isImageUsed(image, imagesInUse) {
klog.V(5).Infof("Image ID %s is being used", image)
continue
}
images = append(images, evictionInfo{
id: image,
imageRecord: *record,
})
}
sort.Sort(byLastUsedAndDetected(images))

// Delete unused images until we've freed up enough space.
var deletionErrors []error
spaceFreed := int64(0)
for _, image := range images {
klog.V(5).Infof("Evaluating image ID %s for possible garbage collection", image.id)
// Images that are currently in used were given a newer lastUsed.
if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) {
klog.V(5).Infof("Image ID %s has lastUsed=%v which is >= freeTime=%v, not eligible for garbage collection", image.id, image.lastUsed, freeTime)
continue
}

// Avoid garbage collect the image if the image is not old enough.
// In such a case, the image may have just been pulled down, and will be used by a container right away.

if freeTime.Sub(image.firstDetected) < im.policy.MinAge {
klog.V(5).Infof("Image ID %s has age %v which is less than the policy's minAge of %v, not eligible for garbage collection", image.id, freeTime.Sub(image.firstDetected), im.policy.MinAge)
continue
}

// Remove image. Continue despite errors.
klog.Infof("[imageGCManager]: Removing image %q to free %d bytes", image.id, image.size)
err := im.runtime.RemoveImage(container.ImageSpec{Image: image.id})
if err != nil {
deletionErrors = append(deletionErrors, err)
continue
}
delete(im.imageRecords, image.id)
spaceFreed += image.size

if spaceFreed >= bytesToFree {
break
}
}

if len(deletionErrors) > 0 {
return spaceFreed, fmt.Errorf("wanted to free %d bytes, but freed %d bytes space with errors in image deletion: %v", bytesToFree, spaceFreed, errors.NewAggregate(deletionErrors))
}
return spaceFreed, nil
}

磁盘驱逐与ImageGC

除了上述GC逻辑外,实际上还有额外的ImageGC触发条件。在运行中,偶尔会遇到ImageGCHighThresholdPercent被设置为100但还是有镜像被清理的情况。我们反过来看下在上文提到的ImageGC的接口,可以看到DeleteUnusedImages是个public方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
func buildSignalToNodeReclaimFuncs(imageGC ImageGC, containerGC ContainerGC, withImageFs bool) map[evictionapi.Signal]nodeReclaimFuncs {
signalToReclaimFunc := map[evictionapi.Signal]nodeReclaimFuncs{}
// usage of an imagefs is optional
if withImageFs {
// with an imagefs, nodefs pressure should just delete logs
signalToReclaimFunc[evictionapi.SignalNodeFsAvailable] = nodeReclaimFuncs{}
signalToReclaimFunc[evictionapi.SignalNodeFsInodesFree] = nodeReclaimFuncs{}
// with an imagefs, imagefs pressure should delete unused images
signalToReclaimFunc[evictionapi.SignalImageFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
signalToReclaimFunc[evictionapi.SignalImageFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
} else {
// without an imagefs, nodefs pressure should delete logs, and unused images
// since imagefs and nodefs share a common device, they share common reclaim functions
signalToReclaimFunc[evictionapi.SignalNodeFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
signalToReclaimFunc[evictionapi.SignalNodeFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
signalToReclaimFunc[evictionapi.SignalImageFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
signalToReclaimFunc[evictionapi.SignalImageFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
}
return signalToReclaimFunc
}

实际上,在磁盘满导致节点驱逐信号触发时会直接调用容器和镜像的GC方法,毕竟节点驱逐的触发是更紧急的。

总结

总的来看,Kubelet会在节点驱逐信号触发Image对应的Filesystem空间不足的情况下删除冗余的镜像。整个GC的要点如下:

  • 清理的触发为到达HighThresholdPercent开始清理,一直清理到LowThresholdPercent为止。但是需要注意的是通过将HighThresholdPercent设置为100关闭GC的做法对节点驱逐不生效,只能关闭定时清理任务
  • 镜像清理过程中,有三类镜像不会被清除:
    • Sanbox所需镜像;
    • GC首次探测和刚被刷新过最后使用时间的镜像;
    • 探测累计时长小于MinimumGCAge的镜像。
  • 清理过程会优先清除最久没用到的和最早探测到的镜像。