自动迁移
根据宿主机的 CPU 使用率或者空闲内存,自动迁移宿主机上的虚拟机
使用场景
- 当宿主机 CPU 使用率超过阈值或内存超过阈值后,自动迁移宿主机上部分虚拟机到指定范围的宿主机
- 当宿主机 CPU 使用率超过阈值或内存超过阈值后,自动迁移宿主机上部分虚拟机到负载最空的宿主机
监控指标
每台宿主机都有对应的 cpu.usage_active 和 mem.available 监控指标,说明如下:
- cpu.usage_active: CPU 总核心使用率,上限为 100% ,表示所有核心都处于忙碌状态
- mem.available: 内存可用大小,单位为 Byte
实现原理
创建对应的宿主机监控报警指标,当宿主机发生报警的时候,监控服务根据指标的当前值和阈值只差,选择宿主机上对应的虚拟机进行迁移。
使用说明
CPU
下面以宿主机 CPU 超过阈值,迁移虚拟机到其他宿主举例:
- 创建 cpu.usage_active 的迁移规则,名为 test-cpu ,监控宿主机 test-66-onecloud02 上的指标,当 cpu.usage_active 大于 60% 后触发自动迁移,每隔 2 分钟检查一次
$ climc monitor-migrationalert-create \
--period 2m \
--source-host test-66-onecloud02 \
test-cpu cpu.usage_active.gt 60
- 创建对应的虚拟机进行测试,假设宿主机 CPU 40 核,为了达到阈值触发迁移,虚拟机的 CPU 核数就需要是 24 (40 * 60%) 核,然后在虚拟机使用 stress-ng 压测工具把所有核心打到 100%
# 创建虚拟机
$ climc server-create --disk CentOS-7.6.1810-20190430.qcow2 \
--net your-net \
--mem-spec 1g \
--ncpu 24 \
--allow-delete \
--auto-start \
--prefer-host test-66-onecloud02 \
cpu-test-vm
# 登录虚拟机,使用 stress-ng 压测 CPU
$ climc server-ssh cpu-test-vm
(cpu-test-vm)$ yum install -y stress-ng
(cpu-test-vm)$ stress-ng --cpu 24 --timeout 36000s
- 隔2分钟查看监控迁移记录
# 可以先登录 influxdb 查看当前宿主机的监控指标
$ kubectl exec -ti -n onecloud $(kubectl get pods -n onecloud | grep default-influxdb | awk '{print $1}') -- influx -host 127.0.0.1 -port 30086 -type influxql -ssl -precision rfc3339 -unsafeSsl
Connected to https://127.0.0.1:30086 version 1.7.7
InfluxDB shell version: 1.7.7
> use telegraf
# 通过 climc host-list --search test-66-onecloud02 得到 host_id 为 6fc10297-eb20-4a96-86a8-4b65260d6016
# 下面查看该宿主过去 2m 的 cpu.usage_active 指标已经大于 60% 的阈值了
> select usage_active from cpu where host_id = '6fc10297-eb20-4a96-86a8-4b65260d6016' and time > now() - 2m GROUP BY "host_id"
name: cpu
tags: host_id=6fc10297-eb20-4a96-86a8-4b65260d6016
time usage_active
---- ------------
2022-06-28T03:55:00Z 62.90831581190119
2022-06-28T03:56:00Z 70.15669899594904
# 查看报警迁移记录
# 找到 id 为 a7a92f4a-fed1-49bb-880b-59eae5185acc
$ climc monitor-migrationalert-list --scope system
+--------------------------------------+----------+------------------+
| id | name | metric_type |
+--------------------------------------+----------+------------------+
| a7a92f4a-fed1-49bb-880b-59eae5185acc | test-cpu | cpu.usage_active |
+--------------------------------------+----------+------------------+
# 查看迁移记录事件
$ climc monitor-migrationalert-event a7a92f4a-fed1-49bb-880b-59eae5185acc --scope system
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | ops_time | obj_id | obj_type | obj_name | user | tenant | action | notes |
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 428124 | 2022-06-28T07:40:02.000000Z | a7a92f4a-fed1-49bb-880b-59eae5185acc | migrationalert | test-cpu | monitoradmin | system | find_result_fail | find result to migrate: not found target for guest &balancer.cpuCandidate{guestResource:(*balancer.guestResource)(0xc001963c20), usageActive:99.46995000000001, guestCPUCount:24, hostCPUCount:40}: [host:test-69-onecloud01:current(55.313408) + guest:cpu-test-vm:score(59.681970) >= threshold(60.000000), host:a15:current(62.305391) + guest:cpu-test-vm:score(59.681970) >= threshold(60.000000)] |
| 427991 | 2022-06-28T03:56:57.000000Z | a7a92f4a-fed1-49bb-880b-59eae5185acc | migrationalert | test-cpu | sysadmin | system | create | {"id":"a7a92f4a-fed1-49bb-880b-59eae5185acc","name":"test-cpu","res_name":"migrationalert"} |
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# 发现了一条 find_result_fail 的记录,表示虽然发生了报警,但是没有找到对应的目标宿主机进行迁移
# 原因是集群中的另外两台宿主机 test-69-onecloud01 当前指标为 55.313408%,a15 为 59.681970%,如果把 cpu-test-vm 59.681970% 的 cpu 负载迁移到另外两台宿主机
# 又会导致其他两台宿主机超过阈值,所以失败
# 如果把集群节点的负载降低,或者加入新的宿主机,负载高的虚拟机预期就会迁移过去,下面是迁移成功的记录
# 假设我重新使用 climc monitor-migrationalert-create 创建了一条 a15 宿主机的迁移规则,大于 60 触发,id 为 afc9468c-2cd7-4be8-83c7-92d7535a53cf
$ climc monitor-migrationalert-event afc9468c-2cd7-4be8-83c7-92d7535a53cf --scope system
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | ops_time | obj_id | obj_type | obj_name | user | tenant | action | notes |
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 428147 | 2022-06-28T07:56:01.000000Z | afc9468c-2cd7-4be8-83c7-92d7535a53cf | migrationalert | test-cpu | monitoradmin | system | migrating | {"guest":{"host":"a15","host_id":"733b10fa-bd33-4503-836d-2ccd225bf12f","id":"a3107d1f-c46e-43cf-8aa8-55743d1533b1","name":"aisenzhe","score":10.769393333333335,"vcpu_count":8,"vmem_size":8192},"target_host":{"id":"6fc10297-eb20-4a96-86a8-4b65260d6016","name":"test-66-onecloud02","score":22.65071999099409}} |
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# 上述信息表示把 a15 上的 aisenzhe 虚拟机(cpu.usage_active 10.76%) 迁移到 test-66-onecloud02(cpu.usage_active 22.65%) 目标宿主机上
# 查看虚拟机的状态发现正在迁移中
$ climc server-list --search aisenzhe
+--------------------------------------+----------+--------------+-----------+------------+-----------+-----------+-----------------------------+------------+---------+-----------+
| ID | Name | Billing_type | Status | vcpu_count | vmem_size | Secgrp_id | Created_at | Hypervisor | os_type | is_system |
+--------------------------------------+----------+--------------+-----------+------------+-----------+-----------+-----------------------------+------------+---------+-----------+
| a3107d1f-c46e-43cf-8aa8-55743d1533b1 | aisenzhe | postpaid | migrating | 8 | 8192 | default | 2022-01-06T08:34:45.000000Z | kvm | Linux | false |
+--------------------------------------+----------+--------------+-----------+------------+-----------+-----------+-----------------------------+------------+---------+-----------+
# 热迁移会持续一段时间,具体时间视虚拟机内存和磁盘大小而定,等待迁移结束后,会记录迁移成功的日志
其他操作
# 自动调节集群宿主机 cpu 负载,即不指定 --source-host 参数
$ climc monitor-migrationalert-create --period 5m all-host-cpu cpu.usage_active.gt 80
# 指定目标宿主机,当宿主机 cpu.usage_active 大于 80 后,只能迁移到目标宿主机 host1 和 host2
$ climc monitor-migrationalert-create --period 5m --target-host host1 --target-host host2 target-host-cpu cpu.usage_active.gt 80
# 指定监控的源宿主机,只关心 src-host1 和 src-host2 的监控
$ climc monitor-migrationalert-create --period 5m --source-host src-host1 --source-host src-host2 src-host-cpu cpu.usage_active.gt 80
# 指定迁移的源虚拟机,当宿主机 cpu.usage_active 大于 80 时候,只能迁移 gst1 和 gst2 虚拟机
$ climc monitor-migrationalert-create --period 5m --source-guest gst1 --source-guest gst2 host-gst-cpu cpu.usage_active.gt 80
注意事项
该功能目前只是 alpha 版本不一定稳定,仅限测试使用。
另外为了防止迁移条件判断不准确,导致宿主机之前虚拟机相互迁移,最后出现雪崩效应。
目前同一个时刻,只会一条报警触发的迁移逻辑,会一次迁移一批机器。如果该时刻另外一个 migrationalert 报警规则触发,会放弃此次迁移,必须等待全局没有其他 migrationalert 触发的迁移时,才会开始自己的迁移逻辑。
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.
最后修改 01.01.0001