AWS VPC 网络中基于 keepalived 的主备切换

雲端並沒有什麼場景一定需要用keepalived,只不過是一些傳統企業地端思維揮之不去,當然,ELB 要花錢,省錢也是重要的因素,特別是業務沒有那麼重要,卻就是需要HA。

但是,AWS VPC 並不支持VRRP multicast,這讓很多在地端運行的商業 Firewall/LB 廠商上雲之後都要做修改,比如F5/Radware/A10/深信服 等等,然而,並沒有官方文件說明AWS VPC 支持或是不支持,官方blog 文章在某種程度上可以視作你可以自己做,但是我們不會為你進行官方的技術支持,可就是有人看不懂。

當然,官方blog 語焉不詳也是一個問題,SA們寫出許多不在官方支持範圍內但又可以在AWS 平台上運行的blog,但他們又沒有講得很清楚,你不去理解blog 而是完全照著去做,是會失敗的。


首先創建兩台ec2,基於 Amazon Linux 2,

安裝必要的組件
sudo yum install keepalived jq -y

從webconsole 修改 IMDSv2 為 Optional

======

ec2 10.4.133.196 is node-1 MASTER

keepalived.conf

global_defs {
    router_id node-1
}

vrrp_script health_check {
    script /etc/keepalived/health-check.sh
    interval 2
}

vrrp_instance VI_1 {
    state MASTER
    debug 2
    interface eth0
    virtual_router_id 10
    priority 100
    advert_int 1
    unicast_peer {
        10.4.129.250
    }

    track_script {
        health_check
    }

    notify_master /etc/keepalived/i-am-master.sh
}

======

ec2 10.4.129.250 is node-2 SLAVE

keepalived.conf

global_defs {
    router_id node-2
}

vrrp_script health_check {
    script /etc/keepalived/health-check.sh
    interval 2
}

vrrp_instance VI_1 {
    state BACKUP
    debug 2
    interface eth0
    virtual_router_id 10
    priority 50
    advert_int 1
    unicast_peer {
        10.4.133.196
    }

    track_script {
        health_check
    }

    notify_master /etc/keepalived/i-am-master.sh
}

從webconsole 添加一個新的 eni-032a8bec5486841fd , and assign static ip 10.4.128.9

添加一個健康檢測試腳本 /etc/keepalived/health-check.sh

------
#!/bin/bash
exit 0  # Always healthy
------

keepalived 切換時調用的eni attach/detach 腳本:/etc/keepalived/i-am-master.sh

------
#!/bin/bash
ENI=eni-032a8bec5486841fd
METADATA=http://169.254.169.254/latest/meta-data/instance-id
export AWS_DEFAULT_REGION=ap-northeast-3

# get ENI attachment information
attach=$(aws ec2 describe-network-interface-attribute --network-interface-id $ENI --attribute attachment --output json)

# check if ENI has already been attached to this instance
inst=$(curl -qs $METADATA)

if echo "$attach" | jq -e '.Attachment' >/dev/null 2>&1; then
    attachedInst=$(echo "$attach" | jq -r ".Attachment.InstanceId")

    if [ "$inst" = "$attachedInst" ]; then
        exit 0
    fi

    # get attachment ID and detach it
    id=$(echo "$attach" | jq -r ".Attachment.AttachmentId")

    if [[ $id == eni-attach-* ]]; then
        aws ec2 detach-network-interface --attachment-id $id

        # Wait for detachment to complete
        echo "Waiting for ENI detachment..."
        sleep 10

        # Wait until ENI is actually detached
        while aws ec2 describe-network-interfaces --network-interface-ids $ENI --query 'NetworkInterfaces[0].Status' --output text | grep -q "in-use"; d
o
            echo "Still detaching..."
            sleep 2
        done
    fi
fi

# attach to this instance
aws ec2 attach-network-interface --network-interface-id $ENI --instance-id $inst --device-index 1
------

創建一個 iam role ,這裡我們叫他 keepalived-eni ,添加下面的Trusted entities 和Permissions policies 並附加到兩台測試用例:

------Trusted entities
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
------


------Permissions policies
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeNetworkInterfaceAttribute", 
                "ec2:AttachNetworkInterface",
                "ec2:DetachNetworkInterface"
            ],
            "Resource": "*"
        }
    ]
}
------

從ec2 內測試確認,這裡我們看到可以獲取到角色:

curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/
Keepalived-ENI
[root@ip-10-4-133-196 keepalived]# aws sts get-caller-identity
{
    "Account": "123454567890", 
    "UserId": "AGHTAQ3DUCTSWKIOWBGTI:i-0a5aacedc82d1f580", 
    "Arn": "arn:aws:sts::123454567890:assumed-role/Keepalived-ENI/i-0a5aacedc82d1f580"
}

=======開始測試!

attach eni to keepalived-a and 
systemctl enable keepalived
systemctl start keepalived
systemctl status keepalived

systemctl status keepalived
● keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2025-11-22 08:43:32 UTC; 13s ago
  Process: 2593 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 2594 (keepalived)


login to keepalived-b and 
systemctl enable keepalived
systemctl start keepalived
systemctl status keepalived

● keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2025-11-22 08:44:45 UTC; 83ms ago
  Process: 2370 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 2371 (keepalived)

======check slave======
sudo journalctl -u keepalived -f

Nov 22 08:43:37 ip-10-4-129-250.ap-northeast-3.compute.internal Keepalived_vrrp[2597]: Opening script file /etc/keepalived/i-am-master.sh
Nov 22 08:47:13 ip-10-4-129-250.ap-northeast-3.compute.internal Keepalived_vrrp[2597]: VRRP_Instance(VI_1) Received advert with higher priority 100, ours 50
Nov 22 08:47:13 ip-10-4-129-250.ap-northeast-3.compute.internal Keepalived_vrrp[2597]: VRRP_Instance(VI_1) Entering BACKUP STATE

ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.4.129.250  netmask 255.255.240.0  broadcast 10.4.143.255
        inet6 fe80::ce5:7fff:fedd:e065  prefixlen 64  scopeid 0x20<link>
        ether 0e:e5:7f:dd:e0:65  txqueuelen 1000  (Ethernet)
        RX packets 2000  bytes 3503391 (3.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1653  bytes 213478 (208.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

======check master======
Nov 22 08:47:12 ip-10-4-133-196.ap-northeast-3.compute.internal Keepalived_healthcheckers[2195]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 22 08:47:13 ip-10-4-133-196.ap-northeast-3.compute.internal Keepalived_vrrp[2196]: VRRP_Instance(VI_1) Transition to MASTER STATE
Nov 22 08:47:14 ip-10-4-133-196.ap-northeast-3.compute.internal Keepalived_vrrp[2196]: VRRP_Instance(VI_1) Entering MASTER STATE
Nov 22 08:47:14 ip-10-4-133-196.ap-northeast-3.compute.internal Keepalived_vrrp[2196]: Opening script file /etc/keepalived/i-am-master.sh

ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.4.133.196  netmask 255.255.240.0  broadcast 10.4.143.255
        inet6 fe80::c10:b2ff:febb:7d53  prefixlen 64  scopeid 0x20<link>
        ether 0e:10:b2:bb:7d:53  txqueuelen 1000  (Ethernet)
        RX packets 1294  bytes 3447754 (3.2 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1031  bytes 141789 (138.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.4.128.9  netmask 255.255.240.0  broadcast 10.4.143.255
        inet6 fe80::c61:7eff:fe76:2c57  prefixlen 64  scopeid 0x20<link>
        ether 0e:61:7e:76:2c:57  txqueuelen 1000  (Ethernet)
        RX packets 81  bytes 7072 (6.9 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 90  bytes 9050 (8.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

======failover======
reboot 10.4.133.196

Nov 22 09:10:52 ip-10-4-129-250.ap-northeast-3.compute.internal Keepalived_vrrp[3334]: VRRP_Instance(VI_1) Entering BACKUP STATE
Nov 22 09:14:20 ip-10-4-129-250.ap-northeast-3.compute.internal Keepalived_vrrp[3334]: VRRP_Instance(VI_1) Transition to MASTER STATE
Nov 22 09:14:21 ip-10-4-129-250.ap-northeast-3.compute.internal Keepalived_vrrp[3334]: VRRP_Instance(VI_1) Entering MASTER STATE
Nov 22 09:14:21 ip-10-4-129-250.ap-northeast-3.compute.internal Keepalived_vrrp[3334]: Opening script file /etc/keepalived/i-am-master.sh

eni switched to 10.4.129.250

ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.4.129.250  netmask 255.255.240.0  broadcast 10.4.143.255
        inet6 fe80::ce5:7fff:fedd:e065  prefixlen 64  scopeid 0x20<link>
        ether 0e:e5:7f:dd:e0:65  txqueuelen 1000  (Ethernet)
        RX packets 5150  bytes 3704335 (3.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2458  bytes 391996 (382.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.4.128.9  netmask 255.255.240.0  broadcast 10.4.143.255
        inet6 fe80::c61:7eff:fe76:2c57  prefixlen 64  scopeid 0x20<link>
        ether 0e:61:7e:76:2c:57  txqueuelen 1000  (Ethernet)
        RX packets 12  bytes 1444 (1.4 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 18  bytes 2177 (2.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


when 10.4.133.196 back online , the eni switched back

Nov 22 09:15:08 ip-10-4-133-196.ap-northeast-3.compute.internal systemd[1]: Started LVS and VRRP High Availability Monitor.
Nov 22 09:15:08 ip-10-4-133-196.ap-northeast-3.compute.internal Keepalived_healthcheckers[2438]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 22 09:15:08 ip-10-4-133-196.ap-northeast-3.compute.internal Keepalived_vrrp[2439]: VRRP_Instance(VI_1) Transition to MASTER STATE
Nov 22 09:15:09 ip-10-4-133-196.ap-northeast-3.compute.internal Keepalived_vrrp[2439]: VRRP_Instance(VI_1) Entering MASTER STATE
Nov 22 09:15:09 ip-10-4-133-196.ap-northeast-3.compute.internal Keepalived_vrrp[2439]: Opening script file /etc/keepalived/i-am-master.sh


ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.4.133.196  netmask 255.255.240.0  broadcast 10.4.143.255
        inet6 fe80::c10:b2ff:febb:7d53  prefixlen 64  scopeid 0x20<link>
        ether 0e:10:b2:bb:7d:53  txqueuelen 1000  (Ethernet)
        RX packets 1589  bytes 3497704 (3.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1264  bytes 170077 (166.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.4.128.9  netmask 255.255.240.0  broadcast 10.4.143.255
        inet6 fe80::c61:7eff:fe76:2c57  prefixlen 64  scopeid 0x20<link>
        ether 0e:61:7e:76:2c:57  txqueuelen 1000  (Ethernet)
        RX packets 81  bytes 7072 (6.9 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 90  bytes 9050 (8.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0



======ping
Sat Nov 22 17:25:04 TST 2025: 64 bytes from 10.4.128.9: icmp_seq=263 ttl=253 time=49.366 ms
Sat Nov 22 17:25:05 TST 2025: 64 bytes from 10.4.128.9: icmp_seq=264 ttl=253 time=52.408 ms
Sat Nov 22 17:25:07 TST 2025: Request timeout for icmp_seq 265
Sat Nov 22 17:25:08 TST 2025: Request timeout for icmp_seq 266
Sat Nov 22 17:25:09 TST 2025: Request timeout for icmp_seq 267
Sat Nov 22 17:25:10 TST 2025: Request timeout for icmp_seq 268
Sat Nov 22 17:25:11 TST 2025: Request timeout for icmp_seq 269
Sat Nov 22 17:25:12 TST 2025: Request timeout for icmp_seq 270
Sat Nov 22 17:25:13 TST 2025: Request timeout for icmp_seq 271
Sat Nov 22 17:25:14 TST 2025: Request timeout for icmp_seq 272
Sat Nov 22 17:25:15 TST 2025: Request timeout for icmp_seq 273
Sat Nov 22 17:25:16 TST 2025: Request timeout for icmp_seq 274
Sat Nov 22 17:25:17 TST 2025: Request timeout for icmp_seq 275
Sat Nov 22 17:25:18 TST 2025: Request timeout for icmp_seq 276
Sat Nov 22 17:25:19 TST 2025: Request timeout for icmp_seq 277
Sat Nov 22 17:25:20 TST 2025: Request timeout for icmp_seq 278
Sat Nov 22 17:25:21 TST 2025: Request timeout for icmp_seq 279
Sat Nov 22 17:25:22 TST 2025: Request timeout for icmp_seq 280
Sat Nov 22 17:25:23 TST 2025: Request timeout for icmp_seq 281
Sat Nov 22 17:25:24 TST 2025: Request timeout for icmp_seq 282
Sat Nov 22 17:25:24 TST 2025: 64 bytes from 10.4.128.9: icmp_seq=283 ttl=253 time=47.442 ms
Sat Nov 22 17:25:25 TST 2025: 64 bytes from 10.4.128.9: icmp_seq=284 ttl=253 time=50.554 ms

參考文件:https://aws.amazon.com/cn/blogs/china/routing-redundancy-solution-in-aws-vpc-network/