This post is installment #5 in a series of posts providing directions on installing and using Cilium for load balancing and SSL processing. Links to all of the posts in the series are provided below for convenience.
Cilium and Kubernetes - Caveats / Concepts
Cilium and Kubernetes - Installing Cilium Within Kubernetes
Cilium and Kubernetes - Configuring SSL and Load Balancing
Cilium and Kubernetes - Externally Accessing Services via ARP
Cilium and Kubernetes - Externally Accessing Services via BGP
This installment focuses on configuration elements that make virtual IP addresses assigned for load balancing within a Kubernetes cluster reachable to clients outside the cluster using BGP protocol. The processes for allocating IP addresses for auto-assignment and tagging Service objects with attributes to trigger BGP advertisements from within the cluster to BGP routers outside the cluster are described along with workarounds required for some apsects of FRR behavior when using it as a BGP router on Linux operating systems.
With the configuration elements built up to now, the pods are live with private IPs assigned within the cluster, they have been mapped to a service of type LoadBalancer which obtained an IP from a reserved pool of IPs and a layer 7 gateway has been defined with hostname and SSL key information to terminate incoming SSL requests and forward them through an HTTPRoute to the LoadBalancer onto the pods.
However, the VIP address itself is not yet reachable from outside the cluster nodes. In order to expand reachabilty to services beyond the cluster via BGP, the following tasks must be performed:
- External BGP routers must be configured to talk to BGP router processes running within the cluster to propagate internal host routes for assigned virtual IPs to attract traffic to those IP addresses from outside the Kubernetes cluster.
- The BGP router instances within the Kubernetes cluster must be configured via a
CiliumBGPClusterConfig object and by adding a label to each node desired to act as a BGP router within the cluster.
- The Cilium deployment within Kubernetes must be updated to enable its BGP control plane to integrate into each Kubernetes host IP stack
- The Service object defining the unencrypted LoadBalancer VIP for the pod Deployment must be altered to specify that label triggering ARP advertisements
- The Gateway object defining the SSL processing for incoming traffic must be altered to specify that same label triggering ARP advertisement.
- After the Gateway object is activated, the auto-generated Service object tied to the Gateway must be MANUALLY tagged with the same label of mdhlabs-arp: enable
If the BGP approach is selected to allow virtual IPs assigned to services within Kubernetes externaly reachable, a diagram can be assembled to reflect all of the configuration parameters needed in both the internal Cilium configuration elements inside Kubernetes and the external BGP routers. The first key point to make is that it is not strictly necessary to use a physical router that supports BGP for this function. A Linux host running frr (Free Range Router) can provide basic BGP capabilities. In this example, frr will run within two Linux virtual machines simulating two different physical hosts.
The diagram below illustrates the overall topology and the IP address space that will be advertised via BGP both between the two non-Kubernetes router nodes and for ranges advertised from within Kubernetes.
The diagram merits a few summary points of clarification:
- The Kubernetes cluster will act as its own Autonomous System using AS=65432 and each node will run a process that acts like a BGP router
- The external network will be treated as a distinct Autonomous System using AS=62112 and two standalone hosts fedora1 and fedora2 will use frr to implement BGP routers for that AS
- The 192.168.77.0/24 range of IP space to be used for load balancer virtual IPs within Kubernetes is NOT directly reachable to regular hosts on the 192.168.99.0/24 subnet. Those 192.168.77.x IPs will ONLY be reachable after an assignment triggers the Kubernetes BGP routers to originate a /32 advertisement for that IP which will then be advertised to the AS=62112 routers outside the cluster.
- besides sharing those /32 host routes for VIPs between each other via BGP, the fedora1 and fedora2 hosts will incorporate those routes at their Linux OS layer, making the VIPs reachable for command line tests from those hosts.
- in this LAN, the router at 192.168.99.1 is normally the gateway IP configured for all of these hosts.
- If the 192.168.99.1 gateway router supports BGP or OSPF, it is possible to redistribute the /32 routes to the 192.168.99.1 router so all hosts on the 192.168.99.0/24 subnet can reach the VIPs.
- if the 192.168.99.1 gateway router CANNOT support altering its routing configuration, other hosts can be reconfigured to use fedora1 (192.168.99.10) or fedora2 (192.168.99.11) as their gateway IP on their Ethernet configuration.
Visualized Configuration Flow
Given the number of discrete configuration elements required for a complete service deployment using Cilium and the possibility of additional environment-specific deployments, it is useful to depict ALL of the primary configuration objects in a single view that relates the references between the elements. When something isn't working, it is likely due to one of these components being overlooked or having a typo in it.
It is also useful to show all of the YAML configuration files for the overall cluster and specific application or service. Doing so illustrates the value in adopting a naming scheme for these files that reflects their content and object type and allows them to be used as a reminder to ensure all bases have been covered. Here, the files associated with using BGP advertisements for the "cd" service are:
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $ ls -l cd-bgp*
-rw-r--r--. 1 mdh mdh 1514 Apr 13 22:08 cd-bgp-deploy.yaml
-rw-r--r--. 1 mdh mdh 1181 Apr 13 22:10 cd-bgp-gw.dev.yaml
-rw-r--r--. 1 mdh mdh 1171 Apr 12 18:01 cd-bgp-gw.prod.yaml
-rw-r--r--. 1 mdh mdh 361 Apr 12 17:53 cd-bgp-httproute.prod.yaml
-rw-r--r--. 1 mdh mdh 1094 Apr 27 17:21 cd-bgp-svc.yaml
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $ ls -l mdhlabs-bgp*
-rw-r--r--. 1 mdh mdh 349 Apr 12 14:00 mdhlabs-bgp-advertisement.yaml
-rw-r--r--. 1 mdh mdh 883 Apr 26 11:57 mdhlabs-bgp-cluster-config.yaml
-rw-r--r--. 1 mdh mdh 1280 Apr 13 17:47 mdhlabs-bgp-fedora1frr.config.txt
-rw-r--r--. 1 mdh mdh 1280 Apr 13 17:47 mdhlabs-bgp-fedora2frr.config.txt
-rw-r--r--. 1 mdh mdh 781 Apr 11 21:30 mdhlabs-bgp-kube1-node-config.yaml
-rw-r--r--. 1 mdh mdh 781 Apr 11 21:47 mdhlabs-bgp-kube2-node-config.yaml
-rw-r--r--. 1 mdh mdh 781 Apr 11 21:48 mdhlabs-bgp-kube3-node-config.yaml
-rw-r--r--. 1 mdh mdh 328 Apr 12 14:01 mdhlabs-bgp-peer-config.yaml
-rw-r--r--. 1 mdh mdh 133 Apr 12 14:57 mdhlabs-bgp-pw-secret.yaml
-rw-r--r--. 1 mdh mdh 382 Apr 11 17:50 mdhlabs-bgp-vip-pool-dev.yaml
-rw-r--r--. 1 mdh mdh 378 Apr 11 17:50 mdhlabs-bgp-vip-pool-prod.yaml
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $
 |
All of the steps below were first applied to two Fedora servers running as VMs atop ProxMox. The resulting "routers" properly passed routes between themselves and their peers within the Kubernetes cluster and properly routed traffic originated WITHIN them to the virtual IPs advertised via BGP and received responses. However, neither router would FORWARD incoming traffic from other hosts towards those BGP advertised destinations. The same installation was performed on a physical Fedora server running on bare metal on host IP 192.168.99.9 and worked perfectly. There is possibly some limit to how many times packets can be forwarded over virtual Ethernet adapters within virtual machines that prevented the VM based BGP router from functioning. |
Installing FRR (Free Range Router) on Linux
Rather than using real routers that cost real dollars, this environment will run FRR which implements BGP, OSPB, IS-IS and other routing protocols atop Linux operating systems. The tool mimics most of Cisco IOS command syntax so it is relatively easy for anyone with familiarity of Cisco routers to use it as a virtual replacement. The tool can be installed and enabled as a daemon using these commands as root.
dnf install frr
systemctl enable frr
systemctl start frr
In order for FRR to function completely in this use case, two additional changes are required. First, packet forwarding must be permanently enabled at the operating system level by setting net.ipv4.ip_forward=1 within sysctl. The current status can be validated via this command.
root@fedorabgp:~ # sysctl -a | grep ipv4 | grep ip_forward
net.ipv4.ip_forward = 0
net.ipv4.ip_forward_update_priority = 1
net.ipv4.ip_forward_use_pmtu = 0
root@fedorabgp:~ #
It can be dynamically changed (but lost after reboot) via this command
root@fedorabgp:~ # sysctl -w net.ipv4.ip_forward=1
root@fedorabgp:~ # sysctl -a | grep ipv4 | grep ip_forward
net.ipv4.ip_forward = 1
net.ipv4.ip_forward_update_priority = 1
net.ipv4.ip_forward_use_pmtu = 0
root@fedorabgp:~ #
To ensure the change is applied at startup, the command can be added to a file placed in the /etc/sysctl.d directory which will be read at startup.
root@fedorabgp:~ # cat /etc/sysctl.d/99-sysctl.conf
# added by mdh 2021-11-25 to alter this setting for elasticsearch
vm.max_map_count=2621441
# added by mdh 2026-04-25 - enables forwarding to use this
# host as a BGP router
net.ipv4.ip_forward=1
root@fedorabgp:~ #
A second change is required to the /etc/frr/daemons configuration file to enable the bgpd daemon. That change is excerpted below.
# The watchfrr, zebra and staticd daemons are always started.
#
bgpd=yes
ospfd=no
ospf6d=no
ripd=no
ripngd=no
isisd=no
After making that change, the frr daemon should be restarted via systemctl restart frr.
Configuring the fedora1 and fedora2 BGP Routers
Before adding configuration to Kubernetes for BGP, the process of configuring FRR with a BGP network with peers and policies can be demonstrated to ensure the basics are working. This example creates a network with the following characteristics:
- the routers outside of the Kubernetes cluster use Autonomous System (AS) number 62112
- the routers inside of the Kubernetes cluster use Autonomous System (AS) number 65432
- fedora1 router's BGP id will be its primary host IP of 192.168.99.10
- fedora2 router's BGP id will be its primary host IP of 192.168.99.11
- kube1 router's BGP id will be its primary host IP of 192.168.99.12
Here is the initial configuration for the fedora1 router:
fedora1# show running-config
Building configuration...
Current configuration:
!
frr version 10.5.0
frr defaults traditional
hostname fedora1
!
ip prefix-list KUBEFILTER seq 5 permit 192.168.77.0/24
ip prefix-list KUBEFILTER seq 10 deny 0.0.0.0/0 le 32
!
route-map ALLOW-ALL permit 10
exit
!
route-map INBOUNDFILTER permit 10
match ip address prefix-list KUBEFILTER
exit
!
router bgp 62112
bgp router-id 192.168.99.10
neighbor KUBE_AS peer-group
neighbor KUBE_AS remote-as 65432
neighbor KUBE_AS password weakpassword
neighbor KUBE_AS port 32179
neighbor KUBE_AS timers 30 90
neighbor MDHLABS peer-group
neighbor MDHLABS remote-as 62112
neighbor MDHLABS password weakpassword
neighbor MDHLABS timers 30 90
neighbor 192.168.99.12 peer-group KUBE_AS
neighbor 192.168.99.12 description "kube1 peer"
neighbor 192.168.99.13 peer-group KUBE_AS
neighbor 192.168.99.13 description "kube2 peer"
neighbor 192.168.99.14 peer-group KUBE_AS
neighbor 192.168.99.14 description "kube3 peer"
neighbor 192.168.99.11 peer-group MDHLABS
neighbor 192.168.99.11 description "fedora2 peer"
!
address-family ipv4 unicast
network 172.16.99.0/24
neighbor KUBE_AS soft-reconfiguration inbound
neighbor KUBE_AS route-map INBOUNDFILTER in
neighbor KUBE_AS route-map ALLOW-ALL out
neighbor MDHLABS soft-reconfiguration inbound
neighbor MDHLABS route-map ALLOW-ALL out
exit-address-family
exit
!
endfedora1#
Here is the initial configuration for the fedora2 router:
fedora2# show running-config
Building configuration...
Current configuration:
!
frr version 10.5.0
frr defaults traditional
hostname fedora2
!
ip prefix-list KUBEFILTER seq 5 permit 192.168.77.0/24
ip prefix-list KUBEFILTER seq 10 deny 0.0.0.0/0 le 32
!
route-map ALLOW-ALL permit 10
exit
!
route-map FILTER_IN permit 10
match ip address prefix-list MDHFILTER
exit
!
route-map INBOUNDFILTER permit 10
match ip address prefix-list KUBEFILTER
exit
!
router bgp 62112
bgp router-id 192.168.99.11
neighbor KUBE_AS peer-group
neighbor KUBE_AS remote-as 65432
neighbor KUBE_AS password weakpassword
neighbor KUBE_AS timers 30 90
neighbor MDHLABS peer-group
neighbor MDHLABS remote-as 62112
neighbor MDHLABS password weakpassword
neighbor MDHLABS timers 30 90
neighbor 192.168.99.12 peer-group KUBE_AS
neighbor 192.168.99.12 description "kube1 peer"
neighbor 192.168.99.13 peer-group KUBE_AS
neighbor 192.168.99.13 description "kube2 peer"
neighbor 192.168.99.14 peer-group KUBE_AS
neighbor 192.168.99.14 description "kube3 peer"
neighbor 192.168.99.10 peer-group MDHLABS
neighbor 192.168.99.10 description "fedora1 peer"
neighbor 192.168.99.10 timers 30 90
!
address-family ipv4 unicast
network 172.16.88.0/24
neighbor KUBE_AS soft-reconfiguration inbound
neighbor KUBE_AS route-map INBOUNDFILTER in
neighbor KUBE_AS route-map ALLOW-ALL out
neighbor MDHLABS soft-reconfiguration inbound
neighbor MDHLABS route-map ALLOW-ALL out
exit-address-family
exit
!
End
fedora2#
With these configurations in place, querying the fedora1 router for a BGP summary and a list of IP ROUTES produces the following output:
fedora1# show ip bgp summary
IPv4 Unicast Summary:
BGP router identifier 192.168.99.10, local AS number 62112 VRF default vrf-id 0
BGP table version 2
RIB entries 3, using 384 bytes of memory
Peers 2, using 47 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
192.168.99.11 4 62112 31 32 2 0 0 00:13:55 1 1 "fedora1 internal
192.168.99.12 4 62112 0 0 0 0 0 never Active 0 "fedora1 internal
Total number of neighbors 2
fedora1# show ip route
Codes: K - kernel route, C - connected, L - local, S - static,
R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, D - SHARP, F - PBR,
f - OpenFabric, t - Table-Direct,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
IPv4 unicast VRF default:
K>* 0.0.0.0/0 [0/100] via 192.168.99.1, ens18, weight 1, 00:43:35
B>* 172.16.88.0/24 [200/0] via 192.168.99.11, ens18, weight 1, 00:05:54
K>* 172.16.99.0/24 [0/0] via 192.168.99.1, ens18, weight 1, 00:06:45
C>* 172.17.0.0/16 is directly connected, docker0, weight 1, 00:43:35
L>* 172.17.0.1/32 is directly connected, docker0, weight 1, 00:43:35
C>* 192.168.99.0/24 [0/100] is directly connected, ens18, weight 1, 00:43:35
L>* 192.168.99.10/32 is directly connected, ens18, weight 1, 00:43:35
fedora1#
Similarly, running the same commands on the fedora2 router inside vtysh shows the following.
fedora2# show ip bgp summary
IPv4 Unicast Summary:
BGP router identifier 192.168.99.11, local AS number 62112 VRF default vrf-id 0
BGP table version 2
RIB entries 3, using 384 bytes of memory
Peers 2, using 47 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
192.168.99.10 4 62112 22 22 2 0 0 00:09:18 1 1 "fedora1 internal
192.168.99.12 4 62112 0 0 0 0 0 never Active 0 "kube1 internal
Total number of neighbors 2
fedora2# show ip route
Codes: K - kernel route, C - connected, L - local, S - static,
R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, D - SHARP, F - PBR,
f - OpenFabric, t - Table-Direct,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
IPv4 unicast VRF default:
K>* 0.0.0.0/0 [0/100] via 192.168.99.1, ens18, weight 1, 00:11:06
K>* 172.16.88.0/24 [0/0] via 192.168.99.1, ens18, weight 1, 00:01:27
B>* 172.16.99.0/24 [200/0] via 192.168.99.10, ens18, weight 1, 00:02:18
C>* 172.17.0.0/16 is directly connected, docker0, weight 1, 00:11:06
L>* 172.17.0.1/32 is directly connected, docker0, weight 1, 00:11:06
C>* 192.168.99.0/24 [0/100] is directly connected, ens18, weight 1, 00:11:06
L>* 192.168.99.11/32 is directly connected, ens18, weight 1, 00:11:06
fedora2#
Together, both routers show expected / desired results:
- both fedora1 andfedora2 show a live peering session to the other with one route sent and one route received
- each shows another peering session to 192.168.99.12 with no connection since Cilium has not be configured yet
- the unique IP ranges of 172.16.99.0/24 and 172.16.88.0/24 appear in the other router's IP ROUTES summary as having been learned via BGP
 |
These policies are global and are not restricted to any particular namespace. |
Configuring the Cilium BGP Advertisement Policy
The CiliumBGPAdvertisement object is used to define criteria that Cilium will use in deciding what IP addresses will be advertised by this BGP mechanism. For this example, a specific meta tag of mdhlabs-bgp: enable will be used to flag IPs needing advertisement. These tags will be added as part of the configuration of Service objects and Gateways that need a LoadBalancer VIP but the policy specifying this tag is defined as shown below in the mdhlabs-bgp-advertisement.yaml file.
apiVersion: cilium.io/v2
kind: CiliumBGPAdvertisement
metadata:
name: mdhlabs-bgp-advertisement
labels:
bgp.cilium.io/advertise: mdhlabs-bgp-advertisement
spec:
advertisements:
- advertisementType: Service
service:
addresses:
- LoadBalancerIP
selector:
matchLabels:
mdhlabs-bgp: enable
This policy acts against the entire cluster so it doesn't get applied against a particular namespace.
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $ kubectl apply -f mdhlabs-bgp-advertisement.yaml
ciliumbgpadvertisement.cilium.io/mdhlabs-bgp-advertisement created
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $
Configuring the Cilium BGP Routers
In this example, connections between the external BGP routers on fedora1 and fedora2 will require a password to establish a BGP session to the Cilium BGP routers so the Cilium side needs to be provided that password. This will be done by first creating a Secret object which will then be referenced by a CiliumBGPPeerConfig object. This YAML will be housed in mdhlabs-bgp-pw-secret.yaml.
 |
This secret must deployed to the same namespace used to run cilium, which normally installs itself inside the kube-system namespace. Some random examples on the Internet reference putting this secret in a namespace called cilium-secrets which will NOT make the secret accessible in default Cilium installations. |
apiVersion: v1
kind: Secret
metadata:
name: mdhlabs-bgp-pw-secret
namespace: kube-system
stringData:
password: "weakpassword"
The CiliumBGPClusterConfig object is used to define all external BGP routers the Cilium BGP routers will use to share internal routes for virtual IPs. The configuration for this sample lab environment as saved in mdhlabs.bgp-cluster-config.yaml is shown below.
apiVersion: cilium.io/v2
kind: CiliumBGPClusterConfig
metadata:
name: mdhlabs-bgp-cluster-config
spec:
nodeSelector:
matchLabels:
node-role.kubernetes.io/control-plane: ""
bgpInstances:
- name: "kubeas65432"
localASN: 65432
peers:
- name: "fedora1"
peerASN: 62112
peerAddress: 192.168.1.10
peerConfigRef:
name: mdhlabs-bgp-peer-config
- name: "fedora2"
peerASN: 62112
peerAddress: 192.168.1.11
peerConfigRef:
name: mdhlabs-bgp-peer-config
The CiliumBGPPeerConfig object serves the same role as a peer-group attribute in the FRR router configuration. It allows a set of common attributes used by all routers in the cluster to be defined in a single element and re-used. In this example, it is being used to supply the password that will be used to authenticate to the external routers on fedora1 and fedora2. Here is the peer configuration as saved in mdhlabs-bgp-peer-config.yaml.
apiVersion: cilium.io/v2
kind: CiliumBGPPeerConfig
metadata:
name: mdhlabs-bgp-peer-config
spec:
authSecretRef: mdhlabs-bgp-pw-secret
gracefulRestart:
enabled: true
families:
- afi: ipv4
safi: unicast
advertisements:
matchLabels:
bgp.cilium.io/advertise: loadbalancer-services
Adding the Advertising Label to the Service and Gateway
With the policy deployed to look for mdhlabs-bgp:enabled to trigger advertising assigned VIPs, existing Service and Gateway definitions that create load balancer virtual IPs must be altered to specify the expected tag in their metadata and redeployed. The revised cd-bgp-svc.yamlYAML for the Service object is shown below.
apiVersion: v1
kind: Service
metadata:
labels:
app: cd-bgp
mdhlabs-bgp: enable
name: cd-bgp-svc
spec:
selector:
app: cd-bgp
ports:
- protocol: TCP
port: 6680
targetPort: 6680
type: LoadBalancer
# externalTrafficPolicy controls how requests from OUTSIDE the
# cluster are distributed WITHIN the cluster
# Local = traffic is processed by first node / pod that attracted the request
# but does not undergo source NAT
# Cluster (default) = requests are balanced across all nodes/pods but source IPs are NATed
externalTrafficPolicy: Cluster
# internalTrafficPolicy controls how reqeuests originating from WITHIN the
# cluster are distributed:
# Local - requests stay within pods on same node
# Cluster (default) - requests are balanced across all nodes and pods
internalTrafficPolicy: Cluster
The outer Gateway object that initiates SSL / TLS processing also needs the metadata label applied. For this Gateway defined in cd-bgp-gw.prod.yaml, the full configuration looks like this:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
labels:
app: cd-bgp
service-name: cd-bgp-svc
mdhlabs-bgp: enable
# NOTE: For VIPs assigned for Gateway and Service objects, Cilium
# expects to match a label in the Gateway or Service to a label
# that triggers advertisement of the IP via ARP or BGP.
#
# THe label here (mdhlabs-bgp: enable) matches a label in a
# CiliumBGPAdvertisement config which SHOULD trigger the IP
# assigned here to be adverised. HOWEVER, this mechanism actually
# works on the SERVICE object and here, this Gateway auto-generates
# a SERIVCE object but Cilium does not label that auto-generated
# service with the (mdhlabs-bgp: enable) tag here so the IP
# is NOT advertised.
#
# This (mdhlabs-bgp: enable) tag must be MANUUALY added to the
# auto-generated Service for the Gateway EACH time the Gateway
# is deployed.
name: cd-bgp-gw
namespace: prod
spec:
gatewayClassName: cilium
listeners:
- name: https
protocol: HTTPS
port: 443
hostname: "api.mdhlabs.com"
tls:
mode: Terminate
certificateRefs:
- kind: Secret
name: api-secrettls
After applying the cd-bgp-gw configuration to the cluster, a new service object will show up with an IP assigned from the 192.168.77.0/24 subnet that maps to the new gateway.
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $ kubectl -n prod apply -f cd-bgp-gw.prod.yaml
gateway.gateway.networking.k8s.io/cd-bgp-gw created
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $ kubectl -n prod get all
NAME READY STATUS RESTARTS AGE
pod/cd-bgp-deploy-78d5554d58-lvrzz 1/1 Running 15 (6h55m ago) 15d
pod/cd-bgp-deploy-78d5554d58-rxh84 1/1 Running 12 (6h54m ago) 15d
pod/cd-bgp-deploy-78d5554d58-zp64d 1/1 Running 22 (6h54m ago) 15d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cd-bgp-svc LoadBalancer 10.105.122.66 192.168.77.128 6680:31091/TCP 15d
service/cilium-gateway-cd-bgp-gw LoadBalancer 10.103.224.100 192.168.77.129 443:32484/TCP 7s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cd-bgp-deploy 3/3 3 3 15d
NAME DESIRED CURRENT READY AGE
replicaset.apps/cd-bgp-deploy-78d5554d58 3 3 3 15d
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $
But as was the case with the ARP process, there's a problem with this second IP tied to the gateway. The fedora1 host is one of the external BGP routers and should have learned a route for the 192.168.77.129 IP address via BGP from the Kubernetes cluster, but no route is showing up.
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $ ip route
default via 192.168.99.1 dev ens18 proto static metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.49.0/24 dev br-a18ac545ed4a proto kernel scope link src 192.168.49.1 linkdown
192.168.77.128 nhid 24 proto bgp metric 20
nexthop via 192.168.99.13 dev ens18 weight 1
nexthop via 192.168.99.14 dev ens18 weight 1
nexthop via 192.168.99.12 dev ens18 weight 1
192.168.99.0/24 dev ens18 proto kernel scope link src 192.168.99.10 metric 100
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $
Why isn't the 192.168.77.129 virtual IP being advertised from the cluster back to fedora1? Because the auto-generated service created by Cilium wasn't tagged with the mdhlabs-bgp=enabled tag defined on the original Gateway. Querying the cluster for the details of that gateway shows the missing label:
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $ kubectl -n prod describe service cilium-gateway-cd-bgp-gw
Name: cilium-gateway-cd-bgp-gw
Namespace: prod
Labels: gateway.networking.k8s.io/gateway-name=cd-bgp-gw
io.cilium.gateway/owning-gateway=cd-bgp-gw
Annotations: <none>
Selector: <none>
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.103.224.100
IPs: 10.103.224.100
LoadBalancer Ingress: 192.168.77.129 (VIP)
Port: port-443 443/TCP
TargetPort: 443/TCP
NodePort: port-443 32484/TCP
Endpoints:
Session Affinity: None
External Traffic Policy: Cluster
Internal Traffic Policy: Cluster
Events:
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $
If the live auto-generated service is updated to add the expected label of mdhlabs-bgp=enable, the route will be advertised as shown below.
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $ kubectl -n prod label service cilium-gateway-cd-bgp-gw mdhlabs-bgp=enable
service/cilium-gateway-cd-bgp-gw labeled
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $ kubectl -n prod describe service cilium-gateway-cd-bgp-gw
Name: cilium-gateway-cd-bgp-gw
Namespace: prod
Labels: gateway.networking.k8s.io/gateway-name=cd-bgp-gw
io.cilium.gateway/owning-gateway=cd-bgp-gw
mdhlabs-bgp=enable
Annotations: <none>
Selector: <none>
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.103.224.100
IPs: 10.103.224.100
LoadBalancer Ingress: 192.168.77.129 (VIP)
Port: port-443 443/TCP
TargetPort: 443/TCP
NodePort: port-443 32484/TCP
Endpoints:
Session Affinity: None
External Traffic Policy: Cluster
Internal Traffic Policy: Cluster
Events: <none>
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $ ip route
default via 192.168.99.1 dev ens18 proto static metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.49.0/24 dev br-a18ac545ed4a proto kernel scope link src 192.168.49.1 linkdown
192.168.77.128 nhid 24 proto bgp metric 20
nexthop via 192.168.99.13 dev ens18 weight 1
nexthop via 192.168.99.14 dev ens18 weight 1
nexthop via 192.168.99.12 dev ens18 weight 1
192.168.77.129 nhid 24 proto bgp metric 20
nexthop via 192.168.99.13 dev ens18 weight 1
nexthop via 192.168.99.14 dev ens18 weight 1
nexthop via 192.168.99.12 dev ens18 weight 1
192.168.99.0/24 dev ens18 proto kernel scope link src 192.168.99.10 metric 100
mdh@fedora1:~/gitwork/webservices/cdtrackerapi $
Diagnosing BGP Problems
If virtual IPs are not being advertised from Kubernetes to the exterior network via BGP as expected, the existence of the routes at each stage in the flow can be viewed with different commands to determine where in the chain something is being ignored or rejected. The summary list of places to check looks like this:
- Is the Gateway service being assigned an IP or does the assignment show <none>?
- Is the assigned IP for the LoadBalancer assigned to the gateway from the expected IP range mapped for use for BGP?
- Do the Kubernetes nodes enabled for BGP show active sessions to the external BGP routers?
- Does the BGP summary from Cilium show that routes are being advertised?
- Do the external BGP routers show the routes being RECEIVED?
- Do the external BGP routers show the routes being ACCEPTED? If they are being RECEIVED but not being ACCEPTED (appearing in the actual routing table), the external routers are either lacking a policy to accept the routes or that policy has a configuration bug that is blocking the actual routes.
- Within the external routers (here running FRR), do the expected routes appear in the
show ip route output?
Here is an example of using the cilium bgp peers command to verifiy the status of BGP connections from the cluster to any configured external BGP peers. (Note that Cilium does not display the meshed connections BETWEEN the Kubernetes nodes that would normally be required within a BGP network.) In this example, each Cilium BGP router in the cluster is showing it is advertising 2 routes towards each peer router outside the cluster.
mdh@fedora1:/home $ cilium bgp peers
Node Local AS Peer AS Peer Address Session State Uptime Family Received Advertised
kube1 65432 62112 192.168.99.10 established 11m21s ipv4/unicast 2 2
65432 62112 192.168.99.11 established 35m42s ipv4/unicast 0 2
65432 62112 192.168.99.9 established 39m5s ipv4/unicast 0 2
kube2 65432 62112 192.168.99.10 established 11m20s ipv4/unicast 2 2
65432 62112 192.168.99.11 established 35m13s ipv4/unicast 0 2
65432 62112 192.168.99.9 established 38m49s ipv4/unicast 0 2
kube3 65432 62112 192.168.99.10 established 11m21s ipv4/unicast 2 2
65432 62112 192.168.99.11 established 38m21s ipv4/unicast 0 2
65432 62112 192.168.99.9 established 38m28s ipv4/unicast 0 2
mdh@fedora1:/home $
Here is an example of the BGP session statuses as seen from one of the FRR external BGP routers. This list shows both the "external" BGP sessions into the Cilium BGP routers on the Kubernetes nodes and the "internal" BGP peers within the 62112 AS.
root@fedora1:~ # vtysh
Hello, this is FRRouting (version 10.5.0).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
fedora1# show ip bgp summary
IPv4 Unicast Summary:
BGP router identifier 192.168.99.10, local AS number 62112 VRF default vrf-id 0
BGP table version 4
RIB entries 5, using 640 bytes of memory
Peers 5, using 118 KiB of memory
Peer groups 2, using 128 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
192.168.99.12 4 65432 44 44 4 0 0 00:20:27 2 2 "kube1 peer"
192.168.99.13 4 65432 44 44 4 0 0 00:20:27 2 2 "kube2 peer"
192.168.99.14 4 65432 44 44 4 0 0 00:20:28 2 2 "kube3 peer"
192.168.99.9 4 62112 44 45 4 0 0 00:20:30 0 2 "fedorabgp"
192.168.99.11 4 62112 44 45 4 0 0 00:20:30 0 2 "fedora2 peer"
Total number of neighbors 5
fedora1#
Here is an example of using show ip bgp neighbor 192.168.99.12 received to view actual routes SENT from a remote peer. If a route is ADVERTISED by the remote peer and ARRIVES at the local peer but is then REJECTED by policy, that route will appear here as "received" but NOT appear as an actual route.
 |
In order for the show ip bgp neighbor xxxx received command to work for a partiuclar peer, that peer must be configured with the soft-reconfiguration inbound option. If this commmand returns an error, check the neighbor configuration on the remote end to confirm this option is present. |
fedora1# show ip bgp neighbor 192.168.99.12 received
BGP table version is 4, local router ID is 192.168.99.10, vrf id 0
Default local pref 100, local AS 62112
Status codes: s suppressed, d damped, h history, u unsorted, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*> 192.168.77.128/32
192.168.99.12 0 65432 i
*> 192.168.77.129/32
192.168.99.12 0 65432 i
Total number of prefixes 2
fedora1#
Here is an example of IP routes as shown within the FRR router reflecting routes learned from the three Kubernetes nodes.
fedora1# show ip route
Codes: K - kernel route, C - connected, L - local, S - static,
R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, D - SHARP, F - PBR,
f - OpenFabric, t - Table-Direct,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
IPv4 unicast VRF default:
K>* 0.0.0.0/0 [0/100] via 192.168.99.1, ens18, weight 1, 00:00:27
C>* 172.17.0.0/16 is directly connected, docker0, weight 1, 00:00:27
L>* 172.17.0.1/32 is directly connected, docker0, weight 1, 00:00:27
B>* 192.168.77.128/32 [20/0] via 192.168.99.12, ens18, weight 1, 00:00:20
* via 192.168.99.13, ens18, weight 1, 00:00:20
* via 192.168.99.14, ens18, weight 1, 00:00:20
B>* 192.168.77.129/32 [20/0] via 192.168.99.12, ens18, weight 1, 00:00:20
* via 192.168.99.13, ens18, weight 1, 00:00:20
* via 192.168.99.14, ens18, weight 1, 00:00:20
C>* 192.168.99.0/24 [0/100] is directly connected, ens18, weight 1, 00:00:27
L>* 192.168.99.10/32 is directly connected, ens18, weight 1, 00:00:27
fedora1#
Here is how those same routes appear after the FRR BGP router process on fedora1 propagates those routes to the operating system of fedora1. Note the command below was issued in the BASH shell of the fedora1 operating system. The command above was entered inside the vtysh shell program of FRR.
root@fedora1:~ # ip route
default via 192.168.99.1 dev ens18 proto static metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.49.0/24 dev br-a18ac545ed4a proto kernel scope link src 192.168.49.1 linkdown
192.168.77.128 nhid 19 proto bgp metric 20
nexthop via 192.168.99.14 dev ens18 weight 1
nexthop via 192.168.99.12 dev ens18 weight 1
nexthop via 192.168.99.13 dev ens18 weight 1
192.168.77.129 nhid 19 proto bgp metric 20
nexthop via 192.168.99.14 dev ens18 weight 1
nexthop via 192.168.99.12 dev ens18 weight 1
nexthop via 192.168.99.13 dev ens18 weight 1
192.168.99.0/24 dev ens18 proto kernel scope link src 192.168.99.10 metric 100
root@fedora1:~ #
If analysis shows that:
- a virtual IP address IS being assigned by Cilium for a load balancer
- that virtual IP address IS matching Cilium policies for being advertised by BGP to external BGP routers
- that route advertisement IS reaching the external BGP router
- that external BGP router is not accepting that route
it is likely that the remote BGP router and its host operating system is treating the next-hop for reaching that BGP router that advertised the route as unreachable. There are some esoteric reasons why this might take place but the far more common problem is a lower level problem with BPF (Berkeley Packet Filter) functionality and functional issues with virtualized Ethernet interfaces that are resulting in the TCP stack thinking the route destination is unreachable.
Here's an example of this scenario. The external router fedorabgp (192.168.99.9) is physically receiving two routes from Cilium neighbor 192.168.99.12 but is not accepting the routes and putting them in its routing table.
fedorabgp# show ip bgp neighbor 192.168.99.12 received
BGP table version is 0, local router ID is 192.168.99.9, vrf id 0
Default local pref 100, local AS 62112
Status codes: s suppressed, d damped, h history, u unsorted, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*> 192.168.77.128/32
192.168.99.12 0 65432 i
*> 192.168.77.129/32
192.168.99.12 0 65432 i
Total number of prefixes 2
fedorabgp#
fedorabgp#
fedorabgp# show ip bgp neighbor 192.168.99.12 route
BGP table version is 0, local router ID is 192.168.99.9, vrf id 0
Default local pref 100, local AS 62112
Status codes: s suppressed, d damped, h history, u unsorted, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
192.168.77.128/32
192.168.99.12 0 65432 i
192.168.77.129/32
192.168.99.12 0 65432 i
Displayed 2 routes and 9 total paths
fedorabgp#
Restarting the frr process via systemctl restart frrd will correct the problem and allow the BGP routes to be distributed into the zebra layer of the operating system. But what is causing the problem? Usually, it is caused by systemd starting FRR at the point where the operating system network layer is first starting rather than AFTER the network layer is fully up. In some cases, the BGP daemon can recieve routes before lower layers see accurate interface statuses. This leads the operating system to treat those routes as unreachable and reject them. This can be confirmed and corrected by looking at the /usr/lib/systemd/system/frr.service service configuration file.
Initially, that service file looks like this:
[Unit]
Description=FRRouting
Documentation=https://frrouting.readthedocs.io/en/latest/setup.html
Wants=network.target
After=network-pre.target systemd-sysctl.service
Before=network.target
OnFailure=heartbeat-failed@%n
[Service]
Nice=-5
Type=forking
NotifyAccess=all
StartLimitInterval=3m
StartLimitBurst=3
TimeoutSec=2m
WatchdogSec=60s
RestartSec=5
Restart=always
LimitNOFILE=1024
PIDFile=/run/frr/watchfrr.pid
ExecStart=/usr/libexec/frr/frrinit.sh start
ExecStop=/usr/libexec/frr/frrinit.sh stop
ExecReload=/usr/libexec/frr/frrinit.sh reload
[Install]
WantedBy=multi-user.target
To ensure frr is started AFTER the network reaches an active state, the After attribute in the service file
should be changed as shown below in GREEN.
[Unit]
Description=FRRouting
Documentation=https://frrouting.readthedocs.io/en/latest/setup.html
Wants=network.target
After=network-online.target systemd-sysctl.service
Before=network.target
OnFailure=heartbeat-failed@%n
With the updated service file in place, rebooting the server and IMMEDIATELY using ip route to check routes after logging in shows all of the external BGP routes from Kubernetes immediately propagated down to the kernel level routing of the BGP router without having to manually restart frr.
More information on using Cilium within Kubernetes is provided in other posts in this series:
Cilium and Kubernetes - Caveats / Concepts
Cilium and Kubernetes - Installing Cilium Within Kubernetes
Cilium and Kubernetes - Configuring SSL and Load Balancing
Cilium and Kubernetes - Externally Accessing Services via ARP
Cilium and Kubernetes - Externally Accessing Services via BGP