BGP on Cilium

BGP on Cilium: Peering Kubernetes with a Leaf‑Spine Datacenter

BGP is not just the routing protocol that powers the Internet — it has become the standard control plane inside modern data centers.

Today’s data centers are typically built using a leaf–spine architecture, where BGP is responsible for distributing reachability information between racks, spines, and endpoints. And when your endpoints are Kubernetes Pods, it makes perfect sense for Kubernetes networking to speak BGP as well.

That’s exactly where Cilium comes in.

In this post, we’ll walk through a hands‑on lab where we enable BGP on Cilium, peer Kubernetes nodes directly with a virtual leaf–spine fabric, and verify real end‑to‑end Pod connectivity across racks.


Lab Overview

In this lab we build a small but realistic virtual data center:

  • A core router (spine)

  • Two Top‑of‑Rack (ToR) switches

  • A Kubernetes cluster with:

    • 1 control‑plane node
    • 3 worker nodes
  • Nodes logically split across two racks

  • Cilium as the CNI, running in native routing mode

  • BGP peering between Kubernetes nodes and ToR switches

The goal is simple:

Kubernetes Pods in different racks should be reachable using routes learned dynamically via BGP.


Why BGP with Cilium?

Cilium’s BGP support allows Kubernetes nodes to advertise Pod CIDRs directly into your data center fabric.

That means:

  • No overlays required
  • No static routes
  • No NAT between racks
  • Your DC fabric becomes Pod‑aware

With the BGP v2 control plane (introduced in Cilium 1.16), this is configured entirely via Kubernetes CRDs — clean, declarative, and GitOps‑friendly.


Topology

At a high level, the topology looks like this:

Topology
  • A spine router peers with two ToR switches
  • Each ToR switch peers with Kubernetes nodes in its rack
  • Kubernetes nodes advertise their Pod CIDRs using BGP

Each rack maps to its own ASN:

  • Rack 0 → AS 65010
  • Rack 1 → AS 65011
  • Core → AS 65000

This mirrors how real data centers are commonly built.


Kubernetes Cluster Setup (Kind)

We deploy Kubernetes using kind, with CNI disabled so that Cilium can be installed manually.

cluster.yaml
kind: Cluster
name: kind
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
  podSubnet: "10.1.0.0/16"
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-ip: "10.0.1.2"
            node-labels: "rack=rack0"        
  - role: worker
    kubeadmConfigPatches:
      - |
        kind: JoinConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-ip: "10.0.2.2"
            node-labels: "rack=rack0"        
  - role: worker
    kubeadmConfigPatches:
      - |
        kind: JoinConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-ip: "10.0.3.2"
            node-labels: "rack=rack1"        
  - role: worker
    kubeadmConfigPatches:
      - |
        kind: JoinConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-ip: "10.0.4.2"
            node-labels: "rack=rack1"        
containerdConfigPatches:
  - |-
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:5000"]
      endpoint = ["http://kind-registry:5000"]    

Each node is labeled with its rack:

node-labels: "rack=rack0"
node-labels: "rack=rack1"

These labels are critical — Cilium uses them later to decide which nodes should peer with which ToR switch.


Building the Datacenter Fabric with Containerlab

To simulate the data center network, we use containerlab with FRRouting (FRR).

The topology includes:

  • router0 – the core router (spine)
  • tor0 – Top of Rack for rack0
  • tor1 – Top of Rack for rack1

Each device runs FRR and establishes BGP sessions using:

  • eBGP between spine and ToRs
  • iBGP between ToRs and Kubernetes nodes

Once deployed, we can already see the fabric forming:

bgp-topo.yaml
name: bgp-topo
topology:
  kinds:
    linux:
      cmd: bash
  nodes:
    router0:
      kind: linux
      image: frrouting/frr:v8.2.2
      labels:
        app: frr
      exec:
        # NAT everything in here to go outside of the lab
        - iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
        # Loopback IP (IP address of the router itself)
        - ip addr add 10.0.0.0/32 dev lo
        # Terminate rest of the 10.0.0.0/8 in here
        - ip route add blackhole 10.0.0.0/8
        # Boiler plate to make FRR work
        - touch /etc/frr/vtysh.conf
        - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
        - /usr/lib/frr/frrinit.sh start
        # FRR configuration
        - >-
          vtysh -c 'conf t' -c 'frr defaults datacenter' -c 'router bgp 65000' -c '  bgp router-id 10.0.0.0' -c '  no bgp ebgp-requires-policy' -c '  neighbor ROUTERS peer-group' -c '  neighbor ROUTERS remote-as external' -c '  neighbor ROUTERS default-originate' -c '  neighbor net0 interface peer-group ROUTERS' -c '  neighbor net1 interface peer-group ROUTERS' -c '  address-family ipv4 unicast' -c '    redistribute connected' -c '  exit-address-family' -c '!'          
    tor0:
      kind: linux
      image: frrouting/frr:v8.2.2
      labels:
        app: frr
      exec:
        - ip link del eth0
        - ip addr add 10.0.0.1/32 dev lo
        - ip addr add 10.0.1.1/24 dev net1
        - ip addr add 10.0.2.1/24 dev net2
        - touch /etc/frr/vtysh.conf
        - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
        - /usr/lib/frr/frrinit.sh start
        - >-
          vtysh -c 'conf t' -c 'frr defaults datacenter' -c 'router bgp 65010' -c '  bgp router-id 10.0.0.1' -c '  no bgp ebgp-requires-policy' -c '  neighbor ROUTERS peer-group' -c '  neighbor ROUTERS remote-as external' -c '  neighbor SERVERS peer-group' -c '  neighbor SERVERS remote-as internal' -c '  neighbor net0 interface peer-group ROUTERS' -c '  neighbor 10.0.1.2 peer-group SERVERS' -c '  neighbor 10.0.2.2 peer-group SERVERS' -c '  address-family ipv4 unicast' -c '    redistribute connected' -c '  exit-address-family' -c '!'          
    tor1:
      kind: linux
      image: frrouting/frr:v8.2.2
      labels:
        app: frr
      exec:
        - ip link del eth0
        - ip addr add 10.0.0.2/32 dev lo
        - ip addr add 10.0.3.1/24 dev net1
        - ip addr add 10.0.4.1/24 dev net2
        - touch /etc/frr/vtysh.conf
        - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
        - /usr/lib/frr/frrinit.sh start
        - >-
          vtysh -c 'conf t' -c 'frr defaults datacenter' -c 'router bgp 65011' -c '  bgp router-id 10.0.0.2' -c '  bgp bestpath as-path multipath-relax' -c '  no bgp ebgp-requires-policy' -c '  neighbor ROUTERS peer-group' -c '  neighbor ROUTERS remote-as external' -c '  neighbor SERVERS peer-group' -c '  neighbor SERVERS remote-as internal' -c '  neighbor net0 interface peer-group ROUTERS' -c '  neighbor 10.0.3.2 peer-group SERVERS' -c '  neighbor 10.0.4.2 peer-group SERVERS' -c '  address-family ipv4 unicast' -c '    redistribute connected' -c '  exit-address-family' -c '!'          
    srv-control-plane:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:kind-control-plane
      exec:
        # Cilium currently doesn't support BGP Unnumbered
        - ip addr add 10.0.1.2/24 dev net0
        # Cilium currently doesn't support importing routes
        - ip route replace default via 10.0.1.1
    srv-worker:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:kind-worker
      exec:
        - ip addr add 10.0.2.2/24 dev net0
        - ip route replace default via 10.0.2.1
    srv-worker2:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:kind-worker2
      exec:
        - ip addr add 10.0.3.2/24 dev net0
        - ip route replace default via 10.0.3.1
    srv-worker3:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:kind-worker3
      exec:
        - ip addr add 10.0.4.2/24 dev net0
        - ip route replace default via 10.0.4.1
  links:
    - endpoints: ["router0:net0", "tor0:net0"]
    - endpoints: ["router0:net1", "tor1:net0"]
    - endpoints: ["tor0:net1", "srv-control-plane:net0"]
    - endpoints: ["tor0:net2", "srv-worker:net0"]
    - endpoints: ["tor1:net1", "srv-worker2:net0"]
    - endpoints: ["tor1:net2", "srv-worker3:net0"]

containerlab -t topo.yaml deploy

And from the core router:

show bgp ipv4 summary wide
❯ docker exec -it clab-bgp-topo-router0 vtysh -c 'show bgp ipv4 summary wide'

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.0.0.0, local AS number 65000 vrf-id 0
BGP table version 12
RIB entries 23, using 4232 bytes of memory
Peers 2, using 1433 KiB of memory
Peer groups 1, using 64 bytes of memory

Neighbor        V         AS    LocalAS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
tor0(net0)      4      65010      65000      4920      4919        0    0    0 04:05:24            5       13 N/A
tor1(net1)      4      65011      65000      4920      4920        0    0    0 04:05:24            5       13 N/A

Total number of neighbors 2

The spine successfully peers with both ToRs — our virtual DC backbone is alive.


Installing Cilium with BGP Enabled

Now the fun part.

We install Cilium with:

  • Native routing
  • Kubernetes IPAM
  • BGP control plane enabled
cilium install \
  --version v1.19.0-rc.0 \
  --set ipam.mode=kubernetes \
  --set routingMode=native \
  --set ipv4NativeRoutingCIDR="10.0.0.0/8" \
  --set bgpControlPlane.enabled=true \
  --set k8s.requireIPv4PodCIDR=true

And confirm:

cilium config view | grep enable-bgp

enable-bgp-control-plane                          true
enable-bgp-control-plane-status-report            true
enable-bgp-legacy-origin-attribute                false

BGP is officially on 🔥


Cilium BGP Configuration Model

Cilium’s BGP v2 control plane uses three CRDs:

  1. CiliumBGPClusterConfig – defines BGP instances and peers
  2. CiliumBGPPeerConfig – defines address families and behavior
  3. CiliumBGPAdvertisement – defines what gets advertised

This separation makes the configuration extremely flexible.


Defining Rack‑Aware BGP Peering

We define two CiliumBGPClusterConfig resources:

  • One for rack0
  • One for rack1

Each config:

  • Selects nodes via rack labels
  • Assigns a rack‑specific ASN
  • Peers with the corresponding ToR loopback IP

Example (rack0):

nodeSelector:
  matchLabels:
    rack: rack0
localASN: 65010
peerAddress: 10.0.0.1

We then define a CiliumBGPAdvertisement that advertises:

  • PodCIDR routes

cilium-bgp-peering-policies.yaml
---
apiVersion: "cilium.io/v2"
kind: CiliumBGPClusterConfig
metadata:
  name: rack0
spec:
  nodeSelector:
    matchLabels:
      rack: rack0
  bgpInstances:
    - name: "instance-65010"
      localASN: 65010
      peers:
        - name: "peer-65010-rack0"
          peerASN: 65010
          peerAddress: "10.0.0.1"
          peerConfigRef:
            name: "peer-config-generic"
---
apiVersion: "cilium.io/v2"
kind: CiliumBGPClusterConfig
metadata:
  name: rack1
spec:
  nodeSelector:
    matchLabels:
      rack: rack1
  bgpInstances:
    - name: "instance-65011"
      localASN: 65011
      peers:
        - name: "peer-65011-rack1"
          peerASN: 65011
          peerAddress: "10.0.0.2"
          peerConfigRef:
            name: "peer-config-generic"
---
apiVersion: "cilium.io/v2"
kind: CiliumBGPPeerConfig
metadata:
  name: peer-config-generic
spec:
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "pod-cidr"
---
apiVersion: "cilium.io/v2"
kind: CiliumBGPAdvertisement
metadata:
  name: pod-cidr
  labels:
    advertise: pod-cidr
spec:
  advertisements:
    - advertisementType: "PodCIDR"

That’s it. No per‑node config. No static routing. Just labels.


Verifying BGP Sessions

After applying the policies:

kubectl apply -f cilium-bgp-peering-policies.yaml

We immediately see Kubernetes nodes forming BGP sessions with the ToR switches.

From tor0:

show bgp ipv4 summary wide
❯ docker exec -it clab-bgp-topo-tor0 vtysh -c 'show bgp ipv4 summary wide'

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.0.0.1, local AS number 65010 vrf-id 0
BGP table version 13
RIB entries 23, using 4232 bytes of memory
Peers 3, using 2149 KiB of memory
Peer groups 2, using 128 bytes of memory

Neighbor                     V         AS    LocalAS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
router0(net0)                4      65000      65010      4593      4595        0    0    0 03:49:05            8       13 N/A
kind-control-plane(10.0.1.2) 4      65010      65010      4486      4492        0    0    0 03:44:11            1       11 N/A
kind-worker(10.0.2.2)        4      65010      65010      4486      4492        0    0    0 03:44:12            1       11 N/A

Total number of neighbors 3

The ToR now peers with:

  • kind-control-plane
  • kind-worker

And receives Pod CIDR routes dynamically 🎯

The same happens on tor1 for rack1 workers.

❯ docker exec -it clab-bgp-topo-tor1 vtysh -c 'show bgp ipv4 summary wide'

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.0.0.2, local AS number 65011 vrf-id 0
BGP table version 13
RIB entries 23, using 4232 bytes of memory
Peers 3, using 2149 KiB of memory
Peer groups 2, using 128 bytes of memory

Neighbor               V         AS    LocalAS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
router0(net0)          4      65000      65011      4591      4592        0    0    0 03:48:59            8       13 N/A
kind-worker2(10.0.3.2) 4      65011      65011      4484      4490        0    0    0 03:44:06            1       11 N/A
kind-worker3(10.0.4.2) 4      65011      65011      4484      4490        0    0    0 03:44:06            1       11 N/A

Total number of neighbors 3

End‑to‑End Connectivity Test

To validate everything, we deploy netshoot as a DaemonSet.

netshoot-ds.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: netshoot
spec:
  selector:
    matchLabels:
      app: netshoot
  template:
    metadata:
      labels:
        app: netshoot
    spec:
      tolerations:
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
      containers:
        - name: netshoot
          image: nicolaka/netshoot:latest
          command: ["sleep", "infinite"]

This gives us a debugging Pod on every node.

❯ kubectl rollout status ds/netshoot -w
daemon set "netshoot" successfully rolled out

❯ k get pods
NAME             READY   STATUS    RESTARTS   AGE
netshoot-ffssl   1/1     Running   0          95s
netshoot-q7l9l   1/1     Running   0          95s
netshoot-rnm8n   1/1     Running   0          95s

We then:

  1. Pick a source Pod in rack0
  2. Pick a destination Pod in rack1
  3. Ping across racks
❯ SRC_POD=$(kubectl get pods -o wide | grep "kind-worker " | awk '{ print($1); }')

❯ DST_IP=$(kubectl get pods -o wide | grep worker3 | awk '{ print($6); }')

❯ kubectl exec -it $SRC_POD -- ping -c 10 $DST_IP
PING 10.1.1.142 (10.1.1.142) 56(84) bytes of data.
64 bytes from 10.1.1.142: icmp_seq=1 ttl=58 time=0.235 ms
64 bytes from 10.1.1.142: icmp_seq=2 ttl=58 time=0.149 ms
64 bytes from 10.1.1.142: icmp_seq=3 ttl=58 time=0.284 ms
64 bytes from 10.1.1.142: icmp_seq=4 ttl=58 time=0.188 ms
^C
--- 10.1.1.142 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3097ms
rtt min/avg/max/mdev = 0.149/0.214/0.284/0.050 ms
❯ kubectl exec -it $SRC_POD -- traceroute $DST_IP
traceroute to 10.1.1.142 (10.1.1.142), 30 hops max, 46 byte packets
 1  10.1.3.115 (10.1.3.115)  0.008 ms  0.064 ms  0.008 ms
 2  10.0.2.1 (10.0.2.1)  0.008 ms  0.009 ms  0.009 ms
 3  10.0.0.0 (10.0.0.0)  0.109 ms  0.009 ms  0.008 ms
 4  10.0.0.2 (10.0.0.2)  0.007 ms  0.009 ms  0.008 ms
 5  10.0.4.2 (10.0.4.2)  0.008 ms  0.009 ms  0.008 ms
 6  *  *  *
 7  10.1.1.142 (10.1.1.142)  0.009 ms  0.009 ms  0.008 ms

And… success 🎉

Packets traverse:

Pod → Node → ToR → Spine → ToR → Node → Pod

All driven by BGP‑learned routes.


What We Achieved

By the end of this lab, we have:

  • A Kubernetes cluster integrated directly into a DC fabric
  • Dynamic Pod CIDR advertisement via BGP
  • Rack‑aware routing using node labels
  • No overlays, no tunnels, no hacks

This is exactly how Kubernetes networking should look in a modern data center.


Final Thoughts

Cilium’s BGP support is a huge step forward for:

  • Bare‑metal Kubernetes
  • On‑prem data centers
  • Hybrid cloud networking

If your network already speaks BGP — and it almost certainly does — Cilium lets Kubernetes become a first‑class citizen of that network.

Happy routing 🚀

my DevOps Odyssey

“Σα βγεις στον πηγαιμό για την Ιθάκη, να εύχεσαι να ‘ναι μακρύς ο δρόμος, γεμάτος περιπέτειες, γεμάτος γνώσεις.” - Kavafis’ Ithaka.