BGP on Cilium

BGP on Cilium: Peering Kubernetes with a Leaf‑Spine Datacenter

BGP is not just the routing protocol that powers the Internet — it has become the standard control plane inside modern data centers.

Today’s data centers are typically built using a leaf–spine architecture, where BGP is responsible for distributing reachability information between racks, spines, and endpoints. And when your endpoints are Kubernetes Pods, it makes perfect sense for Kubernetes networking to speak BGP as well.

That’s exactly where Cilium comes in.

In this post, we’ll walk through a hands‑on lab where we enable BGP on Cilium, peer Kubernetes nodes directly with a virtual leaf–spine fabric, and verify real end‑to‑end Pod connectivity across racks.

Lab Overview

In this lab we build a small but realistic virtual data center:

A core router (spine)
Two Top‑of‑Rack (ToR) switches
A Kubernetes cluster with:
- 1 control‑plane node
- 3 worker nodes
Nodes logically split across two racks
Cilium as the CNI, running in native routing mode
BGP peering between Kubernetes nodes and ToR switches

The goal is simple:

Kubernetes Pods in different racks should be reachable using routes learned dynamically via BGP.

Why BGP with Cilium?

Cilium’s BGP support allows Kubernetes nodes to advertise Pod CIDRs directly into your data center fabric.

That means:

No overlays required
No static routes
No NAT between racks
Your DC fabric becomes Pod‑aware

With the BGP v2 control plane (introduced in Cilium 1.16), this is configured entirely via Kubernetes CRDs — clean, declarative, and GitOps‑friendly.

Topology

At a high level, the topology looks like this:

A spine router peers with two ToR switches
Each ToR switch peers with Kubernetes nodes in its rack
Kubernetes nodes advertise their Pod CIDRs using BGP

Each rack maps to its own ASN:

Rack 0 → AS 65010
Rack 1 → AS 65011
Core → AS 65000

This mirrors how real data centers are commonly built.

Kubernetes Cluster Setup (Kind)

We deploy Kubernetes using kind, with CNI disabled so that Cilium can be installed manually.

cluster.yaml

kind: Cluster
name: kind
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
  podSubnet: "10.1.0.0/16"
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-ip: "10.0.1.2"
            node-labels: "rack=rack0"        
  - role: worker
    kubeadmConfigPatches:
      - |
        kind: JoinConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-ip: "10.0.2.2"
            node-labels: "rack=rack0"        
  - role: worker
    kubeadmConfigPatches:
      - |
        kind: JoinConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-ip: "10.0.3.2"
            node-labels: "rack=rack1"        
  - role: worker
    kubeadmConfigPatches:
      - |
        kind: JoinConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-ip: "10.0.4.2"
            node-labels: "rack=rack1"        
containerdConfigPatches:
  - |-
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:5000"]
      endpoint = ["http://kind-registry:5000"]

Each node is labeled with its rack:

node-labels: "rack=rack0"
node-labels: "rack=rack1"

These labels are critical — Cilium uses them later to decide which nodes should peer with which ToR switch.

Building the Datacenter Fabric with Containerlab

To simulate the data center network, we use containerlab with FRRouting (FRR).

The topology includes:

router0 – the core router (spine)
tor0 – Top of Rack for rack0
tor1 – Top of Rack for rack1

Each device runs FRR and establishes BGP sessions using:

eBGP between spine and ToRs
iBGP between ToRs and Kubernetes nodes

Once deployed, we can already see the fabric forming:

bgp-topo.yaml

class="highlight">

name: bgp-topo style="color:#f92672">topology: kinds: linux: cmd: bash nodes: router0: kind: linux image: frrouting/frr:v8.2.2 labels: app: frr exec: # NAT everything in here to go outside of the lab - iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE # Loopback IP (IP address of the router itself) - ip addr add 10.0.0.0/32 dev lo # Terminate rest of the 10.0.0.0/8 in here - ip route add blackhole 10.0.0.0/8 # Boiler plate to make FRR work - touch /etc/frr/vtysh.conf - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons - /usr/lib/frr/frrinit.sh start # FRR configuration - >- style="color:#e6db74">          vtysh -c 'conf t' -c 'frr defaults datacenter' -c 'router bgp 65000' -c '  bgp router-id 10.0.0.0' -c '  no bgp ebgp-requires-policy' -c '  neighbor ROUTERS peer-group' -c '  neighbor ROUTERS remote-as external' -c '  neighbor ROUTERS default-originate' -c '  neighbor net0 interface peer-group ROUTERS' -c '  neighbor net1 interface peer-group ROUTERS' -c '  address-family ipv4 unicast' -c '    redistribute connected' -c '  exit-address-family' -c '!' tor0: kind: linux image: frrouting/frr:v8.2.2 labels: app: frr exec: - ip link del eth0 - ip addr add 10.0.0.1/32 dev lo - ip addr add 10.0.1.1/24 dev net1 - ip addr add 10.0.2.1/24 dev net2 - touch /etc/frr/vtysh.conf - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons - /usr/lib/frr/frrinit.sh start - >- style="color:#e6db74">          vtysh -c 'conf t' -c 'frr defaults datacenter' -c 'router bgp 65010' -c '  bgp router-id 10.0.0.1' -c '  no bgp ebgp-requires-policy' -c '  neighbor ROUTERS peer-group' -c '  neighbor ROUTERS remote-as external' -c '  neighbor SERVERS peer-group' -c '  neighbor SERVERS remote-as internal' -c '  neighbor net0 interface peer-group ROUTERS' -c '  neighbor 10.0.1.2 peer-group SERVERS' -c '  neighbor 10.0.2.2 peer-group SERVERS' -c '  address-family ipv4 unicast' -c '    redistribute connected' -c '  exit-address-family' -c '!' tor1: kind: linux image: frrouting/frr:v8.2.2 labels: app: frr exec: - ip link del eth0 - ip addr add 10.0.0.2/32 dev lo - ip addr add 10.0.3.1/24 dev net1 - ip addr add 10.0.4.1/24 dev net2 - touch /etc/frr/vtysh.conf - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons - /usr/lib/frr/frrinit.sh start - >- style="color:#e6db74">          vtysh -c 'conf t' -c 'frr defaults datacenter' -c 'router bgp 65011' -c '  bgp router-id 10.0.0.2' -c '  bgp bestpath as-path multipath-relax' -c '  no bgp ebgp-requires-policy' -c '  neighbor ROUTERS peer-group' -c '  neighbor ROUTERS remote-as external' -c '  neighbor SERVERS peer-group' -c '  neighbor SERVERS remote-as internal' -c '  neighbor net0 interface peer-group ROUTERS' -c '  neighbor 10.0.3.2 peer-group SERVERS' -c '  neighbor 10.0.4.2 peer-group SERVERS' -c '  address-family ipv4 unicast' -c '    redistribute connected' -c '  exit-address-family' -c '!' srv-control-plane: kind: linux image: nicolaka/netshoot:latest network-mode: container:kind-control-plane exec: # Cilium currently doesn't support BGP Unnumbered - ip addr add 10.0.1.2/24 dev net0 # Cilium currently doesn't support importing routes - ip route replace default via 10.0.1.1 srv-worker: kind: linux image: nicolaka/netshoot:latest network-mode: container:kind-worker exec: - ip addr add 10.0.2.2/24 dev net0 - ip route replace default via 10.0.2.1 srv-worker2: kind: linux image: nicolaka/netshoot:latest network-mode: container:kind-worker2 exec: - ip addr add 10.0.3.2/24 dev net0 - ip route replace default via 10.0.3.1 srv-worker3: kind: linux image: nicolaka/netshoot:latest network-mode: container:kind-worker3 exec: - ip addr add 10.0.4.2/24 dev net0 - ip route replace default via 10.0.4.1 links: - endpoints: ["router0:net0", "tor0:net0"] - endpoints: ["router0:net1", "tor1:net0"] - endpoints: ["tor0:net1", "srv-control-plane:net0"] - endpoints: ["tor0:net2", "srv-worker:net0"] - endpoints: ["tor1:net1", "srv-worker2:net0"] - endpoints: ["tor1:net2", "srv-worker3:net0"]


containerlab -t topo.yaml deploy
And from the core router:
show bgp ipv4 summary wide
❯ docker exec -it clab-bgp-topo-router0 vtysh -c 'show bgp ipv4 summary wide'

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.0.0.0, local AS number 65000 vrf-id 0
BGP table version 12
RIB entries 23, using 4232 bytes of memory
Peers 2, using 1433 KiB of memory
Peer groups 1, using 64 bytes of memory

Neighbor        V         AS    LocalAS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
tor0(net0)      4      65010      65000      4920      4919        0    0    0 04:05:24            5       13 N/A
tor1(net1)      4      65011      65000      4920      4920        0    0    0 04:05:24            5       13 N/A

Total number of neighbors 2
The spine successfully peers with both ToRs — our virtual DC backbone is alive.

Installing Cilium with BGP Enabled
Now the fun part.
We install Cilium with:

Native routing
Kubernetes IPAM
BGP control plane enabled

cilium install \
  --version v1.19.0-rc.0 \
  --set ipam.mode=kubernetes \
  --set routingMode=native \
  --set ipv4NativeRoutingCIDR="10.0.0.0/8" \
  --set bgpControlPlane.enabled=true \
  --set k8s.requireIPv4PodCIDR=true
And confirm:
cilium config view | grep enable-bgp

enable-bgp-control-plane                          true
enable-bgp-control-plane-status-report            true
enable-bgp-legacy-origin-attribute                false
BGP is officially on 🔥

Cilium BGP Configuration Model
Cilium’s BGP v2 control plane uses three CRDs:

CiliumBGPClusterConfig – defines BGP instances and peers
CiliumBGPPeerConfig – defines address families and behavior
CiliumBGPAdvertisement – defines what gets advertised

This separation makes the configuration extremely flexible.

Defining Rack‑Aware BGP Peering
We define two CiliumBGPClusterConfig resources:

One for rack0
One for rack1

Each config:

Selects nodes via rack labels
Assigns a rack‑specific ASN
Peers with the corresponding ToR loopback IP

Example (rack0):
nodeSelector:
  matchLabels:
    rack: rack0
localASN: 65010
peerAddress: 10.0.0.1
We then define a CiliumBGPAdvertisement that advertises:

PodCIDR routes



cilium-bgp-peering-policies.yaml
---
apiVersion: "cilium.io/v2"
kind: CiliumBGPClusterConfig
metadata:
  name: rack0
spec:
  nodeSelector:
    matchLabels:
      rack: rack0
  bgpInstances:
    - name: "instance-65010"
      localASN: 65010
      peers:
        - name: "peer-65010-rack0"
          peerASN: 65010
          peerAddress: "10.0.0.1"
          peerConfigRef:
            name: "peer-config-generic"
---
apiVersion: "cilium.io/v2"
kind: CiliumBGPClusterConfig
metadata:
  name: rack1
spec:
  nodeSelector:
    matchLabels:
      rack: rack1
  bgpInstances:
    - name: "instance-65011"
      localASN: 65011
      peers:
        - name: "peer-65011-rack1"
          peerASN: 65011
          peerAddress: "10.0.0.2"
          peerConfigRef:
            name: "peer-config-generic"
---
apiVersion: "cilium.io/v2"
kind: CiliumBGPPeerConfig
metadata:
  name: peer-config-generic
spec:
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "pod-cidr"
---
apiVersion: "cilium.io/v2"
kind: CiliumBGPAdvertisement
metadata:
  name: pod-cidr
  labels:
    advertise: pod-cidr
spec:
  advertisements:
    - advertisementType: "PodCIDR"




That’s it. No per‑node config. No static routing. Just labels.

Verifying BGP Sessions
After applying the policies:
kubectl apply -f cilium-bgp-peering-policies.yaml
We immediately see Kubernetes nodes forming BGP sessions with the ToR switches.
From tor0:
show bgp ipv4 summary wide
❯ docker exec -it clab-bgp-topo-tor0 vtysh -c 'show bgp ipv4 summary wide'

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.0.0.1, local AS number 65010 vrf-id 0
BGP table version 13
RIB entries 23, using 4232 bytes of memory
Peers 3, using 2149 KiB of memory
Peer groups 2, using 128 bytes of memory

Neighbor                     V         AS    LocalAS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
router0(net0)                4      65000      65010      4593      4595        0    0    0 03:49:05            8       13 N/A
kind-control-plane(10.0.1.2) 4      65010      65010      4486      4492        0    0    0 03:44:11            1       11 N/A
kind-worker(10.0.2.2)        4      65010      65010      4486      4492        0    0    0 03:44:12            1       11 N/A

Total number of neighbors 3
The ToR now peers with:

kind-control-plane
kind-worker

And receives Pod CIDR routes dynamically 🎯
The same happens on tor1 for rack1 workers.
❯ docker exec -it clab-bgp-topo-tor1 vtysh -c 'show bgp ipv4 summary wide'

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.0.0.2, local AS number 65011 vrf-id 0
BGP table version 13
RIB entries 23, using 4232 bytes of memory
Peers 3, using 2149 KiB of memory
Peer groups 2, using 128 bytes of memory

Neighbor               V         AS    LocalAS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
router0(net0)          4      65000      65011      4591      4592        0    0    0 03:48:59            8       13 N/A
kind-worker2(10.0.3.2) 4      65011      65011      4484      4490        0    0    0 03:44:06            1       11 N/A
kind-worker3(10.0.4.2) 4      65011      65011      4484      4490        0    0    0 03:44:06            1       11 N/A

Total number of neighbors 3

End‑to‑End Connectivity Test
To validate everything, we deploy netshoot as a DaemonSet.


netshoot-ds.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: netshoot
spec:
  selector:
    matchLabels:
      app: netshoot
  template:
    metadata:
      labels:
        app: netshoot
    spec:
      tolerations:
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
      containers:
        - name: netshoot
          image: nicolaka/netshoot:latest
          command: ["sleep", "infinite"]




This gives us a debugging Pod on every node.
❯ kubectl rollout status ds/netshoot -w
daemon set "netshoot" successfully rolled out

❯ k get pods
NAME             READY   STATUS    RESTARTS   AGE
netshoot-ffssl   1/1     Running   0          95s
netshoot-q7l9l   1/1     Running   0          95s
netshoot-rnm8n   1/1     Running   0          95s
We then:

Pick a source Pod in rack0
Pick a destination Pod in rack1
Ping across racks

❯ SRC_POD=$(kubectl get pods -o wide | grep "kind-worker " | awk '{ print($1); }')

❯ DST_IP=$(kubectl get pods -o wide | grep worker3 | awk '{ print($6); }')

❯ kubectl exec -it $SRC_POD -- ping -c 10 $DST_IP
PING 10.1.1.142 (10.1.1.142) 56(84) bytes of data.
64 bytes from 10.1.1.142: icmp_seq=1 ttl=58 time=0.235 ms
64 bytes from 10.1.1.142: icmp_seq=2 ttl=58 time=0.149 ms
64 bytes from 10.1.1.142: icmp_seq=3 ttl=58 time=0.284 ms
64 bytes from 10.1.1.142: icmp_seq=4 ttl=58 time=0.188 ms
^C
--- 10.1.1.142 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3097ms
rtt min/avg/max/mdev = 0.149/0.214/0.284/0.050 ms
❯ kubectl exec -it $SRC_POD -- traceroute $DST_IP
traceroute to 10.1.1.142 (10.1.1.142), 30 hops max, 46 byte packets
 1  10.1.3.115 (10.1.3.115)  0.008 ms  0.064 ms  0.008 ms
 2  10.0.2.1 (10.0.2.1)  0.008 ms  0.009 ms  0.009 ms
 3  10.0.0.0 (10.0.0.0)  0.109 ms  0.009 ms  0.008 ms
 4  10.0.0.2 (10.0.0.2)  0.007 ms  0.009 ms  0.008 ms
 5  10.0.4.2 (10.0.4.2)  0.008 ms  0.009 ms  0.008 ms
 6  *  *  *
 7  10.1.1.142 (10.1.1.142)  0.009 ms  0.009 ms  0.008 ms
And… success 🎉
Packets traverse:
Pod → Node → ToR → Spine → ToR → Node → Pod
All driven by BGP‑learned routes.

What We Achieved
By the end of this lab, we have:

A Kubernetes cluster integrated directly into a DC fabric
Dynamic Pod CIDR advertisement via BGP
Rack‑aware routing using node labels
No overlays, no tunnels, no hacks

This is exactly how Kubernetes networking should look in a modern data center.

Final Thoughts
Cilium’s BGP support is a huge step forward for:

Bare‑metal Kubernetes
On‑prem data centers
Hybrid cloud networking

If your network already speaks BGP — and it almost certainly does — Cilium lets Kubernetes become a first‑class citizen of that network.
Happy routing 🚀



            
                
my DevOps Odyssey

“Σα βγεις στον πηγαιμό για την Ιθάκη, να εύχεσαι να ‘ναι μακρύς ο δρόμος, γεμάτος περιπέτειες, γεμάτος γνώσεις.” - Kavafis’ Ithaka.






  
     homelab  
  
  
     linkedin  
  



                
                BGP on Cilium
2026-02-04
        

            Series:lab

            Categories:Kubernetes

            Tags:#bgp, #k8s, #cilium

            BGP on Cilium:
            
  
    Lab Overview
    Why BGP with Cilium?
    Topology
    Kubernetes Cluster Setup (Kind)
    Building the Datacenter Fabric with Containerlab
    Installing Cilium with BGP Enabled
    Cilium BGP Configuration Model
    Defining Rack‑Aware BGP Peering
    Verifying BGP Sessions
    End‑to‑End Connectivity Test
    What We Achieved
    Final Thoughts
  

                
            

            
Powered by hugo and risotto.