Routing Protocol Failures — 'Internet Is Down, Nothing Is Working'
The Call Comes In
- "Internet is completely down — nothing is reaching us"
- "We lost connectivity to the entire data center"
- "VoIP calls are dropping every 10 minutes"
- "Started after the router upgrade last night"
- "Core switch was restarted and now half the network is unreachable"
What Makes This Outage Unique
BGP Neighbor Failures — The Real Debug Flow
What Engineers Try First (And Why It Fails)
- Reboot the edge router — clears the BGP session but does not fix the underlying cause
- Blame the ISP — ISP is delivering connectivity correctly, issue is in BGP peering configuration or timer mismatch
- Check physical interface — interface is up, so engineers move on — but the issue is in the BGP session, not the link
Actual Debug Sequence
! Step 1 — Check BGP summary immediately show bgp summary ! Look for: ! Up/Down column: 00:00:XX = recently reset (flapping) ! State column: Idle/Active = not established ! Prefixes received: 0 when it should be thousands = session up but no routes ! Step 2 — Check recent BGP events show bgp neighbors <peer-ip> | include BGP state|uptime|reset ! Shows exact reason for last reset: ! "BGP Notification sent: Hold time expired" = timer mismatch or link quality ! "BGP Notification received: OPEN Error" = AS number or capability mismatch ! Step 3 — Check if the problem is MTU ! This is the most missed cause of BGP adjacency failures: ! TCP session for BGP uses large packets — if MTU on the peering link is wrong, ! the SYN/SYN-ACK works (small packets) but data transfer fails (large packets) ping <peer-ip> size 1500 df-bit ! If this fails but normal ping succeeds — MTU is the cause ! Fix: ip tcp adjust-mss 1452 on the peering interface ! Step 4 — Check route flap dampening show bgp dampening parameters show bgp dampening flap-statistics ! Dampening can suppress a prefix for hours even after the session recovers ! This is why "the BGP is up but routes are still missing" ! Step 5 — Validate peering config matches ISP show bgp neighbors <peer-ip> | include remote AS|local AS|hold time|keepalive ! remote-as, hold time, and MD5 password must match exactly on both sides
The Cascade Pattern
When a BGP session drops with a large number of prefixes, the reconvergence creates a CPU spike on every router that receives the withdrawal. If that spike delays OSPF hello processing past the dead interval — internal OSPF adjacencies also drop. The outage appears total (internet + internal) but the root cause is a single BGP session.
Always check BGP first when multiple systems fail simultaneously. OSPF is usually the secondary victim, not the primary cause.
OSPF Adjacency Failures
Why OSPF Fails in Production (Real Causes)
| Cause | Symptom | How to Confirm |
|---|---|---|
| Hello/Dead timer mismatch | Neighbors stuck at 2-Way state — never reach Full | show ip ospf neighbor — compare timers on both sides |
| MTU mismatch | Neighbors stuck at ExStart/Exchange — never reach Full | show ip ospf interface — check MTU match, or ip ospf mtu-ignore as temporary workaround |
| Area ID mismatch | No adjacency forms at all | show ip ospf neighbor — no neighbor entry for that interface |
| Authentication mismatch | No adjacency, authentication errors in debug | debug ip ospf adj — look for auth type mismatch messages |
| Network type mismatch | Partial adjacency — DR/BDR election wrong | show ip ospf interface — Point-to-Point on one side, Broadcast on other |
! Full OSPF debug sequence for stuck adjacency show ip ospf neighbor ! If neighbor shows ExStart for more than 30 seconds — MTU mismatch show ip ospf interface <interface> ! Shows: Hello/Dead intervals, Area, Network type, Authentication type ! Compare with neighbor's show output — everything must match ! If neighbors never appear at all: debug ip ospf adj ! Look for: "mismatched hello parameters" = timer/area/auth mismatch ! End debug immediately after reproducing — OSPF debug is verbose ! Check for database synchronization after adjacency forms show ip ospf database | include Router|Network ! If database entries are missing compared to other routers = partial sync
Route Flapping and BGP Route Dampening
Interface flapping is deadly in routing protocol environments. Every flap triggers SPF recalculation in OSPF and reconvergence in BGP. Each reconvergence cycle causes a forwarding interruption. In high-flap scenarios, the network never converges — it is always reconverging.
! Identify flapping interfaces show interfaces | include line protocol|input errors|CRC ! Check carrier transitions (times the link went up/down) show interfaces <interface> | include carrier transitions ! Check BGP route flap history show bgp flap-statistics ! Penalty column: above 2000 = prefix is suppressed (not advertised) ! Reuse threshold: typically 750 — when penalty drops below this, prefix returns ! Check route dampening status for specific prefix show bgp <prefix> | include dampened|penalty|reuse
Dampening Trap