Skip to main content
TACUNS
Module 1 of 5
20% complete
Module 1

Routing Protocol Failures — 'Internet Is Down, Nothing Is Working'

The Call Comes In

  • "Internet is completely down — nothing is reaching us"
  • "We lost connectivity to the entire data center"
  • "VoIP calls are dropping every 10 minutes"
  • "Started after the router upgrade last night"
  • "Core switch was restarted and now half the network is unreachable"

What Makes This Outage Unique

Routing protocol failures cascade. One BGP session dropping can cause an entire ISP edge to reconverge, which spikes CPU on every connected router, which delays OSPF hellos, which drops internal adjacencies, which removes internal routes, which makes servers unreachable — all from one event. The outage looks total but has a single root cause.

BGP Neighbor Failures — The Real Debug Flow

What Engineers Try First (And Why It Fails)

  • Reboot the edge router — clears the BGP session but does not fix the underlying cause
  • Blame the ISP — ISP is delivering connectivity correctly, issue is in BGP peering configuration or timer mismatch
  • Check physical interface — interface is up, so engineers move on — but the issue is in the BGP session, not the link

Actual Debug Sequence

cisco-ios
! Step 1 — Check BGP summary immediately
show bgp summary
! Look for:
! Up/Down column: 00:00:XX = recently reset (flapping)
! State column: Idle/Active = not established
! Prefixes received: 0 when it should be thousands = session up but no routes

! Step 2 — Check recent BGP events
show bgp neighbors <peer-ip> | include BGP state|uptime|reset
! Shows exact reason for last reset:
! "BGP Notification sent: Hold time expired" = timer mismatch or link quality
! "BGP Notification received: OPEN Error" = AS number or capability mismatch

! Step 3 — Check if the problem is MTU
! This is the most missed cause of BGP adjacency failures:
! TCP session for BGP uses large packets — if MTU on the peering link is wrong,
! the SYN/SYN-ACK works (small packets) but data transfer fails (large packets)
ping <peer-ip> size 1500 df-bit
! If this fails but normal ping succeeds — MTU is the cause
! Fix: ip tcp adjust-mss 1452 on the peering interface

! Step 4 — Check route flap dampening
show bgp dampening parameters
show bgp dampening flap-statistics
! Dampening can suppress a prefix for hours even after the session recovers
! This is why "the BGP is up but routes are still missing"

! Step 5 — Validate peering config matches ISP
show bgp neighbors <peer-ip> | include remote AS|local AS|hold time|keepalive
! remote-as, hold time, and MD5 password must match exactly on both sides

The Cascade Pattern

When a BGP session drops with a large number of prefixes, the reconvergence creates a CPU spike on every router that receives the withdrawal. If that spike delays OSPF hello processing past the dead interval — internal OSPF adjacencies also drop. The outage appears total (internet + internal) but the root cause is a single BGP session.

Always check BGP first when multiple systems fail simultaneously. OSPF is usually the secondary victim, not the primary cause.

OSPF Adjacency Failures

Why OSPF Fails in Production (Real Causes)

CauseSymptomHow to Confirm
Hello/Dead timer mismatchNeighbors stuck at 2-Way state — never reach Fullshow ip ospf neighbor — compare timers on both sides
MTU mismatchNeighbors stuck at ExStart/Exchange — never reach Fullshow ip ospf interface — check MTU match, or ip ospf mtu-ignore as temporary workaround
Area ID mismatchNo adjacency forms at allshow ip ospf neighbor — no neighbor entry for that interface
Authentication mismatchNo adjacency, authentication errors in debugdebug ip ospf adj — look for auth type mismatch messages
Network type mismatchPartial adjacency — DR/BDR election wrongshow ip ospf interface — Point-to-Point on one side, Broadcast on other
cisco-ios
! Full OSPF debug sequence for stuck adjacency
show ip ospf neighbor
! If neighbor shows ExStart for more than 30 seconds — MTU mismatch

show ip ospf interface <interface>
! Shows: Hello/Dead intervals, Area, Network type, Authentication type
! Compare with neighbor's show output — everything must match

! If neighbors never appear at all:
debug ip ospf adj
! Look for: "mismatched hello parameters" = timer/area/auth mismatch
! End debug immediately after reproducing — OSPF debug is verbose

! Check for database synchronization after adjacency forms
show ip ospf database | include Router|Network
! If database entries are missing compared to other routers = partial sync

Route Flapping and BGP Route Dampening

Interface flapping is deadly in routing protocol environments. Every flap triggers SPF recalculation in OSPF and reconvergence in BGP. Each reconvergence cycle causes a forwarding interruption. In high-flap scenarios, the network never converges — it is always reconverging.

cisco-ios
! Identify flapping interfaces
show interfaces | include line protocol|input errors|CRC

! Check carrier transitions (times the link went up/down)
show interfaces <interface> | include carrier transitions

! Check BGP route flap history
show bgp flap-statistics
! Penalty column: above 2000 = prefix is suppressed (not advertised)
! Reuse threshold: typically 750 — when penalty drops below this, prefix returns

! Check route dampening status for specific prefix
show bgp <prefix> | include dampened|penalty|reuse

Dampening Trap

BGP route dampening suppresses flapping prefixes to protect the network — but it also means the prefix stays withdrawn for 30-60 minutes after the flapping stops. Engineers fix the physical interface issue and wonder why BGP is still not advertising the route. Always check flap statistics and clear dampening explicitly after fixing the root cause.