QoS Misconfiguration & Interface Flapping — 'VoIP Breaks When Busy'

Two Failure Modes, One Shift

QoS and interface flapping often appear together during high-traffic events. A link that flaps under load looks like a hardware problem — but may be a configuration or environmental issue. VoIP that only drops during business hours looks like a capacity problem — but is often a queue configuration that was never validated under real load.

"VoIP calls are crystal clear in the morning, choppy and dropping after 9am"
"Interface going up/down cycling — once every few minutes"
"Video conferencing is degraded but file transfers are fine — same time of day"
"We upgraded bandwidth but VoIP is still dropping at peak"
"Link was stable for months, now cycling since the firmware update"
"Specific site loses connectivity periodically — no pattern we can find"

QoS Misconfiguration — Why It Only Breaks Under Load

QoS queuing is invisible when bandwidth is available. Every packet gets forwarded immediately regardless of queue configuration when the link is not congested. The moment utilization approaches capacity, queuing decisions start to matter. Bad QoS configuration has zero impact at 30% utilization and catastrophic impact at 85%.

The QoS Validation Gap

Most QoS configurations are validated in lab environments at low utilization and declared working. The problem only appears under real production load — at 9am when all employees arrive, during a backup window, or during a major file transfer. "It worked in testing" is true — it worked when queuing was not needed. Under load, the queue behavior determines performance.

Understanding Where Voice Fails

QoS Condition	Effect on VoIP	What You Hear
No QoS — all traffic in one queue	VoIP packets wait behind large data transfers	Choppy audio, one-way audio, dropped calls under load
Wrong DSCP marking — voice marked as best-effort	VoIP gets no priority treatment even if QoS is configured	Same as no QoS — voice suffers when data is competing
Correct DSCP but wrong queue — voice in wrong priority queue	Voice gets queued but wrong priority level — inconsistent behavior	Audio is OK at low load, degrades under moderate load
Priority queue starving data queues (over-provisioned voice)	Data traffic cannot get through — users report slow file transfers alongside working voice	VoIP works, everything else slows dramatically
Jitter buffer exhaustion	Packets arrive with variable delay — buffer cannot absorb it	Robotic or warbling audio — the classic VoIP quality complaint

Debug Sequence — Finding QoS Failures

cisco-ios

! Step 1: Check current queue statistics — the most important command
show policy-map interface GigabitEthernet0/0
! Shows for each class:
! Class-map: VOICE
! packets: 50000
! bytes: 8000000
! rate: 1000000 bps
! Match: dscp ef (46)
! Queueing: Strict Priority
! queue limit: 64 packets
! (output) queue drops: 0 ← this is what matters
! (tail) drop: 0 ← drops = voice is being discarded

! In a voice quality problem: look for queue drops in the VOICE class
! If drops are non-zero: VoIP packets are being discarded = quality issues

! Step 2: Check DSCP markings on actual voice traffic
! Voice should be marked DSCP EF (Expedited Forwarding = decimal 46)
show policy-map interface | include dscp|ef|class VOICE

! To verify live traffic is marked correctly — use NBAR or debug:
show ip nbar protocol-discovery
! Shows what traffic NBAR is classifying — verify voice protocols are being identified

! Step 3: Check interface utilization at peak times
show interface GigabitEthernet0/0 | include input rate|output rate
! If output rate is near line rate when VoIP quality degrades:
! Congestion is confirmed — QoS configuration is what determines voice behavior here

! Step 4: Check queue depths during congestion
! Poll this command during peak load:
show queueing interface GigabitEthernet0/0
! Look for: queue depth increasing for specific classes
! A queue that is constantly at its maximum depth = packets being dropped from that class

! Step 5: Check if trust is configured properly
show mls qos interface GigabitEthernet0/0
! Check: trust dscp is enabled (not trust cos or untrusted)
! If interface is "untrusted" — all DSCP markings from devices are re-marked to 0 = no QoS

Fixing QoS for Voice — What Actually Works

cisco-ios

! Correct QoS policy structure for voice traffic:

! Step 1: Create class map that matches voice DSCP marking
class-map match-any VOICE
  match dscp ef
  ! ef = Expedited Forwarding, DSCP 46 — standard voice marking

class-map match-any VIDEO
  match dscp af41
  ! af41 = Assured Forwarding — video conferencing

class-map match-any SIGNALING
  match dscp cs3
  ! cs3 = Class Selector 3 — call signaling (SIP, H.323)

! Step 2: Create policy that prioritizes voice
policy-map QOS-POLICY
  class VOICE
    priority percent 20
    ! Reserve 20% of bandwidth for voice, always served first
    ! Do not exceed 33% — above this, data starvation occurs
  class VIDEO
    bandwidth percent 30
    ! Guaranteed minimum for video
  class SIGNALING
    bandwidth percent 5
  class class-default
    fair-queue
    ! Remaining bandwidth shared fairly

! Step 3: Apply policy to the WAN/uplink interface
interface GigabitEthernet0/0
  service-policy output QOS-POLICY
  ! Always apply on output — this is where congestion occurs

! Step 4: Configure IP phones to mark their own traffic
! On the switch port connected to a Cisco IP phone:
interface FastEthernet1/0/10
  mls qos trust dscp
  ! Trust the DSCP markings coming from the phone
  ! Without this, the phone's markings are ignored

Interface Flapping — The Link That Keeps Cycling

An interface that cycles up and down — flapping — is one of the most disruptive network events. Every flap triggers routing protocol reconvergence, breaks active sessions through the interface, and can cause BGP or OSPF to drop adjacencies if flapping is fast enough.

Identifying and Categorizing the Flap

cisco-ios

! Check carrier transitions — times the link physically went up/down:
show interface GigabitEthernet0/0 | include carrier transitions|line protocol

! Check the log for flap events:
show log | include changed state|line protocol
! Look for:
! "GigabitEthernet0/0 changed state to down"
! "GigabitEthernet0/0 changed state to up"
! Timestamp pattern — regular interval vs random = different causes

! Check error counters on the flapping interface:
show interface GigabitEthernet0/0
! Look for:
! Input errors: high CRC or frame errors = physical/cabling problem
! Output errors: drops or queue failures = bandwidth or hardware issue
! Runts/Giants: frame size errors = duplex mismatch or hardware fault

! Check for duplex/speed mismatch:
show interface GigabitEthernet0/0 | include duplex|speed
! "Half-duplex" on a GigabitEthernet = forced half or auto-negotiation failure
! Duplex mismatch causes errors and eventual flapping under load

Root Cause Patterns for Interface Flapping

Cause	Log Pattern	Confirming Test
Physical cable fault — bad connector, damaged fiber, kink	Flapping correlates with physical movement (HVAC, building vibration)	Replace cable — if flapping stops immediately, cable was the cause
SFP/transceiver failure	Flapping on fiber link, optics show low receive power	show interface transceiver — check Rx power levels against threshold
Duplex mismatch	Many CRC errors, half-duplex detected, errors increase with load	Force both sides to same speed/duplex — auto-negotiation disabled
Power instability on PoE port	Flapping correlates with power consumption events (phone call initiation)	show power inline — check power draw vs budget. Increase PoE budget or move to dedicated circuit
Keepalive failure (routing protocol)	Interface stays physically up but protocol goes down — loopback issue or BFD problem	show interface — line protocol down while hardware is up = keepalive timeout

The Optic Power Check — Fiber Interface Diagnosis

cisco-ios

! Check optical signal strength — most specific test for fiber link flapping
show interface GigabitEthernet0/0 transceiver
! Shows:
! Tx Power: -2.5 dBm   ← transmit power (what this end is sending)
! Rx Power: -12.8 dBm  ← receive power (what this end is receiving)
! Temperature, Voltage, Current

! Interpreting Rx power:
! Typical SFP operating range: -3 dBm to -20 dBm
! Above -3 dBm = too much light = possible cause of errors
! Below -20 dBm = too little light = cable or SFP failure
! Near -20 dBm and fluctuating = intermittent fiber — will cause flapping

! If Rx power is near threshold — check the fiber path:
! Clean the SFP ferrule and patch cable connector (fiber contamination is common)
! Test with a known-good fiber patch cable
! Test with a known-good SFP from spare inventory

! Check optical DOM (Digital Optical Monitoring) thresholds:
show interface transceiver detail
! Shows alarm thresholds — compare current values to alarm levels
! If current Rx is within 2 dBm of alarm threshold: link is marginal — will flap under load

Damping Interface Flaps at the Routing Protocol Level

While the physical issue is being investigated, route flap damping prevents the flapping interface from destabilizing the entire routing domain. This is a temporary measure — not a substitute for fixing the physical cause.

cisco-ios

! Configure interface dampening to reduce routing protocol impact:
interface GigabitEthernet0/0
  dampening

! Default dampening values:
! Half-life: 5 minutes (penalty decays by half every 5 min when link is stable)
! Suppress threshold: 2000 (suppress route when penalty exceeds this)
! Reuse threshold: 750 (re-advertise when penalty drops below this)
! Max suppress time: 20 minutes

! Custom dampening for more aggressive suppression:
interface GigabitEthernet0/0
  dampening 5 750 2000 20

! View current dampening status:
show interface GigabitEthernet0/0 dampening
! Shows: penalty value, flap count, suppressed status

! On BGP — route dampening prevents BGP route flapping:
! (in router bgp config)
bgp dampening
! Same effect — BGP prefixes from flapping neighbors are suppressed
! This is separate from interface dampening

→

Interface flapping that exceeds one event per minute will eventually drop OSPF adjacencies (dead timer is typically 40 seconds). BGP hold time defaults to 90 seconds — three flap events in 90 seconds drops a BGP session. Apply dampening as soon as flapping is confirmed, then investigate and fix the physical cause. A flapping interface that routes critical traffic is a major outage risk.

Course Summary — The Pattern Behind All Five Modules

Every production network failure in this course follows the same pattern: a specific trigger creates a specific failure signature, and the debug sequence follows the failure signature — not the complaint. The complaint is always "the network is broken." The failure signature is specific.

Module	Failure Signature	First Debug Command
BGP/OSPF Failures	Multiple systems fail simultaneously — internet AND internal	show bgp summary — check session state and uptime
MTU/MSS Mismatch	Large transfers fail, small ones succeed — works on LAN, breaks on VPN	ping <dest> -f -l 1472 — confirm MTU threshold
Spanning Tree Loop	All interfaces show up and passing traffic, but CPU is at 100%	show interfaces — check for wire-rate broadcast traffic
DHCP/DNS Failures	Users get 169.254.x.x or cannot resolve names — routing is fine	show ip dhcp pool — check pool utilization and bindings
QoS/Interface Flapping	Voice quality degrades at peak hours — link cycling	show policy-map interface — check queue drops during load

Course Complete

You have completed Network Troubleshooting in Production. Every module covered real production failure patterns — the symptoms that come in as vague complaints, the wrong first steps that waste time, and the specific debug sequence that finds the actual root cause. These are the failures that engineers encounter repeatedly in production and that take years of fieldwork to develop reliable intuition for. The debug sequences in this course replace that intuition with methodology.

Previous Module

Course Complete