Prevention — Stop These Outages Before They Happen

TAC's View on Why Outages Happen

After working hundreds of production outage cases, a pattern becomes clear: most outages are not caused by unknown failure modes. They are caused by known, documented risks that were not validated before the change was applied. The five outage scenarios in this course — asymmetric routing, NAT hairpin, App-ID shift, SSL decryption, SSL VPN — each have a pre-change validation sequence that would have caught the problem before it became an emergency.

This module covers those sequences. Apply them before every significant firewall change and most of the calls in the previous modules will never happen.

Pre-Change Validation — Universal Steps

These steps apply before any significant NGFW configuration change:

pan-os-cli

# 1. Capture baseline — take a snapshot of current state
show system resources > /tmp/baseline-resources.txt
show session info > /tmp/baseline-sessions.txt
show counter global filter delta yes > /tmp/baseline-counters.txt
show high-availability state > /tmp/baseline-ha.txt

# 2. Confirm HA state is healthy before change
show high-availability state
# Both peers must show: State: active/passive (or active/active)
# Synchronization must be: Complete
# Do NOT apply changes when HA is in a degraded or split-brain state

# 3. Test current working behavior BEFORE change
# Define a test that proves the change worked
# Example: "internal user can reach external-server.com on port 443"
# Document the current passing result — this is your rollback benchmark

# 4. Know your rollback path
# Take a config snapshot before every change
request config-backup destination <name>
# Or via Panorama: create a named configuration snapshot

# 5. Apply during maintenance window with user impact acknowledged
# Never apply security policy or NAT changes during business hours
# without explicit written authorization

Routing Change Validation — Prevent Asymmetric Routing

Before Any Routing Change

pan-os-cli

# Validate current routing table
show routing route

# For every subnet that will be affected by the new route:
# Test what path traffic currently takes
test routing fib-lookup virtual-router default ip <subnet-gateway>

# For HA environments — validate session owner distribution
show session info | match "owner"

# Confirm no existing ECMP that could be affected
show routing route | match "flags.*E"

# After routing change — immediately test return path
# This is what most engineers skip:
# From the SERVER side: traceroute back to the client subnet
# The path must pass through the firewall both ways

The Return Path Test — The Most Skipped Step

Why This Gets Skipped

Engineers test that traffic reaches the destination. They verify the server responds. They call the change successful. What they do not verify is whether the server's response takes the same path back through the firewall. This is the asymmetric routing trap. Always trace the return path explicitly from the server back to the client — not just the forward path.

Run traceroute from the server to a client IP in each affected subnet
Verify the firewall's inside interface IP appears in the path
If it does not — the return path is bypassing the firewall
Fix routing on the server gateway or upstream router before finalizing the change

NAT Change Validation

pan-os-cli

# Before modifying NAT rules — identify all traffic that currently uses them
show running nat-policy

# Test the existing NAT match for affected traffic BEFORE change
test nat-policy-match from trust to untrust   source 10.1.1.100   destination 203.0.113.10   protocol 6   destination-port 443

# Document the result — use this to verify new rules match the same traffic

# After change — test again with identical parameters
test nat-policy-match from trust to untrust   source 10.1.1.100   destination 203.0.113.10   protocol 6   destination-port 443

# For hairpin changes — test BOTH external and internal source
test nat-policy-match from untrust to untrust   source 203.0.113.50   destination 203.0.113.10   protocol 6   destination-port 443

test nat-policy-match from trust to untrust   source 10.1.1.100   destination 203.0.113.10   protocol 6   destination-port 443

SSL Decryption Rollout Validation

Never deploy SSL decryption to all users at once. This is the single most impactful change a security team can make and the one most likely to cause widespread application failures.

Phase	Action	Success Criteria
Phase 1	Deploy Forward Trust CA to all devices and browsers via GPO/MDM	100% of devices trust the CA certificate — verify in browser trust store
Phase 2	Enable decryption for pilot group (IT/security staff)	All pilot users can access all business applications normally for 5 business days
Phase 3	Build exclude list from pilot findings	All pinned/mTLS applications identified and excluded — zero unexplained failures
Phase 4	Expand to one department at a time	Each department runs for 3 days — monitor decryption error logs daily
Phase 5	Full production rollout	Decryption error rate below 1% — monitor for 5 business days

pan-os-cli

# Daily monitoring command during rollout
# Run every morning during expansion phases
show log decryption direction equal forward   | match "error|fail|unsupported"   | match "$(date +%Y/%m/%d)"

# Count decryption errors by destination to identify new exclude candidates
# If same destination appears 5+ times with errors — add to exclude list

# Check total decryption volume vs errors
show log decryption direction equal forward   | match "$(date +%Y/%m/%d)" | match count

# Healthy ratio: errors < 1% of total decrypted sessions

Content Update Validation — Prevent App-ID Outages

pan-os-cli

# Before applying a content update:
# Download but do not install
request content upgrade download latest

# Review the changelog in the support portal for:
# - New applications added to the App-ID database
# - Applications where identification was improved
# - Any application that your security rules reference

# After applying update — immediately check for new denies
show log traffic action equal deny   | match "$(date +%Y/%m/%d)"

# Compare application names in deny logs against your security rules
# Any new application name appearing in deny logs that was previously
# passing as ssl or web-browsing = content update triggered App-ID shift

# Run policy optimizer to identify rules affected by App-ID changes
# (Available in GUI: Policies > Policy Optimizer > Unused Applications)

Automate Content Updates Carefully

Automated content updates are convenient but can cause unplanned App-ID shifts. Best practice: test content updates on a staging firewall or during a defined maintenance window, review the changelog, then apply to production. In critical environments, some organizations delay content updates by one week and monitor for reports of App-ID changes affecting policy behavior before applying.

SSL VPN Certificate Management

Certificate expiry is the most preventable GP outage. Set these up once and the emergency call never happens.

pan-os-cli

# Check all certificate expiry dates
show certificate all | match "not-valid-after|name"

# For each GP-related certificate, note the expiry date
# Set calendar reminders at 90, 60, and 30 days before expiry

# Validate the full certificate chain before installing new cert
# (Do this on a test machine before production deployment)
openssl s_client -connect portal.company.com:443 -showcerts

# Verify chain output shows:
# Certificate chain
# 0 s:CN=portal.company.com    ← leaf cert
# i:CN=Intermediate CA
# 1 s:CN=Intermediate CA        ← intermediate
# i:CN=Root CA
# All three must be present and in correct order

# After installing new certificate — test GP connection from outside
# Using a non-corporate device (personal phone) confirms what a
# remote user sees: if it connects, the cert chain is complete

The Change Record — What TAC Wishes Engineers Would Do

When a production outage happens, the first question TAC asks is: "What changed?" The answer is almost always one of these:

"We applied a content update last night"
"We replaced the SSL certificate yesterday"
"We added a new ISP connection this morning"
"We modified the security policy for a new application"
"Actually, nothing changed" — (and then we find the change that was made without a ticket)

→

Every change to a production firewall must be logged with: timestamp, who made it, what was changed, and the business reason. Not because of compliance — because when the outage happens, this log is the first thing that finds the cause. A firewall without a change log is a firewall that is impossible to troubleshoot efficiently.

→

Config audit log is your last resort: show log config — this shows all configuration changes made through the CLI and GUI. If a change was made and not documented, this finds it. Always check this within the first 5 minutes of a production outage.

pan-os-cli

# Check config change history — first thing in any outage
show log config

# Filter by time of suspected change
show log config start-time equal 2026/05/26/00:00:00   end-time equal 2026/05/26/23:59:59

# Look for: who made the change, what was modified, exact timestamp
# Cross-reference with the reported outage start time

Course Complete

You have completed Firewall Production Outage Troubleshooting. You now understand how NGFW processes packets, how to debug asymmetric routing, NAT hairpin, App-ID shifts, SSL decryption failures, and SSL VPN outages — the five most common production emergency scenarios. You have the full TAC debugging toolkit and the prevention sequences to stop most of these before they become outages. This is the knowledge that takes engineers years of TAC casework to develop.

Previous Module

Course Complete