Prevention — Stop These Outages Before They Happen
TAC's View on Why Outages Happen
After working hundreds of production outage cases, a pattern becomes clear: most outages are not caused by unknown failure modes. They are caused by known, documented risks that were not validated before the change was applied. The five outage scenarios in this course — asymmetric routing, NAT hairpin, App-ID shift, SSL decryption, GlobalProtect — each have a pre-change validation sequence that would have caught the problem before it became an emergency.
This module covers those sequences. Apply them before every significant firewall change and most of the calls in the previous modules will never happen.
Pre-Change Validation — Universal Steps
These steps apply before any significant PAN-OS configuration change:
# 1. Capture baseline — take a snapshot of current state show system resources > /tmp/baseline-resources.txt show session info > /tmp/baseline-sessions.txt show counter global filter delta yes > /tmp/baseline-counters.txt show high-availability state > /tmp/baseline-ha.txt # 2. Confirm HA state is healthy before change show high-availability state # Both peers must show: State: active/passive (or active/active) # Synchronization must be: Complete # Do NOT apply changes when HA is in a degraded or split-brain state # 3. Test current working behavior BEFORE change # Define a test that proves the change worked # Example: "internal user can reach external-server.com on port 443" # Document the current passing result — this is your rollback benchmark # 4. Know your rollback path # Take a config snapshot before every change request config-backup destination <name> # Or via Panorama: create a named configuration snapshot # 5. Apply during maintenance window with user impact acknowledged # Never apply security policy or NAT changes during business hours # without explicit written authorization
Routing Change Validation — Prevent Asymmetric Routing
Before Any Routing Change
# Validate current routing table show routing route # For every subnet that will be affected by the new route: # Test what path traffic currently takes test routing fib-lookup virtual-router default ip <subnet-gateway> # For HA environments — validate session owner distribution show session info | match "owner" # Confirm no existing ECMP that could be affected show routing route | match "flags.*E" # After routing change — immediately test return path # This is what most engineers skip: # From the SERVER side: traceroute back to the client subnet # The path must pass through the firewall both ways
The Return Path Test — The Most Skipped Step
Why This Gets Skipped
- Run traceroute from the server to a client IP in each affected subnet
- Verify the firewall's inside interface IP appears in the path
- If it does not — the return path is bypassing the firewall
- Fix routing on the server gateway or upstream router before finalizing the change
NAT Change Validation
# Before modifying NAT rules — identify all traffic that currently uses them show running nat-policy # Test the existing NAT match for affected traffic BEFORE change test nat-policy-match from trust to untrust source 10.1.1.100 destination 203.0.113.10 protocol 6 destination-port 443 # Document the result — use this to verify new rules match the same traffic # After change — test again with identical parameters test nat-policy-match from trust to untrust source 10.1.1.100 destination 203.0.113.10 protocol 6 destination-port 443 # For hairpin changes — test BOTH external and internal source test nat-policy-match from untrust to untrust source 203.0.113.50 destination 203.0.113.10 protocol 6 destination-port 443 test nat-policy-match from trust to untrust source 10.1.1.100 destination 203.0.113.10 protocol 6 destination-port 443
SSL Decryption Rollout Validation
Never deploy SSL decryption to all users at once. This is the single most impactful change a security team can make and the one most likely to cause widespread application failures.
| Phase | Action | Success Criteria |
|---|---|---|
| Phase 1 | Deploy Forward Trust CA to all devices and browsers via GPO/MDM | 100% of devices trust the CA certificate — verify in browser trust store |
| Phase 2 | Enable decryption for pilot group (IT/security staff) | All pilot users can access all business applications normally for 5 business days |
| Phase 3 | Build exclude list from pilot findings | All pinned/mTLS applications identified and excluded — zero unexplained failures |
| Phase 4 | Expand to one department at a time | Each department runs for 3 days — monitor decryption error logs daily |
| Phase 5 | Full production rollout | Decryption error rate below 1% — monitor for 5 business days |
# Daily monitoring command during rollout # Run every morning during expansion phases show log decryption direction equal forward | match "error|fail|unsupported" | match "$(date +%Y/%m/%d)" # Count decryption errors by destination to identify new exclude candidates # If same destination appears 5+ times with errors — add to exclude list # Check total decryption volume vs errors show log decryption direction equal forward | match "$(date +%Y/%m/%d)" | match count # Healthy ratio: errors < 1% of total decrypted sessions
Content Update Validation — Prevent App-ID Outages
# Before applying a content update: # Download but do not install request content upgrade download latest # Review the changelog in the support portal for: # - New applications added to the App-ID database # - Applications where identification was improved # - Any application that your security rules reference # After applying update — immediately check for new denies show log traffic action equal deny | match "$(date +%Y/%m/%d)" # Compare application names in deny logs against your security rules # Any new application name appearing in deny logs that was previously # passing as ssl or web-browsing = content update triggered App-ID shift # Run policy optimizer to identify rules affected by App-ID changes # (Available in GUI: Policies > Policy Optimizer > Unused Applications)
Automate Content Updates Carefully
GlobalProtect Certificate Management
Certificate expiry is the most preventable GP outage. Set these up once and the emergency call never happens.
# Check all certificate expiry dates show certificate all | match "not-valid-after|name" # For each GP-related certificate, note the expiry date # Set calendar reminders at 90, 60, and 30 days before expiry # Validate the full certificate chain before installing new cert # (Do this on a test machine before production deployment) openssl s_client -connect portal.company.com:443 -showcerts # Verify chain output shows: # Certificate chain # 0 s:CN=portal.company.com ← leaf cert # i:CN=Intermediate CA # 1 s:CN=Intermediate CA ← intermediate # i:CN=Root CA # All three must be present and in correct order # After installing new certificate — test GP connection from outside # Using a non-corporate device (personal phone) confirms what a # remote user sees: if it connects, the cert chain is complete
The Change Record — What TAC Wishes Engineers Would Do
When a production outage happens, the first question TAC asks is: "What changed?" The answer is almost always one of these:
- "We applied a content update last night"
- "We replaced the SSL certificate yesterday"
- "We added a new ISP connection this morning"
- "We modified the security policy for a new application"
- "Actually, nothing changed" — (and then we find the change that was made without a ticket)
Every change to a production firewall must be logged with: timestamp, who made it, what was changed, and the business reason. Not because of compliance — because when the outage happens, this log is the first thing that finds the cause. A firewall without a change log is a firewall that is impossible to troubleshoot efficiently.
Config audit log is your last resort: show log config — this shows all configuration changes made through the CLI and GUI. If a change was made and not documented, this finds it. Always check this within the first 5 minutes of a production outage.
# Check config change history — first thing in any outage show log config # Filter by time of suspected change show log config start-time equal 2026/05/26/00:00:00 end-time equal 2026/05/26/23:59:59 # Look for: who made the change, what was modified, exact timestamp # Cross-reference with the reported outage start time
Course Complete