Troubleshoot Catalyst 9200/9300 Reloads Due to Stack Issues
Introduction
This document describes how to troubleshoot unexpected reloads due to stack issues on Catalyst 9000 switches.
Prerequisites
Requirements
Cisco recommends that you have knowledge of these topics:
- Catalyst 9000 Switches
- Catalyst 9300 Stackwise System Architecture
- Catalyst 9200 Stackwise System Architecture
Components Used
The information in this document is based on these software and hardware versions:
- Catalyst 9300 and 9300L platforms
- Cisco IOS® XE Release 17.2.1 and Cisco IOS XE Release 17.3.5
- Catalyst 9200 and 9200L switches
- Cisco IOS XE Release 17.1.1 and later
The information in this document was created from devices in a specific lab environment. All devices used started with a cleared (default) configuration. If your network is live, ensure you understand the potential impact of any command.
Background Information
The stack reset reasons are described in this table:
Reset Reason | Description |
---|---|
stack merge | This is observed when at least two stack members claim to be the active switch of the stack. This can be seen when the stack ring is broken or when Stack Discovery Protocol (SDP) messages are lost due to bad stack cables. |
stack merge due to incompatibility | Same as stack merge. Seen more frequently in half-ring stack configurations. |
lost both active and standby stack cable authentication failure | When the active switch is lost and if for any reason the standby switch is unable to assume the active role, then all other stack members are reloaded and use this reset reason. This can also be seen when stacks are configured in half-ring configurations. |
stack adapter authentication failure | Usually seen due to a faulty stack cable, stack adapter, or stack port. It could also be seen due to a software issue. |
Troubleshoot
Validate Stack Reload Reason
Validate the last reload reason for all members of the stack.
- Switch number - switch number assigned to a stack member; every stack member has a unique number assigned.
Use the following commands:
show version
show switch
show logging onboard switch <switch number> uptime detail
In the show version
command output, you can identify the different reset reasons for each of the stack members.
Example output:
switch#show version
<omitted output>
Last reload reason: stack merge <-- Switch 1 Reason
<omitted output>
Switch Ports Model SW Version SW Image Mode
* 1 53 C9300-48P 17.3.5 CAT9K IOSXE INSTALL 2 53 C9300-48P 17.3.5 CAT9K IOSXE INSTALL
3 53 C9300-48P 17.3.5 CAT9K IOSXE INSTALL Switch 02
Switch uptime : 13 hours, 47 minutes Base Ethernet MAC Address : aa:aa:aa:aa:aa:aa Motherboard Assembly Number : 11-11111-11 Motherboard Serial Number : AAAAAAAAAAA Model Revision Number : FO Motherboard Revision Number : CO Model Number : C9300-48P System Serial Number : AAAAAAAAAAB Last reload reason : stack merge due to incompatiblity <-- Switch 2 Reason
The show switch
command output displays the current role of the stack members.
switch#show switch
Switch/Stack Mac Address : XXXX.XXXX.XXXX
Mac persistency wait time: Indefinite
Switch# Role Mac Address Priority Version State
*1 Active XXXX.XXXX.XXXX 15 V01 Ready
2 Standby aaaa.aaaa.aaaa 14 V01 Ready
3 Member bbbb.bbbb.bbbb 13 V01 Ready
The last reload reason record can be seen with the next command:
- Current reset timestamp - Shows the time when the switch booted up. However, it does not show the time when the switch went down.
Use show logging onboard switch <switch number> uptime detail
.
Example output for Switch 1:
UPTIME SUMMARY INFORMATION
First customer power on : 11/15/2019 22:46:33
Total uptime : 0 years 0 weeks 6 days 20 hours 15 minutes
Total downtime : 0 years 46 weeks 5 days 23 hours 42 minutes
Number of resets : 10
Number of slot changes : 0
Current reset reason : stack merge <--
Current reset timestamp : 10/15/2020 05:44:01 <--
Current slot : 1
Chassis type : 95
Current uptime : 0 years 0 weeks 0 days 13 hours 0 minutes
UPTIME CONTINUOUS INFORMATION
Time Stamp | Reset Reason | Uptime
MM/DD/YYYY HH:MM:SS | | years weeks days hours minutes
<omitted output>
10/15/2020 05:44:01 stack merge 0 0 0 1 0 <--
Example output for Switch 2:
UPTIME SUMMARY INFORMATION
Number of resets : 14
Number of slot changes : 1
Current reset reason : stack merge due to incompatiblity <--
Current reset timestamp : 10/15/2020 05:44:03
Current slot : 2
Chassis type : 95
Current uptime : 0 years 0 weeks 0 days 13 hours 0 minutes
UPTIME CONTINUOUS INFORMATION
Time Stamp | Reset Reason | Uptime
MM/DD/YYYY HH:MM:SS | | years weeks days hours minutes
<omitted output>
10/15/2020 05:44:03 stack merge due to incompatiblity 0 0 0 1 0 <--
Example output for Switch 3:
UPTIME SUMMARY INFORMATION
Number of resets : 37
Number of slot changes : 3
Current reset reason : lost both active and standby <--
Current reset timestamp : 10/15/2020 18:56:09
Current slot : 3
Chassis type : 95
Current uptime : 0 years 0 weeks 0 days 0 hours 30 minutes
UPTIME CONTINUOUS INFORMATION
Time Stamp | Reset Reason | Uptime
MM/DD/YYYY HH:MM:SS | | years weeks days hours minutes
<omitted output>
10/15/2020 18:56:09 lost both active and standby 0 0 0 0 35 <--
Note: The errors "stack cable authentication failure" and "stack adapter authentication failure" usually do not allow the affected switch to fully boot up. Therefore, no commands can be collected for further analysis. Check the corresponding section with the steps to follow.
Check Stack Cable Hardware
Based on the hardware installation guide for Catalyst 9200 and 9300 switches, you must ensure the stack complies with the stack cable set up and ensure stack cables are properly set.
Confirm Stack Cable Setup
Stack cables must be set up in this manner:
- switch 1 stack port 1 connected to switch 2
- switch 1 stack port 2 connected to switch N
- switch 2 stack port 1 connected to switch 3
- switch 2 stack port 2 connected to switch 1
- switch 3 stack port 1 connected to switch 4
- switch 3 stack port 2 connected to switch 2
- ...
- switch N stack port 1 connected to switch 1
- switch N stack port 2 connected to switch N-1
This way the stack set up looks like these images:
Catalyst 9200L and 9200
[Diagram showing the back panel connections of Catalyst 9200L and 9200 switches in a stack, illustrating the stack cable connections between ports.]
Catalyst 9300
[Diagram showing the back panel connections of Catalyst 9300 switches in a stack, illustrating the stack cable connections between ports.]
Install Stack Cables
When you insert the stack adapter and/or the stack cable, follow these instructions:
Catalyst 9200L and 9200
- Ensure stack adapters are properly inserted. The Cisco logo must be on top.
- Ensure the stack cable is firmly tightened by hand.
[Diagram showing a stack adapter being inserted into a Catalyst 9200/9200L switch, with labels indicating the adapter and the switch port.]
Catalyst 9300L
- Ensure the stack adapters are properly inserted. The Cisco logo must be on top.
- Ensure the stack cable is firmly tightened by hand.
[Diagram showing a stack adapter being inserted into a Catalyst 9300L switch, with labels indicating the adapter and the switch port.]
Catalyst 9300
- The Cisco logo must be on top.
- Ensure the connector screws are firmly tightened by hand (not too loose, not too tight).
[Diagram showing a stack adapter being connected to a Catalyst 9300 switch, with labels indicating the adapter, cables, and screws.]
Check Stack Cable Health
In most cases, the unexpected reloads shown in this document were triggered due to bad stack cables, stack adapters, or stack ports. Regardless of the software version you run, you can be susceptible to this if the stack parts were not installed properly.
Once you validated the Confirm Stack Cable Setup and Install Stack Cables sections, check the stack cable health with these commands:
show switch neighbors
show switch stack-ring speed
show switch stack-ports summary
show switch stack-ports detail
In this example, there is a stack of three Catalyst 9300 switches. The show switch neighbors
command output displays which switches are connected to each stack member:
switch#show switch neighbors
Switch # Port 1 Port 2
1 2 3
2 3 1
3 1 2
When a stack cable is not present, wrongly inserted, or is faulty, 'None' is shown instead of the stack member:
switch#show switch neighbors
Switch # Port 1 Port 2
1 2 None <--
2 3 None
3 1 2
The show switch stack-ring speed
command provides you the stack ring status:
switch#show switch stack-ring speed
Stack Ring Speed : 480G
Stack Ring Configuration: Full
Stack Ring Protocol : StackWise
If for any reason the stack ring is broken, the output looks like this:
switch#show switch stack-ring speed
Stack Ring Speed : 240G
Stack Ring Configuration: Half
Stack Ring Protocol : StackWise
Warning: It is never expected to see Half status in a healthy Stack Ring Configuration. Though the stack works, it loses half of the bandwidth as well as redundancy.
A healthy show switch stack-ports summary
command output looks like this.
Note: Switch 1 stack port 1 shows two link changes. This is normal.
switch#show switch stack-ports summary
Sw#/Port# Port Status Neighbor Cable Length Link OK Link Active Sync OK #Changes to
LinkOK In Loopback
1/1 OK 2 50cm Yes Yes Yes 2 No
1/2 OK 3 100cm Yes Yes Yes 1 No
2/1 OK 3 50cm Yes Yes Yes 1 No
2/2 OK 1 50cm Yes Yes Yes 1 No
3/1 OK 1 100cm Yes Yes Yes 1 No
3/2 OK 2 50cm Yes Yes Yes 1 No
If the output shows many flaps on certain ports, it could be a signal of stack instability. This condition could trigger a stack merge. The Unknown status can be seen if the stack is not properly cabled.
switch#show switch stack-ports summary
Sw#/Port# Port Status Neighbor Cable Length Link OK Link Active Sync OK #Changes to
LinkOK In Loopback
1/1 OK 2 50cm Yes Yes Yes 16 No
<-- 16 flaps on switch 1 stack port 1 facing switch 2
1/2 OK 3 100cm Yes Yes Yes 1 No
2/1 OK 3 50cm Yes Yes Yes 1 No
2/2 OK 1 Unknown Yes Yes Yes 16 No
<-- Cable length 'unknown', 16 flaps on switch 2 stack port 2 facing switch 1
3/1 OK 1 100cm Yes Yes Yes 1 No
3/2 OK 2 50cm Yes Yes Yes 1 No
When excessive link changes are seen, the next step is to check the show switch stack-ports detail
command and focus on the CRC Errors counters. CRCs that increment on an interface means the packets received on that port are malformed. These conditions can apply:
- Corrupt packets sent from the remote side due to a faulty port
- Either the stack adapter (if applicable) or the stack cable is not properly set
- Either the stack adapter or the stack cable is faulty
Example of show switch stack-ports detail
output:
switch#show switch stack-ports detail
1/1 is OK Loopback No
Cable Length 100cm Neighbor 2
Link Ok Yes Sync Ok Yes Link Active Yes
Changes to LinkOK 16
Five minute input rate 1110 bytes/sec
Five minute output rate 47 bytes/sec
24798951 bytes input
737941 bytes output
CRC Errors
Data CRC 459731 <-- CRCS
Ringword CRC 35156 <-- CRCS
InvRingWord 54951 <-- CRCS
PcsCodeWord 35481 <-- CRCS
1/2 is OK Loopback No
Cable Length 100cm Neighbor 3
Link Ok Yes Sync Ok Yes Link Active Yes
Changes to LinkOK 1
Five minute input rate 164 bytes/sec
Five minute output rate 67 bytes/sec
0 bytes input
0 bytes output
CRC Errors
Data CRC 0
Ringword CRC 0
InvRingWord 0
PcsCodeWord 0
Note: The show switch stack-ports detail
command is available in the Cisco IOS XE Release 17.3.x train and later. In order to check the CRC Errors counters on earlier releases, use the legacy commands.
Legacy Commands
Commands that end in 0 are the CRC counters for stack port 1, commands that end in 1 are the CRC counters for stack port 2. These commands must be entered for all stack members.
show platform hardware fed switch <switch number> fwd-asic register read register-name SifRacDataCrcErrorCnt-0
show platform hardware fed switch <switch number> fwd-asic register read register-name SifRacRwCrcErrorCnt-0
show platform hardware fed switch <switch number> fwd-asic register read register-name SifRacInvalidRingWordCnt-0
show platform hardware fed switch <switch number> fwd-asic register read register-name SifRacPcsCodeWordErrorCnt-0
show platform hardware fed switch <switch number> fwd-asic register read register-name SifRacDataCrcErrorCnt-1
show platform hardware fed switch <switch number> fwd-asic register read register-name SifRacRwCrcErrorCnt-1
show platform hardware fed switch <switch number> fwd-asic register read register-name SifRacInvalidRingWordCnt-1
show platform hardware fed switch <switch number> fwd-asic register read register-name SifRacPcsCodeWordErrorCnt-1
Note: The #Changes to LinkOK counter in the show switch stack-ports summary
command output and the CRC counters in the show switch stack-ports detail
command output must be checked at least two times to validate if there is an increment on any of them. Static counters validate a stable stack link, whereas an increment in any of these counters validates stack link instability.
Stack Syslogs
These logs are seen when stack issues are present.
Stack Port Flaps
Example logs:
Aug 9 21:54:22.911: %STACKMGR-6-STACK_LINK_CHANGE: Switch 1 R0/0: stack_mgr: Stack port 1 on Switch 1 is down
Aug 9 21:54:23.011: %STACKMGR-6-STACK_LINK_CHANGE: Switch 1 R0/0: stack_mgr: Stack port 1 on Switch 1 is up
Aug 9 21:54:35.096: %STACKMGR-6-STACK_LINK_CHANGE: Switch 1 R0/0: stack_mgr: Stack port 1 on Switch 1 is down
Aug 9 21:54:35.197: %STACKMGR-6-STACK_LINK_CHANGE: Switch 1 R0/0: stack_mgr: Stack port 1 on Switch 1 is up
Aug 9 21:54:40.334: %STACKMGR-6-STACK_LINK_CHANGE: Switch 2 R0/0: stack_mgr: Stack port 2 on Switch 2 is down
Aug 9 21:54:40.434: %STACKMGR-6-STACK_LINK_CHANGE: Switch 2 R0/0: stack_mgr: Stack port 2 on Switch 2 is up
Stack port flaps in half-ring scenarios cause the stack to split and switch removal. In this scenario, there is a stack of six switches in a half ring. The stack link between switch 1 and 6 is not present, and the stack link between switches 5 and 6 constantly flaps. This causes switch member 6 to be removed from the stack.
Example logs of switch removal:
Apr 9 19:13:25.665: %STACKMGR-6-STACK_LINK_CHANGE: Switch 5 R0/0: stack_mgr: Stack port 1 on Switch 5 is up
Apr 9 19:13:42.513: %STACKMGR-4-SWITCH_REMOVED: Switch 2 R0/0: stack_mgr: Switch 6 has been removed from the stack.
Apr 9 19:13:42.588: %STACKMGR-4-SWITCH_REMOVED: Switch 1 R0/0: stack_mgr: Switch 6 has been removed from the stack.
Apr 9 19:13:42.827: %STACKMGR-4-SWITCH_REMOVED: Switch 5 R0/0: stack_mgr: Switch 6 has been removed from the stack.
Apr 9 19:13:42.999: %STACKMGR-4-SWITCH_REMOVED: Switch 4 R0/0: stack_mgr: Switch 6 has been removed from the stack.
Apr 9 19:13:43.031: %STACKMGR-4-SWITCH_REMOVED: Switch 3 R0/0: stack_mgr: Switch 6 has been removed from the stack.
Apr 9 19:13:47.666: %STACKMGR-6-STACK_LINK_CHANGE: Switch 5 R0/0: stack_mgr: Stack port 1 on Switch 5 is down
Apr 9 19:25:57.715: %STACKMGR-6-STACK_LINK_CHANGE: Switch 5 R0/0: stack_mgr: Stack port 1 on Switch 5 is up
Apr 9 19:26:15.817: %STACKMGR-4-SWITCH_REMOVED: Switch 2 R0/0: stack_mgr: Switch 6 has been removed from the stack.
Apr 9 19:26:15.946: %STACKMGR-4-SWITCH_REMOVED: Switch 1 R0/0: stack_mgr: Switch 6 has been removed from the stack.
Apr 9 19:26:16.290: %STACKMGR-4-SWITCH_REMOVED: Switch 5 R0/0: stack_mgr: Switch 6 has been removed from the stack.
Apr 9 19:26:16.450: %STACKMGR-4-SWITCH_REMOVED: Switch 3 R0/0: stack_mgr: Switch 6 has been removed from the stack.
Apr 9 19:26:16.457: %STACKMGR-4-SWITCH_REMOVED: Switch 4 R0/0: stack_mgr: Switch 6 has been removed from the stack.
Apr 9 19:26:21.717: %STACKMGR-6-STACK_LINK_CHANGE: Switch 5 R0/0: stack_mgr: Stack port 1 on Switch 5 is down
Apr 9 19:38:31.766: %STACKMGR-6-STACK_LINK_CHANGE: Switch 5 R0/0: stack_mgr: Stack port 1 on Switch 5 is up
High Hardware Interrupts
High hardware interrupts are seen due to too many CRC errors seen in the stack port.
Example logs:
Jun 9 09:28:06.723: %SIF_MGR-1-FAULTY_CABLE: Switch 1 R0/0: sif_mgr: High hardware interrupt seen on switch 1
Jun 9 09:29:06.724: %SIF_MGR-1-FAULTY_CABLE: Switch 1 R0/0: sif_mgr: High hardware interrupt seen on switch 1
Jun 9 09:30:06.725: %SIF_MGR-1-FAULTY_CABLE: Switch 1 R0/0: sif_mgr: High hardware interrupt seen on switch 1
Jun 9 09:31:06.726: %SIF_MGR-1-FAULTY_CABLE: Switch 1 R0/0: sif_mgr: High hardware interrupt seen on switch 1
Jun 9 09:33:06.727: %SIF_MGR-1-FAULTY_CABLE: Switch 1 R0/0: sif_mgr: High hardware interrupt seen on switch 1
Jun 9 09:34:06.728: %SIF_MGR-1-FAULTY_CABLE: Switch 1 R0/0: sif_mgr: High hardware interrupt seen on switch 1
Stack Authentication Issues
This kind of issue can prevent switch boot up, therefore show commands are not an option.
'Stack cable authentication failed' is shown when the switch gets reloaded due to this issue.
Example logs:
Jul 5 10:43:33.520: %PMAN-3-PROCESS_NOTIFICATION: R0/0: pvp: System report /crashinfo/system-report_local_20201015-165033-Universal.tar.gz (size: 176 KB) generated
Enter the show version command after the reload.
switch#show version
<omitted output>
Last reload reason: Reload Command <-- switch 1
<omitted output>
Switch 02
Switch uptime : 60 minutes Base Ethernet MAC Address : aa:aa:aa:aa:aa:aa Motherboard Assembly Number : 11-11111-11 Motherboard Serial Number : AAAAAAAAAAA Model Revision Number : FO Motherboard Revision Number : CO Model Number: C9300-48P System Serial Number: AAAAAAAAAAB Last reload reason : Reload slot command
Switch 03
Switch uptime : 56 minutes Base Ethernet MAC Address : bb:bb:bb:bb:bb:bb Motherboard Assembly Number: 22-22222-22 Motherboard Serial Number : BBBBBBBBBBA Model Revision Number : EO Motherboard Revision Number : CO Model Number : C9300L-48P System Serial Number : BBBBBBBBBBB
Last reload reason : stack cable authentication failure <--
switch#show logging onboard switch 3 uptime detail
UPTIME SUMMARY INFORMATION
First customer power on : 08/13/2019 23:46:07
Total uptime : 0 years 38 weeks 5 days 11 hours 54 minutes
Total downtime : 0 years 22 weeks 3 days 7 hours 45 minutes
Number of resets : 37
Number of slot changes : 3
Current reset reason : stack cable authentication failur <--
Current reset timestamp : 10/15/2020 18:56:09
Current slot : 3
Chassis type : 95
Current uptime : 0 years 0 weeks 0 days 0 hours 56 minutes
UPTIME CONTINUOUS INFORMATION
Time Stamp | Reset Reason | Uptime
MM/DD/YYYY HH:MM:SS | | years weeks days hours minutes
10/15/2020 18:56:09 stack cable authentication failur 0 0 0 0 35 <--
"Stack adapter authentication failed" looks like this when the switch gets reloaded due to this software defect.
Both links down, not waiting for other switches
Switch number is X
*** Stack adapter authentication failed on stack port <1|2> on switch X *** <--
Stack Adapter Auth Fail : SIF SERDES CABLE WESTBOUND
It also can look like this:
Both links down, not waiting for other switches
Switch number is X
*** Stack adapter authentication failed on stack port <1|2> on switch X *** <--
Stack Adapter Auth Fail : SIF SERDES CABLE EASTBOUND
Note: If stack adapter/cable authentication fail is found on the switch, the respective switch is