This is a blog post about how I thought I lost my mind this weekend and forgot how to do the spanning-trees.
The project couldn’t have been any simpler — a single 6509 core (I know, I know… but dual Sups at least) connected to some Nexus 93128 switches for the servers to land on. The 9ks function as the default gateway for the servers, and have vPCs to servers where applicable. A static default route (no licensing for dynamic routing… again I know…) pointed back to the 6k. The routing was happening over a VLAN since there was a requirement for some L2 between the 6k/9ks while servers migrated. Easy right?
Physical connectivity to each 9k was a pair of 10g ports from the SUP-2Ts, these connections landed on a 40g QSFP port with the QSFP->SFP+ adapter module. Not super relevant, but I can say I’ve implemented “40g” now 🙂
The uplink was a simple vPC, trunking a native VLAN and the VLAN that the routing is happening on. vPC Peer-Keepalive was up and happy, the Peer-Link was up and happy, and everything was looking good. Once the uplinks were connected, the 6k saw the 9ks as CDP neighbors and things were trucking along… but…. the next hop for the default route was not reachable. So at this point I’m like okay, I suck and fat fingered a VLAN or left it shutdown or something, so I go through and check it all out. L2 VLAN is created on both the 9ks and the 6k, the L3 SVI is created and is up/up on both sides. The 9ks are looking good — can ping the standby IP, and both of the real IPs there, but still no dice on pinging the 6k. So I take a look at the ARP table for my next hop and see that I’m getting an incomplete entry. At this point its pretty obvious that it’s a L2 thing, so I double and triple check that the VLANs are created and all that looks good. Next, I check spanning-tree and see that everything is operating in the same mode, and that the 6k is set up to be the root for this VLAN I’m sending between the boxes. At first glance, the 6k side seems cool — he knows he’s the root for this VLAN and the ports/port-channel are all up so I moved on…
Over on the shiny new 9ks though things are not so awesome though! The 9ks apparently thought because they were new and cool that they needed to be the center of the universe, and showed that they were in fact the root for this VLAN. Well that sure ain’t cool. Interestingly, for the native VLAN though, the 9ks seem to recognize that the 6k is the root. Weird. Configs look fine however, and just to be sure the 9ks REALLY aren’t the root, I bumped the priority for that VLAN to the max. Bump the vPC, and things are still weird. 9ks still think they are the root (well only one of them obviously, but the point being that they stills on’t see the 6k as root). Jumping back over to the 6k, and taking closer inspection I see that the 6k port-channel that goes to the 9ks is in blocking state for that VLAN. Okay…. furthermore it says its in blocking state and lists “P2p Dispute.”
After starring at this and shutting and un-shutting the ports for a few minutes to see if its going to magically start working I say screw it and kill the second 9k, and whittle things down to a single trunk port without any port-channel/vPC to muddy the waters. Same exact results… okay what the hell is happening here. Now I’m starting to wonder if there is something silly going on with the QSFP ports/adapters, so I change things up to a routed link and try to ping across. 100% success for however many pings I want to throw at it. So this yet again confirms that L1 is good to go. At this point I used the phone a friend option thinking I’m just blind and or completely forgot how to do anything with spanning-tree. Thankfully my awesome co-worker picked up and we hopped on a webex and after about 30 minutes he verified that I was in fact not losing my mind! Big relief, but obviously things were still broken.
As a final test before calling TAC to cry we moved the 6k/9k connectivity off the QSFP ports and onto a copper port on the 9k (96 10g copper ports on the 93128TX). Boom — 9ks see the 6k as root and everything starts working as expected.
Turns out there is a nifty feature (not a bug of course) on the 93128 running this particular version of NX-OS that the 40g ports just decide that they don’t want to receive PVST BDPUs…. which of course caused the 6k to not be seen as root, and the 9k to advertise itself as root, causing the dispute on the 6k side and moving that port to a blocking state.
While troubleshooting this one my good friend Google was not very helpful, so I figured I’d write about it. Hopefully if you’ve found this post I can save ya some time 🙂
Bug ID: CSCup33000
Description: PVST BPDUs dropped on Nexus 9300 40G ports