Quick post here for a fun little issue I ran into this week.
I’ve been building out a greenfield ACI deployment with some Nexus 6ks “north” of the fabric acting as a core. The 6ks and the fabric is all greenfield at this point, no connectivity to the rest of the customers network or the internet or anything. I configured the fabric to use the 6ks as the NTP server, with the intent that eventually the 6ks will point to a low stratum clock of some kind. Set this up, was happy NTP was synced and the NTP faults went away in the fabric and I went to grab some lunch. When I came back none of the APICs could see each other (controller state unavailable), yet I could ping in band on each APIC to any other APIC…
Well, the NTP configuration worked, but since the 6ks were out of the box and didn’t have their clocks set their time was the default — something in 2001. Normally that’s probably just an annoyance — “hey! my clock is way off!”… not this time! The APICs synced up with the 6ks and then their time was set to whatever time/date in 2001 as well. What I didn’t know then was that there is a certificate on the APICs (automagically there from Cisco) that has a “not valid before” date… and, when NTP synced up it flipped the clock on the APIC to a date that was before the validity of the cert. This caused the whole APIC cluster to fail because the controllers couldn’t build the cluster because this certificate was deemed invalid! I should note here that even though I foobared the entire APIC cluster, the fabric was totally fine and passing traffic and all that was good — just couldn’t change the fabric since the cluster was messed up.
You can verify the health of the cluster with the “acidiag avread” command — this will list out all the APICs in the cluster. The output isn’t very friendly to read, but you should be able to see all your APICs that way.
The “show controller” command is much more friendly:
apic1# show controller Fabric Name : ACI-Pod-1 Operational Size : 1 Cluster Size : 3 Time Difference : -27031964 Fabric Security Mode : permissive ID Address In-Band IPv4 Address In-Band IPv6 Address OOB IPv4 Address OOB IPv6 Address Version Fl ags Serial Number Health ---- --------------- --------------- ------------------------- --------------- ------------------------- ------------ -- --- ---------------- ---------------- 1* x.x.x.x 0.0.0.0 0.0.0.0 x.x.x.x 0.0.0.0 1.2(2g) cr v- xxxxxxxxxxx fully-fit Flags - c:Commissioned | r:Registered | v:Valid Certificate | a:Approved
From this output you can see the controller and some important information – in band IP (TEP IP in the infra tenant), any in band management IPs, the version, serial number, and importantly the flags. In the above output we can see we have the flags “c”, “r”, and “v” — for this particular scenario the one I care about is that valid certificate flag. If you do like I did and foobar the NTP beyond when the cert is valid this would be something to look for.
Finally, and perhaps easiest of all you can use the “acidiag verifyapic” command to very easily check the certificate status as well as the dates that the certificate is valid for:
apic1# acidiag verifyapic openssl_check: certificate details subject= CN=xxxxxxxxxxx,serialNumber=PID:APIC-SERVER-M1 SN:xxxxxxxxxxx issuer= CN=Cisco Manufacturing CA,O=Cisco Systems notBefore=Mar 21 03:21:13 2015 GMT notAfter=Mar 21 03:31:13 2025 GMT openssl_check: passed ssh_check: passed all_checks: passed
So to fix all of this hot mess the date/time had to be set manually on the 6k, but then the time was too far out of range for the APICs to get synced up so we had to actually manually set the time on the APIC. You can do that like so:
dbgtoken login root date MMDDhhmmYYYY
Note that you will need the debug token and internal access (TAC) to get this fixed.