Introduction
Was in the process of upgrading ESXi servers, 5.0 to 5.5 and noticed a number of virtual machines were disconnected from network randomly.
Throughout this blog post, I will be going through:
- Symptom
- Workaround
Symptom
The cluster had 8 nodes (n+1) and it was scheduled to upgrade one ESXi server per day. On the next day, after one ESXi server was upgraded to 5.5, a few incidents were raised saying some virtual machines aren’t accessible over network.
To summarise:
- One ESXi server was upgraded to 5.5
- Some virtual machines disconnected from network across multiple ESXi servers
- ESXi servers were manageable via vCenter Server
- Virtual machines were also manageable, e.g. reconfigure virtual machine
The first method I’ve chosen for this issue was to check the ESXi logs and found interesting lines in VMKernel.log:
2014-11-07T22:07:47.202Z cpu16:1804972)Net: 1652: connected test_vm1.eth1 eth1 to vDS, portID 0x200002e 2014-11-07T22:07:47.202Z cpu16:1804972)Net: 1985: associated dvPort 513 with portID 0x200002e 2014-11-07T22:07:47.202Z cpu16:1804972)etherswitch: L2Sec_EnforcePortCompliance:247: client test_vm1.eth1 requested mac address change to 00:00:00:00:00:00 on port 0x200002e, disallowed by vswitch policy2014-11-07T22:07:47.202Z cpu16:1804972)etherswitch: L2Sec_EnforcePortCompliance:356: client test_vm1.eth1 has policy vialations on port 0x200002e. Port is blocked 2014-11-07T22:07:47.202Z cpu16:1804972)etherswitch: L2Sec_EnforcePortCompliance:247: client test_vm1.eth1 requested mac address change to 00:00:00:00:00:00 on port 0x200002e, disallowed by vswitch policy 2014-11-07T22:07:47.202Z cpu16:1804972)etherswitch: L2Sec_EnforcePortCompliance:356: client test_vm1.eth1 has policy vialations on port 0x200002e. Port is blocked 2014-11-07T22:07:47.202Z cpu16:1804972)WARNING: NetPort: 1245: failed to enable port 0x200002e: Bad parameter 2014-11-07T22:07:47.202Z cpu16:1804972)NetPort: 1427: disabled port 0x200002e 2014-11-07T22:07:47.202Z cpu16:1804972)WARNING: Net: vm 1804972: 377: cannot enable port 0x200002e: Bad parameter 2014-11-07T23:50:39.821Z cpu4:4151)Net: 2191: dissociate dvPort 513 from port 0x200002e 2014-11-07T23:50:39.821Z cpu4:4151)Net: 2195: disconnected client from port 0x200002e
While Googling, could find a KB article related to this issue which can be found here. The problem was quite simple, ESXi 4.x or 5.0 supports only DVUplinks less than 31 characters and guess what, DVUplinks were longer than 31 characters!
Investigating the history of vMotions, the root cause of this problem was that DRS was load balancing workloads and during this process, some of virtual machines were spread out from ESXi 5.5 to ESXi 5.0 servers.
Workaround
The resolutions the above KB article suggests are to patch ESXi 5.0 to Patch 5. However, the patch needed reboot which means, it would take approximately same time as upgrading ESXi to 5.5. The better option was to upgrade ESXi 5.0 to 5.5 and not allow virtual machines running in ESXi 5.5 to move to ESXi 5.0 servers, i.e. set DRS to manual. The only problem with this was manual vMotion was required to evacuate virtual machines to place the ESXi server on maintenance.
Alternatively, it was possible to rename the DVUplinks to be less than 31 characters to benefit from fully automated DRS. But our standard was to leave the name of DVUplinks as default.
After a discussion, we came up with a workaround:
- Set DRS to manual
- Patching a ESXi server, set DRS to fully automated
- Place the ESXi server on maintenance mode
- Set DRS back to manual once maintenance mode task kicks in evacuation of virtual machines
- Upgrade ESXi server
- Repeat 2~5
In this way, I could evacuate virtual machines to other ESXi servers automatically and not worrying about virtual machines being vMotioned from 5.5 to 5.0.
Hope this helped and feel free to leave a comment.