Last weekend, I upgraded 4 ESXi servers from 5.0 U2 to 5.5.
After the upgrade everything looked fine. There were no issues on virtual machines, new HA agents were installed on the ESXi server without any problems…etc
Few hours later, I was called out saying a list of virtual machines dropped from the network. Usually when this issue happened, all I had to do was to vMotion virtual machines to other ESXi servers or disconnect/reconnect the virtual NIC. It fixed the issue but few hours later, I was called out again with the same problem!
First thing I wanted to check was the OS level to see what’s happening in it and luckily, there was a Windows Server that I had access to. I listed the ARP table first and as expected, there were no entries at all. Also, checked the device manager and interesting thing was the network adapter was being uninstalled! This gave me an idea that it might be:
- VMware Tools issue
- VMXNET3 issue
But these weren’t enough to prove anything, needed something specific. Thus I started looking into the logs. While investigating on the vmkernal.log, I found some interesting lines:
014-02-04T04:30:24.535Z cpu19:33048)MirrorThrottled.etherswitch: MirrorToPorts:3386: session legacy_promiscuous: failed to output 260 pkts to dst 0x33554441 during mirroring: Out of slots
2014-02-04T04:30:24.535Z cpu21:36469)MirrorThrottled.etherswitch: MirrorToPorts:3386: session legacy_promiscuous: failed to output 223 pkts to dst 0x33554441 during mirroring: Out of slots
2014-02-04T04:30:24.589Z cpu21:36994)MirrorThrottled.etherswitch: MirrorToPorts:3386: session legacy_promiscuous: failed to output 258 pkts to dst 0x33554441 during mirroring: Out of slots
2014-02-04T04:30:24.590Z cpu17:36997)MirrorThrottled.etherswitch: MirrorToPorts:3386: session legacy_promiscuous: failed to output 1 pkts to dst 0x33554441 during mirroring: Out of slots
2014-02-04T04:30:24.590Z cpu22:36848)MirrorThrottled.etherswitch: MirrorToPorts:3386: session legacy_promiscuous: failed to output 251 pkts to dst 0x33554441 during mirroring: Out of slots
I could see that the port 0x33554441 on the dvSwitch was consuming all available slots. After checking, port 0x33554441 was for the vADM (VMware Application Discovery Manager) collector. Monitored the cluster for 5~6 hours after powering off all collectors and the cluster was stablised.
In summary, if you have deployed vADM collectors in the cluster and face the issue above, make sure you turn them off. It’s not a permanent solution but it will stop you from people complaining about network outage.
In few days, I will be getting an answer from the vADM team and update the post.
Update
The solution of this problem is to change the network adapter type from VMXNET2 to VMXNET3.
Did you ever find a resolution for this issue that allowed you to reenabled vADM collectors? Also, are you using dVS or VSS? Thanks!
Thanks for the reply, completely forgot to update this blog post with the solution.
The solution was quite simple, change the network adapter type from VMXNET2 to VMXNET3.
We were using distributed switch 🙂