A few weeks ago, I was involved in decommissioning old ESXi servers due to out of warranty and for this work, I had to come up with a migration plan to evacuate virtual machines to a new cluster.
Throughout this blog post, I will be going through:
- Migration Plan
There was only one requirement from the virtual machine owners, there should be no outage during the migration (1 or 2 packet loss is fine).
The VMware infrastructure is setup as following:
- Two clusters in a same vCenter server (Version 5.5) and each cluster has 4 ESXi servers (Version 5.5)
- Storage is FC based and clusters are zoned in different IO group (IBM SVC)
- Source_Cluster in IO group 1
- Destination_Cluster in IO group 0
- Each cluster has it’s own dvSwitch and 2 x 10Gbe uplinks
- Source_dvSwitch (Version 5.0)
- Destination_dvSwitch (Version 5.5)
- LAG is configured on the Source_Cluster and Destination_Cluster but no LACP
There were two major areas to look at, dvSwitch and shared storage between the clusters.
First attempt was, on a dedicated ESXi server, pulled one uplink out from the Source_dvSwitch and added it in to Destination_dvSwitch. After the migration, the management VMKernel port stayed in the Source_dvSwitch. A few minutes later, the ESXi server was disconnected from vCenter server and couldn’t ping it anymore. What happened?
I first logged into ESXi server via Shell and ran esxcli network ip neighbor list and found an interesting output:
The management VMKernel vmk0 could ping 172.27.3.252/253 but not 172.27.3.254, which is the gateway of the subnet. This was why the ESXi server was disconnected.
I’ve done some math below and I strongly recommend this blog to understand how Source & Destination IP Hash algorithm works.
- Source ESXi server IP Address: 172.27.2.79
- Destination IP Address #1: 172.27.3.253
- Destination IP Address #2: 172.27.3.254
After converting them into Hex values:
- Source ESXi server IP Address: 0xAC1B024F
- Destination IP Address #1: 0xAC1B03FD
- Destination IP Address #2: 0xAC1B03FE
Calculating XoR between source and destination IP addresses:
- Source & Destination #1: 0x1B2
- Source & Destination #2: 0x1B1
Finally, calculating MOD on the results above:
- 0x1B2 MOD 2 = 1
- 0x1B1 MOD 2 = 0
Do you see the problem here? The management VMKernel tries to connect 172.27.3.254 via first uplink but because this uplink has been removed and added to Destination_dvSwitch, the management VMKernel lost the connectivity to the gateway.
Since this was not a suitable solution, the decision was made to migrate ESXi server from Source_dvSwitch to Destination_dvSwitch completely.
As ESXi servers in Source_Cluster and Destination_Cluster were zoned in different IO group, it wasn’t possible to share a VMFS volume between clusters for the migration.
There were two solutions to this:
- Dedicate one ESXi server in Source_Cluster and zone it in both IO groups, i.e. 0 and 1 for migration purpose
- Use vCenter 5.5 new feature, Change both host and datastore
To maximise the speed of the migration work, it was decided to go with the first option.
Final Migration Plan
The following was the final migration plan:
- Dedicate one ESXi server in Source_Cluster
- Zone in both IO groups
- Create a VMFS volume for the migration purpose
- vMotion virtual machines to the dedicated ESXi server in step 1
- Storage vMotion virtual machines created in step 3
- Migrate the dedicated ESXi server from Source_dvSwitch to Destination_dvSwitch.
- vMotion virtual machines to Destination_Cluster
- Migrate the dedicated ESXi server back to Source_dvSwitch
- Repeat steps 1~8
One thing I would highlight is the migration plan above is just a guideline, every VMware infrastructure is different and you have to make it fit to yours.
Hope the real life migration scenario described above helps and if you want another example, it could be found here.
If you have a question or problem, always welcome to leave a message.