Introduction
VMware® vCenter™ Site Recovery Manager™ is a disaster recovery offering that provides automated orchestration and non-disruptive testing for virtualized applications (reference: http://www.vmware.com/products/site-recovery-manager). Before SRM was in place, the Disaster Recovery process was manually performed by storage and virtualisation administrators. There were approximately 100 vDisks mirrored from protected site to recovery site and when disaster happened, as the process was manual, the RPO customers expected was not possible to be delivered. Introducing SRM will satisfy customers’ SLAs.
Before SRM
Critical virtual machines were protected by IBM Metro Mirror functionality. The problem was that the process was manual and it involved in storage, virtualisation and OS teams. Also dependencies on the virtual machines, e.g. database must be powered on before powering on a web server, had to be documented and referenced before and after failing over to protected site. This consumed a lot of time as the process was very complex that people got confused easily, even the documentation was available.
Two manual processes are outlined below:
1. To failover virtual machine(s) in a disaster situation (assuming power or storage outage):
- Make sure there are no I/Os to Master vDisk, i.e. power-off virtual machine, unmount VMFS volume, rescan HBAs
- Break the relationship making the Auxiliary vDisk writeable, i.e. svctask stoprcrelationship –access “Relationship ID”
- Map the Auxiliary vDisk at the recovery site
- Mount the VMFS volume
- Register the virtual machine
- Assign a port group
- Power-on
2. To failback virtual machine(s) to the protected site:
- Make the Auxiliary vDisk as the primary and start synching, i.e. svctask startrcrelationship –force –primary aux “Relationship ID”
- Once the relation is consistently synchronised, power-off /unregister the virtual machine and un-map the Auxiliary vDisk from the recovery site
- Make sure there are no I/Os to Auxiliary vDisk, i.e. power-off virtual machine, unmount VMFS volume, rescan HBAs
- Map the Master vDisk to the protected site
- Mount the VMFS volume
- Register the virtual machine
- Assign a port group
- Power-on
After SRM Deployed
Following manual processes explained above are automated by SRM:
- Recovery
- Re-protect
Also, as it is possible to give priorities and dependencies to the virtual machines during recovery process, the processes got much simpler.
Products
The following products are used for the testing:
- vCenter server 5.5.0b
- ESXi 5.5 Build number (Releasebuild-1474528)
- Site Recovery Manager 5.5.0b
- IBM SAN Volume Controller Storage Replication Adapter 2.2.0
- SAN Volume Controller 7.1.0.7 (build 80.4.1312030000)
Pre-requisites
Pre-configured SRA environment:
- Create an equal number of FlashCopy® (target) volumes as the Remote Copy target volumes on the recovery site SAN Volume Controller.
- Create a background copy and incremental FlashCopy mapping between Remote Copy target volumes and the previous created FlashCopy target volumes on the recovery site SAN Volume Controller.
- If the remote copies are in a consistency group, create a corresponding FlashCopy consistency group and configure the corresponding FlashCopy to the FlashCopy consistency group.
- Map the Remote Copy target and FlashCopy target volumes to the recovery site vSphere servers.
- Create an equal number of FlashCopy (target) volumes as the Remote Copy source volumes on the protected site SAN Volume Controller.
- Create a background copy and incremental FlashCopy mapping between Remote Copy source volumes and the previously created FlashCopy target volumes on the protected site SAN Volume Controller.
- If the remote copies are in a consistency group, create a corresponding FlashCopy consistency group and configure the corresponding FlashCopy to the FlashCopy consistency group.
- Map the Remote Copy source and FlashCopy target volumes to the protected site vSphere servers.
- A CopyOperator privilege suffices if you pre-create the needed volumes and map them to the recovery site ESXi servers
Architecture
The following diagram represents the Pre-configured SRA environment:
One thing to keep an eye on is the FlashCopy(s), for both Master and Auxiliary vDisks. Details will be provided in the next section.
What does SRM provide?
SRM Test Process
SRM provides functionality to simulate protected virtual machine failover from protected site to recovery site. This is where the FlashCopy is involved; it does not impact Master or Auxiliary vDisk i.e. the actual protected virtual machine. Once the VMFS volume, the back-end storage being the FlashCopy, is presented and mounted on recovery site:
- Dummy virtual machine(s) is registered
- A dummy portgroup is created and assigned to the virtual machine
- The dummy portgroup is a portgroup that doesn’t have any uplinks attached that duplicate IP address doesn’t have to be worried about.
- The virtual machine is powered on.
Detailed steps (VMware and SVC) are shown below.
Protected Site
VMware | SVC |
None | svctask startrcrelationship -force -primary master “Relation ID”
|
Recovery Site
VMware | SVC |
Rescan all HBAs | svctask prestartfcmap “FlashCopy Map ID”
|
Resolve VMFS volumes | svctask startfcmap “FlashCopy Map ID” |
Refresh host storage system | |
Reconfigure virtual machine | |
Add virtual switch | |
Add port group | |
Power on virtual machine |
I won’t go through the cleanup process as it’s the reverse of the steps above.
SRM Recovery Process
This is where the actual game begins. This feature provides protected virtual machine(s) failover to recovery site (it involves in the actual Master and Auxiliary vDisks).
SRM provides two types of Recovery:
- Planned Migration
- Disaster Recovery
The only difference between these two is that planned migration will stop failover if SRM encounters any errors (sites must be connected and the replication must be available & up-to-date) whereas disaster recovery won’t. To explain it in more details, even if the relationship was broken and/or the data is not up-to-date, it will use the latest synced Auxiliary vDisk whereas planned migration will stop and thrown an error.
Detailed steps (VMware and SVC) are shown below. One thing to note is that the steps below are for Planned Migration. It will be different if the protected site faces complete outage i.e. power/storage outage. This is because the Master vDisk will be offline and relationship will be stopped soon as an outage happens.
Protected Site
VMware | SVC |
Power-off virtual machine(s) | svctask startrcrelationship –force –primary master “Relation ID” |
Un-register virtual machine(s) | |
Un-mount VMFS volume(s) | |
Detach SCSI LUN(s) |
Recovery Site
VMware | SVC |
Rescan all HBAs | svctask switchrcrelationship –primary aux “Relation ID” |
Attach SCSI LUN | svctask stoprcrelationship “Relation ID” |
Resolve VMFS volumes | svctask prestartfcmap “FlashCopy Map ID” |
Refresh host storage system | svctask startfcmap “FlashCopy Map ID” |
Update virtual machine files | |
Reload virtual machine from new configuration | |
Reconfigure virtual machine | |
Reload virtual machine | |
Reconfigure virtual machine | |
Power-On |
Re-protect
Now the original protected site became recovery site and vice versa. Re-protection will only be available if the recovery site is online, i.e. when the datacentre comes back online after a disaster.
Re-protect makes sure the relationship is started and sync is up-to-date so that when a virtual machine fails over back to the original protected site, the latest copy is available.
This process also cleans-up the leftover at the old protected site, i.e. virtual machine, VMFS volumes.
Protected Site
VMware | SVC |
Delete file | svctask startrcrelationship –force –primary aux “Relation ID” |
Recovery Site
VMware | SVC |
Create virtual machine | |
Unregister virtual machine | |
Reload virtual machine from new configuration | |
Reconfigure virtual machine | |
Delete file | |
Reconfigure virtual machine | |
Delete state information for detached SCSI LUN | |
Delete state information for un-mounted VMFS volume | |
Rescan all HBAs | |
Refresh host storage system |
Thoughts
SRM is a great automation tool that will reduce administrators’ workloads when disaster happens. Installation and configuration isn’t hard but documentation is. It will take most of the time to write solid documentation before creating protection groups and recovery plans (well some kind of manager will be writing it, right?). The list below is what needs to be documented:
- Roles and User Groups
- Protection Group(s)
- Recovery Plan(s)
- Dependencies
- Customisation, e.g. scripts, IP address change
- DR Testing
Another advantage is that based on the groups and permissions, the owners of the virtual machines could log-in to SRM and start disaster recovery that virtualisation administrators do not have to be involved in. This allows flexibility to the owners. However, there will be a lot of political games 🙂
One problem from business point of view is cost. As described above, FlashCopy(s) must be created for Master and Auxiliary vDisks. This means, to protect virtual machine(s), the owners need to pay for 4 x disk. For example, if the size of virtual machines is 256GB altogether he/she will need to pay for 1024GB. To improve this, I would suggest:
- Create FlashCopy(s) with the cheapest storage
- Configure with non-preconfigured environment
With non-preconfigured environment, a specific MDisk group will be used purely for FlashCopy(s) where the administrator could put cheapest storages in this MDisk groups. Also, non-preconfigured environment supports 3 volumes types:
- Standard
- Thin Provisioned/Space Efficient
- Compressed Volume
Will need to do some more testing once a dedicated MDisk group is ready to see which one works best but I think thin provisioned will be the best choice. With these, possibly the cost can be reduced to 3 x disk. Another advantage for using dedicated MDisk group for FlashCopy(s) is that it won’t impact any production/test back-end storage while FlashCopy(s) is under copying.
Future
In future, I will be posting another blog on SRM vSphere Replication.
Thanks for finally writing about >Site Recovery Manager 5.5 with IBM SVC SRA | Steven Kang <Loved it!
Steven, great article, but i am having some issues with setting this up
Our environment has TPC-R in place with the VMWare volumes being replicated with GMCV. how would one go about incorporating TPC-R with SVC/SRM?
IBM says it “should’ work, but no testing done, and i do not have a test facility to try this.
What i have done is set up a pre-configured environment with Thin volumes as Flash Copies.
So on my protected site i have my traditional thick volume assigned to the VMWare farm hosts, i have a thin GMCV volumes with a copyset relationship created in TPC-R and the pairs are showing prepared. The GMCV TPC-R volumes are not mapped, and cannot be. I then went ahead and created thin volumes to use for flash copy of the thick volumes, mapped them to the farm hosts too and create the incremental flashcopy mappings(but not sure what copyrate i should set them to) and put them in a consistency group on the H1 side
I did the same thing on the H2 side, Thick volumes mapped to DR hosts, Thin GMCV volume for TPC-R, which was set up with the TPC-R copyset relationship, and Thin FlashCopy volumes mapped to the DR hosts with the flashcopy mapping created and placed in a consistency group on the DR side.
Reading your article multiple times, it appears maybe i cannot set up a pre-configured environment with thin flashCopy volumes.
I am at a point where if this doesn’t work, i have to tear it apart and allow a non preconfigured setup, and if that still doesn’t work, it looks like i have to tear out TPC-R and allow all the replication to happen via the SVC remote copy functions, still using GMCV volumes though.
If you can point me in the right direction or anything, it would be greatly appreciated
I am not sure if this is the right forum to pose this sort of question, but once again i appreciate any assistance offered
Thanks
James
Hi James,
Sorry for the late reply, was on a break 🙂
Reading your comment, I am afraid that I cannot advise on the configuration you are after as I have no experience with that configuration previously. If I still have access to the test lab that I had, I would have assisted you with this but unfortunately, I do not have the access anymore.
I think the best way would be to contact IBM do to the testing, I believe they have the test lab to do this.
Apologies for not being helpful,
Regards,
Steven.