EMC VPLEX Metro: Cluster Witness Server Maintenance

Although it’s not mandatory, the Cluster Witness Server (CWS) in a  VPLEX Metro Cluster serves a very important purpose as the arbitrator between inter-cluster network partition failures and actual cluster/site failures. It’s a lightweight (<3GB) VMware vm deployed from an OVA that establishes itself in a 3 way VPN between both VPLEX clusters and monitors heartbeats between them.  It’s capable of detecting and automatically (read invisibly) recovering from link or site failures to enable apps and services to failover while remaining online. It’s groovy.

23-08-2015 12-29-16 PM

But, like any device, sometimes the host that the CWS vm runs on needs to have maintenance periods. This might be for any number of reasons – updates, hardware failure, upgrades or physical moves. Since these vms are often deployed onto a single host (with no cluster H/A capabilities) this begs the question…… What happens to a Metro cluster when the CWS is offline ?

I had received conflicting advice from Support, who seemed to insist that I needed to disable the CW, re-deploy the CWS from the OVA on the new host…… Sounds a bit drastic, given that it’s well established that witness failure alone will not disrupt I/O flow to either Cluster.

23-08-2015 11-58-33 AM

 

 

 

 

 

 

 

Note that if during the time of the Witness being offline, an actual failure on the WAN between cluster occurs, this may trigger a “surviving clusters must suspend I/O” event. In that case, the default preferences would determine which clusters suspend I/O at either the Consistency Group level or Distributed Device level.

By disabling the CWS gracefully, I hoped to avoid that scenario.

My plan was simple. Disable the cluster witness, shutdown the vm, move it to a new host (choose your transport mechanism). start it up and re-enable the witness functionality. Nothing was changing, other than the host. Ip addresses were the same.

So let’s verify the environment is healthy; cluster status and witness status;
23-08-2015 1-09-10 PM23-08-2015 1-09-24 PM

UNISPHERE view;

23-08-2015 1-09-49 PM

 

 

 

 

 

 

All good. So let’s go at it.

23-08-2015 1-10-55 PM

The UNISPHERE status for the witness changed while the witness was disabled, but the actual cluster-status didn’t change remaining all “ok”.

22 PMz

The guest was shutdown, moved to another host and Powered on. Boot time is very quick, so it was online and operational in <2 minutes, end-to-end.

23-08-2015 1-29-07 PM

 

Then run, cluster-witness enable to re-establish the VPN between both cluster and the CWS. 

So, what happened ? Nothing unexpected. It was over in a matter of minutes with no issues whatsoever with the vm moved to a new host.

25-08-2015 7-25-00 PM

Handy to know for a common Admin task.

 

References:

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: