Here is the scenario. Power outage at 2am. The outage lasted longer than our battery life. We have in our roadmap plans to implement scripts to do graceful shutdowns when a low battery signal comes from the UPS, but for the time being, we do not have that.
We initially had some fun bringing everything back up in order. It gets particularly fun when your AD Domain Controllers are all virtual, DNS is all virtual, and DHCP is virtual. Get some nice little chicken and egg issues, but we have learned our lesson and are going to create a DNS, DHCP, and DC that are physical, so they can come up before the virtual environment.
The real meat of our problem was some of the virtual machines. It wasn't directly related to the power outage either. Our core switches, which are on separate UPS and did not lose power, decided to go Tango Uniform right after we got most of the boxes back up. Unfortunately, our Netgear core switches are not sending logs to a syslog server and the log files do not persist a reboot (I know, I think its stupid too). This means we don't have any way of knowing why they went stupid on us.
So, we have learned some lessons and moved on. On to what I want this article to reflect. When we lost the switches, the virtual machines lost their connection to their vmdk's. We use NFS to connect to the datastore, so when we restored the switches, most of the virtual machines just flushed their writes and went on their merry way. Some virtual machines, however, did not do this. I inspected the vmware.log file stored with the virtual machine files to see what happened, and I noticed that on all of the virtual machines that were locked up (most of which were Windows XP boxes) had the following log message:
Mar 23 13:34:38.731: vmx| VMX has left the building: 0.
So, we have determined that VMWare just gave up on trying to talk to its VMDK file after some amount of time and the VMX decided to ditch this party. Ok, so here is the procedure I had to go through to get the darn things back.
First, we need to get the virtual machine into an 'off' state. This is not easy, nor is it intuitive. What I had to do was the following.
From the service console of the ESX Server running the VM, find the vmid of the virtual machine in question. To do so, run the following command and grep for the Virtual Machine name,
cat /proc/vmware/vm/*/names | grep vdi-rivey
The return should start with vmid=####. Take this number into the next command, where we are looking for the VM Group.
less -S /proc/vmware/vm/####/cpu/status
You are looking for the vmgroup, which will look something like vm.####. Next, using the VM group ID number, feed it into the following command to run an effective kill -9 on the VM within the VMKernel. Note: be sure to run this command as root (or use sudo).
/usr/lib/vmware/bin/vmkload_app --kill 9 ####
Once this command is complete, the virtual machine still shows as though its in a Powered On state. To get the VIC to figure this out, I had to restart some daemons on ESX to force VI to figure it all out. Please note, I disabled VMWare HA and DRS on my cluster because of some of the issues I was having, I am not sure what HA will do with the VM's on this ESX server if you run this command while they HA is enabled.
service vmware-vpxa restart
service mgmt-vmware restart
The virtual machines running on this ESX Server and the ESX Server itself will grey out in your VIC while the services restart. When everything is back to normal, the VM in question will now be grayed out, with (invalid) appended to the name.
Next, I removed the virtual machine from the inventory, then browsed to it in the datastore, right clicked on the vmx file, and added it back into my inventory. It still is not ready to boot because it has a couple of .lck directories (lock files). I browsed in the service console to the virtual machine, went into its directory and ran the following command to blow away all of the locks
rm -rf .lck*
After this was done, I was able to boot the VM back up! Unfortunately, good ole Windows had some issues on 3 of my 60+ virtual machines. These virtual machines boot, but promptly lock up. I am not sure why this happened to a small subset of VM's, but I am attributing this to corruption of the disk. The OS lost access to the disk and we killed the virtual machine without flushing the writes, so that could have been the problem. Luckily, we take snapshots every night of the volume that holds the VM's at midnight. I simply copied the entire VM directory from this backup, blew away the lock files again, added it to inventory, and bam, instant restore from backup.
Everything is back up and running. We have a few infrastructure changes to make to help our recovery from a down state much quicker, we have a new reason to push for the scripts to bring everything down gracefully, and I have a procedure for unlocking a virtual machine. We have also (again), verified that our backups are working like a champ!