Tuesday, November 3, 2009

VMWare ESX Troubles

First a bit of background. We run some VDI (Virtual Desktop Infrastructure) here at GCA. We have our ESX 3.5 servers running and our Netapp SAN hooked up to it via NFS. On the ESX servers, along with our production 'server' vms, we also have a bunch of VDI vms. The VDI vms are just virtual machines running Windows XP for individual users to use. They access these vms using a few tools, such as Sun's Secure Global Desktop and Sunray Ultrathin clients. They can also be accessed with a regular RDP tool.

Let me also preface with the fact that we have nightly snapshots of the VM's and the developers check their code into subversion on a regular basis, so worst case scenario is a 20 min operation to copy the VM from last nights backup in the event one explodes.

Today I had an engineer who tried to reboot her Windows XP VDI and it would not reboot. I asked her if she had any open applications, to which she replied no, so I used VirtualCenter to send the VM a reboot command. That did not work, so I gave it a good amount of time, then got impatient and took the shotgun approach and tried the reset vm command instead. This still did not work, so I tried it again and I get an error that there is another operation in progress.

I did some googling and found information on the nuclear approach to the situation. First thing is to find the vm world id with the following command:

cat /proc/vmware/vm/*/names grep "vmname"

The result should start with vmid=####. With this number, you can browse over to /proc/vmware/vm/####/cpu and do:

less -S /proc/vmware/vm/###/cpu/status

Well, I had a problem. The cpu directory in this location was not present. I could not get the VM to stop at all. I also tried the other method of using ps -aux and grepping for the VM name, but it did not have a process either. I did not find a solution for this, but here is what I found in /var/log/vmkwarning log file. Look at the timestamps, it figured it out eventually and allowed me to do operations it, but it took well over an hour. See below for log output and please let me know if you know how I can fix this without waiting 90 minutes in the future:

Nov 3 14:54:50 gcaesx001 vmkernel: 40:00:33:25.587 cpu1:1042)WARNING: Swap: vm 1390: 7515: Swap sync read failed: status=195887193, retrying...
Nov 3 14:54:50 gcaesx001 vmkernel: 40:00:33:25.587 cpu0:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=1
Nov 3 14:54:50 gcaesx001 vmkernel: 40:00:33:25.638 cpu2:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=2
Nov 3 14:54:51 gcaesx001 vmkernel: 40:00:33:25.939 cpu2:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=4
Nov 3 14:54:54 gcaesx001 vmkernel: 40:00:33:29.144 cpu1:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=8
Nov 3 14:55:02 gcaesx001 vmkernel: 40:00:33:37.147 cpu2:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=16
Nov 3 14:55:18 gcaesx001 vmkernel: 40:00:33:53.155 cpu0:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=32
Nov 3 14:55:50 gcaesx001 vmkernel: 40:00:34:25.175 cpu3:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=64
Nov 3 14:56:54 gcaesx001 vmkernel: 40:00:35:29.224 cpu0:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=128
Nov 3 14:59:02 gcaesx001 vmkernel: 40:00:37:37.300 cpu0:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=256
Nov 3 15:03:18 gcaesx001 vmkernel: 40:00:41:53.473 cpu1:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=512
Nov 3 15:11:51 gcaesx001 vmkernel: 40:00:50:25.865 cpu2:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=1024
Nov 3 15:28:55 gcaesx001 vmkernel: 40:01:07:30.612 cpu3:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=2048
Nov 3 16:03:05 gcaesx001 vmkernel: 40:01:41:40.116 cpu1:1042)WARNING: Swap: vm 1390: 7528: Swap sync read retry failed: status=195887193, retry=4096
Nov 3 16:18:12 gcaesx001 vmkernel: 40:01:56:47.424 cpu3:1042)WARNING: Swap: vm 1390: 7550: Read failed: numPages = 1
Nov 3 16:18:12 gcaesx001 vmkernel: 40:01:56:47.424 cpu3:1042)WARNING: Alloc: vm 1390: 5226: unable to read from slot(0x102ac7c)
Nov 3 16:18:12 gcaesx001 vmkernel: 40:01:56:47.424 cpu3:1042)WARNING: World: vm 1390: 6870: vmm0:vdi-rlosaria:vmk: vcpu-0:Unable to read swapped out PPN(0xa1ef) from swap slot(0x102ac7c) for VM(1390)
Nov 3 16:18:12 gcaesx001 vmkernel: 40:01:56:47.426 cpu3:1042)WARNING: P2MCache: vm 1390: 450: Alloc_GetPhysMemRange failed for PPN 0xa1ef status = Permission denied
Nov 3 16:18:12 gcaesx001 vmkernel: 40:01:56:47.430 cpu2:1037)WARNING: World: vm 1390: 5160: Panic'd VMM world being reaped, but no core dumped.

1 comment:

  1. Thanks for a very detailed solution. Logs are very interesting :)

    ReplyDelete