Fix - Invalid Guest on Virtual Center

December 9, 2008

After encountering an ESX host problem the other night, I ran into an issue today with a VM guest showing up as “invalid” in virtual center. I was able to bring the guest back into VC without taking an outage by doing the following procedures.

First some background.

Due to circumstances still being investigated, the console of an ESX box froze disconnecting it from virtual center. All of the guests (approximately 40) on the host were still available and running, but VMware support confirmed that the state of the server was so degragated that it would require a reboot of the host and thus an outage of all the guests on it to fix. Since the ESX box is in an HA cluster, after some necessary VM guest applications were shut down the ESX box was rebooted and HA promptly brought up the guest VM’s onto other hosts in the cluster. All the guests affected were then checked out and appeared fine.

Thinking I was in the clear, today I noticed one of the affected VM’s icon in Virtual center appeared as blue and was italicized with the words “(invalid)” added after the vm name. Knowing that I had successfully started and checked this particular vm the night before, I was needless to say confused.

First things first, since the VM was a Linux guest I tried to ssh to the guest to see if it was still running. Luckily, I was able to log in to the VM and everything looked normal. Next, I logged onto the ESX host console that this VM had last been registered to and issued a vmware-cmd -l. There was no entry for the invalid VM so to double check I issued a ps -axf | grep -i and found that there was indeed a process running for the vm in question on this particular ESX host.

I decided to try to re-add the VM into VC manually by first removing the invalid guest from inventory in VC and then re-adding it by browsing to the .vmx file. To do this, I clicked on the ESX host in VC and on the summary tab double click on the data store that the .vmx file for this vm lives on. You can then browse to the directory for the vm guest and should be able to right-click the .vmx file and choose the “Add to inventory” option. I say should be able to because in this particular instance that option was grayed out and not selectable.

In an attempt to find out some more information from the ESX host logs, I then logged onto the ESX host the VM was last registered on and navigated to the /var/log/vmware directory. Issuing a grep -i * gave a lot of good output. The interesting bit I found were some entries concerning .vmx file syntax errors. They appeared as follows:

hostd-9.log:[2008-12-07 17:28:17.388 'BaseLibs' 20241328 info] Reloading config state: /vmfs/volumes/48dabc48-573b1344-46f8-001ec939c5cb/vmabc123/vmabc123.vmx
hostd-9.log:[2008-12-07 17:28:17.435 'BaseLibs' 20241328 warning] VMHSVMLoadConfig failed: File “/vmfs/volumes/48dabc48-573b1344-46f8-001ec939c5cb/vmabc123/vmabc123.vmx” line 94: Syntax error.
hostd-9.log:[2008-12-07 17:28:17.448 'vm:/vmfs/volumes/48dabc48-573b1344-46f8-001ec939c5cb/vmabc123/vmabc123.vmx' 3076424608 info] Failed to load virtual machine.
hostd-9.log:[2008-12-07 17:28:17.466 'vm:/vmfs/volumes/48dabc48-573b1344-46f8-001ec939c5cb/vmabc123/vmabc123.vmx' 3076424608 info] Failed to load virtual machine. Marking as unavailable: vim.fault.InvalidVmConfig
hostd-9.log:[2008-12-07 17:28:17.467 'vm:/vmfs/volumes/48dabc48-573b1344-46f8-001ec939c5cb/vmabc123/vmabc123.vmx' 3076424608 info] State Transition
(VM_STATE_INITIALIZING -> VM_STATE_INVALID_CONFIG)
hostd-9.log:[2008-12-07 17:28:17.467 'vm:/vmfs/volumes/48dabc48-573b1344-46f8-001ec939c5cb/vmabc123/vmabc123.vmx' 3076424608 info] Marking VirtualMachine invalid
hostd-9.log:[2008-12-07 17:28:17.467 'Vmsvc' 3076424608 info] Loaded virtual machine: /vmfs/volumes/48dabc48-573b1344-46f8-001ec939c5cb/vmabc123/vmabc123.vmx
hostd.log:[2008-12-08 09:18:04.516 'vm:/vmfs/volumes/48dabc48-573b1344-46f8-001ec939c5cb/vmabc123/vmabc123.vmx' 60660656 info] State Transition (VM_STATE_INVALID_CONFIG -> VM_STATE_UNREGISTERING)
hostd.log:[2008-12-08 09:18:04.586 'vm:/vmfs/volumes/48dabc48-573b1344-46f8-001ec939c5cb/vmabc123/vmabc123.vmx' 60660656 info] State Transition (VM_STATE_UNREGISTERING-> VM_STATE_GONE)

These entries are from approximately 17 hours after I successfully restarted the invalid VM after the ESX host outage. Since they specified bad .vmx entries, I navigated to the .vmx file in question and made a backup copy of the file. Then I opened the original .vmx file and noticed the last three lines of the file were:
evcCompatibilityMode = "FALSE"
0001e9ebd3fbff"
evcCompatibilityMode = "FALSE"

The .vmx file is basically the configuration of the VM, and each line should have relevant information. The second to last line consisting of a multiple of digits is not a correct entry and the evccompatibilitymode entry should only appear once. Seems like I found the syntax errors the hostd logs were complaining about. After editing the .vmx file to remove the last two entries. I decided to stop and restart the vmware management agents to see if they could now pick up the orphaned VM guest process.

This was done using the following commands:
#/etc/rc.d/init.d/vmware-vpxa stop
#service mgmt-vmware stop
#service mgmt-vmware start
#/etc/rc.d/init.d/vmware-vpxa start

After restarting the services, I tried manually registering the VM guest to the host using #vmware-cmd -s register . This returned successfully so I checked for the VM’s operation state using vmware-cmd getstate. The command showed that the VM was in a powered on state, which also meant that the VMware services now recognized the vm as a valid guest. I logged back into VC and sure enough the vm guest icon was now showing as powered on and I was able to open a console to the guest.

I’m still not sure who or what created the bad entries in the vmx file to begin with and why they didn’t cause an issue until so long after the guest was rebooted, but at least I was able to fix the issue without an outage.

Rethinking Storage Protocols

November 27, 2008

Here’s a doc I found presented at vmworld where a vender did a white paper to compare the performance of the 3 storage solutions. The test lab was 2 ESX hosts running 8 guests each. Fibre channel won on speed (and is not vunerable to the network like the other two), but the ip based protocols were not far behind. In this test, iscsi and nfs are roughly equivalent, but I think the test setup had an impact on that (the small number of ESX hosts and guests used). Here are a few quotes I’ve found from other sites as well:

http://www.vi411.org/2006/10/10/nasnfs-vs-iscsi-for-esx.html

“With a single VM and/or connection, iSCSI outperforms NFS anywhere from 10-50%. However, as the number of VMs per server increase, NFS gradually catches up then exceeds iSCSI performance at about 15 virtual machines per server. There are a few reasons, mainly in how NFS locks files compared to iSCSI and also that from the client-side of things, NFS uses much less CPU than iSCSI. There are other advantages of NFS as well - by default VMDK files on NFS are formatted as sparse volumes allowing for thin provisioning. Also, being normal file shares, they are much easier to manage and backup.”

http://blog.scottlowe.org/2007/09/21/nfs-for-vmware-storage/

“FC is always faster with 1 or 2 ESX hosts…however the more ESX host you add the faster NFS performs. This is because of FC SCSI reservations in ESX. Ideally only one host can read or write to a LUN at a time with FCP. With NFS this is not a limitation. Hence the more hosts, the better performance on NFS then presenting a number of LUNs to several hosts on FC….”

So looks like in everybody’s opinion, scsi reservations are the major factor in the performance debate. With TOE and/or hardware initiators the CPU usage of ISCSI can be brought down to a NFS comparable level probably. And finally NFS is just more familiar and manageable to people than ISCSI.

NetApp_Whitepaper