More VMware vSphere Troubleshooting Tips
Tags: troubleshooting vmware
If You Suspect a Network Performance Issue, Check Some of the Following Metrics
If the droppedRx (receive) is greater than 0 for a host, look at the CPU utilization. The CPU plays an important part in moving the packets from the guest operating system on the virtual machine to physical device driver. Check metrics such as like CPU overhead and high CPU utilization, which can cause the virtual machine to be too busy to take on new packets or cause delays in receiving the packets. A possible solution is to increase CPU reservations for the virtual machine or check the application to see if it supports adding more vCPUs.
If the droppedTx (transmit) is greater than 0, this usually means congestion at the physical layer. When a virtual machine transmits packets, the packets get queued in the buffer of the virtual switch port until the packets are transmitted on the physical nicNIC. The buffering of packets on the virtual switch port waiting to transmit can even cause incoming packets to be dropped. To prevent the dropping of transmit packets, look for ways to increase the physical network capabilities. Adding more nicNICs or adding 10 GB Ethernet could solve the problem by increasing the physical network capacity.
Another thing to look for on the network side is to make sure you have the correct network device driver installed on the virtual machine. By default, if VMware Ttools is not installed or running, the Vlance network adapter is used. Vlance is a 10Mbps NIC, which is great for older 32-bit guest operating systems but not so useful running in a 1 GB Ethernet network. Therefore, you want to make sure that VMware Ttools is installed and is enabled, and that the correct network adapter for the operating system is installed in the virtual machine.
Metrics to Check for a Possible Storage Problem
It is always important to consider storage performance in your vSphere environment. esxtop/resxtop, which comes with ESXi 4.x, is an excellent tool to measure performance. Some of the more significant statistics are commands queued. To check these metrics, open a vSphere Management Assistant (vMA) console and start resxtop. Type d to enter the Storage Adapter screen, and t. Type f to select the fields that you want to view. The fields to view should be A (adapter name), F (queue stats), and K (error stats). Queuing happens if there are excessive I/O operations and the buffer is full, which means that the host is waiting for the storage to complete outstanding I/O requests. The ACTV metric is the number of I/O operations that are currently active, and QUED is the number of commands waiting to be processed. QUED is the key value: this number should be zero, otherwise the host is bottlenecking. Commands are aborted if the storage overloads. The ABRTS/s gives you the number of dropped requests on an overloaded storage.
Log Files to View in vSphere ESXi 4.x
Since, ESXi utilizes Unix as its Operating System, the log files are similar to what you find on a Unix server:
/var/log/messages Operating System messages
/var/log/vmware/hostd.log Host agent log
/var/log/vmware/vpx/vpxa.log vCenter agent log
/var/log/sysboot.log System boot log messages
/var/log/vmkernel can be set up for storage errors
List of Configuration Files on an ESXi 4.x host:
/etc/vmware/esx.conf Main configuration file on an ESXi host
/etc/syslog.conf What/where/how syslogd is handling logging
/etc/vmware/hostd/proxy.xml Configuration for hostd which controls access to the ESXi host
/etc/vmware/snmp.xml snmp services
Unable to Configure High Availability (HA)
You may encounter errors when you try to enable High Availability during the initial setup of High Availability. Although the error messages that appear will be generic in nature, check name resolution which, ninety-nine percent of the time, is the problem. An error occurred during configuration of the HA agent on the host is an example of a common error message that there is a problem with name resolution. Although the error message itself does not appear to point to name resolution, it is the proper culprit. High Availability setup problems are another reason to always use a FQDN.
This post is excerpted and reused with permission from 8 tips for Troubleshooting VMware vSphere