Thursday, April 16, 2015

NUMA and VMware

ESXi uses a sophisticated NUMA scheduler to dynamically balance processor load and memory locality or processor load balance.

Each virtual machine managed by the NUMA scheduler is assigned a home node. A home node is one of the system’s NUMA nodes containing processors and local memory, as indicated by the System Resource Allocation Table (SRAT).

When memory is allocated to a virtual machine, the ESXi host preferentially allocates it from the home node. The virtual CPUs of the virtual machine are constrained to run on the home node to maximize memory locality.

The NUMA scheduler can dynamically change a virtual machine's home node to respond to changes in system load. The scheduler might migrate a virtual machine to a new home node to reduce processor load imbalance. Because this might cause more of its memory to be remote, the scheduler might migrate the virtual machine’s memory dynamically to its new home node to improve memory locality. The NUMA scheduler might also swap virtual machines between nodes when this improves overall memory locality.
Some virtual machines are not managed by the ESXi NUMA scheduler. For example, if you manually set the processor or memory affinity for a virtual machine, the NUMA scheduler might not be able to manage this virtual machine. Virtual machines that are not managed by the NUMA scheduler still run correctly. However, they don't benefit from ESXi NUMA optimizations.
The NUMA scheduling and memory placement policies in ESXi can manage all virtual machines transparently, so that administrators do not need to address the complexity of balancing virtual machines between nodes explicitly.
The optimizations work seamlessly regardless of the type of guest operating system. ESXi provides NUMA support even to virtual machines that do not support NUMA hardware, such as Windows NT 4.0. As a result, you can take advantage of new hardware even with legacy operating systems.
A virtual machine that has more virtual processors than the number of physical processor cores available on a single hardware node can be managed automatically. The NUMA scheduler accommodates such a virtual machine by having it span NUMA nodes. That is, it is split up as multiple NUMA clients, each of which is assigned to a node and then managed by the scheduler as a normal, non-spanning client. This can improve the performance of certain memory-intensive workloads with high locality. For information on configuring the behavior of this feature, see Advanced Virtual Machine Attributes.
ESXi 5.0 and later includes support for exposing virtual NUMA topology to guest operating systems. For more information about virtual NUMA control, see Using Virtual NUMA.

More on this in later posts.

Wednesday, April 15, 2015

Calculating Average Guest Latency in VMware

If you are using VMware vSphere, VMware ESXi cannot see application latency because it is above the ESXi stack. What ESXi can do is detect three types of latency that are also reported back into esxtop and VMware vCenter.
Average guest latency (GAVG) has two major components: average disk latency (DAVG) and average kernel latency (KAVG).
DAVG is the measure of time that I/O commands spend in the device, from the driver host bus adapter (HBA) to the back-end storage array.
KAVG is how much time I/O spends in the ESXi kernel. Time is measured in milliseconds. KAVG is a derived metric, which means that there is no specific calculation for it. To derive KAVG, subtract DAVG from GAVG.
In addition, the VMkernel processes I/O very efficiently, so there should be no significant wait in the kernel, or KAVG. In a well-configured, well-running VDI environment, KAVG should be equal to zero. If KAVG is not equal to zero, then the I/O might be stuck in a kernel queue inside the VMkernel. When that is the case, time in the kernel queue is measured as the QAVG.
To get a sense of the latency that the application can see in the guest OS, use a tool such as Perfmon to compare the GAVG and the actual latency the application is seeing. 
This comparison reveals how much latency the guest OS is adding to the storage stack. For instance, if ESXi is reporting GAVG of 10 ms, but the application or Perfmon in the guest OS is reporting storage latency of 30 ms, then 20 ms of latency is somehow building up in the guest OS layer, and you should focus your debugging on the guest OS storage configuration

Friday, May 9, 2014

Virtualize with Confidence - Use VMware: Storage DRS and Drmdisk

Virtualize with Confidence - Use VMware: Storage DRS and Drmdisk: Storage DRS and Drmdisk Storage DRS leverages a special kind of disks for facilitating a more granular control over initial placement an...

Storage DRS and Drmdisk

Storage DRS and Drmdisk

Storage DRS leverages a special kind of disks for facilitating a more granular control over initial placement and migration recommendations. It also plays a major role in I/O load balancing by using these deeper details.

Let us read about it in detail:


vSphere Storage DRS uses the DrmDisk construct as the smallest entity it can migrate. A DrmDisk represents a consumer of datastore resources. This means that vSphere Storage DRS creates a DrmDisk for each VMDK file belonging to the virtual machine. A soft DrmDisk is created for the working directory containing the configuration files such as the .VMX file and the swap file.

  • A separate DrmDisk for each VMDK file
  • A soft DrmDisk for system files (VMX, swap, logs, and so on)
  • If a snapshot is created, both the VMDK file and the snapshot are contained in a single DrmDisk.

VMDK Anti-affinity Rule

When the datastore cluster or the virtual machine is configured with a VMDK-level anti-affinity rule, vSphere Storage DRS must keep the DrmDisk containing the virtual machine disk files on separate datastores.

Impact of VMDK Anti-Affinity Rule on Initial Placement

Initial placement immensely benefits from this increased granularity. Instead of searching a suitable datastore that can fit the virtual machine as a whole, vSphere Storage DRS can seek appropriate datastores for each DrmDisk file separately. Due to the increased granularity, datastore cluster fragmentation—described in the “Initial Placement” section—is less likely to occur; if prerequisite migrations are required, far fewer are expected.

Impact of VMDK Anti-Affinity Rule on Load Balancing

Similar to initial placement, I/O load balancing also benefits from the deeper level of detail. vSphere Storage DRS can find a better fit for each workload generated by each VMDK file. vSphere Storage DRS analyzes the workload and generates a workload model for each DrmDisk. It then determines in which datastore it must place the DrmDisk to keep the load balanced within the datastore cluster while offering sufficient performance for each DrmDisk. This becomes considerably more difficult when vSphere Storage DRS must keep all the VMDK files together. Usually in that scenario, the datastore chosen is the one that provides the best performance for the most demanding workload and is able to store all the VMDK files
and system files.

By enabling vSphere Storage DRS to load-balance on a more granular level, each DrmDisk of a virtual machine is placed to suit its I/O latency needs as well as its space requirements.

Virtual Machine–to–Virtual Machine Anti-Affinity Rule

An inter–virtual machine (virtual machine–to–virtual machine) anti-affinity rule forces vSphere Storage DRS to keep the virtual machines on separate datastores. This rule effectively extends the availability requirements from hosts to datastores. In vSphere DRS, an anti-affinity rule is created to force two virtual machines.

For example,

Microsoft Active Directory servers—to run on separate hosts; in vSphere Storage DRS, a virtual machine–to–virtual machine anti-affinity rule guarantees that the Active Directory virtual machines are not stored on the same datastore. There is a requirement that both virtual machines participating in the virtual machine–to–virtual machine anti-affinity rule be configured with an intra–virtual machine VMDK affinity rule.

Thoughts ?

vSphere Storage DRS is one cool piece of code and is continuously improving how we use traditional storage systems for better.

Always good to dig and understand the VMware vSphere features in detail.  Until next time :)

Saturday, May 3, 2014

VMWare Horizon View Design Best Practices

This blog post talks about designing a basic VMware View deployment that will cover up to 500 desktops. 

For enterprise setup, I plan to do a more elaborate extension to this post where I would talk about areas such Storage, Networking, Load Balancing, POD architecture and best practices around them.

The basic components and what strategy to take around them to avoid single point of failure.

2 vSphere Clusters 
Its not wise to run one vSphere cluster for both your server infrastructure and your desktop infrastructure.  Don't be cheap, separate these clusters.

2 View Connection Servers
The View Connection Server is the brokering server, they establish the connection to the View Agent.  While redundancy is built into the product, all you have to do is install a second "replica server".  Also, keep in mind if you want to do external PCoIP connections you will need 4 View Connection Servers . Two of these servers will be used for internal redundancy, two will be used for external.

2 View Transfer and Security Servers
On similar lines as Connections server based on the functionality.  

2 vCenter Servers
You'll need to use vCenter Heartbeat, as its the only way I know how to make vCenter Servers redundant,. Its a little bit expensive but VMWare doesn't say that you need to make the vCenter server redundant, however vCenter service being down is catatrophic event. For the most part you don't need vCenter, except when you need to boot a linked clone VM

2 SQL Servers
Use SQL 2008 with Microsoft Failover Clustering.  Make this redundant  Well, vCenter DB is on this, Events DB is on this, View Composer DB is on this.  Basically every component of a View setup has a DB, so its advisable to have these DBs on a redundant back end. 

SplitRXMode in VMware vSphere 5

My previous post was on how multicasting is handled in VMware vSwitch context. You can read about it here.

Now it is most apt to mention some advanced setting around it. The advanced setting we talk about is SplitRXMode.

While Multicasting worked fine for some multicast applications this still wasn’t sufficient enough for the more demanding multicast applications and hence stalled their virtualization. 

The reason being that in this case VMs would process the packet replication in a single shared context which ultimately led to constraints. This is because when there was a high VM to ESX ratio there was a consequent high packet rate that often caused large packet losses and bottlenecks. VMware vSphere 5 provides, the new splitRXMode to not only compensate for this problem but also enable the virtualization of demanding multicast applications.

SplitRx mode is an ESXi feature that uses multiple physical CPUs to process network packets received in a single network queue. This feature provides a scalable and efficient platform for multicast receivers. SplitRx mode typically improves throughput and CPU efficiency for multicast traffic workloads.

VMware recommends enabling splitRx Mode in situations where multiple virtual machines share a single physical NIC and receive a lot of multicast or broadcast packets.


  •         SplitRx mode is supported only on vmxnet3 network adapters. 
  •         This feature is disabled by default.
  •         SplitRx mode is individually configured for each virtual NIC.

To enable SplitRX do the following:

This feature, which is supported only for VMXNET3 virtual network adapters, is individually configured for each virtual NIC using the ethernetX.emuRxMode variable in each virtual machine’s .vmx file (where X is replaced with the network adapter’s ID).

The possible values for this variable are:

 ethernetX.emuRxMode = “0″

The above value disables splitRx mode for ethernetX.

ethernetX.emuRxMode = “1″

The above value enables splitRx mode for ethernetX.

To change this variable through the vSphere Client:

  1. Select the virtual machine you wish to change, then click Edit virtual machine settings.
  2. Under the Options tab, select General, then click Configuration Parameters.
  3. Look for ethernetX.emuRxMode (where X is the number of the desired NIC). If the variable isn’t present, click Add Row and enter it as a new variable.
  4. Click on the value to be changed and configure it as you wish.

The change will not take effect until the virtual machine has been restarted.

References and Credits: Chris Hendryx (

Multicast and VMware

This post deals with Multicasting in VMware.

Lets see what multicast is and how it is deployed in VMware.

What is Multicasting ?

An alternate way of content delivery, where IP packet is sent to multiple destinations identified by a Multicast IP address. Multicasting is used for content delivery by Stock Exchanges, Video conferencing etc to multiple destinations at once.  
Multicast sends only one copy of the information along the network, whereby any duplication is at a point close to the recipients, consequently minimizing network bandwidth requirements.

For multicast the Internet Group Management Protocol (IGMP) is utilized in order for membership of the multicast group to be established and coordinated. This leads to single copies of information being sent to the multicast sources over the network. Hence it’s the network that takes responsibility for replicating and forwarding the information to multiple recipients.
By operating between the client and a local multicast router, IGMP utilises layer 2 switches with IGMP snooping and consequently derives the information regarding IGMP transactions. By being between the local and remote multicast routers, Multicast protocols such as PIM are then used to direct the traffic from the multicast server to the many multicast clients.

Multicasting and VMware

In the context of VMware and virtual switches there’s NO need for the vSwitches to perform IGMP snooping in order to recognize which VMs have IP multicast enabled. This is due to the ESX server having authoritative knowledge of the vNICs, so whenever a VM’s vNIC is configured for multicast the vSwitch automatically learns the multicast Ethernet group addresses associated with the VM. With the VMs using IGMP to join and leave multicast groups, the multicast routers send periodic membership queries while the ESX server allows these to pass through to the VMs. The VMs that have multicast subscriptions will in turn respond to the multicast router with their subscribed groups via IGMP membership reports.

 NOTE: IGMP snooping in this case is done by the usual physical Layer 2 switches in the network so that they can learn which interfaces require forwarding of multicast group traffic. 

So when the vSwitch receives multicast traffic, it forwards copies of the traffic to the subscribed VMs in a similar way to Unicast i.e. based on destination MAC addresses. With the responsibility of tracking which vNIC is associated with which multicast group lying with the vSwitch, packets are only delivered to the relevant VMs.

References and Credits: Chris Hendryx (