There is a problem that impacts our mutual customers running VMware ESXi 5 with NFS connectivity. While details are not finalized it appears the engineering teams at VMware and NetApp have identified an issue with the NFS client in ESXi 5 stack and the NFS service in Data ONTAP that results in the two behaving badly with high I/O load, when SIOC is not in use. This issue appears to only affect vSphere 5 releases and not vSphere 4 or VI3 and FAS arrays with less than 2 CPUs, thus it is seen across the FAS2000 series and lower-end systems in the FAS3000 series. I cannot state wether this issue may impact other NFS platforms like EMC isilon, Celerra & VNX.

Massive investments in engineering resources go into assuring quality of product releases and joint solutions; inevitably something falls through the cracks and this is one of those times. For those impacted by this issue, my apologies. The NetApp and VMware engineering teams have been furiously working to identify and resolve this issue. A fix has been released by NetApp engineering and for those unable to upgrade their storage controllers, VMware engineering has published a pair of workarounds.

 

Clarifying The Issue:

A NFS datastore disconnect issue displays the following behaviors…

  • NFS Datastores are displayed as greyed out and unavailable in vCenter Server or the vSphere client
  • Virtual Machines (VMs) on these datastores may hang during these times
  • NFS datastore often reappear after a few minutes, which allows VMs to return to normal operation.
  • This issue is most often seen after ESXi 5 is introduced into an environment

This issue is documented in VMware KB 2016122 and NetApp Bug 321428

 

The Fix:

NetApp customers can upgrade Data ONTAP to correct this issue. Versions 7.3.7P1D2 & 8.0.5 have been released and the forth coming 8.1.3 is expected soon. While Data ONTAP upgrades are non-disruptive, they should likely be scheduled for times of reduced I/O activity.

Note: Data ONTAP release families are defined as 7.3.x, 8.0.x, and 8.1.x, with each dot release introducing a new set of features and capabilities. To address a bug, NetApp support suggests applying the DOT version containing the fix based on the DOT installed on your array.

 

The Workarounds:

While a Data ONTAP upgrade is non-disruptive some VMware administrators may prefer to address the issue immediately to ensure operations. For those interested VMware has published the following:

Workaround Option #1 – Enable SIOC

For those with vSphere Enterprise Plus license, enabling Storage I/O Control will eliminate this issue at it manages the value of MaxQueueDepth.

Workaround Option #2 – Limit MaxQueueDepth

For those without a vSphere Esential Plus license or those who have not enabled SIOC, setting a manual limit on the MaxQueueDepth will prevent the disconnect issue from occurring.

For the step-by-step procedure on how to complete this process in the vSphere Client, vSphere 5 Web Client and on the command line please see VMware KB 2016122.

 

Considerations of the Workarounds:

I would advise these workarounds be implemented on a temporary basis and remain in place until the NetApp FAS array(s) have been upgraded; at which point these workarounds should be disabled.

The reason for this suggestion is when one implements an I/O limit, such as a queue depth of 64 from the default 4.23 billion, there is a potential of creating faux I/O bottleneck. vSphere is equipped to remedy such issues via data migration technologies like SDRS; however, please note that shuffling data produces a negative impact on storage and networking resources for those using disk-based backups, data deduplication, and data replication with VMware.

I will edit this post should additional information be made available.

I’d like to thank Cormac Hogan for helping raise awareness with his post.