Friday, November 23, 2007

Why VMware over Netapp NFS

  1. While today the majority of VMware ESX servers connect to their datastores using either Fiber Channel or iSCSI protocol. I believe that using the NFS protocol is a significantly better way to access your ESX datastores. Here's why...

    An ESX datastore is simply a place to store your virtual machine files. Yes files, nothing more than files like a word document. So all that’s needed from an ESX host is a way to read and write to these files.

    You have two options for datastores: VMFS or NFS. If you want to use the advanced features like VMotion you need a datastore that is shared across all your ESX hosts. Initially VMware only supported VMFS for datastores and hence the reason for the high number of FC implementations, however NFS was added in ESX 3.0 (August 2006) and is starting to catch on.

    NFS is not really a filesystem, but a protocol to access files on a remote file server and has been in use since 1984. The remote file server is where the filesystem lives and is really where all the magic happens. In the case of Netapp, the filesystem is called WAFL, with Windows it’s called NTFS and Linux it may be ESX3 or some other filesystem.

    So basically, Vmware not only has a significant burden with managing all the components of virtualization, it also has to maintain VMFS. With NFS, the burden shifts to the NFS vendor which also has the freedom to add features as long as it adheres to the NFS protocol.

    Here are some reasons to use the Netapp implementation of NFS for VMware instead of using VMFS volumes over FC or iSCSI:

  • You get thin provisioning by default with NFS. FC and iSCSI VMDKs are thick. This can save 50% of your disk space.

  • Adding NFS datastores are simple. Mount the NFS volume using the GUI and start creating VMs

  • Adding additional Netapp filers for datastores requires no down and no cabling changes.

  • You can have large datastores that span many disks. 16TB for Netapp.

  • You can use A-SIS to de-duplicate your datastores for a 50-80% reduction in disk space

  • You can expand AND decrease NFS volumes on the fly

  • You can use snapshots of the volumes to restore individual VMs

  • You can use snapmirror to backup VMware volumes to a DR site over a WAN

  • You don't have to deal with VMFS or RDMs

  • You don't have to deal with FC switches, zones, lun sizing, HBAs, and identical LUN IDs

  • You can restore multiple VMs, individual VMs, or files within VMs.

  • You can instantaneously clone (Netapp Flexclone), a single VM, or multiple VMs

  • You can also backup whole VMs, or files within VMs using NDMP or any other backup software

  • ESX server I/O is small block and extremely random which means that bandwidth matters little

  • No single disk I/O queue, so your performance is strictly dependent upon the size of the pipe and the disk array.
  • Failover to your SnapMirrored copies can be done in minutes. iSCSi/FC requires LUN resignaturing.
  • In the near future, you will be able to clone a single VM or create 100’s of VMs from a template in seconds

Some background… The previous information is based on our experience, and not just some theory.

In August 2006 when NFS was announced we were in the planning stage for a major upgrade to our VMware infrastructure. Our VMware infrastructure then consisted of about 20 ESX hosts with about 750 VMs all using Fiber Channel to Hitachi SAN. We are also a heavy user of Netapp filers and knowing the benefits of NFS over SAN we decided to investigate the possibility of using Netapp over NFS. I’m pretty sure we were the first Netapp customer to do this…

Of course the first hurdle was performance. Fortunately, we had more than a years worth of VMware performance data on our SAN. After looking very close at the numbers, we discovered that the throughput to the SAN was extremely low. Somewhere in the 10-15MB/s on average across all 20 ESX host, and the spikes were well under 50MB/s. Since the migration to NFS is so simple, we decide to move several test servers to NFS. All we had to do is mount a NFS share on the current ESX hosts and start moving the VMs. After migrating several 100 VMs to NFS over 6 months, we decided to implement our new Infrastructure completely on NFS.

We purchased 2 dedicated Netapp 3070s and several new 8way ESX hosts for the new project. We also used an existing Netapp R200 to keep 21 says of snapshots for online backups. The R200 also serves as a failover I case of complete corruption of our primary system. Within 6 months we had completely migrated all of our VM’s off SAN and onto Netapp NFS. We now run almost 1000 VMs in this environment.

With our current Netapp IO load on the 3070’s, we estimate that we could add 2000 or more VMs to this configuration by simply adding ESX hosts. The Netapp 3070c IO is running 4MB on average with a few 30MB spikes during the day. Not one IO performance issue has arisen. Our VMware administrators says it’s even faster than our SAN when performing OS Builds, VMotion and Cloning.

We currently don’t run Exchange or Sqlserver VMs, however with 10Gbit and Infiniband solutions on the way I believe that soon all real servers will be virtual.

So I stand my initial statement, however I should say that today it’s really Netapp and not NFS that makes the difference. In the future however, I expect to see other vendors catch up with Netapp and all their added value to the VMware Infrastructure environment.

Additional Links to NFS for Vmware


Anonymous said...

We're trying this out on a smaller scale in Jan 08 with a Netapp 2050 and (4) ESX hosts. Hopefully it works well.

I notice that ESX 3.5 supports jumbo frames but not for NAS (NFS) from what I can tell. :-(

Any thoughts on whether having jumbo frames enabled would cause problems or performance issues in this case?

Dave Wujcik said...

I'm curious about your SAN benchmarks...

I've got an EMC Clariion CX-700 and a DMX3 with VMWare volumes on each.

In my testing, I could easily pull 90-95MB/sec sustained reads w/o issue on both...and this was not on isolated systems...these were the current in-use systems.

I also discovered a little "gotchya" that destroys performance #s on the DMX3 (Due to it's internal layout), where the larger the LUN you create, the slower your read performance is.

I found the best setup for performance on the DMX3 was to share out bare meta luns (28.6GB) and then let ESX tack them all together ala extents. This avoids any possiblity of internal "plaiding" (stripes going more than one way at the same time, exponentially increasing I/O events).

It turned out to be a "pretty darned fast(TM)" implementation that sometimes out-paces our physical hardware (depending on the task being performed).

I did all of this over 2GB FC.

How did you do your testing and what was your setup?


-- Dave

She said...

Dave, from what I hear, you may want to avoid using more than 2 extents per VMFS volume;

I haven't tested 3.5 for issues, but ESX 3.0.2 has been known to have issues correlating with multiple extents. Generally this is seen as a temporary solution - and with storage vmotion, the migration to a extent-free VMFS datastore may be much more appealing.

Dave Wujcik said...

Thanks for the response...

Do you have any other info regarding the extent issues?

Anything that you can link me to?


-- Dave

Unknown said...

Hi Adrian,

Just curious on your success. Any issues so far?

We have just implemented a solution like yours here. IBM nSeries n3600 (your FAS2050 Active/Active) but only 3 ESX hosts. 4 x GigE to each controller, multilink IP hash loadbalancing implemented and jumbos turned on. No issues so far.

Have you done any performance tests yet?

Anonymous said...

Here's another one....

With NFS, if you have to failover to your SnapMirrored copies, you don't have to deal with LUN resignaturing as you do with iSCSI/FC!

franklyfrank21 said...

Just setup VMWare with NFS and working fine but unable to see Disk I/O Performance stats for any VM's i create. I can see disk stats fine if i use local storage. Any ideas?

franklyfrank21 said...

Just setup VMWare with NFS and working fine but unable to see Disk I/O Performance stats for any VM's i create. I can see disk stats fine if i use local storage. Any ideas?

Craig said...

To share with you, I run my environment with FC 2Gb/s on CX3-80. We use 300GB per VMFS with Meta Lun concept on EMC technology. I am able to achieve 190MB/s easily with my FC. When I try the benchmark on the NFS box I have now, I only manage to get the performance at Max 120MB/s average. Agree with you that we may not require such a high performance, but is also depend what you try to virtualize. I am aiming for more high load machine to be virtualize, and to sincere with you, Equallogic is catching up in term of featurs, pricing and performance.

Unknown said...

Rolando what filer do you have? Also how many disk in your aggregate? 128MB/s is the limit on your 1Gbit network, so your bottle neck in the network... Step up to 10gbit on your filer to up your raw sequential speed. However most of the data on VMs travel using 4K blocks, so you should compare FC to NFS using your real vm's and not some testing tool...

Andrew Miler said...

I'd meant to post on this earlier as I've found this post very helpful. Besides my own use, I also linked to it here along my own summary.

Thanks again.

Scott said...

NetApp & VMware have released a whitepaper detailing their performance testing of FC, iSCSI & NFS. You can din it here: