VMworld 2018 – HPC/Big Data/AI: What’s New, Updates and Sessions

By , Monday, September 10th 2018

I attended VMworld 2018 in Las Vegas the week of August 27^th, 2018. As every year there is a lot more going at the conference than just VMware. Many storage vendors are present to show off their new technology and others are part of the VMware eco system. My focus this year was around High-Performance Computing (HPC), Big Data and Artificial Intelligence (AI) more specifically Machine Learning and Deep Learning (DL) workloads instead of the traditional virtual workloads.

In the Corporate IT world virtualization is well understood and what the benefits are. The efficient usage and consolidation of infrastructure in IT combined with flexible management tools make virtualization an easy choice. The HPC world it is a more difficult selling point as it is assumed that bare metal is the default option for maximum performance. The virtualization technology has significantly improved, and the overhead associated with it is a fraction of what it used to be. The latency is now also very competitive and makes the overall virtualization picture worth looking at.

In this blog I talk about the new vSphere SKU for high-performance workloads, the features within vSphere 6.7 related to hardware accelerators such as GPUs and the upcoming 6.7 Update 1. Finally, a list of the HPC/Big Data/AI sessions at VMworld that I wanted to see. If you missed VMworld this year you are in luck as VMware recorded all the sessions and can be viewed online. I provided a direct link to each session for your convenience.

HPC/Big Data

The new HPC workloads demand more flexible infrastructure, change frequently and rely on a variance in resources processing those workloads. The use of virtualization can improve the efficiency of resource usage as well as reducing the time it takes to reconfigure and adapt the required infrastructure need. On top of that there is the benefit of using all the tools needed to manage, support and run workloads securely that come with VMware ECO system. If you have used vSphere in the past than you are already familiar with the tools.

The overall obtained efficiency closes the performance gap with bare metal and in some cases delivers even faster results with comparable latency.

vSphere for HPC/Big Data

VMware has increased its focus in the last few years to the growing markets for HPC, Big Data and AI and worked on adding options for those markets to their product lines to accommodate their needs. It is a work in progress, but the improvements and added features are impressive. All the knowledge they have build-up with virtualization over the years can be a significant competitive advantage and benefit the newer high-performance workloads driven by data.

VMware introduced, about a year ago, a new addition to their vSphere product line specifically for the HPC and Big Data market that is called “vSphere Scale-Out”. It is a version of vSphere that contains all the essential core features aimed at HPC and Big Data workloads. According to VMware it will be licensed exclusively for HPC and Big Data and at a reduced price compared to the vSphere Enterprise edition. Some of the key features included in the vSphere Scale-Out edition are the ESXi Hypervisor, vMotion, Storage vMotion, Host Profiles, Auto Deploy and Distributed Switch. For a comparison of the various vSphere editions see Figure 1 below from VMware.

VMworld 2018 - Using vSphere to Virtualize AI Infrastructures

Figure 1 – vSphere Edition Comparison – VMware

AI

The use of Artificial Intelligence with accelerated hardware such as GPUs and FPGAs are becoming mainstream due to the complexity of the algorithms requiring a higher level of parallelization and the vast amounts of data to be processed. The use of GPUs is nothing new to VMware as they had Virtual Desktop Infrastructure (VDI) solutions for a long time now. They have been building on this expertise and expanded with using GPUs as workhorses for AI. Those GPUs are sometimes also called General Purpose GPUs (GPGPUs) to avoid confusion with the GPUs used for displays

VMware has been adding new features around HPC, Big Data and AI to facilitate the use of hardware accelerators such as GPUs. Earlier this year VMware released vSphere 6.7 and the upcoming vSphere 6.7 Update 1 are important releases for people that want to use hardware accelerators with vSphere. There are many new features and improvements and some of them are listed below. For a full list of features please refer to the VMware website.

vSphere 6.7:

Using and enhancing NVIDIA Grid vGPU technology.
- This technology allows you to create one or more virtual GPU instances (vGPU) on a single physical GPU. It is the ability to run a different GPU workload on each vGPU instance and attach those instances to VMs. Each vGPU is assigned a profile that defines the memory size per vGPU and the maximum amount of vGPUs per physical GPU. Each VM has a GPU driver and is unaware that the attached vGPU is virtual.
- Figure 2 is showing a single physical GPU accelerator with eight (8) vGPU instances with each instance being attached to a single VM. Attaching more than one vGPU to a VM is currently not supported.
- Pause & Resume functionality for VMs that take advantage of GPU workloads. A feature that been available for CPUs and now available for GPUs.
Another key feature is support for Remote Direct Memory Access (RDMA) for high-performance workloads that require maximum bandwidth with the lowest latency. It allows for direct memory transfers from one computer to another with minimal involvement from the CPU.
The ability to Slice & Dice a physical GPU into one or more vGPUs added with the Pause & Resume capabilities is making vSphere a very competitive and attractive solution. It increases the overall usage and efficiency of GPUs and improves the ROI which is quite important considering the cost of GPUs.

VMworld 2018 - vSphere 6.7 Architecture

Figure 2- NVIDIA GRID vGPU – VMware

vSphere 6.7 Update 1:

The release was announced at VMworld 2018 (US) and will be released at the end of the quarter (VMware quarter ends on 11/02/2018).
vMotion for GPUs
- On top of the Pause & Resume functionality the new release adds vMotion for NVIDIA vGPU powered VMs.
- An impressive feature that was on many customer’s Wishlist.
- There are the expected limitations such as only being able to vMotion between vGPUs of the same type and GPU technology.
- It is delivering the management benefits that have been available for a long time on vSphere to GPU workloads.
Support or Intel FPGA
- The release also comes with support for the Intel Programmable Acceleration Card with Intel Arria 10 GX FPGA.
- Near bare metal performance as the card is accessed through the VMware DirectPath I/O technology.

Roadmap:

During one of the sessions at VMworld it was announced that a Distributed Resource Scheduler (DRS) was on the roadmap for NVIDIA vGPU and for DirectPath I/O GPU. DRS is responsible for balancing computing workloads with available resources in a virtualized environment This should benefit the ease of consumption and provisioning of GPUs. No timeline was given for a release date yet.

Sessions

Before leaving for VMworld I made a list of the sessions that I wanted to attend with topics around HPC, Big Data and AI. Although I couldn’t go to all sessions I wasn’t disappointed with the content for the sessions I was able to attend. Luckily, VMware recorded all the sessions from my list and they are available on the VMworld On-Demand Video Library website for replay.

There was a good mix of HPC, Big Data and AI related sessions for beginners, intermediate and experts. Some of the sessions focused on vSphere while others focused on accelerator hardware such as GPUs, FPGAs and interconnect accelerators such as RDMA. The content of some sessions overlapped but I do recommend viewing all listed sessions.

For convenience I listed for each session below, the title, session ID, presenter(s), a description and a direct link to the On-Demand Video Library for that session.

“Next-Gen Multi-Cloud Architecture for Machine Learning” (CTO1189BU)

Presenters:

o Andrea Siviero, Principal Architect, VMware

o Tom Hite, Sr. Director, PS Emerging Technology, VMware

Description:

Introduction with the definitions for many of the terms used in Machine Learning. Followed by a TensorFlow demo using the Titanic passenger list for analysis.

“Elastic AI Infrastructure on vSphere: Virtual GPU and FPGA with Bitfusion” (VAP2134BU)

Presenters:

o Michael Zimmerman, CEO, Bitfusion

o Ziv Kalmanovich, Sr. Product Manager, VMware

Description:

Describes the new demands on IT infrastructure and the need for GPU accelerated applications in the enterprise.

And a presentation by the CEO of Bitfusion explaining their disaggregated platform for GPUs and FPGAs.

“How VMware vSphere and NVIDIA GPUs Accelerate Your Organization” (VIN2124BU)

Presenters:

o Raj Rao, Director, Product Management, NVIDIA

o Ziv Kalmanovich, Sr. Product Manager, VMware

Description:

Describing the new demands on IT infrastructure and the need for GPU accelerated applications in the enterprise.

A presentation by NVIDIA and use cases for HPC and AI with GPUs and vGPUs.

“Virtualize and Accelerate HPC/Big Data with SR-IOV, vGPU and RDMA” (CTO2390BU)

Presenters:

o Josh Simons, Chief Technologist for High Performance Computing, VMware

o Mohan Potheri, HPC Solutions Architect, VMware

Description:

A good overview of high-performance workloads and how to obtain maximum performance. Josh Simons talks about breaking or bending the virtual abstraction for the ultimate best performance. The presenters take you through the various components of high-performance workloads on vSphere. The latter portion of the session talks about benchmarking.

“High Performance Big Data and Machine Learning on VMware Cloud on AWS” (VAP1900BU)

Presenters:

o Dave Jaffe, Staff Engineer, VMware

o Justin Murray, Senior Technical Marketing Architect, VMware

Description:

This session answers the question “how do I run high-performance workloads on VMware Cloud on AWS?”. They also discuss the differences between VMware on-premises and VMware Cloud on AWS.

“Architecting and Deploying Virtualized HPC Clusters” (VAP2010BU)

Presenters:

o Justin King, Senior Business Development Manager, VMware

o Mohan Potheri, HPC Solutions Architect, VMware

Description:

The presenters are explaining why vSphere is the optimal platform for high-performance computing. They are going through the traditional HPC architectures and why to virtualize HPC.

“Driving Organizational Value by Virtualizing AI/ML/DL and HPC Workloads” (VAP2340BU)

Presenters:

o Anthony Foster, Sr. Advisor, Technical Marketing, Dell EMC

o Gina Rosenthal, Sr Product Marketing Manager, VMware

Description:

Excellent session for people who are familiar with VMware but are new to HPC/Big Data/AI or had very little exposure to it. Explained in a way that doesn’t require a heavy technical background to grasp the concept.

“Interconnect Acceleration for Machine Learning, Big Data, and HPC” (VAP2807BU)

Presenters:

o Adit Ranadive, Sr MTS, VMware

o Aviad Shaul Yehezkel, Mellanox

Description:

A focus on the acceleration of interconnects with RDMA (going over the basics) and how this can help GPU and FPGA based workloads achieve higher bandwidth, lower latency by offloading the CPU.