Pytorch sources for IBM PowerAI version with LMS support on IBM Power9

edowson · April 24, 2019, 5:42pm

Hi,

Would anyone happen to know where IBM hosts the Pytorch sources for the IBM PowerAI versions which feature LMS (Large Model Support) support for Power9 systems?

Large Model Support is a feature provided in PowerAI PyTorch that allows the successful training of deep learning models that would otherwise exhaust GPU memory and abort with “out of memory” errors. LMS manages this oversubscription of GPU memory by temporarily swapping tensors to host memory when they are not needed.One or more elements of a deep learning model can lead to GPU memory exhaustion. These include:
Model depth and complexity
Base data size (for example, high-resolution images)
Batch size
Traditionally, the solution to this problem has been to modify the model until it fits in GPU memory. This approach, however, can negatively impact accuracy – especially if concessions are made by reducing data fidelity or model complexity.
With LMS, deep learning models can scale significantly beyond what was previously possible and, ultimately, generate more accurate results.

There is a Conda channel with ppc64le support, but those are all pre-built binaries.

I’d prefer if their sources with the LMS patches for PyTorch are available publically.

Are these patches included in the current PyTorch mainline? A quick search for PRs with “large model support” or “LMS” didn’t yield any results.

hartbx · June 12, 2019, 6:35pm

Hello!

I work on the PowerAI (now renamed Watson Machine Learning Community Edition (phew!)) development team. Sorry for not noticing this question sooner!

At the time you asked the LMS code wasn’t public, but we’re now prepared to offer it to the community.

It’s currently sitting in a team-member’s repo on github. The LMS implementation based on PyTorch 1.1.0 stable can be found at:

Usage and implementation notes are available in a wiki at:

edowson · June 12, 2019, 7:39pm

Great, thank you for sharing!

If you have access to an IBM AC922 8335-GTH model server (4 GV-100 GPUs 32GB each, dual core Power9 CPUs, 1TB system memory), can you tell me if the following would scenario for over-provisioning physically available GPU memory would work?

Let’s assume that I’m running Ubuntu-18.04 LTS on the AC922 with support for NVIDIA docker runtime support. The system is configured with 1TB of physical system memory. I have a docker image which runs a DNN model that requires 16GB of GPU memory.

Now, the total physical available GPU memory is 32GB x 4 = 128GB.

Will I be able to run and launch 16 docker containers, creating a total demand for 16x16GB = 256GB of GPU memory? This is 128GB more than available physical GPU RAM.

Will I still be able to run the docker containers, or will PyTorch complain that it has insufficient GPU resources for some of the docker container instances?

This is for a distributed reinforcement learning environment setup and simulation within one physical node, before considering distribution of the workload over multiple nodes over an infiniband network.

hartbx · June 12, 2019, 8:42pm

That’s an interesting question. I think the answer is probably: it depends.

By default LMS tries to minimize what’s resident on the GPU; that’s controlled by the set_limit_lms tunable. The rationale for that is to favor “large model works at all” over best possible performance. Reasonable folks could prefer different defaults, so things might change in the future. Anyway, that conservative setting works in your favor for this use case.

On the other side, though, a couple things work against: Generally, all the tensors needed for a specific GPU operator have to be resident on the GPU while the operator is running. That means there’s some model-dependent minimum amount of GPU memory that will be needed.

And the current LMS decides where to move tensors (GPU or host memory) based only on the state of the single PyTorch instance. Multiple LMS-using PyTorch instances (containerized or not) wouldn’t coordinate. The LMS code doesn’t have any cross-instance distribution awareness. And it doesn’t, say, sense that GPU memory is nearly full and so try to move stuff to host memory. It’s currently just based on what tensors are needed by the activate operations and the tunable settings.

So LMS will allow you to over-subscribe, but for now you’ll have to do some experimentation to figure out how many instances will fit. And if you’re too agressive, yeah–allocation failures or insufficient resource are the likely failure modes.

edowson · June 12, 2019, 9:04pm

That’s what I thought. Do you have access to an AC922 to test this out? I think that even without trying the test, the answer would be obvious, that it would fail. But I thought I’d ask anyway.

It reminds me of a time when CPUs didn’t have hardware virtual memory management and paging support.

In this particular case though, it probably can be handled initially by a container-aware cache controller, before moving this logic into hardware on the NVLINK fabric.

hartbx · June 12, 2019, 10:07pm

That’s a harder question.

Access to Power systems with GPU is available (for a fee) in the IBM PowerAI Cloud:

https://cloud.ibm.com/catalog/services/powerai

But those are POWER8 systems (with P100) and appear to still offer PowerAI 1.5.3 pre-installed, which is older than what you’d want.

If you’re working with a common model, or a variant of one, we might be able give you a ballpark of GPU memory usage in the non-LMS vs LMS case.

Or, the current release of PowerAI actually includes both Power and x86 packages in the conda channel. So you could A/B test for GPU memory usage LMS vs non- on an x86 machine. LMS will perform better on an AC922 due to the NVLINK connections providing higher bandwidth between GPU and host memory, but it works on x86 as well.

A few notes:

We’re preparing to update our conda channel for our upcoming 1.6.1 release, and that may take a day or two for that to settle. If you wanted to install today, you’d probably want to take care to restrict the intall to 1.6.0:
- $ conda install -y -n my-pai-env python=3.6 pytorch powerai-release=1.6.0
An added caution for x86: Anaconda publishes a pytorch package in their free channel, and their installer sometimes prefers stuff from the free channel over higher priority channels. So if you go that route please see the Tip at the bottom of: https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.0/navigation/pai_setupAnaconda.html
We include the necessary CUDA, etc. packages in the conda channel, but you do need to download (from NVIDIA) and install the GPU driver separately. Our 1.6.0 and 1.6.1 releases are both built against CUDA 10.1 and so require the 418-series GPU driver.
The upcoming PowerAI / WML CE 1.6.1 PyTorch package is based on 1.1.0, and it includes an LMS update that should reduce GPU memory further in the default/conservative case.

ticlazau · June 14, 2019, 3:53pm

In order this to work on AC922 you will need to:

lunch 4x containers only
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 ibmcom/powerai:1.6.0-pytorch-ubuntu18.04-py3 bash
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1 ibmcom/powerai:1.6.0-pytorch-ubuntu18.04-py3 bash
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=2 ibmcom/powerai:1.6.0-pytorch-ubuntu18.04-py3 bash
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=3 ibmcom/powerai:1.6.0-pytorch-ubuntu18.04-py3 bash

1TB available RAM will allow expansion of your models for each GPU, up to 90% of the system capacity. With TF you may use 16x GPUs if you max limit the GPU memory utilization / session / container, however this is not available in PyTorch because PyTorch doesn’t pre-occupy the GPU’s entire memory.

Rgds,
FM

edowson · June 15, 2019, 4:40am

Setting NVIDIA_VISIBLE_DEVICES to a GPU id makes that GPU available exclusively for the docker container. A GV100 has 5120 CUDA cores and 32GB HBM2 graphics memory.

What I want to be able to do is to allocate 2560 CUDA cores and 16GB of graphics RAM, effectively splitting the GV100 resources into 2, and allocate 2 docker containers to it and get LMS to work. You could consider going a bit more aggressive and allocate say 1280 CUDA cores and 8GB of graphics RAM and allocate 4 docker containers to a single GPU and try to get LMS to work.

This requires virtualization of GPU resources.

As I understand virtualization of GPU resources is only supported by VMware for x86 platforms. VMware is currently not available for Power9. For the moment if you run Ubuntu or RHEL on the AC922, the only option is to running multiple docker instances is to make a dedicated GPU resource available and pass the entire GPU up to the docker container. This will restrict the maximum number of docker containers to 4 on an 8335-GTH, and 6 on a 8335-GTX.

For a reinforcement learning context, I want to run between 8 to 16 docker instances with GPU acceleration to maximize distributed agent training within a single node, before scaling out horizontally across multiple AC922 nodes.

ticlazau · June 16, 2019, 7:54pm

Hi,

VMware passthrough mode for V100 GPUs is a dedicated mode as well 1:1 (GPU:VM). There is no way today to control the GPU memory or no of used CORES. What you can do is to limit your application to use more then x amount of HBM2 (like in TF). You can have multiple containers running on a single GPU using nvidia-docker2 and NVIDIA_VISIBLE_DEVICES option let’s say:
4x containers to NVIDIA_VISIBLE_DEVICES=0
4x containers to NVIDIA_VISIBLE_DEVICES=1
4x containers to NVIDIA_VISIBLE_DEVICES=2
4x containers to NVIDIA_VISIBLE_DEVICES=3
but the HBM2 utilization of GPU’s will be a big issue if you don’t limit the GPU memory usage from Pytorch (i.e. 7GB /session/container) ; you may get very rapidly CUDA OMM and all your training sessions will fail.
With Pytorch LMS you have two option to control how LMS will work:

torch.cuda.set_limit_lms(limit) - defines the soft limit in bytes on GPU memory allocated for tensors )
torch.cuda.set_size_lms(size) - defines the minimum tensor size in bytes that is eligible for LMS swapping

If today your Pytorch GPU process is cc 7GB it may worth trying (assuming V100 with 32GB HBM2 in AC922) but because Pytorch can’t cap the HBM2 usage to a specific value let’s say 7GB HBM2, you have very high risk for OMM.

Rgds,
FM

edowson · June 17, 2019, 12:16am

I was referring to the use of the mediated pass-through mode using a vSphere and NVIDIA GRID vGPU driver with a vDWS license.

Machine Learning using Virtualized GPUs on VMware vSphere - 20180522