Is there a way to selectively load model params into GPU during forward pass?

jerinphilip · June 15, 2020, 10:08am

I came across some way to change the GPU for component modules of a larger model in the following link:

https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

However, my model runs in a distributed environment (ddp) and have several sub-modules holding all of which for a forward-backward cycle each of which will decrease the size of batch I can fit for an update. But for each forward/backward, not all of these are active and therefore are not required to stay in GPU (as in I can load these for a single-batch) and otherwise leave it in the non-gpu memory.

In other words, I have parameters Theta + theta[t] (for t=1…T), where t is a particular task. I want to only load a single theta[t] for a forward and backward pass into the GPU and fit larger batches. Currently I’m holding all theta[t] in the GPU.

Is it possible to use the same semantics if it’s the same (sub)-module (theta[t]) to achieve the intention described above?

mrshenli · June 15, 2020, 3:27pm

Hey @jerinphilip, I believe this is possible. You can use Tensor.to(device) to move the parameters to the GPUs in the forward pass, and the to (i.e., copy) operator should be added into the autograd graph, so that the backward pass will compute gradients for the original on-CPU parameters properly. Let me know if it didn’t work.

Note that, although this can reduce the footprint on GPU memory, DDP would still need to communicate the same amount of parameters, as that is determined at DDP construction time. And as those parameters are on CPU, you won’t be able to use NCCL which might cause considerable slow down.

jerinphilip · June 15, 2020, 6:20pm

Where can I read more on the DDP communications setup? Thanks in advance.

mrshenli · June 15, 2020, 6:32pm

Hey @jerinphilip, this page briefly describes DDP: https://pytorch.org/docs/master/notes/ddp.html

We have a paper with more comprehensive details. Let me upload that to archive.

jerinphilip · June 16, 2020, 12:30pm

Where do I obtain details corresponding to this particular information? Isn’t only .grad meant to be communicated and the workers applying the updates individually? If my parameters of theta[t] has only gradients for the particular task, would this help the case? I’m reading the Forward Pass section of Internal Design, with find_unused_parameters, it is possible to operate on a subgraph, correct(?). I already have this enabled.

mrshenli · June 16, 2020, 2:20pm

Where do I obtain details corresponding to this particular information?

We need to go through some internal approval process to publicly disclose that paper. It will take some time. For now Distributed Data Parallel — PyTorch master documentation is the best place for overall intro. The implementation of DDP is linked below:

github.com

pytorch/pytorch/blob/ebd869153c6adb37507d2ecb6a9fe3fd495fbb6e/torch/nn/parallel/distributed.py

from contextlib import contextmanager
import copy
import itertools

import torch

import torch.cuda.comm
import torch.distributed as dist

if dist.is_available():
    from torch.distributed.distributed_c10d import _get_default_group

from ..modules import Module
from .replicate import replicate
from .scatter_gather import scatter_kwargs, gather
from .parallel_apply import parallel_apply
from torch.cuda._utils import _get_device_index


def _find_tensors(obj):

This file has been truncated. show original

github.com

pytorch/pytorch/blob/ebd869153c6adb37507d2ecb6a9fe3fd495fbb6e/torch/csrc/distributed/c10d/reducer.cpp

#include <torch/csrc/distributed/c10d/reducer.h>

#include <functional>

#include <c10/core/DeviceGuard.h>
#include <c10/util/Exception.h>
#include <torch/csrc/autograd/engine.h>
#include <torch/csrc/autograd/function_hook.h>
#include <torch/csrc/autograd/functions/accumulate_grad.h>
#include <torch/csrc/autograd/profiler.h>
#include <torch/csrc/autograd/utils/lambda_post_hook.h>
#include <torch/csrc/distributed/c10d/comm.h>
#include <torch/csrc/utils/hash.h>
#include <torch/csrc/utils/memory.h>

namespace c10d {
namespace {

inline int64_t current_time_in_nanos() {
  return torch::autograd::profiler::getTime();

This file has been truncated. show original

Isn’t only .grad meant to be communicated and the workers applying the updates individually?

No. Currently at construction time, DDP creates a mapping from parameters to buckets, and always communicate all buckets even if some gradients are not used in one iteration. The reason for doing so is that it is possible process 1 only computes grad A and process 2 only computes grad B. However, AllReduce operation requires all processes to provide the same set of input tensors. So in this case, both process 1 and 2 need to communicate grad A and B. DDP can use another communication to first figure out which grads are used globally. However, if block waiting for this signal, there will be no overlap between communication and computation, which could result in >30% slowdown in some cases.

If my parameters of theta[t] has only gradients for the particular task, would this help the case?

It helps to skip computation but not communication. DDP always communicates all parameters in the model you passed to DDP constructor.

I’m reading the Forward Pass section of Internal Design , with find_unused_parameters , it is possible to operate on a subgraph, correct(?)

That flag only allows DDP to skip waiting for grads of those parameters. The communication phase is the same regardless the value of find_unused_parameters.

jerinphilip · June 30, 2020, 2:59pm

This the relevant paper?

mrshenli · June 30, 2020, 3:12pm

Yep, it is the paper. Sorry, I forgot to update it here today.