DataParallel and DistributedDataParallel with fastai on AWS SageMaker performance

pl3 · August 25, 2020, 5:56pm

I am trying to make use of either distributed or parallel training using fastai and SageMaker notebooks or training jobs (somewhat fixed on using this service based on my team). I am running code on a ml.p3.8xlarge with 4x V100, but I cannot get any speed ups with any of the approaches I have taken.

After spinning up the ml.p3.8xlarge notebook instance, here is the set up in my notebook using the pytorch env:

%%bash
pip install fastai==2.0.0 fastcore==1.0.0
sudo mkdir -p /opt/ml/input/data/collab
sudo chmod 777 /opt/ml/input/data/collab

Here is the code I am testing:

import fastai, fastcore, torch
print(f'fastai {fastai.__version__}')
print(f'fastcore {fastcore.__version__}')
print(f'torch {torch.__version__}')

from fastai.collab import *
from fastai.tabular.all import *
from fastai.distributed import *

path = untar_data(URLs.ML_100k, dest="/opt/ml/input/data/collab")

ratings = pd.read_csv(
    path/'u.data',
    delimiter='\t',
    header=None,
    names=['user','movie','rating','timestamp']
)

movies = pd.read_csv(
    path/'u.item',
    delimiter='|',
    encoding='latin-1',
    usecols=(0,1),
    names=['movie','title'],
    header=None,
)

ratings = ratings.merge(movies)

dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)

n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 64

model = EmbeddingDotBias(n_factors, n_users, n_movies)

learn = Learner(dls, model, loss_func=MSELossFlat())

print(learn.model)

print("rank_distrib():", rank_distrib())
print("num_distrib():", num_distrib())
print("torch.cuda.device_count():", torch.cuda.device_count())

epochs, lr = 5, 5e-3

print('learn.fit_one_cycle')
learn.fit_one_cycle(epochs, lr)

print('with learn.distrib_ctx():')
with learn.distrib_ctx():
    learn.fit_one_cycle(epochs, lr)

print('with learn.distrib_ctx(torch.cuda.device_count()-1):')
with learn.distrib_ctx(torch.cuda.device_count()-1):
    learn.fit_one_cycle(epochs, lr)

print('with learn.parallel_ctx():')
with learn.parallel_ctx():
    learn.fit_one_cycle(epochs, lr)

print('nn.DataParallel(learn.model)')
if torch.cuda.device_count() > 1:
    learn.model = nn.DataParallel(learn.model)
learn.fit_one_cycle(epochs, lr)

Here is the output from running code as a script:

sh-4.2$ /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python /home/ec2-user/SageMaker/cf.py
fastai 2.0.0
fastcore 0.1.39
torch 1.6.0
EmbeddingDotBias(
  (u_weight): Embedding(944, 64)
  (i_weight): Embedding(1665, 64)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)
rank_distrib(): 0
num_distrib(): 0
torch.cuda.device_count(): 4
learn.fit_one_cycle
epoch     train_loss  valid_loss  time
0         1.153435    1.154428    00:11
1         0.957201    0.954827    00:11
2         0.816548    0.878350    00:11
with learn.distrib_ctx():
epoch     train_loss  valid_loss  time
0         0.999254    1.040871    00:11
1         0.821853    0.914921    00:11
2         0.658059    0.845227    00:11
with learn.distrib_ctx(torch.cuda.device_count()-1):
epoch     train_loss  valid_loss  time
0         0.749317    0.997568    00:11
1         0.580846    0.912386    00:11
2         0.381058    0.878295    00:11
with learn.parallel_ctx():
epoch     train_loss  valid_loss  time
0         0.514148    1.025872    00:25
1         0.383893    0.996381    00:18
2         0.204836    0.970403    00:18
nn.DataParallel(learn.model)
epoch     train_loss  valid_loss  time
0         0.341708    1.103849    00:16
1         0.272570    1.067705    00:16
2         0.134262    1.055507    00:16

Using the command nvidia-smi dmon -s u to watch GPU usage, I can see that only the training with DataParallel (using with learn.parallel_ctx(): and nn.DataParallel(learn.model)) show GPU ids 1,2,3 being used. The problem is the data parallel is slower, even when I have tried increasing batch size or embedding size.

Any help with this would be appreciated. I have a much larger collaborative filtering model I would like to use that is experiencing the same issues as this movie example and I need to reduce the training time hopefully with the use of parallel/distributed training.

mrshenli · August 28, 2020, 3:42pm

Hey @pl3, sorry about the delay.

For DataParallel (DP), it can become slow when the model is large, as DP needs to replicate the model in every forward pass.

For DistributedDataParallel (DDP), I would expect it is faster than local training. Which of the numbers shown above are DistributedDataParallel? And how did you initialize DDP module? When using DDP, did you reduce the per-process batch_size to batch_size / world_size?

pl3 · August 28, 2020, 7:51pm

For DataParallel (DP), it can become slow when the model is large, as DP needs to replicate the model in every forward pass.

Ahh that makes sense why that is slower, especially for the models that I have a couple large embeddings.

The lines after with learn.distrib_ctx(): use DDP under the hood as a context manager that handles setting up and tearing down the distributed model. You can find a link to the code here, though it is a bit abstracted and a little difficult to understand (at least for me) depending on familiar with the fastai library.

I am guessing there might be an issue with fastai functions/defaults for how it reads number of distributed GPUs available in sagemaker environments.

When using DDP, did you reduce the per-process batch_size to batch_size / world_size ?

Slightly unclear what you mean here. I had the same batch size for each training loop which meant that each GPU in the DP would have been receiving 1/4th the batch size which I was assuming should have been faster.

mrshenli · August 28, 2020, 9:22pm

I am not familiar with fastai’s DDP wrapper. When using the raw DDP API, applications need to spawn one process per GPU and then create one DDP instance and one dataloader in each process. With this setting, the per-process dataloader should use batch_size/world_size as the new batch size.

Given the linked code, looks like it does not spawn subprocess for you. And it only calls init_process_group when num_distrib() > 1. So, if you didn’t spawn subprocesses explicitly in application code, it might fall back to local training?

github.com

fastai/fastai/blob/3c6dca627c1f3812d58b0447bc9a45dd866c601f/fastai/distributed.py#L143-L159


      
          @patch
          @contextmanager
          def distrib_ctx(self: Learner, cuda_id=None,sync_bn=True):
              "A context manager to adapt a learner to train in distributed data parallel mode."
              # Figure out the GPU to use from rank.  Create a dpg if none exists yet.
              if cuda_id is None: cuda_id = rank_distrib()
              if not torch.distributed.is_initialized():
                  setup_distrib(cuda_id)
                  cleanup_dpg =   torch.distributed.is_initialized()
              else: cleanup_dpg = False
              # Adapt self to DistributedDataParallel, yield, and cleanup afterwards.
              try:
                  if num_distrib() > 1: self.to_distributed(cuda_id,sync_bn)
                  yield self
              finally:
                  self.detach_distributed()
                  if cleanup_dpg: teardown_distrib()

In case this is helpful, here is a quick example with a brief explanation of how DDP works. And this section tries to explain the differences between DP and DDP.

pl3 · August 28, 2020, 9:45pm

Yeah you are correct, I just found out I was not implementing DDP correctly. I knew there was extra steps I needed to do to get DDP working, so was hoping that DP would speed things up, but with the large model that doesn’t seem to be the case.

I found this example in fastai which uses the distributed context, so I am working on my script to add in the correct functionality. I will review the links your provided as well, it seems I need to get into the docs a little more. I appreciate your help!