Error for loading matrix in distributed training

junjie.qian · July 20, 2018, 7:12pm

Hi, I have a problem during implementing distributed training.

The dataset is an M*N matrix and input is a vector.

The dataset is loaded as:

class ReadDataset(data.Dataset):
  def __init__(self, filename):
    self._filename = filename
    self._total_data = 0
    with open(filename, 'r') as f:
      self._total_data = len(f.readlines()) - 1

  def __getitem__(self, idx):
    line = linecache.getline(self._filename, idx + 1)
    return line

  def __len__(self):
    return self._total_data

Then, read it

dataset = ReadDataset(training_filename)
  dataLoader = data.DataLoader(dataset)

the model is:
model = nn.Sequential( nn.Linear(2000, 150), nn.Linear(150, 2000) )

In distributed training:

train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
 dataLoader = data.DataLoader(dataset, num_workers=args['world-size'], sampler=train_sampler)

 for epoch in range(10):
    train_sampler.set_epoch(epoch)
    for target in dataLoader:
       in_val = ## a vector, whose size equals to len(target) ##
       out = model(in_val)

The error when training with 2 GPUs is,
RuntimeError: size mismatch, m1: [1 x 1000], m2: [2000x 150] at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCTensorMathBlas.cu:249.

Something is wrong, but I do not know where to start. Can someone give some suggestions?

Thanks

aplassard · July 20, 2018, 7:17pm

It looks like your input matrix is sized (1, 1000) and your first matrix is sized (2000, 150).

junjie.qian · July 20, 2018, 7:19pm

Thanks for looking into this issue!
The in_val is a vector (size of 2000), according to the profiling result.

aplassard · July 20, 2018, 7:21pm

You mention this happens when training with two gpus. Does this error not occur when only training with one GPU?

junjie.qian · July 20, 2018, 7:21pm

No, this only happens when 2 GPUs are used.

aplassard · July 20, 2018, 7:25pm

Just to clarify, if you run

import torch
import torch.nn as nn
model = nn.Sequential(nn.Linear(2000, 150), nn.Linear(150, 2000))
in_val = torch.randn(1, 2000)
out = model(in_val)

do you get the same error?

junjie.qian · July 20, 2018, 7:28pm

It runs fine without error.
This is also observed that if no distributed training used (that two training processes started with MPI but not in a distributed training way in pytorch), it passes.

aplassard · July 20, 2018, 7:29pm

Could you post the full code that produces the error then? Or at least a minimal, runnable example that produces the error?

junjie.qian · July 20, 2018, 10:56pm

Hi @aplassard, thanks for help! Following is the cleaned code to repro the error.

from __future__ import print_function
import argparse
from collections import OrderedDict
import linecache
import os
import torch
import torch.nn as nn
import torch.utils.data as data
import torch.utils.data.distributed

# Load matrix from file
class LazyDataset(data.Dataset):
  def __init__(self, filename):
    self._filename = filename
    self._total_data = 0
    with open(filename, 'r') as f:
      self._total_data = len(f.readlines()) - 1

  def __getitem__(self, idx):
    line = linecache.getline(self._filename, idx + 1)
    return idx, line

  def __len__(self):
    return self._total_data

if __name__ == "__main__":
  # Input args processing
  parser = argparse.ArgumentParser()

  parser.add_argument('-datadir', '--datadir', help='Data directory where the training dataset is located', required=False, default=None)

  args = vars(parser.parse_args())

  # Training dataset maybe splitted into multiple files
  training_filename = args['datadir'] + '/matrixTest'

  # Load the dataset
  dataset = LazyDataset(training_filename)
  dataLoader = data.DataLoader(dataset)

  # Initialize the model
  model = nn.Sequential(
      nn.Linear(20, 10),
      nn.Linear(10, 20)
  )

  # Initialize the distributed one
  torch.distributed.init_process_group(world_size=2, \
    init_method='file:///' + os.path.join(os.environ['HOME'], 'distributedFile'), \
    backend='gloo')
  model.cuda()
  model = nn.parallel.DistributedDataParallel(model)
  # load dataset
  train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)

  dataLoader = data.DataLoader(dataset, num_workers=2, sampler=train_sampler)

  # Put the model to GPU if used
  model.cuda()

  for epoch in range(5):
    total_loss = 0
    train_sampler.set_epoch(epoch)
    for idx, _ in dataLoader:
      in_val = torch.zeros(20)
      in_val[idx] = 1.0
      output = model(in_val)

The matrixTest can be as following, but actually not used.

1 1 2 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 8 9
2 3 2 2 2 1 2 2 1 3 4 5 1 2 3 4 1 1 1 1

The error from one process (two processes have the same error):

Traceback (most recent call last):
  File "test.py", line 67, in <module>
    output = model(in_val)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 216, in forward
    outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 223, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply
    raise output
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 41, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 55, in forward
    return F.linear(input, self.weight, self.bias)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/functional.py", line 994, in linear
    output = input.matmul(weight.t())
RuntimeError: size mismatch, m1: [1 x 10], m2: [20 x 10] at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCTensorMathBlas.cu:249
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249: driver shutting down

junjie.qian · August 1, 2018, 9:01pm

I am still confused that which part of this code could be wrong.

Both the input and model size are as expected:

output = model(in_val)

The length of in_val is 20, while the model is

DistributedDataParallel(
  (module): Sequential(
    (0): Linear(in_features=20, out_features=10, bias=True)
    (1): Linear(in_features=10, out_features=20, bias=True)
  )
)

Any suggestion or comment would be appreciated.