Error for loading matrix in distributed training

Hi, I have a problem during implementing distributed training.

The dataset is an M*N matrix and input is a vector.

The dataset is loaded as:

class ReadDataset(data.Dataset):
  def __init__(self, filename):
    self._filename = filename
    self._total_data = 0
    with open(filename, 'r') as f:
      self._total_data = len(f.readlines()) - 1

  def __getitem__(self, idx):
    line = linecache.getline(self._filename, idx + 1)
    return line

  def __len__(self):
    return self._total_data 

Then, read it

dataset = ReadDataset(training_filename)
  dataLoader = data.DataLoader(dataset)

the model is:
model = nn.Sequential( nn.Linear(2000, 150), nn.Linear(150, 2000) )

In distributed training:

train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
 dataLoader = data.DataLoader(dataset, num_workers=args['world-size'], sampler=train_sampler)

 for epoch in range(10):
    train_sampler.set_epoch(epoch)
    for target in dataLoader:
       in_val = ## a vector, whose size equals to len(target) ##
       out = model(in_val)

The error when training with 2 GPUs is,
RuntimeError: size mismatch, m1: [1 x 1000], m2: [2000x 150] at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCTensorMathBlas.cu:249.

Something is wrong, but I do not know where to start. Can someone give some suggestions?

Thanks

It looks like your input matrix is sized (1, 1000) and your first matrix is sized (2000, 150).

Thanks for looking into this issue!
The in_val is a vector (size of 2000), according to the profiling result.

You mention this happens when training with two gpus. Does this error not occur when only training with one GPU?

No, this only happens when 2 GPUs are used.

Just to clarify, if you run

import torch
import torch.nn as nn
model = nn.Sequential(nn.Linear(2000, 150), nn.Linear(150, 2000))
in_val = torch.randn(1, 2000)
out = model(in_val)

do you get the same error?

It runs fine without error.
This is also observed that if no distributed training used (that two training processes started with MPI but not in a distributed training way in pytorch), it passes.

Could you post the full code that produces the error then? Or at least a minimal, runnable example that produces the error?

Hi @aplassard, thanks for help! Following is the cleaned code to repro the error.

from __future__ import print_function
import argparse
from collections import OrderedDict
import linecache
import os
import torch
import torch.nn as nn
import torch.utils.data as data
import torch.utils.data.distributed

# Load matrix from file
class LazyDataset(data.Dataset):
  def __init__(self, filename):
    self._filename = filename
    self._total_data = 0
    with open(filename, 'r') as f:
      self._total_data = len(f.readlines()) - 1

  def __getitem__(self, idx):
    line = linecache.getline(self._filename, idx + 1)
    return idx, line

  def __len__(self):
    return self._total_data

if __name__ == "__main__":
  # Input args processing
  parser = argparse.ArgumentParser()

  parser.add_argument('-datadir', '--datadir', help='Data directory where the training dataset is located', required=False, default=None)

  args = vars(parser.parse_args())

  # Training dataset maybe splitted into multiple files
  training_filename = args['datadir'] + '/matrixTest'

  # Load the dataset
  dataset = LazyDataset(training_filename)
  dataLoader = data.DataLoader(dataset)

  # Initialize the model
  model = nn.Sequential(
      nn.Linear(20, 10),
      nn.Linear(10, 20)
  )

  # Initialize the distributed one
  torch.distributed.init_process_group(world_size=2, \
    init_method='file:///' + os.path.join(os.environ['HOME'], 'distributedFile'), \
    backend='gloo')
  model.cuda()
  model = nn.parallel.DistributedDataParallel(model)
  # load dataset
  train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)

  dataLoader = data.DataLoader(dataset, num_workers=2, sampler=train_sampler)

  # Put the model to GPU if used
  model.cuda()

  for epoch in range(5):
    total_loss = 0
    train_sampler.set_epoch(epoch)
    for idx, _ in dataLoader:
      in_val = torch.zeros(20)
      in_val[idx] = 1.0
      output = model(in_val)

The matrixTest can be as following, but actually not used.

1 1 2 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 8 9
2 3 2 2 2 1 2 2 1 3 4 5 1 2 3 4 1 1 1 1

The error from one process (two processes have the same error):

Traceback (most recent call last):
  File "test.py", line 67, in <module>
    output = model(in_val)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 216, in forward
    outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 223, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply
    raise output
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 41, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 55, in forward
    return F.linear(input, self.weight, self.bias)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/functional.py", line 994, in linear
    output = input.matmul(weight.t())
RuntimeError: size mismatch, m1: [1 x 10], m2: [20 x 10] at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCTensorMathBlas.cu:249
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249: driver shutting down

I am still confused that which part of this code could be wrong.

Both the input and model size are as expected:

output = model(in_val)

The length of in_val is 20, while the model is

DistributedDataParallel(
  (module): Sequential(
    (0): Linear(in_features=20, out_features=10, bias=True)
    (1): Linear(in_features=10, out_features=20, bias=True)
  )
)

Any suggestion or comment would be appreciated.