Terminate called without an active exception when using Dataparallele

Imene.b · May 18, 2020, 10:35pm

when i train my neural network using Dataparallel of pytorch this error appears
terminate called without an active exception
Aborted (core dumped)
the instruction which cause the error is :
network = DataParallel(network, chunk_sizes= list_chunk_sizes)
list_chunk_sizes = [4, 5,5,5,5,5,5,5,5,5]
batch_size = 49
any one can help me?

ptrblck · May 19, 2020, 10:07am

Are you using nn.DataParallel, as the chunk_sizes argument shouldn’t be defined?
Could you explain, what this argument does (in this custom implementation)?

Imene.b · May 19, 2020, 10:26am

yes i am using torch.nn.Dataparallel to train network in parallel

Imene.b · May 19, 2020, 10:34am

rst forgive my language because i dont speek english
i use batch size = 49 then i divide it in 10 sub batchez, each have a size folloing this list [4, 5,5,5,5,5,5,5,5,5]
and when the model train the network, the images are feed to the network based on the list of chunk sizes
its for doing parallelliszme if you have + then 1 GPU card

ptrblck · May 19, 2020, 8:16pm

Thanks for the explanation.
Did you write the DataParallel class yourself or are you using a specific implementation from another repository?
The native nn.DataParallel class shouldn’t take chunk_sizes as its argument, so I’m just wondering which implementation you are using.

Imene.b · May 19, 2020, 8:32pm

aah yes
it is an implementation from CornerNet, algorithm for detection of object in images
the problem in this error is that sometimes it dont appear
and where should i put chunk sizes, if it is not the correct place

ptrblck · May 20, 2020, 4:49am

I’m not familiar with this implementation, so could you post a link to it, please?
Also, could you use nn.DataParallel and check, if you are running into the same error?

Imene.b · May 20, 2020, 7:26am

the link to cornernet is:

GitHub

princeton-vl/CornerNet

Contribute to princeton-vl/CornerNet development by creating an account on GitHub.

you can find this implementation in CornerNet-master/nnet/py_factory.py
There is part of this code:
class NetworkFactory(object):
def init(self, db):

    super(NetworkFactory, self).__init__()

    module_file = "models.{}".format(system_configs.snapshot_name)
    nnet_module = importlib.import_module(module_file)
    self.model   = DummyModule(nnet_module.model(db))
    self.loss    = nnet_module.loss
    self.network = Network(self.model, self.loss)
  
    self.network = DataParallel(self.network,chunk_sizes=system_configs.chunk_sizes )
    total_params = 0
    for params in self.model.parameters():
        num_params = 1
        for x in params.size():
            num_params *= x                              
        total_params += num_params    
    print("total parameters: {}".format(total_params))
    if system_configs.opt_algo == "adam":
        self.optimizer = torch.optim.Adam(
            filter(lambda p: p.requires_grad, self.model.parameters())