I am using 2 Nvidia GPUs for image training with DistributedDataParallel. But getting some unexpected error called
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: flock: Input/output error
Aborted (core dumped)
If I use one gpu it works fine with that, but getting the error mentioned above when I use 2 gpus parallelly.
Here are some function given below for initialization:
def create_grids(self, img_size=416, gridsize=(13, 13), device=‘cpu’, type=torch.float32):
“”" calculate a grid, with the defines gridsize over the input-image
img_size: (width and height) of the input-image gridsize: size of the grid projected over the input-image device: cpu or gpu-index where to run the calculation on type: type of (float) nvidia of cpu to use on device """ nx, ny = gridsize # x and y grid size try: self.img_size = max(img_size) #take the biggest dimentions out of width or height, to calculate stride except TypeError: #if only one dimention is given, take that self.img_size = int(img_size) self.stride = self.img_size / max(gridsize) # build xy offsets yv, xv = torch.meshgrid([torch.arange(ny), torch.arange(nx)]) self.grid_xy = torch.stack((xv, yv), 2).to(device).type(type).view((1, 1, ny, nx, 2)) # build wh gains self.anchor_vec = self.anchors.to(device) / self.stride self.anchor_wh = self.anchor_vec.view(1, self.number_of_anchers, 1, 1, 2).to(device).type(type) self.gridsize = torch.Tensor(gridsize).to(device) self.nx = nx self.ny = ny
Precisely, the error comes from the last line of this section below. I initialized my distributed training this way:
# Initialize distributed training if len(gpu_list) > 1: #generate path of a (none existing) file, used to setup the distributed learning assert distributed_folder, "The distributed-training folder isnt't set" #check if not the default 0 distributed_learning_filename = str(Nnet_name) + "_distlearn_setup" #remove this file when program is stopped!! distributed_init_filepath = os.path.join(distributed_folder, distributed_learning_filename) #there is more than 1 GPU-index than use distributed training dist.init_process_group(backend='nccl', # use distributed backend 'nccl' init_method='file://' + str(distributed_init_filepath), #file used to setup the distributed learning world_size = distributed_world_size, #number of nodes for distributed training rank = distributed_node_rank) #distributed training node rank model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True) model.yolo_layers = model.module.Get_YOLO_layers_list() # move yolo layer indices to top level
Could you give me some idea about the error above or any suggestion why I’m getting those errors and please let me know if you want more info. Looking forward to hearing any suggestions.