Issue with DataParallel on model with additional data structure

class Node(object):
    def __init__(self, name, layer, op_type=None):
        self.name = name
        self.layer = layer
        self.op_type = op_type
        self.output_trace = None

        self.prev_list = []
        self.next_list = []

class Graph(nn.Module):
    def __init__(self, layer_dict):
        super().__init__()
        self.layer_dict = layer_dict
        self.node_dict = {name: Node(name, layer, Graph.__get_op_type(name)) for name, layer in self.layer_dict.items()}

I define my model class at the code snippet above. self.layer_dict is a nn.ModuleDict and I define the topological structure of the network in self.node_dict. However, when I wrapped my model using nn.DataParallel I ran into this error RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution).

I came across this GitHub thread https://github.com/pytorch/pytorch/issues/8637 where the problem described seems to be very similar to mine. It seems like as the attribute self.node_dict is not a tensor, it doesn’t get properly broadcast to all the GPUs so the broadcast copy self.node_dicts all point to the same layers that are on GPU 0.

I’ve tried to rectify this by matching the device_ids in the self.forward method like this:

def forward(self, x):
        process_stack = self.topological_sort(self.node_dict)

        while process_stack:
            nodename = process_stack.pop()
            current_node = self.node_dict[nodename] 
           
            # Matching the device ids HERE!!!!
            data_device = x.device
            current_node.layer = current_node.layer.to(data_device)

            if not current_node.prev_list:
                x = current_node.layer(x)
            else:
                x = current_node.layer(out_trace_sum)

            current_node.output_trace = x

Unfortunately it doesn’t work most probably due to the async nature of the implementation:

Before OP, data: cuda:0, layer: cuda:0
After OP, data: cuda:0, layer: cuda:0
Before OP, data: cuda:1, layer: cuda:0
Before OP, data: cuda:2, layer: cuda:0
Before OP, data: cuda:3, layer: cuda:0
After OP, data: cuda:2, layer: cuda:2
After OP, data: cuda:1, layer: cuda:2
After OP, data: cuda:3, layer: cuda:3

I’m out of ideas…:thinking:

Solved the problem by creating self.node_dict within the self.forward method:

def forward(self, x):
        self.node_dict = {}

        for nodename in self.nodename_dict:
            self.node_dict[nodename] = Node(nodename, self.layer_dict[nodename], Graph.__get_op_type(nodename))

        for nodename in self.nodename_dict:
            self.node_dict[nodename].prev_list = [self.node_dict[node.name] for node in self.nodename_dict[nodename].prev_list]
            self.node_dict[nodename].next_list = [self.node_dict[node.name] for node in self.nodename_dict[nodename].next_list]

        process_stack = self.topological_sort(self.node_dict)

using another dictionary self.nodename_dict which only creates a topological record for node names, so it can be copied across different GPUs without a problem.:sunglasses: