Sending Layers of nn.ModuleList() and tensors to GPU

I am trying to run over 6000 custom layers initialized with nn.ModuleList() on GPUs. I can use 80-120 GPUs on cluster. What is the best way to distribute the layers on the GPU? Should I do it in init method or in forward? Do the layers run asynchronously when I execute them in a for loop? I don’t know how to execute over 6000 layers more efficiently with pytorch.

Thank you very much for your help

I am initializing the layers in the following way:

if torch.cuda.is_available():
self.nr_gpus = torch.cuda.device_count()
print(f’Number of GPUs available: {self.nr_gpus}’)

    # initialize the Oligo Kernel Layers
    self.CMKN_layers: nn.ModuleList = nn.ModuleList()

    for gene_nr in range(self.num_genes):
        self.CMKN_layers.append(CONLayer(in_channels, self.anchor_points[gene_nr], filter_size, padding=padding,
                                subsampling=stride, kernel_func=kernel_func, kernel_args=kernel_args,
                                kernel_args_trainable=kernel_args_trainable, **kwargs))

and my forward method looks like this:

def forward(self, genes: Tuple[torch.Tensor]) -> torch.Tensor:

    x_out_temp: List[torch.Tensor] = []

    gpu_nr: int = 0
    for idx, CMKN_layer in enumerate(self.CMKN_layers):
        gene = genes[idx].cuda(gpu_nr)
        CMKN_layer = CMKN_layer.cuda(gpu_nr)
        x_out: torch.Tensor = CMKN_layer(genes[idx])

        if gpu_nr == self.nr_gpus - 1:
            gpu_nr = 0
            gpu_nr += 1

        x_out ='cpu')
        x_out_temp.append(x_out.view(x_out.size(0), -1))

    x_out =, dim=1)
    x_out = self.fc(x_out)

    return self.classifier(x_out)

I am trying to share my model on different GPUs. But I’m not sure if this is properly done in this way. I want to decrease the time needed for training.

I am grateful for any advises to improve this.