Train dynamic models in parallel

Hi everyone,
I created a small dynamic neural network based on multiple feedforward NN (based on the input data it constricts the computational graph), the input of this network is composed of two dict (one composed of tensors and the other used to choose the network path in the output calculation); the size of each single network is very small and varies from 1 to 9 elements (each element is a feedforward NN).
With a dataset of about 150 elements it takes about 0.3 seconds for a single epoch.
For each network except the result and use it with the ‘real’ value by calculating the ‘difference’ with MSELoss, I sum up all the losses and calculate the gradient with respect to the weights of the feedforward NN (shared weight trick); finally I update the weights with an optimizer.
Since the individual models can be called separately (they only share the feed forward weights NN) is there a way to reduce the calculation time by running the models in parallel?
I tried to use the GPU but the computation time is greater than that of the CPU (why is the model very small?)

class Net(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Net, self).__init__()                    # Inherited from the parent class nn.Module
        self.fc1 = nn.Linear(input_size, hidden_size,bias=False)  # 1st Full-Connected Layer:  (input data) ->  (hidden node)
        nn.init.kaiming_normal_(self.fc1.weight)
        self.fc2 = nn.Linear(hidden_size, output_size,bias=False) # 2nd Full-Connected Layer:  (hidden node) ->  (output class)        
        nn.init.kaiming_normal_(self.fc2.weight)
        self.LRelu = nn.LeakyReLU(0.25)
        self.fnn = nn.Sequential(
            self.fc1,
            self.LRelu,
            self.fc2
        ) 
    
    def forward(self,g,gTensor):
        f={}
        root=0
        for key, value in g.items():
            lenghtV=len(value)
            if (lenghtV==0):
                f[key]=self.fnn(gTensor[key])
            else:
                i=1
                z=gTensor[key].clone()
                for n in value:
                    z[i]=f[n]
                    i=i+1
                f[key]=self.fnn(z)
        root=f[key]
        del f
        return root


def train(net,opt,trainGraphSet,testGraphSet,num_epochs,save,deviceName):
        startTime=time.time()
        criterion = nn.MSELoss()
        optimizer = opt
        # # Training loop
        RMSE=[]
        start = time.time()
        for epoch in range(num_epochs): 
            losses=0
            for g,gT,y in trainGraphSet:
                res=net(g,gT)
                loss = criterion(res,y) 
                losses+=loss
            optimizer.zero_grad()
            losses+=net.fc1.weight.norm()+net.fc2.weight.norm()
            losses.backward() 
            optimizer.step()

I thank in advance who can help me