Implement Selected Sparse connected neural network

I am trying to implement the following general NN model (Not CNN) using Pytorch.

Here, 3rd, 4th, 5th layers are fully connected-- and Network 1,2, 3 itself are fully connected but they are not fully connected to each other.

I don’t know how to implement this kind of selected (Not Random) sparse connection in Pytorch.

Any help/comments on this are much appreciated.

1 Like

You could just write custom modules for the smaller networks and concatenate their outputs in a larger model.
Here is a small dummy example, which might be a good starter:

class MySmallModel(nn.Module):
    def __init__(self):
        super(MySmallModel, self).__init__()
        self.fc1 = nn.Linear(5, 2)
        self.fc2 = nn.Linear(2, 1)
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.network1 = MySmallModel()
        self.network2 = MySmallModel()
        self.network3 = MySmallModel()
        self.fc1 = nn.Linear(3, 2)
        self.fc_out = nn.Linear(2, 1)
    def forward(self, x1, x2, x3):
        x1 = F.relu(self.network1(x1))
        x2 = F.relu(self.network2(x2))
        x3 = F.relu(self.network3(x3))
        x =, x2, x3), 1)
        x = F.relu(self.fc1(x))
        x = self.fc_out(x)
        return x

model = MyModel()
N = 10
x1, x2, x3 = torch.randn(N, 5), torch.randn(N, 5), torch.randn(N, 5)

output = model(x1, x2, x3)

Dear @ptrblck, according to your suggestion I have tried to write a simple NN for Binary classification. But, the training accuracy is not improved-- it is always around 40-50%. And train loss almost stacked on one point after 5-10 iterations.

Here is my code:

class MySmallModel(nn.Module):
    def __init__(self,nodes):
        super(MySmallModel, self).__init__()
        self.fc1 = nn.Linear(nodes, 50)
        self.fc2 = nn.Linear(50, 10)
        self.fc3 = nn.Linear(10, 1)
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        return x

class Classifier(nn.Module):
    def __init__(self,input_nodes):
        super(Classifier, self).__init__()
        self.input_nodes = input_nodes
        for i in range(len(input_nodes)):
            self.networks.append(MySmallModel(input_nodes[i]))# number of model = 80
        self.network1 = MySmallModel(i1)
        self.network2 = MySmallModel(i2)
        self.network3 = MySmallModel(i3)
        self.fc1 = nn.Linear(len(input_nodes), 40)
        self.fc2 = nn.Linear(40, 10)
        self.fc_out = nn.Linear(10, 1)
    def forward(self, input_):
        for i in range(len(self.input_nodes)):
            x_list.append(F.relu(self.networks[i](input_[:,i])))#input_[:,i] shape 500 * 200
        x =, 1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.sigmoid(self.fc_out(x))
        return x

model = Classifier(input_nodes)
print (model)
criterion = nn.BCELoss()#nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.08)

 for e in range(epochs):
    running_loss = 0
    i = 0
    print ('Epochs: ', e)
    for data, label in trainloader:
         y_hat  = model(data) # Shape (500,80,200)
         loss = criterion(y_hat, label)
         y_hat_class = np.where(y_hat.detach().numpy()<0.5, 0, 1)

         accuracy = np.sum(label.numpy()==y_hat_class.flatten()) / len(label)
         print('Train Accuracy: ', accuracy)
         running_loss += loss.item()
        print(f"Training loss: {running_loss/len(trainloader)}")

Can you please check the code? Does the first small model train correctly?
Is the training process is correct here?
Here, Shape of the Data = 500 *80*200
for each small model, input = 500*200 (one of the 80 list).

Please help

The parameters of MySmallModels are most likely missing in model.parameters(), since you are storing them in a plain Python list, thus the optimizer is ignoring them.
Try to use self.networks = nn.ModuleList instead.

I assume x1, x2, etc. are the input values, while each arrow would represent the connectivity pattern between these units?
If to, the pattern looks quite like a 1-dimensional convolution with a kernel size of 2 and stride 1.
Would that work for you?

I am so sorry for my uncleared description
,for this example, I would like to implement the regular sparse neural network(topology is 3-5-3,meaning 3 inputs, 5 neurons for first layer and 3 outputs). There are no CNN involved.微信图片_20200110115659

I want to ask about how to implement in pytorch!
Thank you so much again!!!

There might not be convolutions involved yet, but the pattern would maybe make it possible to easily use convolutions for your linear layer. :wink:

I assume the connection from the bottom input unit to the first output unit is wrong.
If so, then this pattern now looks like a transposed convolution.

Oh you are completely right and I misunderstood the use case.
Somehow I mixed it up with weight reusage, but in your case your approach looks fine.
You would have to make sure to zero out the gradients using the same mask as shown here:

class Net(nn.Module):
    def __init__(self, mask):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(4, 5)
        self.fc2 = nn.Linear(5, 3)
        with torch.no_grad():

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return x

mask = torch.tensor([[1,0,0,1],[1,1,0,0],[1,1,1,0],[0,1,1,1],[0,0,1,1]])
model = Net(mask)

# Check weights

# Dummy training steps
optimizer = torch.optim.Adam(model.parameters(), lr=1.)
data = torch.randn(1, 4)
target = torch.randint(0, 3, (1,))
criterion = nn.CrossEntropyLoss()

for idx in range(3):
    output = model(data)
    loss = criterion(output, target)
    print('Step {}, weight {}, weight.grad {}'.format(
        idx, model.fc1.weight, model.fc1.weight.grad))
    # Zero out gradients
    with torch.no_grad():

Dear Mr. Ptrblck,
Hi, I have run the example, but I still have two other questions:
1. I am not sure if the above example makes fc1’weight without gradient update or the weights set to 0 without gradient update.
This is my result of the example:
Step 2, weight Parameter containing:
tensor([[ 5.8677e-01, 1.6791e-02, -1.0280e-03, -1.4700e+00],
[ 1.5483e-01, -6.6316e-01, 4.2021e-01, 1.2349e+00],
[-3.2102e-02, -5.1647e-02, -2.1887e-02, -1.1000e-03],
[ 6.9480e-04, 6.0782e-02, 6.0572e-02, 5.4448e-02],
[-6.6701e-04, -4.8314e-03, 5.1907e-02, 5.2159e-02]],
weight.grad tensor([[-0.1711, -0.1190, -0.0521, -0.0074],
[ 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000],
[-0.0137, -0.0095, -0.0042, -0.0006],
[-0.0104, -0.0072, -0.0032, -0.0005]])

2. If this model is very large, then the mask will be difficult to define. Have any example similar like the LocallyConnected1D?

Thank you so much!!!
Infinite gratitude!!!

  1. Do you get this output running my code example?
    The weight parameter as well as the gradient should be set to zero for the marked positions.
    weight shouldn’t have any values in the masked region in step2.

  2. torch.roll with some views should work:

in_features = 4
out_features = 5
mask_width = 3

x =

mask =[torch.roll(x, i) for i in range(in_features)])
mask = mask.view(in_features, out_features).t()

This is most likely not the most efficient way, but if you only have to create the mask once, it could be good enough for now.

Dear Mr. Ptrblck,
Yes, I have run your code example. I agree with “The weight parameter as well as the gradient should be set to zero for the marked positions.”
However, in this:with torch.no_grad():

Does this mean to all the weight of fc1 are set to masked region?
Also include this:

Zero out gradients

with torch.no_grad():

In my understanding, the weight that fc1 layer has connected able to backward its weight.
Maybe there is a deviation in my understanding.

Thank you again for all !!!
Thank you very much!!

Yes, this will zero out the weight and gradients of model.fc1, as I’ve used your code snippet, which only dealt with this layer.
If you need to apply the mask on other layers as well, you would need to create the masks using the appropriate shapes for each parameter and maybe write utility functions to keep your code clean.

Dear Ptrblck,

I hope you are well. I have a question. I run a ANN with 3000 sample in one time, and another time I try to train the ANN in 3 step and in each step I feed the ANN with 1000 and continue until 3000 . each time I used the trained ANN from previous step.
The Result from when I train the ANN with whole 3000 is better than when I did saturation and reach the point 3000. the big difference is in Specificity and Accuracy.Do you think is it natural? My ANN is 2 Fully connected layer

The batch size will most likely change the final result.
Smaller batch sizes usually are a bit more noisy, but might give you a better final accuracy.
Are you running the code for just one epoch (i.e. 3000 samples) or multiples epochs?

I am really thankful for your reply. Indeed, I never use any batch size, I am training the ANN in Matlab without any batch size. just feed all training data in one time. But, firstly I pass all 3000 samples, and another time feed in 3 steps of 1000, I expect that in the third step (reach 3000 training data) the sensitivity ,specificity, AUC, be the same as the first experiment, but it is lower :(.

You can’t expect both training setups to return the same training result.
Let’s have a look at the extreme situation, where a) you feed the complete training dataset at once and b) you feed each sample separately.
In the first use case, the parameters will be updated only once with the gradient computed by the loss from the complete dataset (which will be the averaged loss using all samples), while the second case would compute the loss (and gradients) separately for each sample and update the model 3000 times.

Indeed, the first time I feed ANN with all 3000 samples, which loss and weights will be counted and updated with 3000 samples in one time. But in the second condition I feed each new set of 1000 samples, and used trained ANN as previous step(for step 2 I used saved ANN from step 1 and continue to train), and when I reach 3000 in the third step, the weights are updated 3 times but each time the trained ANN from previous step is used. Do you agree?

Thanks! I also wondered how to solve this problem.
And I have one question.
Why does some error occur when I remove the line that is ‘with torch.no_grad():’?
I thought that line was not needed because that line means just a multiplication between weights and the adjacency matrix(mask).
Please I need your help!

torch.no_grad() makes sure that Autograd is not tracking the operations inside the block, which allows you to manipulate/initialize them without raising errors.

1 Like

I’ve just faced a problem when applied your code to my work.
I met the error like ‘RuntimeError: expected device cuda:0 and dtype Float but got device cpu and dtype Float’ at line 223.

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    mask = torch.tensor([[1, 0, 0], [1, 1, 0], [1, 0, 1]], dtype=torch.float)
    model = neuralNet(input_size, hidden_size, num_classes, mask)
    # Move tensors to the configured device
    inputs =
    labels =

    # Forward pass
    y_pred = model(inputs)
    loss = criterion(y_pred, labels)


    # Zero out gradients
    with torch.no_grad():
        model.fc1.weight.grad.mul_(mask) ---> line 223


How can I solve this problem?