MLP results not matching with those from TF/Keras

I just wrote my first pyTorch code and have some questions.

#running on a Tesla V100-SXM2-32GB
device = torch.device('cuda')

Train, validation and test matrices are read in as sparse matrices.

#load data
print('Sparse matrices:')
train_pd = pd.read_pickle(DIR_INST+'/{}_train.pkl'.format(INST))
print('  train_pd: {:,} x {:,}'.format(*train_pd.shape))
train_csc = load_npz(DIR_INST+'/{}_train{}.npz'.format(INST,SFX))
print('  train_csc: {:,} x {:,}'.format(*train_csc.shape))

#same for validation and test

Making the instance dense before converting to pyTorch tensor.

class MyDataset(
    def __init__(self, X_csc, obs_pd):
        self.X = torch.tensor(X_csc.toarray(), dtype=torch.float32, device=device)
        self.y = torch.tensor(obs_pd['is_case'].values, dtype=torch.float32, device=device)

    def __getitem__(self, index):
        return self.X[index,:], self.y[index]

    def __len__(self):
        return self.X.shape[0]

data_train = MyDataset(train_csc,train_pd)
data_valid = MyDataset(valid_csc,valid_pd)
train_loader =, batch_size=511, shuffle=True)

Flexible class for MLP so that number and size of layers, and dropout probs can be varied.

class MLP(torch.nn.Module):
    def __init__(self, inputsz, hidden, drop):
        super(MLP, self).__init__()

        widths = [inputsz] + hidden
        self.linears = torch.nn.ModuleList()
        for n in range(len(widths)-1):

    def forward(self, x):
        for n in range(len(self.linears)):
            x = self.linears[n](x)
        return x

#example model
model = MLP(train_csc.shape[1],[63],[.5])
model =


  (linears): ModuleList(
    (0): Linear(in_features=1911, out_features=63, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.5)
    (3): Linear(in_features=63, out_features=1, bias=True)
    (4): Sigmoid()
criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=.0001)

Computing the f1 score for training and validation at the end of every epoch. f1_prev is the f1 score computed at prevalence. Example: in 10 observations suppose there are 3 cases, then prevalence is .3. So, set the three highest scores as cases and then compute tp, fp, etc. and finally f1.

def f1_prev(label, score):
    ncase = label.sum().astype(np.uint64)
    ncont = (label.shape[0]-ncase).astype(np.uint64)
    newlab = label[np.argsort(score, kind='mergesort')]
    tp,fp = newlab[ncont:].sum(),newlab[:ncont].sum()
    fn,tn = ncase-tp,ncont-fp
    return 2.*tp/(2.*tp+fn+fp)

for epoch in range(200):
    for X_mb,y_mb in train_loader:
        yhat_mb = model(X_mb)
        loss = criterion(yhat_mb[:,0], y_mb)


    yhat_train = model(data_train.X).detach().cpu().numpy().ravel()
    yhat_valid = model(data_valid.X).detach().cpu().numpy().ravel()
    print(epoch, loss.item(), f1_prev(train_pd['is_case'].values, yhat_train),
        f1_prev(valid_pd['is_case'].values, yhat_valid))


  1. as I noted earlier, this is the first time I am using pyTorch. I have this running in Keras and that results in a f1_valid score which is almost 2% higher. Here is the keras network ('learning_rate': 0.0001, 'batch_size': 511)
Layer (type)                 Output Shape              Param #   
    inp (InputLayer)             (None, 1911)              0         
drop0 (Dropout)              (None, 1911)              0         
hide1 (Dense)                (None, 63)                120456    
drop1 (Dropout)              (None, 63)                0         
out (Dense)                  (None, 1)                 64        
model.compile(optimizer=keras.optimizers.Adam(lr=.0001), loss='binary_crossentropy')

Any idea what the reason may be?

  1. how can I speed up the training above?
    For example
    • can the sparse matrices be used without converting to dense using toarray()?
    • can the f1 score computed as shown above be pushed to the GPU?

Or any other way?


It looks like your Keras model uses a dropout layer directly for the input.
Could you add it to your PyTorch model and check the accuracy again?

The f1_prev computation should work on the GPU after a few changes.
Try to replace all numpy methods with PyTorch ones and pass CUDA tensors to this method.

Thanks for your response. How can I add the dropout layer after the input? Adding a Linear(1911,1911) with bias=False and then adding Dropout(p=.5) at the beginning?

No, that would also create an additional linear layer, and as far as I understand the Keras code, the InputLayer is basically not doing anything.
Just add the dropout layer right after initializing self.linears:

self.linears = nn.ModuleList()
for n in ...

Thanks for the input. Using the same arch for Keras/TF and pyTorch was very important. I did get the accuracy very close last night with the following addition.

for epoch in range(200):
    _ = model.train()
    for X_mb,y_mb in train_loader:

    _ = model.eval()
    yhat_train = model(...)
    yhat_valid = model(...)

Adding those those line brings the accuracy up to where it is about the same as TF. Without those (and I just verified) even with the changed arch, the accuracy is lower. I read somewhere that those are important if one is using Dropout or BatchNorm…

Yes, if you would like to evaluate your model, you should call model.eval(), as this will e.g. disable dropout and use the running estimates in batch norm layers.

Did the additional dropout change anything at all or are you seeing approx. the same accuracies nevertheless?

The accuracy is slightly better but that is well within the margin of what I would call as random :slight_smile: But it was very important to do an Apples-to-Apples comparison. So, Thank you for catching the difference in architecture.

Regarding your other point in the response, I did convert the numpy function to pyTorch tensors function. f1_prev shown below is slightly different from the one above in that this one allows for groups of scores (specified using cut). In my model I want to use an age-based threshold rather than a single threshold and this allows that.

This function gives me a HUGE speed up over what I had. Can you suggest any changes…sorry for the newbie questions:

def f1_prev(label, score, cut):
    cmat = torch.zeros((cut.unique().shape[0],3), device=device)  #tp,fp,fn
    i = 0
    for n in cut.unique():
        ind = (cut==n).nonzero()[:,0]
        labelt,scoret = label[ind],score[ind]
        ncase = labelt.sum().int()
        ncont = (labelt.shape[0]-ncase).int()

        newlab = labelt[torch.argsort(scoret)]
        cmat[i,0] = newlab[ncont:].sum()  #tp
        cmat[i,1] = newlab[:ncont].sum()  #fp
        cmat[i,2] = ncase-cmat[i,0] #fn
        i += 1

    cmat = cmat.sum(dim=0)
    val = 2.*cmat[0]/(2.*cmat[0]+cmat[1]+cmat[2])
    return val.item()

Good to hear it’s matching the expected output!

Could you post the shapes and ranges of labels, score, and cut?
This would make it easier to have a look at potential bottlenecks.