Really strange! Different results between pytorch, caffe and keras

Nick_Young · May 24, 2017, 11:02am

I build three versions of the same network achitecture( i am not sure now). And trained them on the same dataset( i swear) using the same solver Adam with default hyper parameters.

The caffe and keras version works really well, but the pytorch version just doesn’t work.

the loss will go down, but rises at the end of every epoch.

This is strange. What’s wrong with my code?

time_step = 150
batch_size = 256
input_dim = 6

lstm_size1 = 100
lstm_size2 = 512
fc1_size = 512

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.lstm1 = nn.LSTM(input_dim, lstm_size1, dropout=0.3)
        self.lstm2 = nn.LSTM(lstm_size1, lstm_size2, dropout=0.3)
        self.fc1 = nn.Linear(lstm_size2, fc1_size)
        self.fc2 = nn.Linear(fc1_size, 3755)

    def forward(self, x, num_strokes, batch_size):
		x = torch.nn.utils.rnn.pack_padded_sequence(x,num_strokes)
		hidden_cell_1 = (Variable(torch.zeros(1, batch_size, lstm_size1).cuda()),
						Variable(torch.zeros(1, batch_size, lstm_size1).cuda()))
		
		allout_1, _ = self.lstm1(x, hidden_cell_1)
		
		hidden_cell_2 = (Variable(torch.zeros(1, batch_size, lstm_size2).cuda()),
						Variable(torch.zeros(1, batch_size, lstm_size2).cuda()))
		
		_, last_hidden_cell_out = self.lstm2(allout_1, hidden_cell_2)
		
		last_hidden_cell_out = Variable(last_hidden_cell_out[0].data)
		x = last_hidden_cell_out.view(batch_size, lstm_size2)
		x = self.fc1(x)
		x = F.relu(x)
		x = F.dropout(x, p = 0.3)
		x = self.fc2(x)
		# out = F.log_softmax(x)
		return x

The mainly training part:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# one sample is of size [150, 6], so data is of size [256, 150, 6], i need to transpose data to get T*B*L
data = torch.transpose(data, 0, 1)
data, target = Variable(data), Variable(target)
optimizer.zero_grad()
out = model(data, num_strokes, batch_size)
loss = F.cross_entropy(out, target)
loss.backward()
optimizer.step()

The key point here is that my data is padded to 150, which means one sample maybe only have a length of 35 and of size [35, 6], i padded it to [150, 6], so only 35 lines at the begining are none zero. the rest 115 lines are all zeros. So i use:
x = torch.nn.utils.rnn.pack_padded_sequence(x,num_strokes)
where x is sorted by data lengths, and num_strokes indicates their lengths in descent order.
Am i using it in the right way?

Could any help me? I really don’t know why after debug for several days

mderakhshani · June 4, 2017, 7:34am

Hello @Nick_Young. I have a similar problem like you but the difference is that I have never trained any network. What I have done was to transfer weights in keras to pytorch. The result are so different. I swear I check every line of code to verify are there any bugs are not. Do you find any difference between keras and pytorch?

smb · June 10, 2017, 6:28pm

Well, for one with this line

last_hidden_cell_out = Variable(last_hidden_cell_out[0].data)

you are detaching the computational graph for everything that happened to compute last_hidden_cell_out. Therefore, the parameters for lstm1 and lstm2 never get updated.

What was the point for this, is this intentional?

Nick_Young · June 11, 2017, 3:39am

Wow, really? How should I implement this? Thank you!

Nick_Young · June 11, 2017, 4:34am

got it! should be:

x = last_hidden_cell_out[0].view(batch_size, lstm_size2)

and it finally worked!

Thank you so much!!!

Nick_Young · June 11, 2017, 4:51am

There are some mistakes in the offical doc.

it says that h_n and c_n are tensors, but they turn out to be Variables.

and the shapes of weights:

for weight_ih and weight_hh, they are transposed.

dgriff · June 14, 2017, 4:36pm

Variables are tensors just with pytorch Variable wrapper around them so you can auto compute their gradients