Different results on cpu and gpu

timbmg · August 2, 2018, 12:02pm

I trained my model on GPU and saved the model with

params = {k:v.cpu() for k, v in self.state_dict().items()}
torch.save(params, open('model.pt', 'wb'))

Then, when evaluating the model, I load it like this:

model = m(...)
params = torch.load('model.pt')
model.load_state_dict(params)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

However, I am getting much worse results on GPU, as on CPU. It seems like on CPU its performing correctly (I get similar results as in the paper), but on GPU its about ~20% worse.

I wrote all my code device agnostic, so apart from saving/loading a model, I never specify the device. In my forward pass, when I needed a new tensor, I usually used input_tensor.new_*(...) to create a new tensor on the same device as the input.

Anyone got an idea where the different performances might come from?

SimonW · August 2, 2018, 3:29pm

can you share your model please?

chandrachud · September 18, 2018, 3:45pm

Hi, I have a similar issue which is mystifying. I started with this repo https://github.com/BelBES/crnn-pytorch and customised it for my project. But the model_loader is essentially the same as in the repo:

from .crnn import CRNN


def load_weights(target, source_state):
    new_dict = OrderedDict()
    for k, v in target.state_dict().items():
        if k in source_state:
            if v.size() == source_state[k].size():
                new_dict[k] = source_state[k]
            else:
                print('Size MISMATCH: {} vs {}'.format(v.size(), source_state[k].size()))
                new_dict[k] = v
        else:
            if 'num_batches_tracked' not in k:
                print('Layer NOT FOUND: {}'.format(k))
            new_dict[k] = v
    target.load_state_dict(new_dict)


def load_model(abc, seq_proj=[0, 0], backend='resnet18', snapshot=None, cuda=True):
    net = CRNN(abc=abc,
               seq_proj=seq_proj,
               backend=backend,
               rnn_hidden_size=128,
               rnn_num_layers=2,
               rnn_dropout=0.5)
    net = nn.DataParallel(net)
    if snapshot is not None:
        load_weights(net, torch.load(snapshot, map_location=lambda storage, loc: storage))
    if cuda:
        net = net.cuda()
    return net

When I train and validate the model on GPU on a Ubuntu cloud instance, it gives very good performance, but when I copy the trained checkpoint to my MacBook and run a test script, it gives terrible performance, almost as if the weights were random. But once in a while for a particular checkpoint, it works fine.

I did some investigating but there doesn’t seem to be any problem with the load_weights() function. The only weights which are there in the checkpoint but not in the model definition are the ‘num_batches_tracked’ in the batch norm layers. Don’t think these are used during eval anyway.

Any ideas? Thanks

sami · February 27, 2019, 11:10am

Did you solve this? I have a similar problem, training the model on an Linux cloud instance gives nice results, but loading the checkpoint on my Macbook really tanks the performance!

spmimi · July 2, 2019, 2:46pm

I had the same problem. It was driving me crazy

chandrachud · July 2, 2019, 4:51pm

Hey guys, been a long time, don’t remember know what the issue was, sorry.
Could’ve been something to do with the alphabet.

spmimi · July 3, 2019, 1:15am

What does the alphabet mean

spmimi · July 3, 2019, 1:40am

What does the alphabet mean