Pytorch model Parameters changes in CPU and GPU

I have created the model and save the weights using google colab. Now I have created a prediction script. The prediction script contains the model class. I am trying to load the model weights using the following method-

Saving & Loading Model Across Devices

Save on GPU, Load on CPU Save:

torch.save(model.state_dict(), PATH)

Load:

device = torch.device('cpu')
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH, map_location=device))

The above method should work, right? Yes.

But when I am trying to do so I have different parameters of the model in Google Colab (Prediction, runtime-None, device=CPU) and different in my local machine (prediction, device=cpu)

Model Params in Colab-

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 12,490,234 trainable parameters

+-------------------------------------------------------+------------+
|                        Modules                        | Parameters |
+-------------------------------------------------------+------------+
|              encoder.tok_embedding.weight             |  2053376   |
|              encoder.pos_embedding.weight             |   25600    |
|      encoder.layers.0.self_attn_layer_norm.weight     |    256     |
|       encoder.layers.0.self_attn_layer_norm.bias      |    256     |
|         encoder.layers.0.ff_layer_norm.weight         |    256     |
|          encoder.layers.0.ff_layer_norm.bias          |    256     |
|      encoder.layers.0.self_attention.fc_q.weight      |   65536    |
|       encoder.layers.0.self_attention.fc_q.bias       |    256     |
|      encoder.layers.0.self_attention.fc_k.weight      |   65536    |
|       encoder.layers.0.self_attention.fc_k.bias       |    256     |
|      encoder.layers.0.self_attention.fc_v.weight      |   65536    |
|       encoder.layers.0.self_attention.fc_v.bias       |    256     |
|      encoder.layers.0.self_attention.fc_o.weight      |   65536    |
|       encoder.layers.0.self_attention.fc_o.bias       |    256     |
| encoder.layers.0.positionwise_feedforward.fc_1.weight |   131072   |
|  encoder.layers.0.positionwise_feedforward.fc_1.bias  |    512     |
| encoder.layers.0.positionwise_feedforward.fc_2.weight |   131072   |
|  encoder.layers.0.positionwise_feedforward.fc_2.bias  |    256     |
|      encoder.layers.1.self_attn_layer_norm.weight     |    256     |
|       encoder.layers.1.self_attn_layer_norm.bias      |    256     |
|         encoder.layers.1.ff_layer_norm.weight         |    256     |
|          encoder.layers.1.ff_layer_norm.bias          |    256     |
|      encoder.layers.1.self_attention.fc_q.weight      |   65536    |
|       encoder.layers.1.self_attention.fc_q.bias       |    256     |
|      encoder.layers.1.self_attention.fc_k.weight      |   65536    |
|       encoder.layers.1.self_attention.fc_k.bias       |    256     |
|      encoder.layers.1.self_attention.fc_v.weight      |   65536    |
|       encoder.layers.1.self_attention.fc_v.bias       |    256     |
|      encoder.layers.1.self_attention.fc_o.weight      |   65536    |
|       encoder.layers.1.self_attention.fc_o.bias       |    256     |
| encoder.layers.1.positionwise_feedforward.fc_1.weight |   131072   |
|  encoder.layers.1.positionwise_feedforward.fc_1.bias  |    512     |
| encoder.layers.1.positionwise_feedforward.fc_2.weight |   131072   |
|  encoder.layers.1.positionwise_feedforward.fc_2.bias  |    256     |
|      encoder.layers.2.self_attn_layer_norm.weight     |    256     |
|       encoder.layers.2.self_attn_layer_norm.bias      |    256     |
|         encoder.layers.2.ff_layer_norm.weight         |    256     |
|          encoder.layers.2.ff_layer_norm.bias          |    256     |
|      encoder.layers.2.self_attention.fc_q.weight      |   65536    |
|       encoder.layers.2.self_attention.fc_q.bias       |    256     |
|      encoder.layers.2.self_attention.fc_k.weight      |   65536    |
|       encoder.layers.2.self_attention.fc_k.bias       |    256     |
|      encoder.layers.2.self_attention.fc_v.weight      |   65536    |
|       encoder.layers.2.self_attention.fc_v.bias       |    256     |
|      encoder.layers.2.self_attention.fc_o.weight      |   65536    |
|       encoder.layers.2.self_attention.fc_o.bias       |    256     |
| encoder.layers.2.positionwise_feedforward.fc_1.weight |   131072   |
|  encoder.layers.2.positionwise_feedforward.fc_1.bias  |    512     |
| encoder.layers.2.positionwise_feedforward.fc_2.weight |   131072   |
|  encoder.layers.2.positionwise_feedforward.fc_2.bias  |    256     |
|              decoder.tok_embedding.weight             |  3209728   |
|              decoder.pos_embedding.weight             |   25600    |
|      decoder.layers.0.self_attn_layer_norm.weight     |    256     |
|       decoder.layers.0.self_attn_layer_norm.bias      |    256     |
|      decoder.layers.0.enc_attn_layer_norm.weight      |    256     |
|       decoder.layers.0.enc_attn_layer_norm.bias       |    256     |
|         decoder.layers.0.ff_layer_norm.weight         |    256     |
|          decoder.layers.0.ff_layer_norm.bias          |    256     |
|      decoder.layers.0.self_attention.fc_q.weight      |   65536    |
|       decoder.layers.0.self_attention.fc_q.bias       |    256     |
|      decoder.layers.0.self_attention.fc_k.weight      |   65536    |
|       decoder.layers.0.self_attention.fc_k.bias       |    256     |
|      decoder.layers.0.self_attention.fc_v.weight      |   65536    |
|       decoder.layers.0.self_attention.fc_v.bias       |    256     |
|      decoder.layers.0.self_attention.fc_o.weight      |   65536    |
|       decoder.layers.0.self_attention.fc_o.bias       |    256     |
|     decoder.layers.0.encoder_attention.fc_q.weight    |   65536    |
|      decoder.layers.0.encoder_attention.fc_q.bias     |    256     |
|     decoder.layers.0.encoder_attention.fc_k.weight    |   65536    |
|      decoder.layers.0.encoder_attention.fc_k.bias     |    256     |
|     decoder.layers.0.encoder_attention.fc_v.weight    |   65536    |
|      decoder.layers.0.encoder_attention.fc_v.bias     |    256     |
|     decoder.layers.0.encoder_attention.fc_o.weight    |   65536    |
|      decoder.layers.0.encoder_attention.fc_o.bias     |    256     |
| decoder.layers.0.positionwise_feedforward.fc_1.weight |   131072   |
|  decoder.layers.0.positionwise_feedforward.fc_1.bias  |    512     |
| decoder.layers.0.positionwise_feedforward.fc_2.weight |   131072   |
|  decoder.layers.0.positionwise_feedforward.fc_2.bias  |    256     |
|      decoder.layers.1.self_attn_layer_norm.weight     |    256     |
|       decoder.layers.1.self_attn_layer_norm.bias      |    256     |
|      decoder.layers.1.enc_attn_layer_norm.weight      |    256     |
|       decoder.layers.1.enc_attn_layer_norm.bias       |    256     |
|         decoder.layers.1.ff_layer_norm.weight         |    256     |
|          decoder.layers.1.ff_layer_norm.bias          |    256     |
|      decoder.layers.1.self_attention.fc_q.weight      |   65536    |
|       decoder.layers.1.self_attention.fc_q.bias       |    256     |
|      decoder.layers.1.self_attention.fc_k.weight      |   65536    |
|       decoder.layers.1.self_attention.fc_k.bias       |    256     |
|      decoder.layers.1.self_attention.fc_v.weight      |   65536    |
|       decoder.layers.1.self_attention.fc_v.bias       |    256     |
|      decoder.layers.1.self_attention.fc_o.weight      |   65536    |
|       decoder.layers.1.self_attention.fc_o.bias       |    256     |
|     decoder.layers.1.encoder_attention.fc_q.weight    |   65536    |
|      decoder.layers.1.encoder_attention.fc_q.bias     |    256     |
|     decoder.layers.1.encoder_attention.fc_k.weight    |   65536    |
|      decoder.layers.1.encoder_attention.fc_k.bias     |    256     |
|     decoder.layers.1.encoder_attention.fc_v.weight    |   65536    |
|      decoder.layers.1.encoder_attention.fc_v.bias     |    256     |
|     decoder.layers.1.encoder_attention.fc_o.weight    |   65536    |
|      decoder.layers.1.encoder_attention.fc_o.bias     |    256     |
| decoder.layers.1.positionwise_feedforward.fc_1.weight |   131072   |
|  decoder.layers.1.positionwise_feedforward.fc_1.bias  |    512     |
| decoder.layers.1.positionwise_feedforward.fc_2.weight |   131072   |
|  decoder.layers.1.positionwise_feedforward.fc_2.bias  |    256     |
|      decoder.layers.2.self_attn_layer_norm.weight     |    256     |
|       decoder.layers.2.self_attn_layer_norm.bias      |    256     |
|      decoder.layers.2.enc_attn_layer_norm.weight      |    256     |
|       decoder.layers.2.enc_attn_layer_norm.bias       |    256     |
|         decoder.layers.2.ff_layer_norm.weight         |    256     |
|          decoder.layers.2.ff_layer_norm.bias          |    256     |
|      decoder.layers.2.self_attention.fc_q.weight      |   65536    |
|       decoder.layers.2.self_attention.fc_q.bias       |    256     |
|      decoder.layers.2.self_attention.fc_k.weight      |   65536    |
|       decoder.layers.2.self_attention.fc_k.bias       |    256     |
|      decoder.layers.2.self_attention.fc_v.weight      |   65536    |
|       decoder.layers.2.self_attention.fc_v.bias       |    256     |
|      decoder.layers.2.self_attention.fc_o.weight      |   65536    |
|       decoder.layers.2.self_attention.fc_o.bias       |    256     |
|     decoder.layers.2.encoder_attention.fc_q.weight    |   65536    |
|      decoder.layers.2.encoder_attention.fc_q.bias     |    256     |
|     decoder.layers.2.encoder_attention.fc_k.weight    |   65536    |
|      decoder.layers.2.encoder_attention.fc_k.bias     |    256     |
|     decoder.layers.2.encoder_attention.fc_v.weight    |   65536    |
|      decoder.layers.2.encoder_attention.fc_v.bias     |    256     |
|     decoder.layers.2.encoder_attention.fc_o.weight    |   65536    |
|      decoder.layers.2.encoder_attention.fc_o.bias     |    256     |
| decoder.layers.2.positionwise_feedforward.fc_1.weight |   131072   |
|  decoder.layers.2.positionwise_feedforward.fc_1.bias  |    512     |
| decoder.layers.2.positionwise_feedforward.fc_2.weight |   131072   |
|  decoder.layers.2.positionwise_feedforward.fc_2.bias  |    256     |
|                 decoder.fc_out.weight                 |  3209728   |
|                  decoder.fc_out.bias                  |   12538    |
+-------------------------------------------------------+------------+
Total Trainable Params: 12490234

Model Params in Local-

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 12,506,137 trainable parameters

+-------------------------------------------------------+------------+
|                        Modules                        | Parameters |
+-------------------------------------------------------+------------+
|              encoder.tok_embedding.weight             |  2053376   |
|              encoder.pos_embedding.weight             |   25600    |
|      encoder.layers.0.self_attn_layer_norm.weight     |    256     |
|       encoder.layers.0.self_attn_layer_norm.bias      |    256     |
|         encoder.layers.0.ff_layer_norm.weight         |    256     |
|          encoder.layers.0.ff_layer_norm.bias          |    256     |
|      encoder.layers.0.self_attention.fc_q.weight      |   65536    |
|       encoder.layers.0.self_attention.fc_q.bias       |    256     |
|      encoder.layers.0.self_attention.fc_k.weight      |   65536    |
|       encoder.layers.0.self_attention.fc_k.bias       |    256     |
|      encoder.layers.0.self_attention.fc_v.weight      |   65536    |
|       encoder.layers.0.self_attention.fc_v.bias       |    256     |
|      encoder.layers.0.self_attention.fc_o.weight      |   65536    |
|       encoder.layers.0.self_attention.fc_o.bias       |    256     |
| encoder.layers.0.positionwise_feedforward.fc_1.weight |   131072   |
|  encoder.layers.0.positionwise_feedforward.fc_1.bias  |    512     |
| encoder.layers.0.positionwise_feedforward.fc_2.weight |   131072   |
|  encoder.layers.0.positionwise_feedforward.fc_2.bias  |    256     |
|      encoder.layers.1.self_attn_layer_norm.weight     |    256     |
|       encoder.layers.1.self_attn_layer_norm.bias      |    256     |
|         encoder.layers.1.ff_layer_norm.weight         |    256     |
|          encoder.layers.1.ff_layer_norm.bias          |    256     |
|      encoder.layers.1.self_attention.fc_q.weight      |   65536    |
|       encoder.layers.1.self_attention.fc_q.bias       |    256     |
|      encoder.layers.1.self_attention.fc_k.weight      |   65536    |
|       encoder.layers.1.self_attention.fc_k.bias       |    256     |
|      encoder.layers.1.self_attention.fc_v.weight      |   65536    |
|       encoder.layers.1.self_attention.fc_v.bias       |    256     |
|      encoder.layers.1.self_attention.fc_o.weight      |   65536    |
|       encoder.layers.1.self_attention.fc_o.bias       |    256     |
| encoder.layers.1.positionwise_feedforward.fc_1.weight |   131072   |
|  encoder.layers.1.positionwise_feedforward.fc_1.bias  |    512     |
| encoder.layers.1.positionwise_feedforward.fc_2.weight |   131072   |
|  encoder.layers.1.positionwise_feedforward.fc_2.bias  |    256     |
|      encoder.layers.2.self_attn_layer_norm.weight     |    256     |
|       encoder.layers.2.self_attn_layer_norm.bias      |    256     |
|         encoder.layers.2.ff_layer_norm.weight         |    256     |
|          encoder.layers.2.ff_layer_norm.bias          |    256     |
|      encoder.layers.2.self_attention.fc_q.weight      |   65536    |
|       encoder.layers.2.self_attention.fc_q.bias       |    256     |
|      encoder.layers.2.self_attention.fc_k.weight      |   65536    |
|       encoder.layers.2.self_attention.fc_k.bias       |    256     |
|      encoder.layers.2.self_attention.fc_v.weight      |   65536    |
|       encoder.layers.2.self_attention.fc_v.bias       |    256     |
|      encoder.layers.2.self_attention.fc_o.weight      |   65536    |
|       encoder.layers.2.self_attention.fc_o.bias       |    256     |
| encoder.layers.2.positionwise_feedforward.fc_1.weight |   131072   |
|  encoder.layers.2.positionwise_feedforward.fc_1.bias  |    512     |
| encoder.layers.2.positionwise_feedforward.fc_2.weight |   131072   |
|  encoder.layers.2.positionwise_feedforward.fc_2.bias  |    256     |
|              decoder.tok_embedding.weight             |  3217664   |
|              decoder.pos_embedding.weight             |   25600    |
|      decoder.layers.0.self_attn_layer_norm.weight     |    256     |
|       decoder.layers.0.self_attn_layer_norm.bias      |    256     |
|      decoder.layers.0.enc_attn_layer_norm.weight      |    256     |
|       decoder.layers.0.enc_attn_layer_norm.bias       |    256     |
|         decoder.layers.0.ff_layer_norm.weight         |    256     |
|          decoder.layers.0.ff_layer_norm.bias          |    256     |
|      decoder.layers.0.self_attention.fc_q.weight      |   65536    |
|       decoder.layers.0.self_attention.fc_q.bias       |    256     |
|      decoder.layers.0.self_attention.fc_k.weight      |   65536    |
|       decoder.layers.0.self_attention.fc_k.bias       |    256     |
|      decoder.layers.0.self_attention.fc_v.weight      |   65536    |
|       decoder.layers.0.self_attention.fc_v.bias       |    256     |
|      decoder.layers.0.self_attention.fc_o.weight      |   65536    |
|       decoder.layers.0.self_attention.fc_o.bias       |    256     |
|     decoder.layers.0.encoder_attention.fc_q.weight    |   65536    |
|      decoder.layers.0.encoder_attention.fc_q.bias     |    256     |
|     decoder.layers.0.encoder_attention.fc_k.weight    |   65536    |
|      decoder.layers.0.encoder_attention.fc_k.bias     |    256     |
|     decoder.layers.0.encoder_attention.fc_v.weight    |   65536    |
|      decoder.layers.0.encoder_attention.fc_v.bias     |    256     |
|     decoder.layers.0.encoder_attention.fc_o.weight    |   65536    |
|      decoder.layers.0.encoder_attention.fc_o.bias     |    256     |
| decoder.layers.0.positionwise_feedforward.fc_1.weight |   131072   |
|  decoder.layers.0.positionwise_feedforward.fc_1.bias  |    512     |
| decoder.layers.0.positionwise_feedforward.fc_2.weight |   131072   |
|  decoder.layers.0.positionwise_feedforward.fc_2.bias  |    256     |
|      decoder.layers.1.self_attn_layer_norm.weight     |    256     |
|       decoder.layers.1.self_attn_layer_norm.bias      |    256     |
|      decoder.layers.1.enc_attn_layer_norm.weight      |    256     |
|       decoder.layers.1.enc_attn_layer_norm.bias       |    256     |
|         decoder.layers.1.ff_layer_norm.weight         |    256     |
|          decoder.layers.1.ff_layer_norm.bias          |    256     |
|      decoder.layers.1.self_attention.fc_q.weight      |   65536    |
|       decoder.layers.1.self_attention.fc_q.bias       |    256     |
|      decoder.layers.1.self_attention.fc_k.weight      |   65536    |
|       decoder.layers.1.self_attention.fc_k.bias       |    256     |
|      decoder.layers.1.self_attention.fc_v.weight      |   65536    |
|       decoder.layers.1.self_attention.fc_v.bias       |    256     |
|      decoder.layers.1.self_attention.fc_o.weight      |   65536    |
|       decoder.layers.1.self_attention.fc_o.bias       |    256     |
|     decoder.layers.1.encoder_attention.fc_q.weight    |   65536    |
|      decoder.layers.1.encoder_attention.fc_q.bias     |    256     |
|     decoder.layers.1.encoder_attention.fc_k.weight    |   65536    |
|      decoder.layers.1.encoder_attention.fc_k.bias     |    256     |
|     decoder.layers.1.encoder_attention.fc_v.weight    |   65536    |
|      decoder.layers.1.encoder_attention.fc_v.bias     |    256     |
|     decoder.layers.1.encoder_attention.fc_o.weight    |   65536    |
|      decoder.layers.1.encoder_attention.fc_o.bias     |    256     |
| decoder.layers.1.positionwise_feedforward.fc_1.weight |   131072   |
|  decoder.layers.1.positionwise_feedforward.fc_1.bias  |    512     |
| decoder.layers.1.positionwise_feedforward.fc_2.weight |   131072   |
|  decoder.layers.1.positionwise_feedforward.fc_2.bias  |    256     |
|      decoder.layers.2.self_attn_layer_norm.weight     |    256     |
|       decoder.layers.2.self_attn_layer_norm.bias      |    256     |
|      decoder.layers.2.enc_attn_layer_norm.weight      |    256     |
|       decoder.layers.2.enc_attn_layer_norm.bias       |    256     |
|         decoder.layers.2.ff_layer_norm.weight         |    256     |
|          decoder.layers.2.ff_layer_norm.bias          |    256     |
|      decoder.layers.2.self_attention.fc_q.weight      |   65536    |
|       decoder.layers.2.self_attention.fc_q.bias       |    256     |
|      decoder.layers.2.self_attention.fc_k.weight      |   65536    |
|       decoder.layers.2.self_attention.fc_k.bias       |    256     |
|      decoder.layers.2.self_attention.fc_v.weight      |   65536    |
|       decoder.layers.2.self_attention.fc_v.bias       |    256     |
|      decoder.layers.2.self_attention.fc_o.weight      |   65536    |
|       decoder.layers.2.self_attention.fc_o.bias       |    256     |
|     decoder.layers.2.encoder_attention.fc_q.weight    |   65536    |
|      decoder.layers.2.encoder_attention.fc_q.bias     |    256     |
|     decoder.layers.2.encoder_attention.fc_k.weight    |   65536    |
|      decoder.layers.2.encoder_attention.fc_k.bias     |    256     |
|     decoder.layers.2.encoder_attention.fc_v.weight    |   65536    |
|      decoder.layers.2.encoder_attention.fc_v.bias     |    256     |
|     decoder.layers.2.encoder_attention.fc_o.weight    |   65536    |
|      decoder.layers.2.encoder_attention.fc_o.bias     |    256     |
| decoder.layers.2.positionwise_feedforward.fc_1.weight |   131072   |
|  decoder.layers.2.positionwise_feedforward.fc_1.bias  |    512     |
| decoder.layers.2.positionwise_feedforward.fc_2.weight |   131072   |
|  decoder.layers.2.positionwise_feedforward.fc_2.bias  |    256     |
|                 decoder.fc_out.weight                 |  3217664   |
|                  decoder.fc_out.bias                  |   12569    |
+-------------------------------------------------------+------------+
Total Trainable Params: 12506137

So, that’s why I am unable to load the model. Because the model has a different parameter in local.

Even if I try to load the weights in local it gives me-

model.load_state_dict(torch.load(f"{model_name}.pt", map_location=device))

Error-

--------------------------------------------------------------------------- RuntimeError                              Traceback (most recent call last) <ipython-input-24-f5baac4441a5> in <module>
----> 1 model.load_state_dict(torch.load(f"{model_name}_2.pt", map_location=device))

c:\anaconda\envs\lang_trans\lib\site-packages\torch\nn\modules\module.py in load_state_dict(self, state_dict, strict)
    845         if len(error_msgs) > 0:
    846             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 847                                self.__class__.__name__, "\n\t".join(error_msgs)))
    848         return _IncompatibleKeys(missing_keys, unexpected_keys)
    849 

RuntimeError: Error(s) in loading state_dict for Seq2Seq:   size mismatch for decoder.tok_embedding.weight: copying a param with shape torch.Size([12538, 256]) from checkpoint, the shape in current model is torch.Size([12569, 256]).    size mismatch for decoder.fc_out.weight: copying a param with shape torch.Size([12538, 256]) from checkpoint, the shape in current model is torch.Size([12569, 256]).   size mismatch for decoder.fc_out.bias: copying a param with shape torch.Size([12538]) from checkpoint, the shape in current model is torch.Size([12569]).

The model param of the local must be wrong because in colab (device=CPU, runtime=None) I am able to load the weights after defining model class. But in the local machine the params changes, so I am unable to load the weights. I know it’s weird, help me to find the solution.

You can check the full code of the model here-

https://gist.github.com/Dipeshpal/90c715a7b7f00845e20ef998bda35835

After this model params change.

Could you print the decoder.fc_out setup on both machines?
I guess you are using diverged scripts, where the number of features was defined in a different way.