Train OpenNMT and TypeError: NoneType object is not callable

dl4daniel · March 30, 2017, 7:08am

I was trying to train OpenNMT example on mac with cpu with the following steps:

Env: python3.5, Pytorch 0.1.10.1

preprocess data and shrink src and tgt to have only the first 100 sentences by inserting the following lines after line133 in preprocess.py

    shrink = True
    if shrink:
        src = src[0:100]
        tgt = tgt[0:100]

then, I ran

python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo

then I train using python train.py -data data/demo.train.pt -save_model demo_model

Then it rans ok for a while before an error appeared:

(dlnd-tf-lab)  ->python train.py -data data/demo.train.pt -save_model demo_model
Namespace(batch_size=64, brnn=False, brnn_merge='concat', curriculum=False, data='data/demo.train.pt', dropout=0.3, epochs=13, extra_shuffle=False, gpus=[], input_feed=1, layers=2, learning_rate=1.0, learning_rate_decay=0.5, log_interval=50, max_generator_batches=32, max_grad_norm=5, optim='sgd', param_init=0.1, pre_word_vecs_dec=None, pre_word_vecs_enc=None, rnn_size=500, save_model='demo_model', start_decay_at=8, start_epoch=1, train_from='', train_from_state_dict='', word_vec_size=500)
Loading data from 'data/demo.train.pt'
 * vocabulary size. source = 24999; target = 35820
 * number of training sentences. 100
 * maximum batch size. 64
Building model...
* number of parameters: 58121320
NMTModel (
  (encoder): Encoder (
    (word_lut): Embedding(24999, 500, padding_idx=0)
    (rnn): LSTM(500, 500, num_layers=2, dropout=0.3)
  )
  (decoder): Decoder (
    (word_lut): Embedding(35820, 500, padding_idx=0)
    (rnn): StackedLSTM (
      (dropout): Dropout (p = 0.3)
      (layers): ModuleList (
        (0): LSTMCell(1000, 500)
        (1): LSTMCell(500, 500)
      )
    )
    (attn): GlobalAttention (
      (linear_in): Linear (500 -> 500)
      (sm): Softmax ()
      (linear_out): Linear (1000 -> 500)
      (tanh): Tanh ()
    )
    (dropout): Dropout (p = 0.3)
  )
  (generator): Sequential (
    (0): Linear (500 -> 35820)
    (1): LogSoftmax ()
  )
)

Train perplexity: 29508.9
Train accuracy: 0.0216306
Validation perplexity: 4.50917e+08
Validation accuracy: 3.57853

Train perplexity: 1.07012e+07
Train accuracy: 0.06198
Validation perplexity: 103639
Validation accuracy: 0.944334

Train perplexity: 458795
Train accuracy: 0.031198
Validation perplexity: 43578.2
Validation accuracy: 3.42942

Train perplexity: 144931
Train accuracy: 0.0432612
Validation perplexity: 78366.8
Validation accuracy: 2.33598
Decaying learning rate to 0.5

Train perplexity: 58696.8
Train accuracy: 0.0278702
Validation perplexity: 14045.8
Validation accuracy: 3.67793
Decaying learning rate to 0.25

Train perplexity: 10045.1
Train accuracy: 0.0457571
Validation perplexity: 26435.6
Validation accuracy: 4.87078
Decaying learning rate to 0.125

Train perplexity: 10301.5
Train accuracy: 0.0490849
Validation perplexity: 24243.5
Validation accuracy: 3.62823
Decaying learning rate to 0.0625

Train perplexity: 7927.77
Train accuracy: 0.062812
Validation perplexity: 7180.49
Validation accuracy: 5.31809
Decaying learning rate to 0.03125

Train perplexity: 4573.5
Train accuracy: 0.047421
Validation perplexity: 6545.51
Validation accuracy: 5.6163
Decaying learning rate to 0.015625

Train perplexity: 3995.7
Train accuracy: 0.0549085
Validation perplexity: 6316.25
Validation accuracy: 5.4175
Decaying learning rate to 0.0078125

Train perplexity: 3715.81
Train accuracy: 0.0540765
Validation perplexity: 6197.91
Validation accuracy: 5.86481
Decaying learning rate to 0.00390625

Train perplexity: 3672.46
Train accuracy: 0.0540765
Validation perplexity: 6144.18
Validation accuracy: 6.01392
Decaying learning rate to 0.00195312

Train perplexity: 3689.7
Train accuracy: 0.0528286
Validation perplexity: 6113.55
Validation accuracy: 6.31213
Decaying learning rate to 0.000976562
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x118b19b70>
Traceback (most recent call last):
  File "/Users/Natsume/miniconda2/envs/dlnd-tf-lab/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable

Could you tell me how to fix it? Thanks!

tom · March 30, 2017, 7:48pm

Hello,

I think might be seeing a [bug in python 3.5 weakref]
(Issue 29519: weakref spewing exceptions during finalization when combined with multiprocessing - Python tracker) that occurs during shutdown.

On my machine I was able to resolve it by applying this patch (though I seem to recall that there was some fuzz in the line numbers):
https://github.com/python/cpython/commit/9cd7e17640a49635d1c1f8c2989578a8fc2c1de6.patch

Best regards

Thomas

dl4daniel · March 31, 2017, 12:08am

Thanks a lot, Tom!

According to your suggestion, in order to get code running, I switched to python 2.7, and it trains without error! and it works for python 3.6 too. I am testing on python3.5 after conda update python. Now it seems they all work without error when training like above.

Thanks again!