Errors running OpenNMT-py with torch 1.6.0 on Tesla K80

I’m trying to run OpenNMT-py with torch==1.6.0 on Ubuntu 20.04 on a Google Cloud Tesla K80 GPU and getting strange and inconsistent errors. On some runs the VM freezes up and I have to restart it from the cloud console. On others it continues training but switches to the CPU, and on some I’ve gotten illegal memory address accessed errors. It also isn’t consistent in how far it makes it into training.

I’ve also tried running in Docker containers with different versions of CUDA with no success.

I posted this on the OpenNMT Forum but I don’t think this is an OpenNMT issue: https://forum.opennmt.net/t/opennmt-py-starts-to-train-on-gpu-then-switches-to-cpu/4185

GPU Information

$ nvidia-smi    
Tue Feb  2 02:14:21 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0   162W / 149W |  10045MiB / 11441MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       766      G   /usr/lib/xorg/Xorg                  8MiB |
|    0   N/A  N/A       884      G   /usr/bin/gnome-shell                3MiB |
|    0   N/A  N/A     12494      C   /usr/bin/python3                10027MiB |
+-----------------------------------------------------------------------------+

PyTorch Information

pip freeze | grep torch
torch==1.6.0
torchtext==0.5.0

My code is here: https://github.com/argosopentech/onmt-models/tree/747b806262b8754fb5f8b36dd4f0ecb599c5266a

Logs

[2021-02-02 02:57:24,712 INFO] Counter vocab from -1 samples.
[2021-02-02 02:57:24,712 INFO] n_sample=-1: Build vocab on full datasets.
[2021-02-02 02:57:24,726 INFO] corpus_1's transforms: TransformPipe()
[2021-02-02 02:57:24,727 INFO] Loading ParallelCorpus(split_data/src-train.txt, split_data/tgt-train.txt, align=None)...
[2021-02-02 02:58:47,628 INFO] Counters src:2224500
[2021-02-02 02:58:47,628 INFO] Counters tgt:2137382
[2021-02-02 02:58:49,324 INFO] Counters after share:3272985
Done with tokenization
[2021-02-02 02:58:55,478 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2021-02-02 02:58:55,478 WARNING] Corpus corpus_1's weight should be given. We default it to 1 for you.
[2021-02-02 02:58:55,479 INFO] Missing transforms field for valid data, set to default: [].
[2021-02-02 02:58:55,479 INFO] Parsed 2 corpora from -data.
[2021-02-02 02:58:55,479 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2021-02-02 02:58:55,479 INFO] Loading vocab from text file...
[2021-02-02 02:58:55,479 INFO] Loading src vocabulary from opennmt_data/openmt.vocab
[2021-02-02 02:59:04,827 INFO] Loaded src vocab has 3272985 tokens.
[2021-02-02 02:59:06,932 INFO] Loading tgt vocabulary from opennmt_data/openmt.vocab
[2021-02-02 02:59:15,574 INFO] Loaded tgt vocab has 3272985 tokens.
[2021-02-02 02:59:17,681 INFO] Building fields with vocab in counters...
[2021-02-02 02:59:25,126 INFO]  * tgt vocab size: 50004.
[2021-02-02 02:59:32,205 INFO]  * src vocab size: 50002.
[2021-02-02 02:59:32,206 INFO]  * merging src and tgt vocab...
[2021-02-02 02:59:45,079 INFO]  * merged vocab size: 50004.
[2021-02-02 02:59:45,313 INFO]  * src vocab size = 50004
[2021-02-02 02:59:45,314 INFO]  * tgt vocab size = 50004
[2021-02-02 02:59:45,382 INFO] Building model...
[2021-02-02 03:00:25,641 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50004, 512, padding_idx=1)
        )
      )
    )
    (transformer): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (2): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (3): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (4): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (5): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50004, 512, padding_idx=1)
        )
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
    (transformer_layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=True)
        )
        (layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
      )
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=512, out_features=50004, bias=True)
    (1): Cast()
    (2): LogSoftmax(dim=-1)
  )
)
[2021-02-02 03:00:25,756 INFO] encoder: 44517376
[2021-02-02 03:00:25,757 INFO] decoder: 25275220
[2021-02-02 03:00:25,757 INFO] * number of parameters: 69792596
[2021-02-02 03:00:26,102 INFO] Starting training on GPU: [0]
[2021-02-02 03:00:26,102 INFO] Start training loop and validate every 5000 steps...
[2021-02-02 03:00:26,132 INFO] corpus_1's transforms: TransformPipe()
[2021-02-02 03:00:26,228 INFO] Loading ParallelCorpus(split_data/src-train.txt, split_data/tgt-train.txt, align=None)...
[2021-02-02 03:06:48,606 INFO] Step 100/100000; acc:   8.36; ppl: 9026.83; xent: 9.11; lr: 0.00001; 2423/2798 tok/s;    382 sec
[2021-02-02 03:13:09,476 INFO] Step 200/100000; acc:  10.40; ppl: 4152.53; xent: 8.33; lr: 0.00002; 2491/2852 tok/s;    763 sec

So isolate the illegal memory access, you could run the code via:

cuda-gdb --args python script.py args
...
run
...
bt

and post the backtrace here.
Also, I would recommend to update PyTorch to the latest stable release (1.7.1) or the nightly release, as you might be running into already fixes issues.

I tried installing OpenNMT from source with pytorch==1.7.1 and I got the error where GPU use goes to 0% and it starts using 1 cpu again.

This issue seems to be unrelated to the illegal memory access. Were you able to create the backtrace?
For the 0% utilization: your code might be bottlenecked by the data loading or any other operation. If the GPU has little work to do and is “starving” (waiting for the next input), the utilization might go to 0% in these phases.

It seems like they are related issues, the same code on the same machine runs differently on different runs getting different errors. I can try to get the illegal memory access error with a backtrace though.

The 0% GPU utilization issue doesn’t seem like data loading. The CPU runs at the beginning doing data loading, then it trains for 100 iterations in 300s, trains for another 100 iterations in 300s, then GPU utilization goes to 0% and no progress is made for 45 minutes. I can try leaving it running longer to make sure this isn’t an intentional part of OpenNMT training but the timing makes this seem like a bug.

It is easier for us to help you if you provide the information that we ask for. Try getting the backtrack as @ptrblck suggests.

Sorry, I appreciate the help. The solution was uninstalling Xorg:

sudo apt-get remove -y xserver-xorg-core