I’m trying to run OpenNMT-py with torch==1.6.0 on Ubuntu 20.04 on a Google Cloud Tesla K80 GPU and getting strange and inconsistent errors. On some runs the VM freezes up and I have to restart it from the cloud console. On others it continues training but switches to the CPU, and on some I’ve gotten illegal memory address accessed
errors. It also isn’t consistent in how far it makes it into training.
I’ve also tried running in Docker containers with different versions of CUDA with no success.
I posted this on the OpenNMT Forum but I don’t think this is an OpenNMT issue: https://forum.opennmt.net/t/opennmt-py-starts-to-train-on-gpu-then-switches-to-cpu/4185
GPU Information
$ nvidia-smi
Tue Feb 2 02:14:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:04.0 Off | 0 |
| N/A 60C P0 162W / 149W | 10045MiB / 11441MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 766 G /usr/lib/xorg/Xorg 8MiB |
| 0 N/A N/A 884 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 12494 C /usr/bin/python3 10027MiB |
+-----------------------------------------------------------------------------+
PyTorch Information
pip freeze | grep torch
torch==1.6.0
torchtext==0.5.0
My code is here: https://github.com/argosopentech/onmt-models/tree/747b806262b8754fb5f8b36dd4f0ecb599c5266a
Logs
[2021-02-02 02:57:24,712 INFO] Counter vocab from -1 samples.
[2021-02-02 02:57:24,712 INFO] n_sample=-1: Build vocab on full datasets.
[2021-02-02 02:57:24,726 INFO] corpus_1's transforms: TransformPipe()
[2021-02-02 02:57:24,727 INFO] Loading ParallelCorpus(split_data/src-train.txt, split_data/tgt-train.txt, align=None)...
[2021-02-02 02:58:47,628 INFO] Counters src:2224500
[2021-02-02 02:58:47,628 INFO] Counters tgt:2137382
[2021-02-02 02:58:49,324 INFO] Counters after share:3272985
Done with tokenization
[2021-02-02 02:58:55,478 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2021-02-02 02:58:55,478 WARNING] Corpus corpus_1's weight should be given. We default it to 1 for you.
[2021-02-02 02:58:55,479 INFO] Missing transforms field for valid data, set to default: [].
[2021-02-02 02:58:55,479 INFO] Parsed 2 corpora from -data.
[2021-02-02 02:58:55,479 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2021-02-02 02:58:55,479 INFO] Loading vocab from text file...
[2021-02-02 02:58:55,479 INFO] Loading src vocabulary from opennmt_data/openmt.vocab
[2021-02-02 02:59:04,827 INFO] Loaded src vocab has 3272985 tokens.
[2021-02-02 02:59:06,932 INFO] Loading tgt vocabulary from opennmt_data/openmt.vocab
[2021-02-02 02:59:15,574 INFO] Loaded tgt vocab has 3272985 tokens.
[2021-02-02 02:59:17,681 INFO] Building fields with vocab in counters...
[2021-02-02 02:59:25,126 INFO] * tgt vocab size: 50004.
[2021-02-02 02:59:32,205 INFO] * src vocab size: 50002.
[2021-02-02 02:59:32,206 INFO] * merging src and tgt vocab...
[2021-02-02 02:59:45,079 INFO] * merged vocab size: 50004.
[2021-02-02 02:59:45,313 INFO] * src vocab size = 50004
[2021-02-02 02:59:45,314 INFO] * tgt vocab size = 50004
[2021-02-02 02:59:45,382 INFO] Building model...
[2021-02-02 03:00:25,641 INFO] NMTModel(
(encoder): TransformerEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(50004, 512, padding_idx=1)
)
)
)
(transformer): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(1): TransformerEncoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(2): TransformerEncoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(3): TransformerEncoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(4): TransformerEncoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(5): TransformerEncoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
)
(decoder): TransformerDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(50004, 512, padding_idx=1)
)
)
)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(transformer_layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(drop): Dropout(p=0.1, inplace=False)
(context_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(drop): Dropout(p=0.1, inplace=False)
(context_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
)
(2): TransformerDecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(drop): Dropout(p=0.1, inplace=False)
(context_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
)
(3): TransformerDecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(drop): Dropout(p=0.1, inplace=False)
(context_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
)
(4): TransformerDecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(drop): Dropout(p=0.1, inplace=False)
(context_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
)
(5): TransformerDecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(drop): Dropout(p=0.1, inplace=False)
(context_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=True)
(linear_values): Linear(in_features=512, out_features=512, bias=True)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=True)
)
(layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
)
)
)
(generator): Sequential(
(0): Linear(in_features=512, out_features=50004, bias=True)
(1): Cast()
(2): LogSoftmax(dim=-1)
)
)
[2021-02-02 03:00:25,756 INFO] encoder: 44517376
[2021-02-02 03:00:25,757 INFO] decoder: 25275220
[2021-02-02 03:00:25,757 INFO] * number of parameters: 69792596
[2021-02-02 03:00:26,102 INFO] Starting training on GPU: [0]
[2021-02-02 03:00:26,102 INFO] Start training loop and validate every 5000 steps...
[2021-02-02 03:00:26,132 INFO] corpus_1's transforms: TransformPipe()
[2021-02-02 03:00:26,228 INFO] Loading ParallelCorpus(split_data/src-train.txt, split_data/tgt-train.txt, align=None)...
[2021-02-02 03:06:48,606 INFO] Step 100/100000; acc: 8.36; ppl: 9026.83; xent: 9.11; lr: 0.00001; 2423/2798 tok/s; 382 sec
[2021-02-02 03:13:09,476 INFO] Step 200/100000; acc: 10.40; ppl: 4152.53; xent: 8.33; lr: 0.00002; 2491/2852 tok/s; 763 sec