Hi thanks for your reply
This is the complete error:
2019-11-10 10:56:51,904 - Parser - Current learning rate: 0.002000
2019-11-10 10:56:59,308 - Parser - Epoch 0, Batch 1, AvgCost: 2.10, CorrectSpan: 0.49, CorrectNuclear: 0.37, CorrectRelation: 0.03 - 0 mins 7 secs
2019-11-10 10:57:00,668 - Parser - Epoch 0, Batch 2, AvgCost: 2.08, CorrectSpan: 0.51, CorrectNuclear: 0.36, CorrectRelation: 0.02 - 0 mins 8 secs
2019-11-10 10:57:03,495 - Parser - Epoch 0, Batch 3, AvgCost: 2.11, CorrectSpan: 0.51, CorrectNuclear: 0.35, CorrectRelation: 0.03 - 0 mins 11 secs
2019-11-10 10:57:04,270 - Parser - Epoch 0, Batch 4, AvgCost: 2.10, CorrectSpan: 0.52, CorrectNuclear: 0.36, CorrectRelation: 0.03 - 0 mins 12 secs
2019-11-10 10:57:04,866 - Parser - Epoch 0, Batch 5, AvgCost: 2.04, CorrectSpan: 0.52, CorrectNuclear: 0.34, CorrectRelation: 0.03 - 0 mins 12 secs
2019-11-10 10:57:09,912 - Parser - Epoch 0, Batch 6, AvgCost: 2.05, CorrectSpan: 0.53, CorrectNuclear: 0.35, CorrectRelation: 0.05 - 0 mins 18 secs
2019-11-10 10:57:12,131 - Parser - Epoch 0, Batch 7, AvgCost: 2.07, CorrectSpan: 0.52, CorrectNuclear: 0.37, CorrectRelation: 0.04 - 0 mins 20 secs
2019-11-10 10:57:12,906 - Parser - Epoch 0, Batch 8, AvgCost: 2.06, CorrectSpan: 0.53, CorrectNuclear: 0.37, CorrectRelation: 0.04 - 0 mins 21 secs
2019-11-10 10:57:13,351 - Parser - Epoch 0, Batch 9, AvgCost: 2.06, CorrectSpan: 0.53, CorrectNuclear: 0.37, CorrectRelation: 0.04 - 0 mins 21 secs
2019-11-10 10:57:13,651 - Parser - Epoch 0, Batch 10, AvgCost: 2.08, CorrectSpan: 0.53, CorrectNuclear: 0.37, CorrectRelation: 0.04 - 0 mins 21 secs
Traceback (most recent call last):
File "train.py", line 319, in <module>
main()
File "train.py", line 238, in main
cost, cost_val = network.loss(subset_data, gold_subtrees, epoch=epoch)
File "/home/ffajri/Workspace/neural_project/models/architecture.py", line 497, in loss
cost = self.decode(encoder_output, gold_nuclear, gold_relation, gold_segmentation, span, len_golds)
File "/home/ffajri/Workspace/neural_project/models/architecture.py", line 381, in decode
segment_output, transformer_output = self.run_transformer_segmentation(hidden_state1, segment_mask) #output in cuda-2
File "/home/ffajri/Workspace/neural_project/models/architecture.py", line 98, in run_transformer_segmentation
edus_score, edus_vec = self.transformer_segmenter(segmented_encoder, segment_mask.int())
File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 3.
Original Traceback (most recent call last):
File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/ffajri/Workspace/neural_project/modules/encoder.py", line 95, in forward
x = self.transformer_inter[i](i, x, x, mask!=1) # all_sents * max_tokens * dim
File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/ffajri/Workspace/neural_project/modules/encoder.py", line 68, in forward
mask=mask)
File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/ffajri/Workspace/neural_project/modules/neural.py", line 410, in forward
query = query / math.sqrt(dim_per_head)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 3; 15.75 GiB total capacity; 11.64 GiB already allocated; 2.12 MiB free; 2.97 GiB cached)
Without DataParallel, it requires me to run it with at least two GPUs, and it works perfectly.