I am quite a pytorch newby, I hope this is the right place to post my issue. I am trying to train a transformer model with model parallelism following closely the megatron example from fairseq (just using complete transformer model instead gpt, same options including --fp16).
My setup is: two nodes with 6 GPUs (Titan RTX) each.
Pytorch 1.6
Cuda 10.1.243
Ubuntu 18.04 LTS
The model trains, however only with 4 GPUs per node. When switching to 6 GPUs per node (plus tweaking the model/dictionary to ensure divisibility by number of GPUs) I get the following error on the second node (right when training should start):
terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: misaligned address
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fc06b9c91e2 in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libc10.so)
frame [#1]: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fc06bc17f92 in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc06b9b79cd in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libc10.so)
frame [#3]: std::vector<at::Tensor, std::allocatorat::Tensor >::~vector() + 0x5c (0x7fc0b3262d1c in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame [#4]: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&) + 0x16b2 (0x7fc0a5d8f6b2 in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame [#5]: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) + 0x451 (0x7fc0a5d8ffa1 in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame [#6]: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x89 (0x7fc0a5d88119 in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame [#7]: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x4a (0x7fc0b352834a in /secondary/thies/.virtualenvs/pytorch-1.6/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame [#8]: + 0xbd6ef (0x7fc0b46826ef in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame [#9]: + 0x76db (0x7fc0b85746db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame [#10]: clone + 0x3f (0x7fc0b88ad88f in /lib/x86_64-linux-gnu/libc.so.6)
It seems there is problem with the c10 library, however I get exactly the same error when adding the --ddp-backend=no_c10d option.
When removing the fp16 option the model trains fine on the c10 backend.