Thanks @ptrblck. I am posting some code here.
from main.py
# Make the Dataloaders
train_dataloader, val_dataloader = make_dataloader(data_config, data_path)
print_rank("prepared the dataloaders")
# Make the Model
model = make_model(model_config, train_dataloader)
print_rank("prepared the model")
# Make the optimizer
optimizer = make_optimizer(optimizer_config, model)
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
from make_model
subroutine:
model = Seq2Seq(input_dim=train_dataloader.dataset.input_dim,
vocab_size=train_dataloader.dataset.vocab_size,
model_config=model_config)
print("trying to move the model to GPU")
# Move it to GPU if you can
model.cuda() if torch.cuda.is_available() else model.cpu()
print("moved the model to GPU")
And below is the error message. Unlike previous, this time I got a more descriptive flow of events.
INFO - 2018-09-08T01:01:01.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep 8 01:01:01 2018 | rank 0: prepared the dataloaders
INFO - 2018-09-08T01:01:02.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep 8 01:01:02 2018 | rank 2: prepared the dataloaders
INFO - 2018-09-08T01:01:02.000Z /container_e2206_1531767901933_63142_01_000116: trying to move the model to GPU
INFO - 2018-09-08T01:01:03.000Z /container_e2206_1531767901933_63142_01_000116: trying to move the model to GPU
INFO - 2018-09-08T01:01:04.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep 8 01:01:04 2018 | rank 1: prepared the dataloaders
INFO - 2018-09-08T01:01:05.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep 8 01:01:05 2018 | rank 3: prepared the dataloaders
INFO - 2018-09-08T01:01:05.000Z /container_e2206_1531767901933_63142_01_000116: trying to move the model to GPU
INFO - 2018-09-08T01:01:05.000Z /container_e2206_1531767901933_63142_01_000116: moved the model to GPU
INFO - 2018-09-08T01:01:05.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep 8 01:01:05 2018 | rank 0: prepared the model
INFO - 2018-09-08T01:01:06.000Z /container_e2206_1531767901933_63142_01_000116: moved the model to GPU
INFO - 2018-09-08T01:01:06.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep 8 01:01:06 2018 | rank 2: prepared the model
INFO - 2018-09-08T01:01:07.000Z /container_e2206_1531767901933_63142_01_000116: trying to move the model to GPU
INFO - 2018-09-08T01:01:08.000Z /container_e2206_1531767901933_63142_01_000116: moved the model to GPU
INFO - 2018-09-08T01:01:08.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep 8 01:01:08 2018 | rank 1: prepared the model
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: THCudaCheck FAIL file=/pytorch/aten/src/THC/THCTensorCopy.cu line=102 error=77 : an illegal memory access was encountered
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: Traceback (most recent call last):
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: File "train.py", line 90, in <module>
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: main(model_path, config, data_path, log_dir)
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: File "train.py", line 35, in main
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: hvd.broadcast_parameters(model.state_dict(), root_rank=0)
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: File "/usr/local/lib/python3.6/dist-packages/horovod/torch/__init__.py", line 158, in broadcast_parameters
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: synchronize(handle)
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: File "/usr/local/lib/python3.6/dist-packages/horovod/torch/mpi_ops.py", line 404, in synchronize
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: mpi_lib.horovod_torch_wait_and_clear(handle)
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: File "/usr/local/lib/python3.6/dist-packages/torch/utils/ffi/__init__.py", line 202, in safe_call
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: result = torch._C._safe_call(*args, **kwargs)
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: torch.FatalError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt
This might be a little to parse. First, please ignore all the container_*
strings prepended to each line. Second, notice that rank 3 does not print prepared model
. We also see 4 "trying to move the model to GPU"
but only 3 "moved the model to GPU"
prints. Thus rank 3 was failing at line "model.cuda() if torch.cuda.is_available() else model.cpu()"
.
Now, what could be causing THIS? there are no indexes involved till now. Just moving a model to GPU.