Arbitrary "an illegal memory access was encountered" when trying to move model to GPU

Hi,

I arbitrarily get the below error message.

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorCopy.cpp line=20 error=77 : an illegal memory access was encountered

Usually there is no error trace. Just this message. I am using CUDA_LAUNCH_BLOCKING=1. I see there are several posts of similar nature. However, I do not see any definitive solution. I am pretty sure there is no illegal index problem as the code runs fine half the times. This error happens even before the 1st minibatch has processed.

The program does not crash after that but it does not move ahead either.

Any help appreciated. Thanks!

Do you have a small runnable code snippet to reproduce this error or does it occur completely random?

This error was happening arbitrarily till yesterday. Most of my attempts today have failed though. No change to source code during this time.

Re small snippet, not sure what to share with you as I am not sure where this is emerging from yet. Sometimes it happens when I do model.cuda() if torch.cuda.is_available() else model.cpu(), some times from other places.

Do you get any errors running your code on CPU?

No errors on CPU. No errors on GPU some of the time either.

How do you debug these kind of issues?

I’m not sure, why you don’t get a proper error message even with CUDA_LAUNCH_BLOCKING=1.
Usually I check that all batches, especially the targets, have valid values.
Often one unusual batch has e.g. a too high target values for whatever reason, so that NLLLoss will crash.

Since the error is thrown randomly, we could try to remove some random operations, e.g. random transformations, no shuffling, num_workers=0 etc. to narrow down the problem.

However, it’s also strange that the CPU code runs without any problems.
Can you think of some operation, which might mess up your targets or shapes of your input or target?

Thanks @ptrblck. I am posting some code here.

from main.py

    # Make the Dataloaders
    train_dataloader, val_dataloader = make_dataloader(data_config, data_path)
    print_rank("prepared the dataloaders")    
    # Make the Model
    model = make_model(model_config, train_dataloader)
    print_rank("prepared the model")

    # Make the optimizer
    optimizer = make_optimizer(optimizer_config, model)
    optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
    hvd.broadcast_parameters(model.state_dict(), root_rank=0)

from make_model subroutine:

        model = Seq2Seq(input_dim=train_dataloader.dataset.input_dim, 
                        vocab_size=train_dataloader.dataset.vocab_size,
                        model_config=model_config)
        print("trying to move the model to GPU")
        # Move it to GPU if you can
        model.cuda() if torch.cuda.is_available() else model.cpu()
        print("moved the model to GPU")

And below is the error message. Unlike previous, this time I got a more descriptive flow of events.

INFO - 2018-09-08T01:01:01.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep  8 01:01:01 2018 | rank 0: prepared the dataloaders
INFO - 2018-09-08T01:01:02.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep  8 01:01:02 2018 | rank 2: prepared the dataloaders
INFO - 2018-09-08T01:01:02.000Z /container_e2206_1531767901933_63142_01_000116: trying to move the model to GPU
INFO - 2018-09-08T01:01:03.000Z /container_e2206_1531767901933_63142_01_000116: trying to move the model to GPU
INFO - 2018-09-08T01:01:04.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep  8 01:01:04 2018 | rank 1: prepared the dataloaders
INFO - 2018-09-08T01:01:05.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep  8 01:01:05 2018 | rank 3: prepared the dataloaders
INFO - 2018-09-08T01:01:05.000Z /container_e2206_1531767901933_63142_01_000116: trying to move the model to GPU
INFO - 2018-09-08T01:01:05.000Z /container_e2206_1531767901933_63142_01_000116: moved the model to GPU
INFO - 2018-09-08T01:01:05.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep  8 01:01:05 2018 | rank 0: prepared the model
INFO - 2018-09-08T01:01:06.000Z /container_e2206_1531767901933_63142_01_000116: moved the model to GPU
INFO - 2018-09-08T01:01:06.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep  8 01:01:06 2018 | rank 2: prepared the model
INFO - 2018-09-08T01:01:07.000Z /container_e2206_1531767901933_63142_01_000116: trying to move the model to GPU
INFO - 2018-09-08T01:01:08.000Z /container_e2206_1531767901933_63142_01_000116: moved the model to GPU
INFO - 2018-09-08T01:01:08.000Z /container_e2206_1531767901933_63142_01_000116: Sat Sep  8 01:01:08 2018 | rank 1: prepared the model
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: THCudaCheck FAIL file=/pytorch/aten/src/THC/THCTensorCopy.cu line=102 error=77 : an illegal memory access was encountered

INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: Traceback (most recent call last):
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116:   File "train.py", line 90, in <module>
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116:     main(model_path, config, data_path, log_dir)
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116:   File "train.py", line 35, in main
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116:     hvd.broadcast_parameters(model.state_dict(), root_rank=0)
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116:   File "/usr/local/lib/python3.6/dist-packages/horovod/torch/__init__.py", line 158, in broadcast_parameters
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116:     synchronize(handle)
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116:   File "/usr/local/lib/python3.6/dist-packages/horovod/torch/mpi_ops.py", line 404, in synchronize
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116:     mpi_lib.horovod_torch_wait_and_clear(handle)
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116:   File "/usr/local/lib/python3.6/dist-packages/torch/utils/ffi/__init__.py", line 202, in safe_call
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116:     result = torch._C._safe_call(*args, **kwargs)
INFO - 2018-09-08T01:01:09.000Z /container_e2206_1531767901933_63142_01_000116: torch.FatalError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt

This might be a little to parse. First, please ignore all the container_* strings prepended to each line. Second, notice that rank 3 does not print prepared model. We also see 4 "trying to move the model to GPU" but only 3 "moved the model to GPU" prints. Thus rank 3 was failing at line "model.cuda() if torch.cuda.is_available() else model.cpu()".

Now, what could be causing THIS? there are no indexes involved till now. Just moving a model to GPU.

torch.cuda.set_device(hvd.local_rank()) is how I set device for each rank. hvd.local_rank() is printing the correct GPU number. So Horovod is not to blame here.

What else should I check? What are the cases where moving the model to the GPU would cause "illegal memory access"?

Since you are using horovod and mpi, this can be a bug in either horovod, pytorch distributed code or mpi. The first two seem more likely. If you can come up with a more reliable short script, I suggest you post to GitHub repos of those two projects.