CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect

Anirudh_Alameluvari · June 6, 2024, 9:29pm

I get this error when I try to run multiple models in parallel

net = get_model(model_path) # This is a basic resnet 18 with the final fc changed.
configs = get_configs(model_path) # Gets the hyperparams for the model.
prepped = before(image, perspective, configs["data"]["mean"], configs["data"]["std"]) # Preprocessing the input
with lock: # Having a lock here fixes the error
    with torch.no_grad():
        pred = (net(prepped) > 0).type(torch.int).squeeze(0).cpu().tolist()

Without synchronization, I get the cuda illegal access error after some time. Never with the synchronization. I’m unable to make any sense of this.

ptrblck · June 6, 2024, 10:58pm

Could you describe your use case in more detail and how you are parallelizing the run?

Anirudh_Alameluvari · June 7, 2024, 9:23pm

with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = {
            executor.submit(predict, image, perspective, model_path): key
            for key, model_path in models_map.items()
        }

        return {index for future, index in futures.items() if future.result()}

For more context, I am passing frames of a real time video through an ensemble of models, each model trained to predict one aspect of the frame. All the models are resnets. It doesn’t necessarily happen on the first frame. But after running this for a while on the video stream, I face this error which fails all subsequent GPU calls.