Thanks and you are right. The problem is in Dataloader
s. Just 4 num_worker
s and prefetch_factor
s are not enough. My device has 12 workers and when I tried to use all of them the speed increased a lot. However there’s a new issue: if I use model = torch.compile(model, mode = 'max-autotune')
with all 12 workers, then there will be a lot of errors like
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 328, in cudagraphify_impl
static_outputs = model(list(static_inputs))
File "/tmp/torchinductor_root/vi/cvigpzgiqkkwgcp62tygjtv2pdbmxoytpgcnbl52qjnqiq3m5u6u.py", line 2684, in call
triton__16.run(buf39, buf37, primals_171, primals_172, buf42, buf40, buf41, buf43, 256, 1605632, grid=grid(256), stream=stream0)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 190, in run
result = launcher(
File "<string>", line 6, in launcher
RuntimeError: Triton Error [CUDA]: operation failed due to a previous error during capture
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/syc_cache/train.py", line 71, in <module>
main()
File "/mnt/syc_cache/train.py", line 20, in main
train(args, p, epoch)
File "/mnt/syc_cache/train.py", line 38, in train
loss = forward(images, labels, args, p)
File "/mnt/syc_cache/train.py", line 56, in forward
output = p['model'](images)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 82, in forward
return self.dynamo_ctx(self._orig_mod.forward)(*args, **kwargs)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn
return fn(*args, **kwargs)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torchvision/models/resnet.py", line 284, in forward
def forward(self, x: Tensor) -> Tensor:
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn
return fn(*args, **kwargs)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2819, in forward
return compiled_fn(full_args)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1222, in g
return f(*args)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2386, in debug_compiled_function
return compiled_function(*args)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1898, in runtime_wrapper
all_outs = call_func_with_args(
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1247, in call_func_with_args
out = normalize_as_list(f(args))
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1222, in g
return f(*args)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2151, in forward
fw_outs = call_func_with_args(
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1247, in call_func_with_args
out = normalize_as_list(f(args))
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 248, in run
return model(new_inputs)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 265, in run
compiled_fn = cudagraphify_impl(model, new_inputs, static_input_idxs)
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 327, in cudagraphify_impl
with torch.cuda.graph(graph, stream=stream):
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/cuda/graphs.py", line 173, in __exit__
self.cuda_graph.capture_end()
File "/root/miniconda3/envs/myconda/lib/python3.10/site-packages/torch/cuda/graphs.py", line 79, in capture_end
super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.This text will be hidden
; if I just use model = torch.compile(model)
with all 12 workers, then there will be no errors. Does max-autotune
requires extra workers?