Hello, I have a question about how the dataloader works. Suppose I have code that looks like this:
model.to("cuda")
for data, label in dataloader:
data.to("cuda")
label.to("cuda")
output = model(data) # on GPU
loss = criterion(output, label) # on GPU
print(loss) # use CPU?
In my dataset code, I only use the CPU. Will the line loss = criterion(output, label) wait for the line output = model(data) to finish, or will the CPU continue to the next batch and load the data? If the answer is that the CPU will wait, what should I do to avoid the situation where the CPU is idle like that?
Hi Aakira
The code you provided is sequential so compiler will treat it line by line no parallelism and specially asynchronous operations , but in the case where model is in gpu all asynchronous ops will hit barrier after that loss calculation will happen
Thank you for the response. I always see everyone coding for training like in the code snippet above. However, when tracking chart of GPU usage, every time a batch of data is loaded, the GPU becomes idle (usage line drops to very low). Is there a way to help the CPU always load data ahead (or at least on time) for the GPU?
If you don’t print the loss or perform CUDA synchronization operations, the CUDA runtime will manage the copy stream corresponding to data.to("cuda"), as well as all the streams associated with the kernels needed for model execution. cuda runtime will handle and run these streams according to their dependency relationships.
If you want to relazie more about cuda concurrently.
see this ppt:
Thank you for your response, the slides are very helpful, however, I have not grasped how your point is related to the initial question. Would you mind clarifying it further?
kernel run and cuda memory copy are all async operation.
so for your code(unroll it and delete print):
data0, label0 = dataloader.next()
data0 = data0.to("cuda")
label0 = label0.to("cuda")
output0 = model(data0)
loss0 = criterion(output0, label0)
data1, label1 = dataloader.next() # when run this line, loss0 not need to get the result and copy from device to host(it's an aysnc operation). data1 may return before loss0 was computed.
data1 = data1.to("cuda")
label1 = label1.to("cuda")
output1 = model(data1)
loss1 = criterion(output1, label1)
...
when compute output0, loss0 (kernel running, while result not returned), the data1 maybe always returned by dataloader.
In this case, CPU was not idle when model(data) was computing.
But for modern gpu and current model, it’s always memory-bound,
which means that model(data) is waiting for dataloader get data and copy to cuda device memory.