Avoid cpu from idle while load data

Aakira · August 12, 2024, 11:20pm

Hello, I have a question about how the dataloader works. Suppose I have code that looks like this:

model.to("cuda")
for data, label in dataloader:
    data.to("cuda")
    label.to("cuda")

    output = model(data)  # on GPU

    loss = criterion(output, label)  # on GPU

    print(loss)  # use CPU?

In my dataset code, I only use the CPU. Will the line loss = criterion(output, label) wait for the line output = model(data) to finish, or will the CPU continue to the next batch and load the data? If the answer is that the CPU will wait, what should I do to avoid the situation where the CPU is idle like that?

Tahamustapha_Nehdi · August 12, 2024, 11:47pm

Hi Aakira
The code you provided is sequential so compiler will treat it line by line no parallelism and specially asynchronous operations , but in the case where model is in gpu all asynchronous ops will hit barrier after that loss calculation will happen

Aakira · August 13, 2024, 12:22am

Thank you for the response. I always see everyone coding for training like in the code snippet above. However, when tracking chart of GPU usage, every time a batch of data is loaded, the GPU becomes idle (usage line drops to very low). Is there a way to help the CPU always load data ahead (or at least on time) for the GPU?

yanzhou1994 · August 13, 2024, 2:09am

If you don’t print the loss or perform CUDA synchronization operations, the CUDA runtime will manage the copy stream corresponding to data.to("cuda"), as well as all the streams associated with the kernels needed for model execution. cuda runtime will handle and run these streams according to their dependency relationships.
If you want to relazie more about cuda concurrently.
see this ppt:

Aakira · August 13, 2024, 3:39am

Thank you for your response, the slides are very helpful, however, I have not grasped how your point is related to the initial question. Would you mind clarifying it further?

Thanks.

yanzhou1994 · August 13, 2024, 8:16am

kernel run and cuda memory copy are all async operation.
so for your code(unroll it and delete print):

data0, label0 = dataloader.next()
data0 = data0.to("cuda")
label0 = label0.to("cuda")
output0 = model(data0)
loss0 = criterion(output0, label0) 
data1, label1 = dataloader.next() # when run this line, loss0 not need to get the result and copy from device to host(it's an aysnc operation).  data1 may return before loss0 was computed.
data1 = data1.to("cuda")
label1 = label1.to("cuda")
output1 = model(data1)
loss1 = criterion(output1, label1) 

...

when compute output0, loss0 (kernel running, while result not returned), the data1 maybe always returned by dataloader.
In this case, CPU was not idle when model(data) was computing.

But for modern gpu and current model, it’s always memory-bound,
which means that model(data) is waiting for dataloader get data and copy to cuda device memory.

Aakira · August 13, 2024, 2:10pm

Thank you for your detailed answer, it’s very helpful.