@rtwolfe94022 It turns out that the dataloader’s speed is fine. Most of the time is from _loss.cpu().detach().numpy()
which synchronize the GPU. And my timing code wrapped this procedure’s time in dataloader_time
. In my case, make batch size smaller can relieve the problem.
1 Like