I read through the posts you shared with me, and many other forum’s questions regarding this issue. I changed my dataloaders to the following:

```
X_train, X_val, y_train, y_val = train_test_split(X.astype('float32'), Y.astype('float32'), test_size=0.1, random_state=2)
```

```
X_train = torch.tensor(X_train)
y_train = torch.tensor(y_train)
train = torch.utils.data.TensorDataset(X_train,y_train)
train_loader = torch.utils.data.DataLoader(train, batch_size = 256, shuffle = True, num_workers= 4, pin_memory= True )
```

and:

```
X_val = torch.tensor(X_val)
y_val = torch.tensor(y_val)
val = torch.utils.data.TensorDataset(X_val,y_val)
val_loader = torch.utils.data.DataLoader(val, batch_size =256, shuffle = True, num_workers= 4, pin_memory= True)
```

here’s my simple train loop:

```
for epoch in range(last_epoch, 100):
end = time.time()
for i, (inputs , targets) in enumerate(train_loader):
inputs , targets = inputs.cuda() , targets.cuda()
print("train: time for loading batch {} is {}".format(i+1,time.time() - end))
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
running_loss_train += loss.data.item()
end = time.time()
end = time.time()
for i, (inputs , targets) in enumerate(val_loader):
inputs , targets = inputs.cuda() , targets.cuda()
print("val: time for loading batch {} is {}".format(i+1,time.time() - end))
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
running_loss_val += loss.data.item()
end = time.time()
```

This resulted in my GPU utilization flactuating between 1% and 34% from the Timing output I think loading the very first batch is causing this bottleneck:

```
epoch 1
train: time for loading batch 1 is 0.5566799640655518
train: time for loading batch 2 is 0.0019986629486083984
train: time for loading batch 3 is 0.0019989013671875
train: time for loading batch 4 is 0.001997709274291992
train: time for loading batch 5 is 0.0009999275207519531
train: time for loading batch 6 is 0.001999378204345703
train: time for loading batch 7 is 0.0019981861114501953
train: time for loading batch 8 is 0.001999378204345703
train: time for loading batch 9 is 0.0019991397857666016
train: time for loading batch 10 is 0.0019991397857666016
train: time for loading batch 11 is 0.0019991397857666016
train: time for loading batch 12 is 0.001999378204345703
train: time for loading batch 13 is 0.0009996891021728516
train: time for loading batch 14 is 0.001998424530029297
train: time for loading batch 15 is 0.0010004043579101562
train: time for loading batch 16 is 0.0019986629486083984
train: time for loading batch 17 is 0.0010006427764892578
train: time for loading batch 18 is 0.0019989013671875
train: time for loading batch 19 is 0.0019996166229248047
train: time for loading batch 20 is 0.0
val: time for loading batch 1 is 0.4817237854003906
val: time for loading batch 2 is 0.0019991397857666016
val: time for loading batch 3 is 0.0
epoch 2
train: time for loading batch 1 is 0.46173715591430664
train: time for loading batch 2 is 0.0019996166229248047
train: time for loading batch 3 is 0.0019996166229248047
train: time for loading batch 4 is 0.0009996891021728516
.
.
.
```

Any insights ? PS: I played around with the batch size and the num_workers, but they dont seem to solve the issue!

Here’s another timing using `non_blocking=True`

:

```
epoch 1
train: time for loading batch 1 is 0.6221673488616943
train: time for loading batch 2 is 0.0
train: time for loading batch 3 is 0.0
train: time for loading batch 4 is 0.0
train: time for loading batch 5 is 0.0009992122650146484
train: time for loading batch 6 is 0.0
train: time for loading batch 7 is 0.0
train: time for loading batch 8 is 0.0
train: time for loading batch 9 is 0.0
train: time for loading batch 10 is 0.0
train: time for loading batch 11 is 0.0
train: time for loading batch 12 is 0.0009992122650146484
train: time for loading batch 13 is 0.0
train: time for loading batch 14 is 0.00099945068359375
train: time for loading batch 15 is 0.0009996891021728516
train: time for loading batch 16 is 0.0
train: time for loading batch 17 is 0.0009992122650146484
train: time for loading batch 18 is 0.0010004043579101562
train: time for loading batch 19 is 0.0
train: time for loading batch 20 is 0.0
val: time for loading batch 1 is 0.46076011657714844
val: time for loading batch 2 is 0.0
val: time for loading batch 3 is 0.0
epoch 2
train: time for loading batch 1 is 0.48410654067993164
train: time for loading batch 2 is 0.00099945068359375
train: time for loading batch 3 is 0.00099945068359375
train: time for loading batch 4 is 0.0
train: time for loading batch 5 is 0.0
```