Following are the specification of my laptop:
-> RAM : 16 GB
-> Intel i7 - 9th gen (4.0GHz)
-> Nvidia RTX 2070 Max Q (8GB)
-> Windows 10
-> pytorch version 1.5
I have not looked extremely closely at your code, but two things I noticed that are missing, which might have some impact on performance, is to optimize for better CPU and GPU operation interleaving by using asynchronous data transfers. This might not have a huge impact in your case, but since it is quick and easy to implement, I recommend giving it a shot.
First, on your data loaders, add the keyword argument pin_memory=True. Second, in your train/eval loop where you copy data from CPU memory to GPU memory using .to(...), append the keyword argument non_blocking=True.
Specific examples of these changes to your notebook:
# DataLoader with pin_memory=True
train_data_loader = DataLoader(train_dataset, batch_size=2048, shuffle=True, num_workers=0, pin_memory=True)
# Tensor.to(...) with non_blocking=True
xb = xb.to(device, non_blocking=True)
How much impact this will have in your specific case is really hard to determine, since I cannot see how much time the GPU spends waiting on data transfer, etc. But again, making these changes is quick, so I recommend testing to see if it speeds things up. If the impact is minimal, we will need to see more details related to GPU utilization; e.g. output from nvprof or similar.