Need help with slow autograd on GPU

The following code is meant to fit a polynomial to a simple data set. The train function takes arguments that select the framework from numpy, pytorch, pytorch on GPU, pytorch with autograd, and pytorch on GPU with autograd.

I’m getting long execution times for autograd on GPU. I would greatly appreciate help with understanding what I am doing that results in such long execution times.

The complete code example is in http://nbviewer.ipython.org/url/www.cs.colostate.edu/~anderson/cs793/notebooks/a.ipynb

Here is a summary of the results, which are at the bottom of the above notebook:

numpy 11.24 seconds, final error 0.0079
torch 10.40 seconds, final error 0.0079
torch-gpu 46.31 seconds, final error 0.0079
torch-autograd 102.41 seconds, final error 0.0079
torch-autograd-gpu 237.49 seconds, final error 0.0079.