Is there a general relationship between the number of parameters and training time?

I’ve recently created a custom pytorch nn model that uses a combination of linear and non-linear connections between layers. I am trying to train a vector of weights on this model. Using about 700 weights, 20 layers, and a batch size of 100, the program takes about 11 seconds to evaluate a single batch while training. For 10 weights, 20 layers, and a batch size of 100, the time for a single batch is about 6 seconds. I find it weird that for 10 weights, the program doesn’t run faster. Is this a sign that I am doing something wrong, or does this seem feasible?

For reference, I am using the weights to create a symetric toeplitz matrix. In both cases, the Toeplitz matrices are about 700 by 700, but in the case where I use 10 weights, only the first 10 diagonals are filled in this matrix.

Look at it the other way, with 700 weights it doesn’t run 70 times slower, because various parallelization mechanisms amortize the time cost (until weights get too big). Plus, there is Amdahl’s law, that explains why time/size dependency is not linear.