How to stop updating the parameters of a part of a layer in a CNN model (not the parameters of the whole layer)?

For example, there are ten parameters(filters) in a CNN layer, how I can do to only update five of them and keep the rest unchanged(while considering efficiency)?

Hi Monkey!

If your “ten parameters” are separate Tensors (or Tensors wrapped
in Parameters), set the requires_grad property of the five you wish
to keep unchanged to False (and don’t add them to your optimizer).

If your “ten parameters” are all values in the same Tensor, the best
approach, in my mind, is to store the values before the optimizer update
and then, after the update, restore the values you wish to keep unchanged.

This approach is outlined in the following post:

Best.

K. Frank

Thanks, K. Frank

Yes, you provide a useful method.
But if we only update some parameters through “store and restore”, it will bring extra time consumption compare to the original training process.
Is there some methods that don’t bring extra computation?

Best,
MonKeyBoy

Hi Monkey!

The short story is that floating-point operations != computation time.

Yes, for the typical use case, updating only some elements of, for example,
a weight tensor will take longer than updating all of them.

It is possible to avoid the extra computation. But the key point is that
avoiding computation doesn’t necessarily save time – and can actually
take more time.

This is because the floating-point pipelines in the gpu (and cpu, for
that matter), as well as the gpu “kernels” (the code that performs the
computations), are optimized for performing certain highly-structured
sequences of floating-point computations.

As a basic example, consider multiplying two large matrices together
where one of the matrices has many zero elements. You could avoid
many floating-point operations by skipping the multiply-by-zeros. But
the logic to do so would break up the optimized way in which those
structured floating-point operations are streamed through the gpu
pipeline, and the pipeline would end up spending a lot of its time not
performing any floating-point operations.

So unless your matrix is very sparse (has a very large percentage of
zero elements), it is (much) cheaper in time – on this kind of optimized
hardware – to multiply the full matrices together, rather than to use a
sparse-matrix-multiplication algorithm to perform the computation.

Best.

K. Frank