For example, there are ten parameters(filters) in a CNN layer, how I can do to only update five of them and keep the rest unchanged(while considering efficiency)?

Hi Monkey!

If your “ten parameters” are separate `Tensor`

s (or `Tensor`

s wrapped

in `Parameter`

s), set the `requires_grad`

property of the five you wish

to keep unchanged to `False`

(and don’t add them to your optimizer).

If your “ten parameters” are all values in the same `Tensor`

, the best

approach, in my mind, is to store the values before the optimizer update

and then, after the update, restore the values you wish to keep unchanged.

This approach is outlined in the following post:

Best.

K. Frank

Thanks, K. Frank

Yes, you provide a useful method.

But if we only update some parameters through “store and restore”, it will bring extra time consumption compare to the original training process.

Is there some methods that don’t bring extra computation?

Best,

MonKeyBoy

Hi Monkey!

The short story is that floating-point operations != computation time.

Yes, for the typical use case, updating only some elements of, for example,

a `weight`

tensor will take longer than updating all of them.

It is possible to avoid the extra *computation.* But the key point is that

avoiding computation doesn’t necessarily save time – and can actually

take more time.

This is because the floating-point pipelines in the gpu (and cpu, for

that matter), as well as the gpu “kernels” (the code that performs the

computations), are optimized for performing certain highly-structured

sequences of floating-point computations.

As a basic example, consider multiplying two large matrices together

where one of the matrices has many zero elements. You could avoid

many floating-point operations by skipping the multiply-by-zeros. But

the logic to do so would break up the optimized way in which those

structured floating-point operations are streamed through the gpu

pipeline, and the pipeline would end up spending a lot of its time not

performing *any* floating-point operations.

So unless your matrix is *very* sparse (has a very large percentage of

zero elements), it is (much) cheaper in time – on this kind of optimized

hardware – to multiply the full matrices together, rather than to use a

sparse-matrix-multiplication algorithm to perform the computation.

Best.

K. Frank