Element-wise multiplication for model parameters and self-defined mask in traing workflow

terrytengli · September 20, 2022, 5:10pm

There is a stupid question from a PyTorch beginner.

Here is a base model, I denote the model as $f(w_t, x)$.
There is also a mask that is the same size with the weight of $f(w_t, x)$. Let it to be denoted as $w_{m}$. For each element in $w_m$, they are sampled in the range of [0, 1].

My question is, can we integrate the mask $w_m$ into the training workflow and only calculate the grade derivation on the parameters of model $f(w_t, x)$. In the training workflow, the new model will be like $f(w_t * w_m, x)$.

Just wanna update the $w_t$ partition of parameters, while keeping the untrainable property of $w_m$.

ptrblck · September 21, 2022, 3:43am

I don’t know how the mask is calculated etc. but wouldn’t masking the gradients after the backward call work?

terrytengli · September 21, 2022, 5:58am

Hi, @ptrblck. Thanks for your reply and sorry for my imprecise declaration.

The details are as follows.

A DNN model is defined and we denote it as m. In PyTorch, it can be represented as,

m = Model()

I wanna split the parameters in m into theta and m. All of them are the same in shape. The relationship can be represented as $w = \theta \odot m$, where the symbol $w$ denotes the weight parameters in m.

In the training workflow, I update the parameters separately and step-by-step. Specificly, I obtain the w by combining the theta and m using $w = \theta \odot m$, and set w as the parameters of m. In the first round, I wanna only to update the theta and fix m. Different in the second round, I wanna only to update the m and fix theta. So back and forth.

What confused me was that I do not have any ideas on the implementation of separated parameters updating of w, theta, and m.

terrytengli · September 21, 2022, 6:11am

I don’t know how the mask is calculated etc.

Another view to understand the mask is that the mask m and the theta are two sets of parameters of the model m. The difference between theta and m is the value range, m is restricted to [0, 1] while theta is freedom.

but wouldn’t masking the gradients after the backward call work?

Thanks for your suggestions.

Masking the gradient after backward is not what we want, as we need all parameters of m or theta (depending on the round) to be updated, not the specific partition.