There is a stupid question from a PyTorch beginner.

Here is a base model, I denote the model as $f(w_t, x)$.

There is also a mask that is the same size with the weight of $f(w_t, x)$. Let it to be denoted as $w_{m}$. For each element in $w_m$, they are sampled in the range of [0, 1].

My question is, can we integrate the mask $w_m$ into the training workflow and only calculate the grade derivation on the parameters of model $f(w_t, x)$. In the training workflow, the new model will be like $f(w_t * w_m, x)$.

Just wanna update the $w_t$ partition of parameters, while keeping the untrainable property of $w_m$.

Hi, @ptrblck. Thanks for your reply and sorry for my imprecise declaration.

The details are as follows.

A DNN model is defined and we denote it as m. In PyTorch, it can be represented as,

m = Model()

I wanna split the parameters in m into theta and m. All of them are the same in shape. The relationship can be represented as $w = \theta \odot m$, where the symbol $w$ denotes the weight parameters in m.

In the training workflow, I update the parameters separately and step-by-step. Specificly, I obtain the w by combining the theta and m using $w = \theta \odot m$, and set w as the parameters of m. In the first round, I wanna only to update the theta and fix m. Different in the second round, I wanna only to update the m and fix theta. So back and forth.

What confused me was that I do not have any ideas on the implementation of separated parameters updating of w, theta, and m.

Another view to understand the mask is that the mask m and the theta are two sets of parameters of the model m. The difference between theta and m is the value range, m is restricted to [0, 1] while theta is freedom.

but wouldnâ€™t masking the gradients after the backward call work?

Thanks for your suggestions.

Masking the gradient after backward is not what we want, as we need all parameters of m or theta (depending on the round) to be updated, not the specific partition.