The role of forward, backward, and update

mrqorib · April 15, 2020, 5:48am

Hi everyone,
I’m new to PyTorch (and neural network overall). I have some doubts regarding the process of forward, backward, and update. I have read the tutorial and tried searching topics in the forums but still a bit confused, I’m sorry if it turns out this has been asked before.
So, from what I know, PyTorch calculates (and saves) the gradient of a tensor after a backward call from a loss function right?
So in my problem, I’m trying to do an adversarial training [1], and this is a snippet of my train step:

0  | output = model(x, x_len)
1  | loss = loss_function(output, y)
2  | print('Input: {}'.format(x))
3  | model.embeds.retain_grad()
4  | if args.adv:
5  |     model.zero_grad()
6. |     output = model(x, x_len)
7  |     adv_loss = loss_function(output, y)
8  |     model.embeds.retain_grad()
9  |     adv_loss.backward()
10 |     print('First call: {}'.format(model.embeds.grad), flush=True)
11 |     embed_grad = model.embeds.grad.detach()
12 |     output = model(x, x_len, embed_grad=embed_grad)
13 |     adv_loss = loss_function(output, y)
14 |     loss += adv_loss
15 | model.embeds.retain_grad()
16 | model.zero_grad()
17 | print('Grad before: {}'.format(model.embeds.grad), flush=True)
18 | loss.backward()
19 | print('Outside: {}'.format(model.embeds.grad), flush=True)
20 | optimizer.step()

And this is the forward function of the model:

21 | def forward(self, sentence, sentence_len, embed_grad=None):
22 |     embeds = self.word_embed(sentence)
23 |     self.embeds = embeds
24 |     if embed_grad is not None:
25 |         sign_grad = embed_grad.sign()
26 |         embeds = embeds + self.epsilon * sign_grad
27 |         if self.clamp_embed:
28 |             embeds = torch.clamp(embeds, 0, 1)
29 |     batch_size = len(sentence)
30 |     max_length = max(sentence_len)
31 |     embeds = self.dropout_layer(embeds)
32 |     packed = pack_padded_sequence(embeds, sentence_len, batch_first=True, enforce_sorted=False)
33 |     packed_output, _ = self.lstm(packed)
34 |     lstm_out, input_sizes = pad_packed_sequence(packed_output, batch_first=True)
35 |     hidden_space = self.hidden_layer(lstm_out.view(-1, self.hidden_dim))
36 |     hidden_space = self.dropout_layer(F.relu(hidden_space))
37 |     tag_space = self.output_layer(hidden_space)
38 |     output = F.log_softmax(tag_space, dim=1)
39 |     return output

*the line number is not the actual line number in my script XD

I realized this implementation may not be an efficient one, but my question for now is:

For the same input and same gradient seed, why is the output of line 19 when the args.adv is False, different than the output of line 10 when the args.adv is True with epsilon=0? Is it because of the randomness in the model architecture (such as dropout)?
If the args.adv = True, but the epsilon=0, should the expected output of line 19 = line 10 x 2?
In this current implementation, I’m making 3 calls to the model (1st to call with the original input, 2nd to get the gradient, 3rd to call with adversarial input), but can I make it only call the model twice with implementation below? Is it equal and more efficient?

40 | output = model(x, x_len)
41 | loss = loss_function(output, y)
42 | model.embeds.retain_grad()
43 | loss.backward()
44 | if args.adv:
45 |     embed_grad = model.embeds.grad.detach()
46 |     output = model(x, x_len, embed_grad=embed_grad)
47 |     adv_loss = loss_function(output, y)
48 |     adv_loss.backward()
49 | optimizer.step()

Sorry for the long question, thank you very very much!

Warm regards,
Reza

[1] https://arxiv.org/abs/1605.07725

albanD · April 15, 2020, 4:33pm

Hi,

Yes dropout layers and any randomness in general (like random sampler to load data or model initialization) will give different values at every run. If you seed them, you can still get different results if your use them differently from one version to the other (like more parameters in your model mean that more random numbers will be drawn) or if you use non-deterministic ops (see note in the doc here: https://pytorch.org/docs/stable/notes/randomness.html).
Not sure why this should be true? if you just compute the same loss twice then backward then yes it should be true. But you seem to be doing fancy stuff when the embed_grad is provided so I guess you will get a different loss.
It looks like it would do the same yes as you simply accumulate the gradients in to steps.

mrqorib · April 16, 2020, 2:27am

Hi albanD,

I see, I think it’s different is because I didn’t set torch.backends.cudnn.deterministic = True even though I have set the torch.manual_seed. I’ll try this and remove some operations in the forward to get a comparable comparison. Thank you very much for your reply!