What does the backward() function do?

Thank you for your response.
I explained my confusion with loss.backward() in another topic.

I have a very similar implementation however it askes for retain_graph=True and then that slows down the code so that it’s impractical to train. Any thoughts?

why the x.grad is added with the dloss/dx… Insted it should be multiplied with learning rate and add with the older weight values.
Because, new_weight = old_weight - (learning_rate)*x.grade

  1. When we pass the 1st batch for the forward pass and compute the loss for the 1st batch.
  2. We calculate the back propagation to compute d_loss/dx for all layers.
  3. Then with optimization technique we updates the weights with help of optimizer.step function for the 1st batch.
    Later for the second batch whether the updated weights from the 1st batch will be used or what. And before applying the backward() function for second batch weather we should do optimizer.zero_grade() or WHAT???
1 Like

Hi @colesbury

I used two loss function loss=loss1+loss2, and I expect to have different gradient when I use just loss=loss1,But the gradient flow and numbers was same.indeed adding second loss does not have any effect. Would you pleas help me with that? I try different second loss but the result does not have any change. The first loss is BCELoss and the second one is L1. I change the sigmoid function to Relu, But again the gradient from backward.() with loss2 and without loss2 is same!

netG = Generator(ngpu,nz,ngf).to(device)

optimizerG = optim.Adam(netG.parameters(), lr=lr2, betas=(beta1, 0.999))

netG.zero_grad()

label.fill_(real_label)  
label=label.to(device)
output = netD(fake).view(-1)
# Calculate G's loss based on this output
loss1 = criterion(output, label)


xxx=torch.histc(Gaussy.squeeze(1).view(-1).cpu(),100, min=0, max=1, out=None)
ddGaussy=xxx/xxx.sum()

xxx1=torch.histc(fake.squeeze(1).view(-1).cpu(),100, min=0, max=1, out=None)
ddFake=xxx1/xxx1.sum()
loss2=abs(ddGaussy-ddFake).sum()

# Calculate gradients for G with 2 loss

errG=loss1+loss2
errG.backward()

for param in netG.parameters():
            print(param.grad.data.sum())

# Update G
optimizerG.step()




 
## ------------------
class Generator(nn.Module):
    def __init__(self,ngpu,nz,ngf):
        super(Generator, self).__init__()
        self.ngpu=ngpu
        self.nz=nz
        self.ngf=ngf
        self.l1= nn.Sequential(
            # input is Z, going into a convolution
            nn.ConvTranspose2d(self.nz, self.ngf * 8, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 8),
            nn.ReLU(True),)
            # state size. (ngf*8) x 4 x 4
        self.l2=nn.Sequential(nn.ConvTranspose2d(self.ngf * 8, self.ngf * 4, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 4),
            nn.ReLU(True),)
            # state size. (ngf*4) x 8 x 8
        self.l3=nn.Sequential(nn.ConvTranspose2d( self.ngf * 4, self.ngf * 2, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 2),
            nn.ReLU(True),)
            # state size. (ngf*2) x 16 x 16
        self.l4=nn.Sequential(nn.ConvTranspose2d( self.ngf*2, 1, 3, 1, 0, bias=False),nn.Sigmoid()
#            nn.Tanh()
            # state size. (nc) x 64 x 64
        )

    def forward(self, input):
        out=self.l1(input)
        out=self.l2(out)
        out=self.l3(out)
        out=self.l4(out)
        print(out.shape)
        return out

Double post with answer from here.

1 Like

hi @colesbury, I am trying to do a similar thing where I have a reconstruction loss and a kernel alignment loss. They are calculated as below:

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()

        self.We1 = torch.nn.Parameter(torch.Tensor(input_length, args.hidden_size).uniform_(-1.0 / math.sqrt(input_length), 1.0 / math.sqrt(input_length)))
        self.We2 = torch.nn.Parameter(torch.Tensor(args.hidden_size, args.code_size).uniform_(-1.0 / math.sqrt(args.hidden_size), 1.0 / math.sqrt(args.hidden_size)))

        self.be1 = torch.nn.Parameter(torch.zeros([args.hidden_size]))
        self.be2 = torch.nn.Parameter(torch.zeros([args.code_size]))

    def encoder(self, encoder_inputs):
        hidden_1 = torch.tanh(torch.matmul(encoder_inputs.float(), self.We1) + self.be1)
        code = torch.tanh(torch.matmul(hidden_1, self.We2) + self.be2)
        return code

    def decoder(self,encoder_inputs):
        code = self.encoder(encoder_inputs)

        # ----- DECODER -----
        if tied_weights:

            Wd1 = torch.transpose(We2)
            Wd2 = torch.transpose(We1)

        else:

            Wd1 = torch.nn.Parameter(
                torch.Tensor(args.code_size, args.hidden_size).uniform_(-1.0 / math.sqrt(args.code_size),
                                                                           1.0 / math.sqrt(args.code_size)))
            Wd2 = torch.nn.Parameter(
                torch.Tensor(args.hidden_size, input_length).uniform_(-1.0 / math.sqrt(args.hidden_size),
                                                                             1.0 / math.sqrt(args.hidden_size)))

            bd1 = torch.nn.Parameter(torch.zeros([args.hidden_size]))
            bd2 = torch.nn.Parameter(torch.zeros([input_length]))

            if lin_dec:
                hidden_2 = torch.matmul(code, Wd1) + bd1
            else:
                hidden_2 = torch.tanh(torch.matmul(code, Wd1) + bd1)

            dec_out = torch.matmul(hidden_2, Wd2) + bd2

        return  dec_out

    def kernel_loss(self,code, prior_K):
        # kernel on codes
        code_K = torch.mm(code, torch.t(code))

        # ----- LOSS -----
        # kernel alignment loss with normalized Frobenius norm
        code_K_norm = code_K / torch.linalg.matrix_norm(code_K, ord='fro', dim=(- 2, - 1))
        prior_K_norm = prior_K / torch.linalg.matrix_norm(prior_K, ord='fro', dim=(- 2, - 1))
        k_loss = torch.linalg.matrix_norm(torch.sub(code_K_norm,prior_K_norm), ord='fro', dim=(- 2, - 1))
        return k_loss

# Initialize model
model = Model()

Now, during training I pass my training data as inputs to the encoder and decoder.

for ep in range(args.num_epochs):
    for batch in range(max_batches):
        # get input data
            
            dec_out = model.decoder(encoder_inputs)
            reconstruct_loss = torch.mean((dec_out - encoder_inputs) ** 2)
            enc_out = model.encoder(encoder_inputs)
            k_loss = model.kernel_loss(enc_out,prior_K)
       

            tot_loss = reconstruct_loss + args.w_reg * reg_loss + args.a_reg * k_loss
            tot_loss = tot_loss.float()

            # Backpropagation
            optimizer.zero_grad()
            #tot_loss.backward(retain_graph=True)
            tot_loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_gradient_norm)
            optimizer.step()

This always gives me an error saying “RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed).
Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need
to backward through the graph a second time”

It works only when I activate the retain_graph flag. But takes huge time for training. Can you please let me know what wrong I am doing here?

Thank you!

The issue is raised e.g. if you are keeping the computation graph alive and are then trying to calculate the gradients from the current as well as the previous iteration(s).
Your current code is unfortunately not executable as the init parameters are missing, so could you post a minimal, executable code snippet which would reproduce the issue in case you get stuck?

hi @ptrblck , thank you for your time. The code now works without “retain_graph = True” flag after I moved a variable declaration inside the training loop, which I was using in training batches. But it did not resolve the high training time issue and the difference in number of trainable params. I have created two sample codes with a random data. Can you please help me with - 1) Why the number of trainable params is just half of the TF code. 2) The training takes more than double the time of the original TF code.

PYTORCH:

import torch
import torch.nn as nn
from torchvision.utils import save_image
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse
import time
import matplotlib.pyplot as plt
import math
from scipy import stats
import scipy
import os
import datetime
from math import sqrt
from math import log
from torch import optim
from torch.autograd import Variable
from math import sqrt
from math import log



# from tensorflow import keras as K

# dim_red = 1  # perform PCA on the codes and plot the first two components
# plot_on = 1  # plot the results, otherwise only textual output is returned
# interp_on = 0  # interpolate data (needed if the input time series have different length)
# tied_weights = 0  # train an AE where the decoder weights are the econder weights transposed
# lin_dec = 1  # train an AE with linear activations in the decoder

# parse input data
parser = argparse.ArgumentParser()
parser.add_argument("--code_size", default=20, help="size of the code", type=int)
parser.add_argument("--w_reg", default=0.001, help="weight of the regularization in the loss function", type=float)
parser.add_argument("--a_reg", default=0.2, help="weight of the kernel alignment", type=float)
parser.add_argument("--num_epochs", default=5000, help="number of epochs in training", type=int)
parser.add_argument("--batch_size", default=25, help="number of samples in each batch", type=int)
parser.add_argument("--max_gradient_norm", default=1.0, help="max gradient norm for gradient clipping", type=float)
parser.add_argument("--learning_rate", default=0.001, help="Adam initial learning rate", type=float)
parser.add_argument("--hidden_size", default=30, help="size of the code", type=int)
args = parser.parse_args()
print(args)

# ================= DATASET =================
# (train_data, train_labels, train_len, _, K_tr,
#  valid_data, _, valid_len, _, K_vs,
#  test_data_orig, test_labels, test_len, _, K_ts) = getBlood(kernel='TCK',
#                                                             inp='zero')  # data shape is [T, N, V] = [time_steps, num_elements, num_var]

train_data = np.random.rand(9000,6)
train_labels = np.ones([9000,1])
train_len = 9000

valid_data = np.random.rand(9000,6)
valid_len = 9000

test_data = np.random.rand(1500,6)
test_labels = np.ones([1500,1])

K_tr = np.random.rand(9000,9000)
K_ts = np.random.rand(1500,1500)
K_vs =  np.random.rand(9000,9000)

#test_data = test_data_orig


print(
    '\n**** Processing Blood data: Tr{}, Vs{}, Ts{} ****\n'.format(train_data.shape, valid_data.shape, test_data.shape))

input_length = train_data.shape[1]  # same for all inputs

# ================= GRAPH =================

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

encoder_inputs = train_data
prior_k = K_tr

# ============= TENSORBOARD =============
writer = SummaryWriter()

# # ----- ENCODER -----

input_length = encoder_inputs.shape[1]
print ("INPUT ")

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()

        self.We1 = torch.nn.Parameter(torch.Tensor(input_length, args.hidden_size).uniform_(-1.0 / math.sqrt(input_length), 1.0 / math.sqrt(input_length)))
        self.We2 = torch.nn.Parameter(torch.Tensor(args.hidden_size, args.code_size).uniform_(-1.0 / math.sqrt(args.hidden_size), 1.0 / math.sqrt(args.hidden_size)))

        self.be1 = torch.nn.Parameter(torch.zeros([args.hidden_size]))
        self.be2 = torch.nn.Parameter(torch.zeros([args.code_size]))


    def encoder(self, encoder_inputs):
        hidden_1 = torch.tanh(torch.matmul(encoder_inputs.float(), self.We1) + self.be1)
        code = torch.tanh(torch.matmul(hidden_1, self.We2) + self.be2)
        #print ("CODE ENCODER SHAPE:", code.size())
        return code

    def decoder(self,encoder_inputs):
        code = self.encoder(encoder_inputs)

        # ----- DECODER -----
        # if tied_weights:
        #
        #     Wd1 = torch.transpose(We2)
        #     Wd2 = torch.transpose(We1)
        #
        # else:

        Wd1 = torch.nn.Parameter(
            torch.Tensor(args.code_size, args.hidden_size).uniform_(-1.0 / math.sqrt(args.code_size),
                                                                       1.0 / math.sqrt(args.code_size)))
        Wd2 = torch.nn.Parameter(
            torch.Tensor(args.hidden_size, input_length).uniform_(-1.0 / math.sqrt(args.hidden_size),
                                                                         1.0 / math.sqrt(args.hidden_size)))

        bd1 = torch.nn.Parameter(torch.zeros([args.hidden_size]))
        bd2 = torch.nn.Parameter(torch.zeros([input_length]))


        #if lin_dec:
        hidden_2 = torch.matmul(code, Wd1) + bd1
        #else:
        #hidden_2 = torch.tanh(torch.matmul(code, Wd1) + bd1)

        #print("hidden SHAPE:", hidden_2.size())
        dec_out = torch.matmul(hidden_2, Wd2) + bd2

        return  dec_out

    def kernel_loss(self,code, prior_K):
        # kernel on codes
        code_K = torch.mm(code, torch.t(code))

        # ----- LOSS -----
        # kernel alignment loss with normalized Frobenius norm
        code_K_norm = code_K / torch.linalg.matrix_norm(code_K, ord='fro', dim=(- 2, - 1))
        prior_K_norm = prior_K / torch.linalg.matrix_norm(prior_K, ord='fro', dim=(- 2, - 1))
        k_loss = torch.linalg.matrix_norm(torch.sub(code_K_norm,prior_K_norm), ord='fro', dim=(- 2, - 1))
        return k_loss


# Initialize model
model = Model()

# trainable parameters count
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print('Total parameters: {}'.format(total_params))

#Optimizer
optimizer = torch.optim.Adam(model.parameters(),args.learning_rate)

# ================= TRAINING =================

# initialize training variables
time_tr_start = time.time()
batch_size = args.batch_size
max_batches = train_data.shape[0] // batch_size
loss_track = []
kloss_track = []
min_vs_loss = np.infty
model_dir = "logs/dkae_models/m_0.ckpt"

logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))

###############################################################################
# Training code
###############################################################################

try:
    for ep in range(args.num_epochs):

        # shuffle training data
        idx = np.random.permutation(train_data.shape[0])
        train_data_s = train_data[idx, :]
        K_tr_s = K_tr[idx, :][:, idx]


        for batch in range(max_batches):
            fdtr = {}
            fdtr["encoder_inputs"] = train_data_s[(batch) * batch_size:(batch + 1) * batch_size, :]
            fdtr["prior_K"] =  K_tr_s[(batch) * batch_size:(batch + 1) * batch_size,
                             (batch) * batch_size:(batch + 1) * batch_size]

            encoder_inputs = (fdtr["encoder_inputs"].astype(float))
            encoder_inputs = torch.from_numpy(encoder_inputs)
            #print("TYPE ENCODER_INP IN TRAIN:", type(encoder_inputs))

            prior_K = (fdtr["prior_K"].astype(float))
            prior_K = torch.from_numpy(prior_K)

            dec_out = model.decoder(encoder_inputs)

            #print("DEC OUT TRAIN:", dec_out)


            reconstruct_loss = torch.mean((dec_out - encoder_inputs) ** 2)
            reconstruct_loss = reconstruct_loss.float()
            #print("RECONS LOSS TRAIN:", reconstruct_loss)

            enc_out = model.encoder(encoder_inputs)
            k_loss = model.kernel_loss(enc_out,prior_K)
            k_loss = k_loss.float()
            #print ("K_LOSS TRAIN:", k_loss)


            #print ("ENTRPY LOSS:", entrpy_loss)

            # Regularization L2 loss
            reg_loss = 0

            parameters = torch.nn.utils.parameters_to_vector(model.parameters())
            # print ("PARAMS:", (parameters))
            for tf_var in parameters:
                reg_loss += torch.mean(torch.linalg.norm(tf_var))

            tot_loss = reconstruct_loss + args.w_reg * reg_loss + args.a_reg * k_loss
            tot_loss = tot_loss.float()

            # Backpropagation
            optimizer.zero_grad()
            #tot_loss.backward(retain_graph=True)
            tot_loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_gradient_norm)
            optimizer.step()

            #tot_loss = tot_loss.detach()

            loss_track.append(reconstruct_loss)
            kloss_track.append(k_loss)

        #check training progress on the validations set (in blood data valid=train)
        if ep % 100 == 0:
            print('Ep: {}'.format(ep))

            # fdvs = {"encoder_inputs": valid_data,
            #         "prior_K": K_vs}

            fdvs = {}
            fdvs["encoder_inputs"] = valid_data
            fdvs["prior_K"] = K_vs


            #dec_out_val, lossvs, klossvs, vs_code_K, summary = sess.run(
             #   [dec_out, reconstruct_loss, k_loss, code_K, merged_summary], fdvs)

            encoder_inp = (fdvs["encoder_inputs"].astype(float))
            encoder_inp = torch.from_numpy(encoder_inp)

            prior_K_vs = (fdvs["prior_K"].astype(float))
            prior_K_vs = torch.from_numpy(prior_K_vs)

            enc_out_vs = model.encoder(encoder_inp)


            dec_out_val = model.decoder(encoder_inp)
            #print ("DEC OUT VAL:", dec_out_val)


            reconstruct_loss_val = torch.mean((dec_out_val - encoder_inp) ** 2)
            #print("RECONS LOSS VAL:", reconstruct_loss)

            k_loss_val = model.kernel_loss(enc_out_vs,prior_K_vs)
            #print("K_LOSS VAL:", k_loss_val)


            writer.add_scalar("reconstruct_loss", reconstruct_loss_val, ep)
            writer.add_scalar("k_loss", k_loss_val, ep)
            #writer.add_scalar("tot_loss", tot_loss, ep)


            print('VS r_loss=%.3f, k_loss=%.3f -- TR r_loss=%.3f, k_loss=%.3f' % (
            reconstruct_loss_val, k_loss_val, torch.mean(torch.stack(loss_track[-100:])), torch.mean(torch.stack(kloss_track[-100:]))))
            #reconstruct_loss_val, k_loss_val, np.mean(loss_track[-100:].detach().numpy()), np.mean(kloss_track[-100:].detach().numpy())))


            # Save model yielding best results on validation
            if reconstruct_loss_val < min_vs_loss:
                min_vs_loss = reconstruct_loss_val
                torch.save(model, model_dir)
                torch.save(model.state_dict(), 'logs/dkae_models/best-model-parameters.pt')

                #save_path = saver.save(sess, model_name)

except KeyboardInterrupt:
    print('training interrupted')

time_tr_end = time.time()
print('Tot training time: {}'.format((time_tr_end - time_tr_start) // 60))
writer.close()

TF code:

import tensorflow as tf
import tensorflow.compat.v1 as tf
tf.compat.v1.enable_eager_execution()
tf.disable_v2_behavior()
import argparse
import time
import numpy as np
import matplotlib.pyplot as plt
import math
from scipy import stats
import scipy
import os
import datetime
from math import sqrt
from math import log
import numpy as np
import tensorflow_probability as tfp

dim_red = 1  # perform PCA on the codes and plot the first two components
plot_on = 1  # plot the results, otherwise only textual output is returned
interp_on = 0  # interpolate data (needed if the input time series have different length)
tied_weights = 0  # train an AE where the decoder weights are the econder weights transposed
lin_dec = 1  # train an AE with linear activations in the decoder

# parse input data
parser = argparse.ArgumentParser()
parser.add_argument("--code_size", default=20, help="size of the code", type=int)
parser.add_argument("--w_reg", default=0.001, help="weight of the regularization in the loss function", type=float)
parser.add_argument("--a_reg", default=0.2, help="weight of the kernel alignment", type=float)
parser.add_argument("--num_epochs", default=5000, help="number of epochs in training", type=int)
parser.add_argument("--batch_size", default=25, help="number of samples in each batch", type=int)
parser.add_argument("--max_gradient_norm", default=1.0, help="max gradient norm for gradient clipping", type=float)
parser.add_argument("--learning_rate", default=0.001, help="Adam initial learning rate", type=float)
parser.add_argument("--hidden_size", default=30, help="size of the code", type=int)
args = parser.parse_args()
print(args)

# ================= DATASET =================
# (train_data, train_labels, train_len, _, K_tr,
#  valid_data, _, valid_len, _, K_vs,
#  test_data_orig, test_labels, test_len, _, K_ts) = getBlood(kernel='TCK',
#                                                             inp='zero')  # data shape is [T,T N, V] = [time_steps, num_elements, num_var]

train_data = np.random.rand(9000,6)
train_labels = np.ones([9000,1])
train_len = 9000

valid_data = np.random.rand(9000,6)
valid_len = 9000

test_data = np.random.rand(1500,6)
test_labels = np.ones([1500,1])

K_tr = np.random.rand(9000,9000)
K_ts = np.random.rand(1500,1500)
K_vs =  np.random.rand(9000,9000)

print(
    '\n**** Processing Blood data: Tr{}, Vs{}, Ts{} ****\n'.format(train_data.shape, valid_data.shape, test_data.shape))

input_length = train_data.shape[1]  # same for all inputs

# ================= GRAPH =================

# init session
# tf.reset_default_graph() # needed when working with iPython
sess = tf.Session()


# placeholders
encoder_inputs = tf.placeholder(shape=(None, input_length), dtype=tf.float32, name='encoder_inputs')
prior_K = tf.placeholder(shape=(None, None), dtype=tf.float32, name='prior_K')

# ----- ENCODER -----
We1 = tf.Variable(
    tf.random_uniform((input_length, args.hidden_size), -1.0 / math.sqrt(input_length), 1.0 / math.sqrt(input_length)))
We2 = tf.Variable(tf.random_uniform((args.hidden_size, args.code_size), -1.0 / math.sqrt(args.hidden_size),
                                    1.0 / math.sqrt(args.hidden_size)))

be1 = tf.Variable(tf.zeros([args.hidden_size]))
be2 = tf.Variable(tf.zeros([args.code_size]))


hidden_1 = tf.nn.tanh(tf.matmul(encoder_inputs, We1) + be1)
code = tf.nn.tanh(tf.matmul(hidden_1, We2) + be2)

# kernel on codes
code_K = tf.tensordot(code, tf.transpose(code), axes=1)

print("CODE K:", code_K)

print("Shape prior_K:", (tf.shape(prior_K)))
print("Code_k shape:", tf.shape(code_K))


# ----- DECODER -----
if tied_weights:
    Wd1 = tf.transpose(We2)
    Wd2 = tf.transpose(We1)
else:
    Wd1 = tf.Variable(tf.random_uniform((args.code_size, args.hidden_size), -1.0 / math.sqrt(args.code_size),
                                        1.0 / math.sqrt(args.code_size)))
    Wd2 = tf.Variable(tf.random_uniform((args.hidden_size, input_length), -1.0 / math.sqrt(args.hidden_size),
                                        1.0 / math.sqrt(args.hidden_size)))

bd1 = tf.Variable(tf.zeros([args.hidden_size]))
bd2 = tf.Variable(tf.zeros([input_length]))



if lin_dec:
    hidden_2 = tf.matmul(code, Wd1) + bd1
else:
    hidden_2 = tf.nn.tanh(tf.matmul(code, Wd1) + bd1)

dec_out = tf.matmul(hidden_2, Wd2) + bd2

# ----- LOSS -----
# kernel alignment loss with normalized Frobenius norm
code_K_norm = code_K / tf.norm(code_K, ord='fro', axis=[-2, -1])
prior_K_norm = prior_K / tf.norm(prior_K, ord='fro', axis=[-2, -1])
k_loss = tf.norm(code_K_norm - prior_K_norm, ord='fro', axis=[-2,-1])

# reconstruction loss
parameters = tf.trainable_variables()
print ("PARAMS:", (parameters))
optimizer = tf.train.AdamOptimizer(args.learning_rate)
reconstruct_loss = tf.losses.mean_squared_error(labels=dec_out, predictions=encoder_inputs)


# L2 loss
reg_loss = 0
for tf_var in tf.trainable_variables():
    reg_loss += tf.reduce_mean(tf.nn.l2_loss(tf_var))

print ("REG_LOSS:", reg_loss)
tot_loss = reconstruct_loss + args.w_reg * reg_loss + args.a_reg * k_loss

# Calculate and clip gradients
print ("TOT LOSS:", tot_loss)

gradients = tf.gradients(tot_loss,parameters)
print ("GRADS:", gradients)
clipped_gradients, _ = tf.clip_by_global_norm(gradients, args.max_gradient_norm)
update_step = optimizer.apply_gradients(zip(clipped_gradients, parameters))

sess.run(tf.global_variables_initializer())


# trainable parameters count
total_parameters = 0
for variable in tf.trainable_variables():
    shape = variable.get_shape()
    variable_parametes = 1
    for dim in shape:
        variable_parametes *= dim.value
    total_parameters += variable_parametes
print('Total parameters: {}'.format(total_parameters))

# ============= TENSORBOARD =============
mean_grads = tf.reduce_mean([tf.reduce_mean(grad) for grad in gradients])
tf.summary.scalar('mean_grads', mean_grads)
tf.summary.scalar('reconstruct_loss', reconstruct_loss)
tf.summary.scalar('k_loss', k_loss)
tvars = tf.trainable_variables()
for tvar in tvars:
    tf.summary.histogram(tvar.name.replace(':', '_'), tvar)
merged_summary = tf.summary.merge_all()

# ================= TRAINING =================

# initialize training variables
time_tr_start = time.time()
batch_size = args.batch_size
max_batches = train_data.shape[0] // batch_size
loss_track = []
kloss_track = []
min_vs_loss = np.infty
model_name = "logs/dkae_models/m_0.ckpt"
#train_writer = tf.summary.FileWriter('/logs/tensorboard', graph=sess.graph)
saver = tf.train.Saver()


logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
train_writer = tf.compat.v1.summary.FileWriter(
    logdir,
    graph=None,
    max_queue=10,
    flush_secs=120,
    graph_def=None,
    filename_suffix=None,
    session=None
)

try:
    for ep in range(args.num_epochs):

        # shuffle training data
        idx = np.random.permutation(train_data.shape[0])
        train_data_s = train_data[idx, :]
        K_tr_s = K_tr[idx, :][:, idx]

        for batch in range(max_batches):
            fdtr = {encoder_inputs: train_data_s[(batch) * batch_size:(batch + 1) * batch_size, :],
                    prior_K: K_tr_s[(batch) * batch_size:(batch + 1) * batch_size,
                             (batch) * batch_size:(batch + 1) * batch_size]
                    }


            _, train_loss, train_kloss = sess.run([update_step, reconstruct_loss, k_loss], fdtr)

            loss_track.append(train_loss)
            kloss_track.append(train_kloss)

        # check training progress on the validations set (in blood data valid=train)
        if ep % 100 == 0:
            print('Ep: {}'.format(ep))

            fdvs = {encoder_inputs: valid_data,
                    prior_K: K_vs}
            outvs, lossvs, klossvs, vs_code_K, summary = sess.run(
                [dec_out, reconstruct_loss, k_loss, code_K, merged_summary], fdvs)
            train_writer.add_summary(summary, ep)
            print('VS r_loss=%.3f, k_loss=%.3f -- TR r_loss=%.3f, k_loss=%.3f' % (
            lossvs, klossvs, np.mean(loss_track[-100:]), np.mean(kloss_track[-100:])))

            # Save model yielding best results on validation
            if lossvs < min_vs_loss:
                min_vs_loss = lossvs
                tf.add_to_collection("encoder_inputs", encoder_inputs)
                tf.add_to_collection("dec_out", dec_out)
                tf.add_to_collection("reconstruct_loss", reconstruct_loss)
                save_path = saver.save(sess, model_name)

except KeyboardInterrupt:
    print('training interrupted')

time_tr_end = time.time()
print('Tot training time: {}'.format((time_tr_end - time_tr_start) // 60))

sess.close()

The code can be run as

!python3 filename.py --code_size 5 --w_reg 0.001 --a_reg 0.1 --num_epochs 200 --max_gradient_norm 0.5 --learning_rate 0.001 --hidden_size 30 

Appreciate your help here.

Check the number of parameters in each layer in both frameworks to narrow down which layers are different.
In PyTorch something like this should work:

for name, module in model.named_modules():
    weight = getattr(module, "weight", None)
    if weight is not None:
        print('{}.weight.nelement {}'.format(name, weight.nelement()))
    bias = getattr(module, "bias", None)
    if bias is not None:
        print('{}.bias.nelement {}'.format(name, bias.nelement()))

hi @ptrblck thanks for your time. It is not printing anything. My model is a custom model and that may be reason it cannot call it. Can you please have a look at it. Also, about the training time issue, the code snippet I have pasted above for the pytorch code can be easily reproduced. Please let me know if I am doing something wrong which is causing this high training time. Thank you!

Hi @ptrblck , I have printed all the dimensions. Both the code prints everything same. Still I get number of trainable params for the TF code as 731 whereas for the pytorch code it is 365.

So, the problem is with the way TF is counting the trainable params. Can you please explain me if this has anything to do with my training, because I do not get similar reconstruction loss optimization.

# trainable parameters count
total_parameters = 0
for variable in tf.trainable_variables():
    shape = variable.get_shape()
    variable_parametes = 1
    for dim in shape:
        variable_parametes *= dim.value
    total_parameters += variable_parametes
print('Total parameters: {}'.format(total_parameters))

And, the reconstruction loss is calculated as below:
TF:

reconstruct_loss = tf.losses.mean_squared_error(labels=dec_out, predictions=encoder_inputs)

Pytorch:

reconstruct_loss = torch.mean((dec_out - encoder_inputs) ** 2)

Thank you!

Can you guys also help me with this backpropagation matter? I am currently running some tests on a modified architecture for DCGAN where a single generator will output images 4x4, 8x8, 16x16, 32x32 and 64x64. Each output will go through a specific discriminator(a discriminator specially for 4x4, another one for 8x8, a third one for 16x16…) and I’d like to make each discriminator to be independent from each other.

This is my test function’s code:

for epoch in range(epochs):
        D4.zero_grad()
        # Format batch
        real_cpu = data[np.random.randint(0, data.shape[0], size=batch_size), :, :, :].to(device)
        label = torch.full((real_cpu.shape[0],), real_label, dtype=torch.float, device=device)
        # Forward pass real batch through D
        output = D4(real_cpu).view(-1)
        # Calculate loss on all-real batch
        errD4_real = loss(output, label)
        # Calculate gradients for D in backward pass
        errD4_real.backward()

        ## Train with all-fake batch
        # Generate batch of latent vectors
        noise = torch.randn(b_size, 100, 1, 1, device=device)
        # Generate fake image batch with G
        output1, output2, output3, output4, output = netG(noise)
        label.fill_(fake_label)
        
        # REMEMBER:
        # output1 = 4x4
        # output2 = 8x8
        # output3 = 16x16
        # output4 = 32x32
        # output = 64x64
        
        # Classify all fake batch with D
        Dout1, Dout2 = D4(output1.detach()).view(-1), D8(output2.detach()).view(-1)
        Dout3, Dout4 = D16(output3.detach()).view(-1), D32(output4.detach()).view(-1)
        Dout = D64(output.detach()).view(-1)
        
        # Calculate D's loss on the all-fake batch
        errD4_fake, errD8_fake, errD16_fake = loss(Dout1, label), loss(Dout2, label), loss(Dout3, label)
        errD32_fake, errD64_fake = loss(Dout4, label), loss(Dout, label)
        
        # Calculate the gradients for this batch
        errD4_fake.backward()
        optimizerD4.step()
        
        D8.zero_grad()
        errD8_fake.backward()
        optimizerD8.step()
        
        D16.zero_grad()
        errD16_fake.backward()
        optimizerD16.step()
        
        D32.zero_grad()
        errD32_fake.backward()
        optimizerD32.step()
        
        D64.zero_grad()
        errD64_fake.backward()
        optimizerD64.step()

        # (2) Update G network: maximize log(D(G(z)))
        netG.zero_grad()
        label.fill_(real_label)  # fake labels are real for generator cost
        # Since we just updated D, perform another forward pass of all-fake batch through D
        
        output1, output2 = D4(output1).view(-1), D8(output2).view(-1)
        output3, output4 = D16(output3).view(-1), D32(output4).view(-1)
        output = D64(output).view(-1)
        
        # Calculate G's loss based on this output
        
        errG1 = loss(output1, label)
        errG2 = loss(output2, label)
        errG3 = loss(output3, label)
        errG4 = loss(output4, label)
        errG5 = loss(output, label)
        # Calculate gradients for G
        
        errG1.backward(retain_graph=True)
        errG2.backward(retain_graph=True)
        errG3.backward(retain_graph=True)
        errG4.backward(retain_graph=True)
        errG5.backward()
        
        # Update G
        optimizerG.step()

From what I’ve understood from Brando Miranda and colesbury above, this function should achieve what I want, but I’m not quite sure about this. Should I make some adjustments? Is there a way for me to check if this is working or not?

If I already calculated
x.grad += dloss/dx for the penultimate step, how do the rest of the steps will change?

I’m not sure if I understand your question correctly, but if you already have a gradient in x.grad, you just need to call opt.step() as per usual to update x.

So do i have to do this for backward?

        with torch.no_grad():
            loss_score = - grads * y_true

I would need to have more context to know what you are trying to do, but since you’re doing it in no-grad mode it will not affect gradient computation.

optimizer = optim.Adam([{'params': net_0.parameters()},
                        {'params': net_t.parameters()}], lr=learning_rate)
        for i, data in enumerate((train_loader)):
            data_0, data_t = data[0], data[1]
            optimizer.zero_grad()

            out_0= net_0(data[0])
            out_t = net_t(data[1])

          
            loss = loss_mod(out_0, out_t)
            o = loss.shape[-1] // 2

            loss_net_0 = loss[:, :o] #(gradients for net0)
            loss_net_t = loss[:, o:] #(gradients for nett)
            out_0.backward(loss_net_0)
            out_t.backward(loss_net_t)

Will this work?

You didn’t call optimizer.step() after calling backward, but looks OK otherwise.

Yes there is optimizer.step()
I don’t have to do net_0.backward(loss_net_0) right?