How to implement a deep neural network with different losses for different layers?

zjsong · February 20, 2018, 4:53pm

Hi all! I’m a novice of PyTorch. Recently, I want to implement a special deep neural network, but I didn’t find any related PyTorch examples to help realize it. This special network has the following properties:

Each layer has its own loss function, and the parameters and hidden representation at current layer are learned by minimizing this loss. It should be noticed that there is no closed-form expression for the hidden representation, so here we have to perform an iterative optimization process to infer representation.
The optimization process for one layer only affects the parameters and representation of this corresponding layer, rather than other layers.
For the ith layer, the update of its parameters W^{i} and representation y^{i} depends on y^{i-1} and y^{i+1}.

I don’t know how to build the computational graph for such a model. Any comments and suggestions would be appreciated. And it would be better to give some related code examples or links. Many thanks.

rasbt · February 21, 2018, 9:19pm

There’s probably a more elegant solution, but you could do in manually, I suppose, as follows:

from torch.autograd import grad

# implement your model

# in your training loop:

    # fetch all intermediate outputs:
    outputs = model.forward(x_var, y_var)

    # compute cost for a particular layer
    layer_cost = layer_cost_fn(ouputs[some_index], y_true) 

    # compute gradients
    partial_derivatives = grad(layer_cost, (model.your_layer.weights, 
                                            model.your_layer.bias))

    # update model params
    model.your_layer.weights.data -= learning_rate * partial_derivatives[0]
    model.your_layer.bias.data -= learning_rate * partial_derivatives[1]

zjsong · February 21, 2018, 10:09pm

Thanks for your warm suggestion, @rasbt.

If I understand what you mean correctly, your method needs to first compute all layers’ representations (or intermediate outputs); and then uses those related outputs to obtain cost for each particular layer; finally, computes the gradients of each layer parameters and updates them respectively.

It almost solved my problem. However, as I mentioned in my problem, the hidden representations (outputs) don’t have any closed-form expressions, and meanwhile their values depend on spatially adjacent layers’ representations (outputs). Therefore, it seems like that those representations cannot be directly computed in a forward function.

In order to compute the representations (outputs) of all layers, it has to perform an iterative optimization process. So in this case, the problem becomes intractable.

rasbt · February 22, 2018, 2:49am

your method needs to first compute all layers’ representations (or intermediate outputs); and then uses those related outputs to obtain cost for each particular layer; finally, computes the gradients of each layer parameters and updates them respectively.

yeah, but you don’t need to compute all the layers’ representations upfront, this was just an example. You could, for example, fetch the output of the 1st layer, do a gradient update, fetch the output of the 2nd layer, do a gradient update, and so forth.

zjsong · February 22, 2018, 5:28am

Hi, Sebastian (@rasbt),

Thank you so much for your kind help. Your reply may provide a feasible solution to my problem, and I will have a try.

zjsong · February 24, 2018, 8:13pm

Hi, @rasbt.

Recently, I try to implement my model based on your suggestions. And I encount the problem of computing gradients.

Specifically, in tensorflow, one can use the optimizer.compute_gradients() to get parameters’ gradients. However, in pytorch, I found that there is no such a function or module that can be used to compute the gradients of my own loss function.

So, do you have any hint?

Thank you in advance.

rasbt · February 24, 2018, 9:38pm

The grad function from the autograd submodule in PyTorch doesn’t work with your cost function?

zjsong · February 25, 2018, 4:48am

Thank you again, @rasbt . I’m trying this function.

Cagri_Kaplan · February 5, 2019, 8:10pm

Hi @zjsong,
I am working on a similar model and wonder whether you have seen any paper proposing such a model.

zjsong · February 8, 2019, 8:07am

Hi, @Cagri_Kaplan,

Sorry for the late reply. As far as I know, one of the most distinct characteristics of such a model is the iterative inference mechanism. The representative models with such mechanism include:

Predictive Sparse Decomposition (PSD), which introduces a regression mapping into the original sparse coding model to implement approximate inference.
Deconvolutional Networks, which can be used to learn representations and generate images by the convolutional operation.
Predictive Coding (PC), which imitates the hierarchical architecture and the information processing mechanism in the human vision cortex.
Fast Inference Predictive Coding, which extends the basic PC model to implement approxmate inference like PSD. Another recent work on PC is the Predictive Coding Network that disentangles the bottom-up and top-down information flowing ways in PC to solve the object recognition task.

Hope these stuff could help. I’m glad to share my understandings on this topic with you.

Regards,
Zengjie Song

Cagri_Kaplan · February 11, 2019, 8:01pm

Thank you so much @zjsong

How can I contact you, I need to consult you for some topics.

Would you please Hi me at cagrikaplan@gmail.com

Thank you again, appreciate.

Best Regards.