Online Knowledge Distillation

You could register forward hooks in M2 and M3 in the penultimate layer as described here.

During the forward pass of these models the hooks will be activated and the outputs will be stored e.g. in a dict. Once this is done you can pass the activations from the dict to M5 and continue the training.

Depending if you want to calculate the gradients of the loss w.r.t. M2 and M3 you could either store the intermediate activations directly or detach() them in the forward hook.