We are training following architecture:
class AbstractDenseNet(torch.nn.Module): def __init__( self, in_features: int, depth: int, width: int ): super(AbstractDenseNet, self).__init__() self._depth = depth self._width = width self._in_features = in_features self._out_features = in_features + width * depth self.layers = torch.nn.ModuleList( [ self.get_layer(in_features=in_features + i * width, out_features=width) for i in range(depth) ] ) def get_layer(self, in_features: int, out_features: int) -> torch.nn.Module: raise NotImplementedError def forward(self, tensor: torch.Tensor) -> torch.Tensor: cat_dim = len(tensor.size()) - 1 for i in range(self._depth): cur_out = self.layers[i](tensor) tensor = torch.cat([tensor, cur_out], dim=cat_dim) return tensor
This code uses cat on each layer and blows on memory (initial tensor size is very big). We can’t train our network with sufficient depth on our GPU’s
Obvious (from algorithmic perspective) optimization here - use one big preallocated buffer and fill it inplace during forward pass. However, this does not work with autograd, which forbids inplace modifications.
How can we avoid this restriction?