Hello!
We are training following architecture:
class AbstractDenseNet(torch.nn.Module):
def __init__(
self,
in_features: int,
depth: int,
width: int
):
super(AbstractDenseNet, self).__init__()
self._depth = depth
self._width = width
self._in_features = in_features
self._out_features = in_features + width * depth
self.layers = torch.nn.ModuleList(
[
self.get_layer(in_features=in_features + i * width, out_features=width)
for i in range(depth)
]
)
def get_layer(self, in_features: int, out_features: int) -> torch.nn.Module:
raise NotImplementedError
def forward(self, tensor: torch.Tensor) -> torch.Tensor:
cat_dim = len(tensor.size()) - 1
for i in range(self._depth):
cur_out = self.layers[i](tensor)
tensor = torch.cat([tensor, cur_out], dim=cat_dim)
return tensor
This code uses cat on each layer and blows on memory (initial tensor size is very big). We can’t train our network with sufficient depth on our GPU’s
Obvious (from algorithmic perspective) optimization here - use one big preallocated buffer and fill it inplace during forward pass. However, this does not work with autograd, which forbids inplace modifications.
How can we avoid this restriction?