This layer represents 4 linear layers that run in parallel, where each of them processes vectors of 16 dimensions and returns vectors of the same dimensionality.

Now, suppose that we want each of this layers to process the same input x = torch.randn(N, 16).
To do that, we need to transpose and repeat the input as such:

x = x.t().repeat(4, 1)

The input will now have shape (64, N), and we could run layer(x).

For very large N or vector dimensions, this repeat operation would allocate a huge tensor which we know represents a single tensor repeated multiple times.
My question is: Can we avoid this memory footprint?

I currently don’t see a way to use a combination of expand and view to avoid this issue.

Yes, I don’t believe this would directly work since you would increase the weight matrix from [64, 16, 1] to [64, 64, 1]. Of course you could then try to repeat the 16 filters again to create a kernel filter with the same weights, but this approach sounds quite wasteful since you would be:

repeating the filter kernels

repeating the input tensor

increasing the computational workload

without a necessity since nn.Conv1d(16, 64, 1, bias=False) will just work without any repetitions.