Unfortunately, I didn’t receive an answer yet. In general it’s really a pity that in academia it is so hard to get an answer from the authors. /endofrant

Looking at fig 1. from that paper I am now sure that concatenation is meant:

Those 3x3, 5x5, 7x7 sep. convs are operations from their NAS search space. So as you can see, in this example two feature maps are sampled from the input and fed through those operations while the other ones are bypassed and concatenated to the output.

Therefore, I came up with this “solution”:

```
class SampleConv(nn.Conv2d):
def __init__(self, in_channels, out_channels, kernel_size, sampling_ratio=0.25, stride=1, padding=0, dilation=1,
groups=1, bias=True):
super().__init__(
in_channels, out_channels,
kernel_size, stride=stride, padding=padding, dilation=dilation,
groups=groups, bias=bias)
self.channel_sample_ratio = sampling_ratio
def forward(self, x, sample=True):
if sample:
sampled_channels = torch.rand(x.size(1), device=x.device, requires_grad=False)
weight = self.weight[:, sampled_channels < self.channel_sample_ratio]
weight = weight[sampled_channels < self.channel_sample_ratio]
if self.bias is not None:
bias = self.bias[sampled_channels < self.channel_sample_ratio]
else:
bias = None
x[:, sampled_channels < self.channel_sample_ratio] = F.conv2d(
x[:, sampled_channels < self.channel_sample_ratio], weight, bias, self.stride, self.padding,
self.dilation, self.groups)
else:
x = F.conv2d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)
return x
```

I measured the GPU memory consumption and it requires 50%-60% less memory as the regular `nn.Conv2d`

counterpart. (Which is still a far cry from the stated 75%).

Also the L1-Error on estimating a `torch.randn`

tensor is slightly higher (0.83 vs 0.799) on convergence than with a regular `nn.Conv2d`

:

```
sample_conv = SampleConv(320, 320, 1, sampling_ratio=0.25).cuda()
normal_conv = nn.Conv2d(320, 320, 1).cuda()
optim_sample_conv = torch.optim.SGD(sample_conv.parameters(), lr=0.1, momentum=0.9)
optim_normal_conv = torch.optim.SGD(normal_conv.parameters(), lr=0.1, momentum=0.9)
target = torch.randn(1, 320, 48, 48).cuda()
for e in range(10000):
# sample conv
input_t = torch.ones(1, 320, 48, 48).cuda()
y = sample_conv(input_t)
loss = (y - target).abs().mean()
optim_sample_conv.zero_grad()
loss.backward()
optim_sample_conv.step()
with torch.no_grad():
y = sample_conv(input_t, sample=False)
l1_error = (y - target).abs().mean()
print("SampleConv: Loss: {} L1-Error: {}".format(loss, l1_error))
# regular nn.Conv2d
input_t = torch.ones(1, 320, 48, 48).cuda()
y = normal_conv(input_t)
loss = (y - target).abs().mean()
optim_normal_conv.zero_grad()
loss.backward()
optim_normal_conv.step()
with torch.no_grad():
y = normal_conv(input_t)
l1_error = (y - target).abs().mean()
print("nn.Conv2d: Loss: {} L1-Error: {}".format(loss, l1_error))
```

What do you think?