Hi, I want to compute the lipschitz constant of a network, but it seems slow. Here is how I implement it:
# naive version
lc = 1
for block in self.blocks:
lc = lc * compute_lipschitz(block)
The function compute_lipschitz
is like:
def compute_lipschitz(conv):
x = torch.randn_like(input_of_conv)
for i in range(5):
x = F.conv(x, conv.weight, stride, padding, ...)
x = F.conv_tranpose(x, conv.weight, stride, padding, ...)
x = x / x.norm()
return F.conv(x, ...).norm()
for convolution layers. For activation layers, it returns the maximum absolute derivative, e.g., for relu, it returns 1.
However, I find it very slow since 1) the network is deep, 2) the computation in compute_lipschitz
is not heavy and cannot take full benefits of GPUs, 3) The computation has to be repeated on all GPUs if I use DDP, which I think is not very necessary.
I tried two ideas to solve this problem, but they did not work.
-
Use two CUDA stream, one for the model forward, one for the
compute_lipschitz
, it seems not very fast. -
With DDP on 8 GPUs, one GPU only compute the lipschitz of 1/8 blocks:
# accelerate version
lc = torch.zeros(len(self.blocks)).to(gpu_device)
for idx, block in enumerate(self.blocks):
if idx % num_gpus = rank:
lc[idx] = compute_lipschitz(block)
torch.distributed.all_reduce(lc)
lc = lc.prod()
This method is 8 times faster and the result is correct as the naive version. However, the gradient seems not correct. The training of my network is very different as the naive version, though the value lc is computed correctly.
Any ideas to solve the gradient problem or accelerate the naive version? Thanks!