Making HVP More Scalable for Larger Training Set Sizes

Ethan_R · June 17, 2021, 6:43pm

Hello Everyone,

I am currently working on a research project in which I need to calculate the Hessian Vector Product (HVP) w.r.t. my model’s parameters. I am able to successfully calculate the HVP, however the calculation slows significantly and chews up a significant amount of memory as the size of the training set I am using increases.

Let me make things more explicit:

	def GetHVPFunc_alt(self, datax, datay, testx, testy):
		params = [ p for p in self.model.parameters() if p.requires_grad ]
		predtest = self.model(testx)
		losstest = calc_loss(predtest, testy)
		grads_test = self.To_List(autograd.grad(losstest, params))

		def HVP(v):
			s = time.time()
			predtrain = self.model(datax)
			losstrain = calc_loss(predtrain, datay)

			v_t = self.Reverse_To_List(v, params)
			out = hvp_alt(losstrain, params, v_t)
			print("Time HVP: ", time.time() - s)
			return out
		return HVP, grads_test

I want to calculate H*v where H is the hessian of the loss function w.r.t. the model parameters. As I increase the number of points in datax and datay (the training set images/labels respectively) the computation time of calculating the model output and the model loss increases as expected, however the calculation time along with the memory usage of hvp_alt grows significantly faster. This is likely because the function calls autograd.grad twice, which I believe creates large graphs that take a lot of memory and that scale with the number of inputs to the model. See the code below:

def hvp(y, w, v):
    if len(w) != len(v):
        raise(ValueError("w and v must have the same length."))
    # First backprop
    first_grads = grad(y, w, retain_graph=True, create_graph=True)
    # Elementwise products
    elemwise_products = 0
    for grad_elem, v_elem in zip(first_grads, v):
        elemwise_products += torch.sum(grad_elem * v_elem)
    # Second backprop
    return_grads = grad(elemwise_products, w, create_graph=False)
    return return_grads

I should note that the code above is not my own and is taken from the following github which comes from the paper I am basing my research on: GitHub - nimarb/pytorch_influence_functions: This is a PyTorch reimplementation of Influence Functions from the ICML2017 best paper: Understanding Black-box Predictions via Influence Functions by Pang Wei Koh and Percy Liang. . I have been using 100 data points from CIFAR-10 to approximate the HVP causing each iteration of my overall code (I am approximating the inverse HVP i.e. H^-1 * v) to finish in ~6 seconds, however when I increase the training set size to 1000 the run time increases to ~60 seconds.

In short, is there a way for me to get an accurate, efficient approximation of the HVP that doesn’t depend so heavily on training set size? Is my current solution of using a subset of the data sufficient or will this result in an inaccurate HVP calculation?

Feel free to ask for clarification if anything was confusing.

Thank you so much,
Ethan

Aditya_Ranganath · June 18, 2021, 1:38am

Hello,

I could be wrong here, but it really depends on how big the graph is. How big is your model ? Also you can look into torch.jit. This can get around the complexity overhead as it tries to avoid python dependencies.

Cheers

albanD · June 18, 2021, 3:04pm

Hi,

We are actually working on forward mode right now and that should help quite a bit in your case.
Do you have a small self-contained script that shows the model you’re using?

Ethan_R · June 19, 2021, 1:42am

Hello Alban,

I’m using a smaller model I got from a tutorial right now for testing. Here is a script that describes the model:

# 3x3 convolution
def conv3x3(in_channels, out_channels, stride=1):
    return nn.Conv2d(in_channels, out_channels, kernel_size=3, 
                     stride=stride, padding=1, bias=False)

# Residual block
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(ResidualBlock, self).__init__()
        self.conv1 = conv3x3(in_channels, out_channels, stride)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(out_channels, out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsample = downsample

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        if self.downsample:
            residual = self.downsample(x)
        out += residual
        out = self.relu(out)
        return out

# ResNet
class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=10):
        super(ResNet, self).__init__()
        self.in_channels = 16
        self.conv = conv3x3(3, 16)
        self.bn = nn.BatchNorm2d(16)
        self.relu = nn.ReLU(inplace=True)
        self.layer1 = self.make_layer(block, 16, layers[0])
        self.layer2 = self.make_layer(block, 32, layers[1], 2)
        self.layer3 = self.make_layer(block, 64, layers[2], 2)
        self.avg_pool = nn.AvgPool2d(8)
        self.fc = nn.Linear(64, num_classes)

    def make_layer(self, block, out_channels, blocks, stride=1):
        downsample = None
        if (stride != 1) or (self.in_channels != out_channels):
            downsample = nn.Sequential(
                conv3x3(self.in_channels, out_channels, stride=stride),
                nn.BatchNorm2d(out_channels))
        layers = []
        layers.append(block(self.in_channels, out_channels, stride, downsample))
        self.in_channels = out_channels
        for i in range(1, blocks):
            layers.append(block(out_channels, out_channels))
        return nn.Sequential(*layers)

    def forward(self, x):
        out = self.conv(x)
        out = self.bn(out)
        out = self.relu(out)
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.avg_pool(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

model = ResNet(ResidualBlock, [2, 2, 2])

Is there anything else you want to see? I do agree that the model size will definitely alter the execution time as well. Also, out of curiosity how would this “forward mode” help me?