How to apply different kernel to different data in a batch

Hi,
I want to learn different kernel for different input image. Then convolve on the feature map with the learned kernel. The feature map also comes from the input image.
Suppose for the input, the batch size > 1. Below is the pseudo code demo of what I tried.
It turns out to be quite slow, sometimes even slower than the for loop.

# kernel and feature
rpn_feat = net_1(input)
kernel_feat = net_2(input)
# size
rpn_feat.size = (batch_size, rpn_feat_dim, h, w)
nc_cls_out = 2 * anchor_num
kernel_feat.size = (batch_size, nc_cls_out, h_k, w_k)

# convolve feature with kernel
# OPTION 1: group convolution
kernel = kernel.view(nc_cls_out * batch_size, rpn_feat_dim, h_k, w_k)
cls_feat = torch.nn.functional.conv2d(rpn_feat, kernel, groups=batch_size)
cls_feat = cls_feat.view(batch_size, nc_cls_out, h_cls_feat, w_cls_feat)
# OPTION 2: for loop
cls_feats = list()
for idx_in_batch in range(batch_size):
  kernel = kernel_feat[idx_in_batch:idx_in_batch+1] \
  .view(nc_cls_out, rpn_feat_dim, h_k, w_k)
  feat_for_conv = rpn_feat[idx_in_batch_idx_in_batch+1]
  cls_feat = torch.nn.functional.conv2d(feat_for_conv, kernel)
  cls_feats.append(cls_feat)
cls_feat = torch.cat(cls_feats, 0)

Is that the right way doing this? Or is there any other way faster?
Thanks in advance.

Using pytorch 0.3.2 with cuda 8.0 and cudnn 7.0…