Parallelize tensor dot-products (different sizes)

I have two lists of tensors, G1 and G2, such that G1[i] and G2[i] have the same (arbitrary) shape.
I want to compute some kind of dot-product between chunks of these lists. First, I compute the dot-product between the pairs of tensors given by zip(G1, G2), which I put into a list (or 1D tensor). Then, I want to split this list into chunks and perform a sum over each of these chunks.

My current code is:

device = torch.device("cuda")
group_sizes = [2, 5, 1, 6]    # for instance (if len(G1) = 2 + 5 + 1 + 6 = 14)
H = torch.tensor([torch.dot(g1.view(-1), g2.view(-1)) for g1, g2 in zip(G1, G2)], device = device)
H_split = H.split(group_sizes)
H_final = torch.tensor([h.sum() for h in H_split], device = device)

I believe that some sequential operations remain sub-optimal: the existing torch.dot operations can be performed in parallel.

Is there a way to parallelize this code? I believe there is no simple way to do so, but I would like to know at which level I would have to dive to do this (CUDA backend? …).