How to perform "map reduce" style computations efficiently?

By “map reduce”, let’s say I have three parameters A={set of tensors}, B={set of tensors}, and C={set of tensors}. There may be tens of thousands or more combinations of these.

Now I want to perform some custom computation function (i.e. “map”) between specific combinations of these parameters, and some input data, say { (x1, y1, z1), (x2, y2, z3), … }. The results of these computations, will be aggregated/mapped/associated back (i.e. “reduce”) based on the parameters A, B, C.

The straightforward way is to use a nested for loop. However I’d like to be able to perform this efficiently, using GPU acceleration where possible, since each particular computation is independent from others.

Thank you in advance.