Help understanding cpp codegen for inductor

    I am studying inductor CPP coregen and I have following questions:

    1) for gen_loops() call the parameter loops can have following kind of list input
       [LoopLevel,Looplevel,...] only for `OuterLoopFusedKernel` right?
        otherwise it will always have
        [LoopLevel] and this loop level itself can have loop in its `inner` member i.e nested loops

    2) At Scheduler level say I have node2 depending on node1 (be it any kind of fusion)
       How at cpp loop level this order is preserved?
    
   3) Also at LoopNest and LoopLevel is it possible to track read and write parameters of kernel?

LoopLevel is a general abstraction for for loop and the LoopLevel array is used to generally represent one kernel contains a single or multiple LoopLevel regardless of OuterLoopFusedKernel. Meanwhile, OuterLoopFusedKernel is used to stack different kernels without fusion, that’s why its data filed is a list of LoopNest.

let me give an example here.

Suppose there are two kernels, and the kernel names are kernel0 and kernel1 accordingly.

# Kernel 0
for x0 in range(0, 100):
  for y0 in range(0, 1000):
    for z0 in range(0, 1024):
      # Do something for kernel0

# Kernel 1
for x1 in range(0, 100):
  for y1 in range(0, 1000):
    for z1 in range(0, 2048):
      # Do something for kernel0

Before fusion, the kernel0and kernel1 will be represented as a LoopNest which contains 3 LoopLevel and a CppKernel.

  • Kernel0:
    LoopNest {
      loops: [LoopLevel_0_for_kernel0, LoopLevel_1_for_kernel0, LoopLevel_2_for_kernel0],
      kernel: CppKernel_for_kernel0
    }
    
  • Kernel1:
    LoopNest {
    loops: [LoopLevel_0, LoopLevel_1, LoopLevel_2],
    kernel: CppKernel_for_kernel1
    }
    

During the fusion process of the Cpp/OMP backend, the most outer two loops of kernel0 and kernel1 will be fused and the most inner loop of kernel0 and kernel1 cannot be fused and will be represented as OuterLoopFusedKernel. The Fused_Kernel0_Kernel1 can be viewed as follows.

LoopNest0: {
  loops: [LoopLevel_2_for_kernel0],
  kernel: CppKernel_for_kernel0
}
LoopNest1: {
  loops: [LoopLevel_2_for_kernel1],
  kernel: CppKernel_for_kernel1
}
OuterLoopFusedKernel: {
  inner: [LoopNest0, LoopNest1]
}
LoopNest: {
  loops: [LoopLevel_0, LoopLevel_1],
  kernel: OuterLoopFusedKernel
}

So, it means [LoopLevel,Looplevel,...] does not only serve for OuterLoopFusedKernel

The order is encoded as the index of the LoopLevel array of LoopNest. Say, the most outer loop is LoopNest.loops[0], while the most inner loop is LoopNest.loops[-1].

By design, it does not track the read and write parameters because it solely serves for Loop representation. May I know the motivation behind the question?

so all current (cpu like) HW arch using cpp backend uses OpenMP runtime where mostly once
omp parallel region ends the runtime will automatically add barrier or sync.

However, I am trying for an arch where we don’t have OpenMP support, so I have to decide which are optimal (or just better then syncing after all loop nest) points for sync.

I think different runtimes should only impact how to generate code for Loop statement as the key concepts should be common among different runtimes like parallelism. The LoopNest and LoopLevel are general to represent the loop.