I am studying inductor CPP coregen and I have following questions:
1) for gen_loops() call the parameter loops can have following kind of list input
[LoopLevel,Looplevel,...] only for `OuterLoopFusedKernel` right?
otherwise it will always have
[LoopLevel] and this loop level itself can have loop in its `inner` member i.e nested loops
2) At Scheduler level say I have node2 depending on node1 (be it any kind of fusion)
How at cpp loop level this order is preserved?
3) Also at LoopNest and LoopLevel is it possible to track read and write parameters of kernel?
LoopLevel
is a general abstraction for for
loop and the LoopLevel
array is used to generally represent one kernel contains a single or multiple LoopLevel
regardless of OuterLoopFusedKernel
. Meanwhile, OuterLoopFusedKernel
is used to stack different kernels without fusion, that’s why its data filed is a list of LoopNest
.
let me give an example here.
Suppose there are two kernels, and the kernel names are kernel0
and kernel1
accordingly.
# Kernel 0
for x0 in range(0, 100):
for y0 in range(0, 1000):
for z0 in range(0, 1024):
# Do something for kernel0
# Kernel 1
for x1 in range(0, 100):
for y1 in range(0, 1000):
for z1 in range(0, 2048):
# Do something for kernel0
Before fusion, the kernel0
and kernel1
will be represented as a LoopNest
which contains 3 LoopLevel
and a CppKernel
.
- Kernel0:
LoopNest { loops: [LoopLevel_0_for_kernel0, LoopLevel_1_for_kernel0, LoopLevel_2_for_kernel0], kernel: CppKernel_for_kernel0 }
- Kernel1:
LoopNest { loops: [LoopLevel_0, LoopLevel_1, LoopLevel_2], kernel: CppKernel_for_kernel1 }
During the fusion process of the Cpp/OMP
backend, the most outer two loops of kernel0
and kernel1
will be fused and the most inner loop of kernel0
and kernel1
cannot be fused and will be represented as OuterLoopFusedKernel
. The Fused_Kernel0_Kernel1
can be viewed as follows.
LoopNest0: {
loops: [LoopLevel_2_for_kernel0],
kernel: CppKernel_for_kernel0
}
LoopNest1: {
loops: [LoopLevel_2_for_kernel1],
kernel: CppKernel_for_kernel1
}
OuterLoopFusedKernel: {
inner: [LoopNest0, LoopNest1]
}
LoopNest: {
loops: [LoopLevel_0, LoopLevel_1],
kernel: OuterLoopFusedKernel
}
So, it means [LoopLevel,Looplevel,...]
does not only serve for OuterLoopFusedKernel
The order is encoded as the index of the LoopLevel
array of LoopNest
. Say, the most outer loop is LoopNest.loops[0]
, while the most inner loop is LoopNest.loops[-1]
.
By design, it does not track the read and write parameters because it solely serves for Loop
representation. May I know the motivation behind the question?
so all current (cpu like) HW arch using cpp backend uses OpenMP runtime where mostly once
omp parallel
region ends the runtime will automatically add barrier or sync.
However, I am trying for an arch where we don’t have OpenMP support, so I have to decide which are optimal (or just better then syncing after all loop nest) points for sync.
I think different runtimes should only impact how to generate code for Loop
statement as the key concepts should be common among different runtimes like parallelism
. The LoopNest
and LoopLevel
are general to represent the loop.