Wrap Innermost Loop as neura.kernel#261
Conversation
|
Can you provide IR example for I thought hyperblock would always be there when we generate |
The Here is a transformed ir using %write_outputs = taskflow.task @Task_0 read_memrefs(%arg0 : memref<?x8x6xi32>) write_memrefs(%arg5 : memref<?xi32>) [original_read_memrefs(%arg0 : memref<?x8x6xi32>), original_write_memrefs(%arg5 : memref<?xi32>)] : (memref<?x8x6xi32>, memref<?xi32>) -> (memref<?xi32>) {
^bb0(%arg10: memref<?x8x6xi32>, %arg11: memref<?xi32>):
affine.for %arg12 = 0 to 4 {
affine.for %arg13 = 0 to 8 {
neura.kernel inputs(%arg10, %arg12, %arg13, %arg11 : memref<?x8x6xi32>, index, index, memref<?xi32>) attributes {kernel_name = "kernel_0"} {
^bb0(%arg14: memref<?x8x6xi32>, %arg15: index, %arg16: index, %arg17: memref<?xi32>):
%c0 = arith.constant 0 : index
%c6 = arith.constant 6 : index
%c1 = arith.constant 1 : index
scf.for %arg18 = %c0 to %c6 step %c1 {
%1 = memref.load %arg14[%arg15, %arg16, %arg18] : memref<?x8x6xi32>
memref.store %1, %arg17[%arg18] : memref<?xi32>
}
neura.yield
}
}
}
taskflow.yield writes(%arg11 : memref<?xi32>)
}If we use the architecture that has counter, the transformation process should use the original one -- %write_outputs = taskflow.task @Task_0 read_memrefs(%arg0 : memref<?x8x6xi32>) write_memrefs(%arg5 : memref<?xi32>) [original_read_memrefs(%arg0 : memref<?x8x6xi32>), original_write_memrefs(%arg5 : memref<?xi32>)] : (memref<?x8x6xi32>, memref<?xi32>) -> (memref<?xi32>) {
^bb0(%arg10: memref<?x8x6xi32>, %arg11: memref<?xi32>):
%1 = taskflow.counter attributes {counter_id = 0 : i32, counter_type = "root", lower_bound = 0 : index, step = 1 : index, upper_bound = 4 : index} : index
%2 = taskflow.counter parent(%1 : index) attributes {counter_id = 1 : i32, counter_type = "relay", lower_bound = 0 : index, step = 1 : index, upper_bound = 8 : index} : index
%3 = taskflow.counter parent(%2 : index) attributes {counter_id = 2 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 6 : index} : index
neura.kernel inputs(%arg10, %arg11 : memref<?x8x6xi32>, memref<?xi32>) {
^bb0(%arg12: memref<?x8x6xi32>, %arg13: memref<?xi32>):
%4 = neura.counter {counter_id = 0 : i32, counter_type = "root", lower_bound = 0 : index, step = 1 : index, upper_bound = 4 : index} : index
%5 = neura.counter {counter_id = 1 : i32, counter_type = "relay", lower_bound = 0 : index, step = 1 : index, upper_bound = 8 : index} : index
%6 = neura.counter {counter_id = 2 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 6 : index} : index
%7 = memref.load %arg12[%4, %5, %6] : memref<?x8x6xi32>
memref.store %7, %arg13[%6] : memref<?xi32>
neura.yield
}
taskflow.yield writes(%arg11 : memref<?xi32>)
}We can also support architectures with counter FUs by utilizing this ir. |
|
I still think |
Hmmm, but this should diverges from higher level. Because for counter-based For the innermost-loop-based They actually correspond to two different pass pipelines, which should be handled by the |
|
So you mean The principle should be:
So both opt and compiler should have all 3 passes enabled: And |
Good idea~! I have already modified the conversion pipeline to this pipeline. |
Hi~ @HobbitQia, I enhance the
convert-taskflow-to-neurapass with two modes:neura.kernel; this could be a starting point for your fusion