Wrap Innermost Loop as neura.kernel by ShangkunLi · Pull Request #261 · coredac/dataflow

ShangkunLi · 2026-02-03T14:29:22Z

Hi~ @HobbitQia, I enhance the convert-taskflow-to-neura pass with two modes:

'hyperblock' mode: this mode runs the flow with counters
'innermost' mode: this mode wraps each innermost loop as the neura.kernel; this could be a starting point for your fusion

tancheng · 2026-02-03T19:56:02Z

Can you provide IR example for convert-taskflow-to-neura?

I thought hyperblock would always be there when we generate task, and will be lowered to neura.kernel after the convert-taskflow-to-neura. Then, the neura.kernel would be the entry point for mapper. So why do we need two modes? The neura.kernel with counters can also be a target of our mapper, right?

ShangkunLi · 2026-02-04T07:01:39Z

Can you provide IR example for convert-taskflow-to-neura?

I thought hyperblock would always be there when we generate task, and will be lowered to neura.kernel after the convert-taskflow-to-neura. Then, the neura.kernel would be the entry point for mapper. So why do we need two modes? The neura.kernel with counters can also be a target of our mapper, right?

The innermost mode is to provide a starting point for @HobbitQia, because there are no counter ops in his architecture.

Here is a transformed ir using --convert-taskflow-to-neura="mode=innermost":

%write_outputs = taskflow.task @Task_0 read_memrefs(%arg0 : memref<?x8x6xi32>) write_memrefs(%arg5 : memref<?xi32>) [original_read_memrefs(%arg0 : memref<?x8x6xi32>), original_write_memrefs(%arg5 : memref<?xi32>)] : (memref<?x8x6xi32>, memref<?xi32>) -> (memref<?xi32>) {
    ^bb0(%arg10: memref<?x8x6xi32>, %arg11: memref<?xi32>):
      affine.for %arg12 = 0 to 4 {
        affine.for %arg13 = 0 to 8 {
          neura.kernel inputs(%arg10, %arg12, %arg13, %arg11 : memref<?x8x6xi32>, index, index, memref<?xi32>) attributes {kernel_name = "kernel_0"} {
          ^bb0(%arg14: memref<?x8x6xi32>, %arg15: index, %arg16: index, %arg17: memref<?xi32>):
            %c0 = arith.constant 0 : index
            %c6 = arith.constant 6 : index
            %c1 = arith.constant 1 : index
            scf.for %arg18 = %c0 to %c6 step %c1 {
              %1 = memref.load %arg14[%arg15, %arg16, %arg18] : memref<?x8x6xi32>
              memref.store %1, %arg17[%arg18] : memref<?xi32>
            }
            neura.yield
          }
        }
      }
      taskflow.yield writes(%arg11 : memref<?xi32>)
    }

If we use the architecture that has counter, the transformation process should use the original one -- --convert-taskflow-to-neura="mode=hyperblock". The transfromed ir:

    %write_outputs = taskflow.task @Task_0 read_memrefs(%arg0 : memref<?x8x6xi32>) write_memrefs(%arg5 : memref<?xi32>) [original_read_memrefs(%arg0 : memref<?x8x6xi32>), original_write_memrefs(%arg5 : memref<?xi32>)] : (memref<?x8x6xi32>, memref<?xi32>) -> (memref<?xi32>) {
    ^bb0(%arg10: memref<?x8x6xi32>, %arg11: memref<?xi32>):
      %1 = taskflow.counter attributes {counter_id = 0 : i32, counter_type = "root", lower_bound = 0 : index, step = 1 : index, upper_bound = 4 : index} : index
      %2 = taskflow.counter parent(%1 : index) attributes {counter_id = 1 : i32, counter_type = "relay", lower_bound = 0 : index, step = 1 : index, upper_bound = 8 : index} : index
      %3 = taskflow.counter parent(%2 : index) attributes {counter_id = 2 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 6 : index} : index
      neura.kernel inputs(%arg10, %arg11 : memref<?x8x6xi32>, memref<?xi32>) {
      ^bb0(%arg12: memref<?x8x6xi32>, %arg13: memref<?xi32>):
        %4 = neura.counter {counter_id = 0 : i32, counter_type = "root", lower_bound = 0 : index, step = 1 : index, upper_bound = 4 : index} : index
        %5 = neura.counter {counter_id = 1 : i32, counter_type = "relay", lower_bound = 0 : index, step = 1 : index, upper_bound = 8 : index} : index
        %6 = neura.counter {counter_id = 2 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 6 : index} : index
        %7 = memref.load %arg12[%4, %5, %6] : memref<?x8x6xi32>
        memref.store %7, %arg13[%6] : memref<?xi32>
        neura.yield
      }
      taskflow.yield writes(%arg11 : memref<?xi32>)
    }

We can also support architectures with counter FUs by utilizing this ir.

tancheng · 2026-02-04T07:36:44Z

I still think mode is unnecessary. We should have support_counter in arch_spec.yaml to determine which mode we are targeting. Instead of using different flag. How does this sound?

ShangkunLi · 2026-02-04T07:46:05Z

I still think mode is unnecessary. We should have support_counter in arch_spec.yaml to determine which mode we are targeting. Instead of using different flag. How does this sound?

Hmmm, but this should diverges from higher level. Because for counter-based neura.kernel, it should be transformed through:

--convert-affine-to-taskflow
--contruct-hyperblock-from-task
--convert-taskflow-to-neura="mode=hyperblock"

For the innermost-loop-based neura.kernel, it should be transformed through:

--convert-affine-to-taskflow
--convert-taskflow-to-neura="mode=innermost"

They actually correspond to two different pass pipelines, which should be handled by the neura-compiler instead of the mlir-neura-opt?

tancheng · 2026-02-04T07:54:35Z

So you mean mode=innermost will have no hyperblock exist?

The principle should be:

neura-compiler and mlir-neura-opt should have the same functionality. opt is just pick some passes for verification. we should have single compiler no matter what use case. All passes should be applied on different inputs, just some passes might not make impact on the IRs.

So both opt and compiler should have all 3 passes enabled:

--convert-affine-to-taskflow
--contruct-hyperblock-from-task
--convert-taskflow-to-neura

And hyperblock should exist in both cases. Is this possible?

ShangkunLi · 2026-02-04T11:12:07Z

So you mean mode=innermost will have no hyperblock exist?

The principle should be:

neura-compiler and mlir-neura-opt should have the same functionality. opt is just pick some passes for verification. we should have single compiler no matter what use case. All passes should be applied on different inputs, just some passes might not make impact on the IRs.

So both opt and compiler should have all 3 passes enabled:
--convert-affine-to-taskflow
--contruct-hyperblock-from-task
--convert-taskflow-to-neura
And hyperblock should exist in both cases. Is this possible?

Good idea~!

I have already modified the conversion pipeline to this pipeline.

test/arch_spec/architecture_with_counter.yaml

include/NeuraDialect/NeuraPasses.td

ShangkunLi added 5 commits February 3, 2026 21:40

prototype two mode taskflow->neura transform

53e0110

transform kenrel body to scf

8b50773

remove wrap-loop-in-kernel pass from neura

b9d1fd7

enhance taskflow.task print&parse functions

710a954

modify tests

c2cb24e

ShangkunLi requested review from HobbitQia and tancheng February 3, 2026 14:29

guosran force-pushed the wrap-loop branch from c9a4688 to c2cb24e Compare February 4, 2026 02:07

ShangkunLi added 2 commits February 4, 2026 15:03

[fix] fix bug in constant handling

c95bdbf

[fix] fix bug

b8e042e

ShangkunLi added 6 commits February 4, 2026 17:03

construct hyperblock based on counter support

9d724a3

simplify the taskflow-to-neura logic

21c8eba

[fix] fix bugs in hyperblock construction

7df7081

[fix] fix bugs in taskflow2neura conversion

1a43db7

modify test

5413cad

add description in pass tblgen

ee3fe47

include the architecure yaml file

c7b7306

tancheng reviewed Feb 4, 2026

View reviewed changes

test/arch_spec/architecture_with_counter.yaml Show resolved Hide resolved

tancheng reviewed Feb 4, 2026

View reviewed changes

include/NeuraDialect/NeuraPasses.td Show resolved Hide resolved

tancheng approved these changes Feb 5, 2026

View reviewed changes

modify the name of arch_spec files

95de5b4

ShangkunLi merged commit d016fb1 into coredac:main Feb 5, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrap Innermost Loop as neura.kernel#261

Wrap Innermost Loop as neura.kernel#261
ShangkunLi merged 15 commits intocoredac:mainfrom
ShangkunLi:wrap-loop

ShangkunLi commented Feb 3, 2026

Uh oh!

tancheng commented Feb 3, 2026

Uh oh!

ShangkunLi commented Feb 4, 2026

Uh oh!

tancheng commented Feb 4, 2026

Uh oh!

ShangkunLi commented Feb 4, 2026

Uh oh!

tancheng commented Feb 4, 2026

Uh oh!

ShangkunLi commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShangkunLi commented Feb 3, 2026

Uh oh!

tancheng commented Feb 3, 2026

Uh oh!

ShangkunLi commented Feb 4, 2026

Uh oh!

tancheng commented Feb 4, 2026

Uh oh!

ShangkunLi commented Feb 4, 2026

Uh oh!

tancheng commented Feb 4, 2026

Uh oh!

ShangkunLi commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants