Skip to content

Introduce Affine Controller Design#266

Open
ShangkunLi wants to merge 8 commits intotancheng:masterfrom
ShangkunLi:affine-controller
Open

Introduce Affine Controller Design#266
ShangkunLi wants to merge 8 commits intotancheng:masterfrom
ShangkunLi:affine-controller

Conversation

@ShangkunLi
Copy link
Collaborator

Add Affine Controller (AC) for Outer Loop Management

Summary

This PR introduces the Affine Controller (AC), a programmable hardware module that manages outer loop counters in the CGRA. While the existing LoopCounterRTL (DCU) handles innermost loop counting at the tile level, the AC coordinates multi-level loop nesting and cross-CGRA loop synchronization.

Architecture

The AC contains an array of Configurable Counter Units (CCUs), each representing one level of a loop nest. CCUs form a DAG topology where:

  • Root CCUs drive the outermost loop with no parent
  • Regular CCUs have a local parent CCU and report completion upward

Each CCU tracks a loop variable (lower_bound, upper_bound, step, current_value) and manages a set of targets — tile-array DCUs it must notify when advancing iterations.

State Machine: IDLE → RUNNING → DISPATCHING → RUNNING / COMPLETE

  • RUNNING: Waiting for child completion events
  • DISPATCHING: Sending commands to targets (1 cycle per target)
  • COMPLETE: Loop finished, parent notified internally

Key Design Decisions

  • Single-phase dispatch: Each target receives exactly one command per dispatch — either CMD_RESET_LEAF_COUNTER (leaf-mode DCU) or CMD_UPDATE_COUNTER_SHADOW_VALUE (delivery-mode DCU). This avoids redundant messages and minimizes dispatch latency.
  • Internal CCU→parent completion: When a child CCU completes, it directly increments its parent's received_complete_count in the same cycle — no external signaling needed.
  • Automatic child reset: When a parent CCU finishes dispatching and returns to RUNNING, all child CCUs are automatically reset to lower_bound.
  • Last-iteration optimization: When current_value >= upper_bound, the CCU transitions directly to COMPLETE without dispatching, preventing stale completion events from the previous iteration.
  • Backpressure on events: Both tile and remote completion events are only consumed when a matching CCU is in RUNNING state.

Cross-CGRA Support

CCU targets can be marked as remote (is_remote=1). Dispatch commands for remote targets are sent via send_to_remote (routed through the Controller's inter-CGRA NoC). Remote completion events arrive as CMD_AC_CHILD_COMPLETE on recv_from_remote.

Files Changed

File Change
controller/AffineControllerRTL.py [NEW] AC implementation (~400 lines)
controller/test/AffineControllerRTL_test.py [NEW] 4 test cases
lib/cmd_type.py Added 11 new AC command types, NUM_CMDS 28→40

Test Cases

Test Description
test_basic_2_layer_loop 1 root CCU, 1 leaf DCU. Verifies basic dispatch and completion.
test_sibling_barrier 1 root CCU, 2 leaf DCUs (child_count=2). Verifies barrier synchronization.
test_3_layer_loop CCU[0]→CCU[1]→DCU chain. Verifies internal CCU completion, parent dispatch, and child reset across 2×3=6 inner iterations.
test_cross_cgra_2_layer_loop 1 root CCU with remote + local targets. Verifies send_to_remote / recv_from_remote paths.

@tancheng
Copy link
Owner

How is the inner-loops start/end got updated?

@ShangkunLi
Copy link
Collaborator Author

How is the inner-loops start/end got updated?

Distributed counter units are updated through the cmds (including CMD_UPDATE_COUNTER_SHADOW_VALUE, CMD_RESET_LEAF_COUNTER)

@tancheng
Copy link
Owner

How is the inner-loops start/end got updated?

Distributed counter units are updated through the cmds (including CMD_UPDATE_COUNTER_SHADOW_VALUE, CMD_RESET_LEAF_COUNTER)

  • What does "shadow" mean here?
  • How do we make sure the inner-loops are already done with their execution before sending out the updated values/cmd?

@ShangkunLi
Copy link
Collaborator Author

ShangkunLi commented Feb 25, 2026

How is the inner-loops start/end got updated?

Distributed counter units are updated through the cmds (including CMD_UPDATE_COUNTER_SHADOW_VALUE, CMD_RESET_LEAF_COUNTER)

  • What does "shadow" mean here?

Shadow registers are used in the loop delivery mode in DCU to store and deliver the outer loop indexes.

s.send_out[0].msg @= s.shadow_regs[addr]

They are updated by the affine controller through the CMD_UPDATE_COUNTER_SHADOW_VALUE command.

elif s.recv_from_ctrl_mem.msg.cmd == CMD_UPDATE_COUNTER_SHADOW_VALUE:

  • How do we make sure the inner-loops are already done with their execution before sending out the updated values/cmd?

Only DCUs in loop count mode will send a CMD_LEAF_COUNTER_COMPLETE command to the affine controller when it reaches the upper bound.

CMD_LEAF_COUNTER_COMPLETE, s.DataType(0, 0), 0, s.recv_opt.msg, addr

After receiving this complete signal, the affine controller will trigger outer loop counter increments and send CMD_UPDATE_COUNTER_SHADOW_VALUE & CMD_RESET_LEAF_COUNTER accordingly.

lib/cmd_type.py Outdated
CMD_GLOBAL_REDUCE_COUNT: "(GLOBAL_REDUCE_COUNT)",
CMD_GLOBAL_REDUCE_ADD: "(GLOBAL_REDUCE_ADD)",
CMD_GLOBAL_REDUCE_MUL: "(GLOBAL_REDUCE_MUL)",
CMD_GLOBAL_REDUCE_MUL: "(GLOBAL_REDUCE_MUL)",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more space

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited~

DataType(mk_parent_payload(0, True), 0), 0, CtrlType(0), 0),
]

# ===== Configure CCU[1]: j = 0..2, parent = CCU[0] =====
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is CCU[1] be triggered by CCU[0]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are testing a three-layer nested loop.

for i ... // CCU0
    for j ...  // CCU1
        for k ...  // DCU

As you can see in the CCU config part:

For CCU0, it is used for loop i. It is configured as:

CgraPayloadType(CMD_AC_CONFIG_CHILD_COUNT, DataType(1, 0), 0, CtrlType(0), 0),
# CCU[0] target: i-delivery DCU at ctrl_addr=1 (shadow_only!)
CgraPayloadType(CMD_AC_CONFIG_TARGET,
                    *mk_target_config(1, 0, shadow_only=True), CtrlType(0), 0),
CgraPayloadType(CMD_AC_CONFIG_PARENT,
                    DataType(mk_parent_payload(0, True), 0), 0, CtrlType(0), 0),

This means CCU0 has one child counter and is configured as a parent counter.

For CCU1, it is used for loop j. It is configured as:

CgraPayloadType(CMD_AC_CONFIG_CHILD_COUNT, DataType(1, 0), 0, CtrlType(0), 1),
# CCU[1] target 0: k-DCU at ctrl_addr=0 (leaf, needs reset + shadow)
CgraPayloadType(CMD_AC_CONFIG_TARGET,
                    *mk_target_config(0, 0, shadow_only=False), CtrlType(0), 1),
# CCU[1] target 1: j-delivery DCU at ctrl_addr=2 (shadow_only!)
CgraPayloadType(CMD_AC_CONFIG_TARGET,
                    *mk_target_config(2, 0, shadow_only=True), CtrlType(0), 1),
CgraPayloadType(CMD_AC_CONFIG_PARENT,
                    DataType(mk_parent_payload(0, False), 0), 0, CtrlType(0), 1),

This means this counter has a parent counter, and its id is 0 (i.e., CCU0).

# CCU[0] child_count=2, targets at ctrl_addr=0 and ctrl_addr=1.
#-------------------------------------------------------------------------

def test_sibling_barrier():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What does barrier mean?
  • child_count matters? It is used for what?
    • ctrl_addr is the control signal/instruction index in ctrl memory? how would it be related to child_count?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This is to describe the case where 1 root counter + 2 sibling child counters. The outer loop only updates its index when both inner counters complete.
  • child_count means the required count of complete signals from child counters. The outer loop will increase by 1 when all the required complete signals are received.
  • Since each CCU has a DCU on the tile array, and each leaf counter also has a DCU on the tile array, the ctrl_add is used:
    1. For CCU on the DCU, the DCU is configured as the loop delivery mode, the ctrl_addr helps us to distinguish whether the DCU needs to be updated when its corresponding updates.
    2. For the leaf counter on the DCU, the DCU is configured as the loop count mode, the ctrl_addr helps us to distinguish which outer loop the received complete signal can be used to trigger.

==========================================================================
Affine Controller (AC) for managing outer loop counters in CGRA.

Each AC contains configurable number of Configurable Counter Units (CCUs).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configurable number -> parameterizable number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited~

AffineControllerRTL.py
==========================================================================
Affine Controller (AC) for managing outer loop counters in CGRA.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also explain where this FU located in terms of arch design? It is an FU similar to Adder? But it is inside the controller folder, so seems near the controller instead?

It can also consumes command (e.g., CMD_AC_CONFIG_LOWER and COMPLETE). CMD_AC_CONFIG_LOWER is from whom?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AC is under the control of the controller.

The hierarchy looks like:
Screenshot 2026-02-26 at 16 14 07

The CMD_AC_CONFIG_LOWER is from the controller.

The CMD_AC_CHILD_COMPLETE is from the affine controller that belongs to another CGRA. This command is introduced so that we can chain two affine controllers from two/multiple CGRAs into a bigger affine controller.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz create a folder inside https://github.com/tancheng/VectorCGRA/tree/master/doc/figures to include your design diagrams, and put link into comment of this .py file. Also attach figures into this PR's description.

Moreover, what do you think of renaming AC to LC? i.e., Loop controller? IIRC, we already have a LoopCounter in our FUs. So LC would control LoopCounter, am I right?

Comment on lines +34 to +36
cmp_fn = lambda a, b: (a.cmd == b.cmd) and \
(a.data.payload == b.data.payload) and \
(a.ctrl_addr == b.ctrl_addr)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align the indent of the (a.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited~

@tancheng tancheng requested a review from rp15 February 25, 2026 15:13
Comment on lines +53 to +67
CMD_AC_CONFIG_LOWER = 32 # Configures CCU lower_bound.
CMD_AC_CONFIG_UPPER = 33 # Configures CCU upper_bound.
CMD_AC_CONFIG_STEP = 34 # Configures CCU step.
CMD_AC_CONFIG_CHILD_COUNT = 35 # Configures child_complete_count.
CMD_AC_CONFIG_TARGET = 36 # Configures target (tile_id, ctrl_addr, is_remote, cgra_id).
CMD_AC_CONFIG_PARENT = 37 # Configures parent_ccu_id, is_root, is_relay.
CMD_AC_LAUNCH = 38 # Launches AC (all CCUs enter RUNNING).

# Affine Controller Inter-CGRA Sync Commands.
CMD_AC_SYNC_VALUE = 39 # Parent AC → Child AC: sync current value.
CMD_AC_CHILD_COMPLETE = 40 # Child AC → Parent AC: child complete.
CMD_AC_CHILD_RESET = 41 # Parent AC → Child AC: reset child.

# Affine Controller Status.
CMD_AC_ALL_COMPLETE = 42 # AC → Controller: all loops complete.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly explain who is the parent/sender and who is the receiver/child. tile? another ac? and where the cmd would come from or produce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants