TL 路径问题汇总：GEMM 性能 ~5%（CommonIR 剥离标注），复杂 kernel（FlashAttention 等）无法编译（BiShengHIR Cube-Vector 混合崩溃）

## 问题概述

**TileLang → DLCompiler CommonIR → BiShengHIR → 昇腾 910B3 路径：GEMM 可编译运行但性能极差（~12 TFLOPS，仅为 baseline 的 5%），与同硬件上 Triton → DLCompiler → BiShengHIR 路径（250 TFLOPS）差距约 20 倍。**

根因：DLCompiler 的 `target.build.tilelang_commonir` codegen 将 TileLang TVM IR 中的昇腾硬件标注（内存层次 `#L1`/`#L0C`、计算单元 `gemm_v0`、DMA 搬运 `T.copy`）系统性剥离为通用 MLIR（裸 `memref.alloc()`、`linalg.matmul`、`memref.copy`），导致下游 BiShengHIR 编译器缺乏生成高效 aicore 指令所需的关键信息。

---

## 运行环境

- **硬件**: 昇腾 Ascend 910B3（NPU，aarch64 架构），24 个 AI Core
- **软件**: CANN 8.3.RC1，Python 3.11，torch 2.7.1
- **TileLang 环境**:
  - `tilelang-upstream`（`/root/tilelang-upstream`，版本 0.1.7.post3+cuda）：唯一识别 `TILELANG_USE_DLCOMPILER=1` 的 tilelang 版本，使用旧 API（`T.alloc_shared`, `T.alloc_fragment`, `T.clear`, `T.Pipelined`, `T.gemm`, `T.GemmWarpPolicy`）
  - `tilelang-ascend`（`/root/tilelang-ascend`，版本 0.1.4）：唯一支持昇腾原生后端（`alloc_L1`, `alloc_L0C`, `gemm_v0`, `is_npu=True`）的 tilelang，但**不识别** `TILELANG_USE_DLCOMPILER`
- **DLCompiler**: `/root/dlc_venv/lib/python3.11/site-packages/triton/backends/dicp_triton/`（含 `commonir/adapter.py`、`commonir/compiler.py`）
- **编译器后端**: `target.build.tilelang_commonir`（TVM FFI codegen）→ `dicp_opt` → BiShengHIR（bishengir-compile，闭源）
- **Baseline 对比对象**:
  - TL GEMM via DLC: `torch.matmul(a, b)`（即 `torch_npu.npu_mm`）
  - 同硬件 DLC-TR GEMM: 同一批测试中 Triton 路径达到 250 TFLOPS（baseline 的 113%）
- **计时方法**: `time.perf_counter()` + `torch.npu.synchronize()`（warmup=25, rep=100），与 `do_bench` 等价（ms 级 kernel 测量差异 <1%）
- **计算量公式**: GEMM = 2 × M × N × K

---

## 测试：TL GEMM via DLCompiler（性能极差）

**测试方法**：kernel 代码**与 DLCompiler 仓库 `test/commonir/ascend/test_gemm.py` 完全一致**（逐字节相同），仅在外部包装了命令行参数、bench 计时器和 `TILELANG_USE_DLCOMPILER=1` 环境变量。设置后通过 DLCompiler CommonIR 编译。每个 shape 独立冷启动（单独 Python 进程）。

完整 benchmark 脚本 `/tmp/tl_gemm_single.py` 见附录 A。

**测试结果**：

| 矩阵规模 | FLOPs | DLC-TL TFLOPS | Baseline TFLOPS | 相对 Baseline | DLC-TR 同 shape | 状态 |
|----------|-------|---------------|-----------------|---------------|-----------------|------|
| 1024³ | 2.1 G | 10.63 | 49.36 | 21.5% | — | 正常 |
| 2048³ | 17.2 G | 12.09 | 200.94 | 6.0% | 81.42 (40.5%) | 正常 |
| 4096³ | 137.4 G | 12.07 | 221.02 | 5.5% | 250.62 (113.4%) | 正常 |
| 8192³ | 1.1 T | — | 228.62 | — | 262.20 (114.7%) | **MTE DDR address out of range** |
| 48×7168×16384 | 11.3 G | — | 21.27 | — | 24.87 (117.0%) | **MTE DDR address out of range** |
| 96×18432×7168 | 25.4 G | — | 49.69 | — | 46.53 (93.6%) | **MTE DDR address out of range** |

**关键数据点**：

1. **DLC-TL TFLOPS 锁定在 ~12 TFLOPS**——从 2.1G FLOPs 到 137G FLOPs（65× 计算量增长），性能零增长。仅为 Ascend 910B3 FP16 峰值的 4.7%。

2. **与 DLC-TR 差距约 20 倍**：同一硬件、同一算法，DLC-TR 可达 250-262 TFLOPS（baseline 的 113-115%），DLC-TL 仅 12 TFLOPS。唯一区别是中间 IR 层——DLC-TR 不经过 CommonIR，DLC-TL 经过。

3. **非方阵全部 MTE 错误**：48×7168×16384、96×18432×7168、8192³ 三个 shape 在 CommonIR 阶段生成非法 DDR 地址（Memory Transfer Engine 地址越界），推测是 CommonIR codegen 对非规则 stride 的 `memref.reinterpret_cast` 处理有 bug。

4. **TL native（不经过 DLCompiler）无法作为对照**：tilelang-ascend 原生路径目前存在 `torch.device("npu")` 注册问题（`RuntimeError: Expected one of cpu, cuda, ... device type at start of device string: npu`），暂无法运行——这进一步说明 DLCompiler 是当前 TL kernel 在昇腾上运行的唯一可用路径。

---

## 根因分析：CommonIR 剥离硬件标注

### 证据一：编译管道对比

**DLC-TR 编译管道**（GEMM 达 262 TFLOPS）：
```
Triton IR → ttadapter → linked IR（含硬件标注）→ dicp_opt → bishengir-compile → binary
```

**DLC-TL 编译管道**（GEMM 仅 12 TFLOPS）：
```
TileLang TVM IR → tilelang_commonir codegen → CommonIR（标注全部剥离）→ dicp_opt → bishengir-compile → binary
```

### 证据二：IR 对比（实测 dump）

设置 `DLC_DUMP_IR=1` + `TILELANG_PRINT_COMMONIR=1`，从同一次 TL GEMM 编译中截取了 CommonIR 前后的 IR：

**Pre-CommonIR（TVM IR，有完整硬件标注）**：
```python
C_local = T.decl_buffer((128, 128), scope="local.fragment")    # ← #L0C 累加器
A_shared = T.decl_buffer((32, 128), scope="shared.dyn")        # ← #L1 共享内存
T.gemm_py(A_shared, B_shared, C_local, ...)                    # ← Cube 矩阵乘
T.copy(A[...], A_shared)                                        # ← DMA 搬运
```

**Post-CommonIR（`kernel.commonir.mlir`，标注全部丢失）**：
```mlir
// 内存分配 — 无 L1/L0C 内存空间标注
%C_local = memref.alloc() : memref<128x128xf32>
%A_shared = memref.alloc() : memref<32x128xf16>
%B_shared = memref.alloc() : memref<128x32xf16>

// 矩阵乘 — 通用 linalg.matmul，无 Cube 标注
%21 = linalg.matmul ins(%transposed, %transposed_6 : ...) outs(...) -> ...

// 数据搬运 — 通用 memref.copy，无 DMA 语义
memref.copy %subview, %A_shared : ... to ...
```

### 证据三：CommonIR 系统性丢失的标注

| 标注类别 | Pre-CommonIR (TVM IR) | Post-CommonIR | 后果 |
|----------|----------------------|---------------|------|
| 内存空间 | `scope="shared.dyn"` (#L1) / `scope="local.fragment"` (#L0C) | 裸 `memref<...>` | BiShengHIR 不知道数据在哪个存储层次，保守假设全部放 HBM |
| 计算单元 | `T.gemm_py` (Cube 矩阵乘) | `linalg.matmul` (通用矩阵乘) | 不知道该用 Cube 还是 Vector 执行 |
| 数据搬运 | `T.copy` (DMA 语义) | `memref.copy` (通用内存拷贝) | 无 DMA 硬件指令对应 |

### 为什么这能解释只有 ~5% 的性能

昇腾 910B3 的内存层次带宽差距约一个数量级：
- **HBM**（全局内存）：~1.2 TB/s
- **L1**（AI Core 本地缓存）：~16 TB/s
- **L0C**（Cube 累加器）：更高带宽、更低延迟

当 BiShengHIR 拿到的 CommonIR 里所有 `memref.alloc()` 都没有内存空间标注时，编译器无法判断数据应该放在 L1 还是 HBM。保守策略是把所有中间 buffer 放在 HBM——每次矩阵乘的输入输出都要经过 HBM 往返，带宽立即跌到 1/10。

同时，`linalg.matmul` 没有 Cube/Vector 标注，编译器可能默认用 Vector 单元串行执行矩阵乘，而 Vector 单元的矩阵乘吞吐量远低于 Cube 单元（Cube 是专门的矩阵乘加速器）。

**HBM 带宽瓶颈 + Vector 串行执行 ≈ 5% 峰值性能**，与实测数据完全吻合。

---

## 复现步骤

```bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64:$LD_LIBRARY_PATH
export TORCH_DEVICE_BACKEND_AUTOLOAD=0
export TILELANG_USE_DLCOMPILER=1
export PYTHONPATH=/root/tilelang-upstream:/root/tilelang-upstream/3rdparty/tvm/python:/root/tilelang-upstream/3rdparty/tvm/3rdparty/tvm-ffi/python:/root/dlc_venv/lib/python3.11/site-packages

# TL GEMM via DLCompiler
/root/dlc_venv/bin/python /tmp/tl_gemm_single.py 4096 4096 4096

# 查看 CommonIR 输出（设置 DLC_DUMP_IR=1 + TILELANG_PRINT_COMMONIR=1）
/root/dlc_venv/bin/python /tmp/dump_tl_gemm.py
# CommonIR 输出在 /tmp/tmp/*/kernel.commonir.mlir
```

---

## 附录 A：benchmark 脚本 `/tmp/tl_gemm_single.py`

kernel 部分（`@T.prim_func` 块内）与 [DLCompiler 仓库 `test/commonir/ascend/test_gemm.py`](https://github.com/DeepLink-org/DLCompiler/blob/main/test/commonir/ascend/test_gemm.py) **逐字节相同**。仅在外部添加了 `TILELANG_USE_DLCOMPILER=1`、命令行参数解析、`npu_do_bench` 计时和 CPU 侧 tensor 创建（避免 NPU float16 随机数 kernel bug）。

```python
import os, sys, time
os.environ['TILELANG_USE_DLCOMPILER'] = '1'
import tilelang, tilelang.language as T
import torch, torch_npu

def npu_do_bench(fn, warmup=25, rep=100):
    for _ in range(warmup):
        fn()
    torch.npu.synchronize()
    t0 = time.perf_counter()
    for _ in range(rep):
        fn()
    torch.npu.synchronize()
    t1 = time.perf_counter()
    return (t1 - t0) / rep * 1000  # ms

M, K, N = int(sys.argv[1]), int(sys.argv[2]), int(sys.argv[3])
dtype_str = "float16"
accum_str = "float"
device = torch.device("npu")
block_M, block_N, block_K = 128, 128, 32

@T.prim_func
def gemm(
    A: T.Tensor((M, K), dtype_str),
    B: T.Tensor((K, N), dtype_str),
    C: T.Tensor((M, N), dtype_str),
):
    with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
        A_shared = T.alloc_shared((block_K, block_M), dtype_str)
        B_shared = T.alloc_shared((block_N, block_K), dtype_str)
        C_local = T.alloc_fragment((block_M, block_N), accum_str)
        T.clear(C_local)
        for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
            T.copy(A[k * block_K, by * block_M], A_shared)
            T.copy(B[bx * block_N, k * block_K], B_shared)
            T.gemm(A_shared, B_shared, C_local, transpose_A=True, transpose_B=True,
                   policy=T.GemmWarpPolicy.FullRow)
        T.copy(C_local, C[by * block_M, bx * block_N])

kernel = tilelang.compile(gemm)
td = torch.float16
torch.manual_seed(0)
a = (torch.rand((M, K), dtype=td) - 0.5).to(device)
b = (torch.rand((K, N), dtype=td) - 0.5).to(device)
c = torch.zeros((M, N), dtype=td).to(device)

kernel(a, b, c)
torch.npu.synchronize()

ms_tl = npu_do_bench(lambda: kernel(a, b, c))
ms_bl = npu_do_bench(lambda: torch.matmul(a, b))
flop = 2.0 * M * K * N
tflops_tl = flop / (ms_tl * 1e-3) / 1e12
tflops_bl = flop / (ms_bl * 1e-3) / 1e12
print(f"{M}x{K}x{N} DLCompiler-TL={tflops_tl:.2f} TFLOPS Baseline={tflops_bl:.2f} TFLOPS %={tflops_tl/tflops_bl*100:.1f}%")
```

## 附录 B：IR dump 脚本 `/tmp/dump_tl_gemm.py`

与附录 A 完全相同的 kernel，额外设置 `DLC_DUMP_IR=1` + `TILELANG_PRINT_COMMONIR=1` 捕获所有 IR 阶段。

```python
import os, sys
os.environ["TILELANG_USE_DLCOMPILER"] = "1"
os.environ["DLC_DUMP_IR"] = "1"
os.environ["TILELANG_PRINT_COMMONIR"] = "1"

import tilelang, tilelang.language as T
import torch, torch_npu

M, K, N = 1024, 1024, 1024
dtype_str = "float16"
accum_str = "float"
device = torch.device("npu")
block_M, block_N, block_K = 128, 128, 32

@T.prim_func
def gemm(
    A: T.Tensor((M, K), dtype_str),
    B: T.Tensor((K, N), dtype_str),
    C: T.Tensor((M, N), dtype_str),
):
    with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
        A_shared = T.alloc_shared((block_K, block_M), dtype_str)
        B_shared = T.alloc_shared((block_N, block_K), dtype_str)
        C_local = T.alloc_fragment((block_M, block_N), accum_str)
        T.clear(C_local)
        for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
            T.copy(A[k * block_K, by * block_M], A_shared)
            T.copy(B[bx * block_N, k * block_K], B_shared)
            T.gemm(A_shared, B_shared, C_local, transpose_A=True, transpose_B=True,
                   policy=T.GemmWarpPolicy.FullRow)
        T.copy(C_local, C[by * block_M, bx * block_N])

print("Compiling TL GEMM via DLCompiler...", flush=True)
kernel = tilelang.compile(gemm)
# DLC_DUMP_IR=1 → pre-CommonIR 输出到 ./tmp/*/kernel.commonir.mlir
# TILELANG_PRINT_COMMONIR=1 → 打印每个 pass 前后的 TVM IR

td = torch.float16
torch.manual_seed(0)
a = (torch.rand((M, K), dtype=td) - 0.5).to(device)
b = (torch.rand((K, N), dtype=td) - 0.5).to(device)
c = torch.zeros((M, N), dtype=td).to(device)
kernel(a, b, c)
torch.npu.synchronize()
print("TL GEMM run OK", flush=True)
```

---

## 问题与建议

1. DLCompiler 的 `target.build.tilelang_commonir` codegen（`/root/tilelang-upstream/src/target/codegen_commonir.cc`）是否有计划保留 TileLang TVM IR 中的内存层次标注（`scope="shared.dyn"` → `#L1`、`scope="local.fragment"` → `#L0C`），并在 CommonIR 中作为 `memref` 的 memory space attribute 传递？

2. `T.gemm_py` 是否有对应的 CommonIR dialect op（如 `hivm.gemm_v0`）可以直接映射，而不是退化为通用 `linalg.matmul`？如果没有，是否有计划在 CommonIR 中增加昇腾专用 op？

3. 非方阵 shape（48×7168×16384 等）的 MTE DDR address out of range 错误，是否与 CommonIR codegen 对 `memref.reinterpret_cast` 的 offset 计算有关？

4. 如果 CommonIR 的跨平台抽象目标与昇腾的细粒度硬件标注需求存在根本矛盾，DLCompiler 是否考虑为昇腾提供绕过 CommonIR 的专用编译路径，将 TileLang TVM IR 直接（或经 DicpOpt 但不经 CommonIR codegen）送入 BiShengHIR？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TL 路径问题汇总：GEMM 性能 ~5%（CommonIR 剥离标注），复杂 kernel（FlashAttention 等）无法编译（BiShengHIR Cube-Vector 混合崩溃） #178

问题概述

运行环境

测试：TL GEMM via DLCompiler（性能极差）

根因分析：CommonIR 剥离硬件标注

证据一：编译管道对比

证据二：IR 对比（实测 dump）

证据三：CommonIR 系统性丢失的标注

为什么这能解释只有 ~5% 的性能

复现步骤

附录 A：benchmark 脚本 `/tmp/tl_gemm_single.py`

附录 B：IR dump 脚本 `/tmp/dump_tl_gemm.py`

问题与建议

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

矩阵规模	FLOPs	DLC-TL TFLOPS	Baseline TFLOPS	相对 Baseline	DLC-TR 同 shape	状态
1024³	2.1 G	10.63	49.36	21.5%	—	正常
2048³	17.2 G	12.09	200.94	6.0%	81.42 (40.5%)	正常
4096³	137.4 G	12.07	221.02	5.5%	250.62 (113.4%)	正常
8192³	1.1 T	—	228.62	—	262.20 (114.7%)	MTE DDR address out of range
48×7168×16384	11.3 G	—	21.27	—	24.87 (117.0%)	MTE DDR address out of range
96×18432×7168	25.4 G	—	49.69	—	46.53 (93.6%)	MTE DDR address out of range

标注类别	Pre-CommonIR (TVM IR)	Post-CommonIR	后果
内存空间	`scope="shared.dyn"` (#L1) / `scope="local.fragment"` (#L0C)	裸 `memref<...>`	BiShengHIR 不知道数据在哪个存储层次，保守假设全部放 HBM
计算单元	`T.gemm_py` (Cube 矩阵乘)	`linalg.matmul` (通用矩阵乘)	不知道该用 Cube 还是 Vector 执行
数据搬运	`T.copy` (DMA 语义)	`memref.copy` (通用内存拷贝)	无 DMA 硬件指令对应

TL 路径问题汇总：GEMM 性能 ~5%（CommonIR 剥离标注），复杂 kernel（FlashAttention 等）无法编译（BiShengHIR Cube-Vector 混合崩溃） #178

Description

问题概述

运行环境

测试：TL GEMM via DLCompiler（性能极差）

根因分析：CommonIR 剥离硬件标注

证据一：编译管道对比

证据二：IR 对比（实测 dump）

证据三：CommonIR 系统性丢失的标注

为什么这能解释只有 ~5% 的性能

复现步骤

附录 A：benchmark 脚本 /tmp/tl_gemm_single.py

附录 B：IR dump 脚本 /tmp/dump_tl_gemm.py

问题与建议

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

附录 A：benchmark 脚本 `/tmp/tl_gemm_single.py`

附录 B：IR dump 脚本 `/tmp/dump_tl_gemm.py`