Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
157 commits
Select commit Hold shift + click to select a range
bc6e0d2
upload 2nd version of sdpa backward
yuankuns Oct 3, 2025
50bb240
upload cutlass dot_do_o
yuankuns Oct 3, 2025
991b541
fix dp save issue
yuankuns Oct 6, 2025
b5d257b
update sdpa to include compat change
yuankuns Oct 9, 2025
2424515
Tiled MMA tests using XE_DPAS_TT (#550)
rishi-yadav Oct 14, 2025
4963355
Rename as SYCL*TLA
Antonyvance Oct 14, 2025
a7b4c0e
Rename: Fix python package
Antonyvance Oct 14, 2025
f6b6eb1
Rename:Minor fixes
Antonyvance Oct 14, 2025
546f320
Rename: Minor fixes
Antonyvance Oct 14, 2025
a9d9102
enable prefetch
yuankuns Oct 14, 2025
161417f
Rename as SYCL*TLA (#561)
Antonyvance Oct 15, 2025
9c13a86
fix host compiler issue
yuankuns Oct 16, 2025
d3b8d4f
fix perf issue, reach 95% of xetla in 1 shape
yuankuns Oct 16, 2025
bfc71c0
move HPs out for perf tuning
yuankuns Oct 17, 2025
35e80e1
[CuTe] [Xe] Fix make_block_2d_copy_* for batched tensors (#549)
petercad Oct 17, 2025
905bd97
[Reorder] add optimized e8m0 -> f32 upconversion (#544)
petercad Oct 17, 2025
7d7b689
Guard is_source_supported by CopyOpG2R (#565)
nsingh-habana Oct 17, 2025
1db79a9
examples: cute: tutorial: use default queue in xe_gemm (#566)
Jiexin-Zheng Oct 17, 2025
9ccb008
Update SECURITY.md
Antonyvance Oct 17, 2025
5880275
Update SECURITY.md (#567)
Antonyvance Oct 22, 2025
7feb377
Use newer version of mma_atom and copy_atom in 00_bmg_gemm (#540)
anamikac-intel Oct 22, 2025
33cae04
[CuTe] [Xe] Separate make_block_2d_copy_{C,D} APIs for loads/stores (…
petercad Oct 23, 2025
9555cbd
Resolve conflict between 572 & 540 (#576)
sanchitintel Oct 23, 2025
0b140c8
refactor the dq dk dv
yuankuns Oct 24, 2025
9e16211
grouped gemm with new APIs (#574)
jiyang1011 Oct 27, 2025
3492d71
fix error in host compilation
yuankuns Oct 27, 2025
675727d
[CI] Change PVC driver (#575)
anupren Oct 28, 2025
3509cb0
Unit tests for LOAD_2D and STORE_2D
rishi-yadav Oct 28, 2025
b113448
Update mma.cpp
rishi-yadav Oct 28, 2025
ec036ce
Changes for new cute apis prefetch transpose vnni
rishi-yadav Oct 28, 2025
e4c3276
Gemm Universal unit tests for MainloopIntelW8A8 API
rishi-yadav Oct 28, 2025
9e05683
Delete test/unit/cute/intel_xe/xe_vnni_2d.cpp
rishi-yadav Oct 28, 2025
409d9f9
Delete test/unit/cute/intel_xe/xe_copy_2d_test.cpp
rishi-yadav Oct 28, 2025
16638b9
Delete test/unit/cute/intel_xe/xe_copy_prefetch_2d.cpp
rishi-yadav Oct 28, 2025
5595511
Delete test/unit/cute/intel_xe/xe_transpose_2d.cpp
rishi-yadav Oct 28, 2025
a48b496
Update CMakeLists.txt
rishi-yadav Oct 28, 2025
610e02a
Update mma.cpp
rishi-yadav Oct 28, 2025
0ed83bb
Update test/unit/gemm/device/gemm_testbed_3x.hpp
rishi-yadav Oct 28, 2025
cb84f4e
Epilogue DataType Mismatch (#563)
amitchawla1 Oct 28, 2025
a0172fd
Fixing trD compute type in the Xe Epilogue (#580)
joyalbin Oct 28, 2025
9acfcd5
Support for CUTLASS Library generation / Ops / Xe Arch (#578)
Antonyvance Oct 29, 2025
94d8abc
fix acc issue in loading lse when m is not even
yuankuns Oct 30, 2025
80524d7
Rename python/cutlass to python/cutlass_cppgen (#587)
amitchawla1 Oct 30, 2025
9afbe85
enable is causal
yuankuns Oct 31, 2025
a08c429
[Xe] 4-bit unit stride -> VNNI reorders (#593)
petercad Oct 31, 2025
521dfcd
[Xe] Refactor split barrier functionality
petercad Oct 27, 2025
65b1a3f
[CuTe] [Xe] Reorder fixes/extensions for f32 -> bf16
petercad Oct 27, 2025
2212f1b
[CuTe] [Xe] Allow size-1 fragments in block 2D copies
petercad Oct 27, 2025
dec36a9
[CuTe] [Xe] Copy fixes
petercad Oct 27, 2025
48d82e8
[CuTe] [Xe] make_block_2d_copy_{C,D} variants with subtiling
petercad Oct 27, 2025
8819b01
[CuTe] Minor layout features/fixes
petercad Oct 27, 2025
21fb89a
[CuTe] [Xe] New make_subgroup_tensor helpers
petercad Oct 27, 2025
9a6aa27
[CuTe] [Xe] Subgroup-scope broadcast/reduction
petercad Oct 27, 2025
d4ef382
[Platform] Add missing numeric_limits<float>::lowest()
petercad Oct 28, 2025
d02c58b
[Xe] Re-implement FlashAttention with new atoms
petercad Oct 27, 2025
9f74e54
[Xe] Additional comments
petercad Oct 28, 2025
7ab29af
Re-implement FlashAttention with new Xe atoms (#547)
petercad Oct 31, 2025
177c85c
Python ops support improvements and test fixes (#595)
Antonyvance Nov 1, 2025
4d48a5b
enable tri-dao style gqa/mqa
yuankuns Nov 2, 2025
d2292f0
v0.6.0 update (#606)
anupren Nov 3, 2025
6571de8
Added PR-540 related details (#609)
kausikmaiti Nov 4, 2025
d17d407
Fix for void ElementC in epilogue. (#590)
amitchawla1 Nov 4, 2025
984e3ab
New mma_atoms and copy_atoms in bmg_grouped_gemm_fp8 (#579)
nsingh-habana Nov 4, 2025
91eaa1a
[Examples] [Xe] Improve performance for some upconversion cases in xe…
petercad Nov 5, 2025
ac1e946
[Github] New Templates (#610)
Antonyvance Nov 5, 2025
56a200d
[Xe] [Reorder] Support broadcasting reorders (#589)
petercad Nov 5, 2025
45a33fe
Merge NV 4.2.1 to SYCL-TLA Main (#592)
anamikac-intel Nov 5, 2025
6df6dd1
Add new falsh attention fp8 support on BMG (#613)
ClarkChin08 Nov 6, 2025
5a0b7a8
Not include MKL when headers only (#615)
airMeng Nov 7, 2025
ffb0d54
Unit tests for LOAD_2D and STORE_2D (#582)
aschabana Nov 10, 2025
887362d
NHD layout (#603)
sunjiweiswift Nov 10, 2025
b62b28d
[Xe] [Reorder] Cleanup (#614)
petercad Nov 11, 2025
b8612e3
Merge branch 'main' into mainloop_unit_tests
rishi-yadav Nov 11, 2025
78b3652
enable parallel over seqlen_kv
yuankuns Nov 11, 2025
9d14539
Merge branch 'main' into mainloop_unit_tests
rishi-yadav Nov 11, 2025
aba5590
align n_block calculation to m_block
yuankuns Nov 11, 2025
85b020e
Update CMakeLists.txt
rishi-yadav Nov 11, 2025
10144d7
Merge branch 'intel:mainloop_unit_tests' into mainloop_unit_tests
rishi-yadav Nov 11, 2025
6f2a837
Create xe_copy_2d_test.cpp
rishi-yadav Nov 11, 2025
62281a6
Update xe_copy_2d_test.cpp
rishi-yadav Nov 11, 2025
3ea724e
Update mma.cpp
rishi-yadav Nov 11, 2025
79eb970
Changes for new cute apis prefetch transpose vnni
rishi-yadav Oct 28, 2025
3b9fb39
Gemm Universal unit tests for MainloopIntelW8A8 API
rishi-yadav Oct 28, 2025
9a51b2d
Delete test/unit/cute/intel_xe/xe_vnni_2d.cpp
rishi-yadav Oct 28, 2025
7b1d004
Delete test/unit/cute/intel_xe/xe_copy_2d_test.cpp
rishi-yadav Oct 28, 2025
1b140d4
Delete test/unit/cute/intel_xe/xe_copy_prefetch_2d.cpp
rishi-yadav Oct 28, 2025
88fcd35
Delete test/unit/cute/intel_xe/xe_transpose_2d.cpp
rishi-yadav Oct 28, 2025
89be086
Update CMakeLists.txt
rishi-yadav Oct 28, 2025
c1834c7
Update mma.cpp
rishi-yadav Oct 28, 2025
452ac0b
Update test/unit/gemm/device/gemm_testbed_3x.hpp
rishi-yadav Oct 28, 2025
702a873
Update CMakeLists.txt
rishi-yadav Nov 11, 2025
7cf016d
Create xe_copy_2d_test.cpp
rishi-yadav Nov 11, 2025
12b4690
Update xe_copy_2d_test.cpp
rishi-yadav Nov 11, 2025
83ba434
Update mma.cpp
rishi-yadav Nov 11, 2025
fee297d
Revert "NHD layout" (#622)
rolandschulz Nov 11, 2025
e2fde37
Use newer version of copy_atom in epilogue collective (#573)
anamikac-intel Nov 11, 2025
fb8c97c
Add CausalMask support with new flash attention api (#604)
ClarkChin08 Nov 12, 2025
5ac9700
Add VarLen support to new flash attention api (#616)
ClarkChin08 Nov 12, 2025
5532d8e
[Docs] [Xe] Data ownership for sub-byte types (#627)
petercad Nov 13, 2025
de631ad
Forward -Werror to host g++ compiler (#624)
nsingh-habana Nov 13, 2025
fc4aaf5
Merge branch 'intel:mainloop_unit_tests' into mainloop_unit_tests
rishi-yadav Nov 13, 2025
18a8441
fix tile shape for 128 headdim
yuankuns Nov 13, 2025
4e2f5f8
Persistent SDPA kernel (#608)
wuxun-zhang Nov 14, 2025
aecfb09
Gemm Universal unit tests for MainloopIntelW8A8 API (#584)
aschabana Nov 14, 2025
3bc283e
Revert "fix tile shape for 128 headdim"
yuankuns Nov 17, 2025
92785e4
[DOC] Clarify what the numbers in the subgroup view mean in the re-ar…
sanchitintel Nov 18, 2025
3f2a337
Rearchitecture: Xe epilogue (#621)
petercad Nov 20, 2025
52941a9
Support multiple targets in DDPCPP_SYCL_TARGET (#630)
nsingh-habana Nov 20, 2025
466f5cb
Example of BF16/FP16 MoE Grouped GEMM with CuTe interface (#600)
sanchitintel Nov 20, 2025
b0deafd
fix oneapi 2025.3 warning; enable 64x64 tileing for 128 and converter…
yuankuns Nov 21, 2025
dbc3290
enable double buffer of p/ds in gmem
yuankuns Nov 22, 2025
34ca80e
fix 2d load gap in bmg & pvc
yuankuns Nov 22, 2025
1f49712
remove useless buff
yuankuns Nov 22, 2025
4e1ba37
reduce dq atomic add operation
yuankuns Nov 23, 2025
315cf75
refine MOE/grouped GEMM (#638)
taozha2 Nov 24, 2025
9262749
enable bottom right mask
yuankuns Nov 25, 2025
b5265a2
xe_array_epilogue with new APIs (#643)
jiyang1011 Nov 25, 2025
f26515e
set stride 1 for n_block pickup
yuankuns Nov 25, 2025
360041b
isolate each block
yuankuns Nov 25, 2025
3380023
Bug fix in the CuTe interface MoE GEMM example (#648)
sanchitintel Nov 26, 2025
3c4e137
Unit tests for prefetch transpose and vnni (#632)
rishi-yadav Nov 26, 2025
22c46ab
Miscellaneous reorder-related fixes (#635)
petercad Nov 27, 2025
9ced4c3
[CuTe] Fix atom partitioning in some edge cases (#628)
petercad Nov 27, 2025
b9a5877
EVTs, part 1 (#647)
petercad Nov 27, 2025
6dfe1b3
KCooperative dispatch policy unit tests (#646)
rishi-yadav Nov 27, 2025
a6b0b5f
Update README.md (#637)
anupren Nov 27, 2025
4cdea5a
KCooperative Cmake changes (#651)
rishi-yadav Dec 1, 2025
9ff1cc9
Changes for reorder apis (#639)
rishi-yadav Dec 2, 2025
3bb4532
Changes for fix flash attention KV cache and prefill issues (#617)
rishi-yadav Dec 2, 2025
482b40e
Updated the epilogue test to use new MMA/Atom APIs (#654)
aschabana Dec 2, 2025
88bc40a
upload 2nd version of sdpa backward
yuankuns Oct 3, 2025
e3b12eb
upload cutlass dot_do_o
yuankuns Oct 3, 2025
1e0889a
fix dp save issue
yuankuns Oct 6, 2025
ba335d5
update sdpa to include compat change
yuankuns Oct 9, 2025
c236960
enable prefetch
yuankuns Oct 14, 2025
85ccdee
fix host compiler issue
yuankuns Oct 16, 2025
bf40086
fix perf issue, reach 95% of xetla in 1 shape
yuankuns Oct 16, 2025
6c7e7cc
move HPs out for perf tuning
yuankuns Oct 17, 2025
4cf06b0
refactor the dq dk dv
yuankuns Oct 24, 2025
77240a1
fix error in host compilation
yuankuns Oct 27, 2025
33682bd
fix acc issue in loading lse when m is not even
yuankuns Oct 30, 2025
84e7b1f
enable is causal
yuankuns Oct 31, 2025
b3ec4ef
enable tri-dao style gqa/mqa
yuankuns Nov 2, 2025
9c6ba63
enable parallel over seqlen_kv
yuankuns Nov 11, 2025
87e4ca2
align n_block calculation to m_block
yuankuns Nov 11, 2025
1fff92c
fix tile shape for 128 headdim
yuankuns Nov 13, 2025
4acbaff
Revert "fix tile shape for 128 headdim"
yuankuns Nov 17, 2025
a8626c8
fix oneapi 2025.3 warning; enable 64x64 tileing for 128 and converter…
yuankuns Nov 21, 2025
dcc5395
enable double buffer of p/ds in gmem
yuankuns Nov 22, 2025
2393e51
fix 2d load gap in bmg & pvc
yuankuns Nov 22, 2025
dc09692
remove useless buff
yuankuns Nov 22, 2025
c24e792
reduce dq atomic add operation
yuankuns Nov 23, 2025
2961e29
enable bottom right mask
yuankuns Nov 25, 2025
3c0aed0
set stride 1 for n_block pickup
yuankuns Nov 25, 2025
c600aa7
isolate each block
yuankuns Nov 25, 2025
17448d7
Merge branch 'sdpabackward' of github.com:yuankuns/cutlass-sycl into …
yuankuns Dec 19, 2025
337b570
move to reorder API
yuankuns Dec 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions .github/ISSUE_TEMPLATE/bug_report.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: Bug Report
description: Create a bug report to help us improve CUTLASS
description: Create a bug report to help us improve SYCL*TLA
title: "[BUG] "
labels: ["? - Needs Triage", "bug"]
assignees: []
Expand All @@ -10,8 +10,9 @@ body:
attributes:
label: Which component has the problem?
options:
- CuTe DSL
- CUTLASS C++
- CuTe APIs
- CUTLASS APIs
- Python (APIs or Pypi package)
validations:
required: true
- type: textarea
Expand Down
5 changes: 0 additions & 5 deletions .github/ISSUE_TEMPLATE/config.yml

This file was deleted.

2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/documentation_request.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: Documentation request
about: Report incorrect or needed documentation to improve CUTLASS
about: Report incorrect or needed documentation to improve SYCL*TLA
title: "[DOC]"
labels: "? - Needs Triage, documentation"
assignees: ''
Expand Down
9 changes: 5 additions & 4 deletions .github/ISSUE_TEMPLATE/feature_request.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: Feature Request
description: Suggest an idea for CUTLASS
description: Suggest an idea for SYCL*TLA
title: "[FEA] "
labels: ["? - Needs Triage", "feature request"]
assignees: []
Expand All @@ -10,8 +10,9 @@ body:
attributes:
label: Which component requires the feature?
options:
- CuTe DSL
- CUTLASS C++
- CuTe APIs
- CUTLASS APIs
- Python (APIs or Pypi package)
validations:
required: true
- type: textarea
Expand All @@ -21,7 +22,7 @@ body:
description: Please fill out all sections below
value: |
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I wish I could use CUTLASS to do [...]
A clear and concise description of what the problem is. Ex. I wish I could use SYCL*TLA to do [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/submit_question.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: Submit question
about: Ask a general question about CUTLASS
about: Ask a general question about SYCL*TLA
title: "[QST]"
labels: "? - Needs Triage, question"
assignees: ''
Expand Down
19 changes: 19 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
## Description
<!-- What does this PR do? -->

## Type
- [ ] Bug - [ ] Feature - [ ] Performance - [ ] Refactor

## Testing
- [ ] Tests pass - [ ] Xe12 - [ ] Xe20

## Performance
| Metric | Before | After |
|--------|--------|-------|
| | | |

## References
Fixes #

## Checklist
- [ ] Copyright - [ ] Co-pilot Review - [ ] Deprecated APIs not used
20 changes: 20 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE/bug_fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
## Bug
<!-- What's broken? -->

Severity: <!-- Critical/High/Medium/Low -->

## Root Cause
<!-- Why? -->

## Fix
<!-- How? -->

## Verification
Before: <!-- error -->
After: <!-- fixed -->

## Testing
- [ ] Regression/Units test
- [ ] Tests pass

## Details
23 changes: 23 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE/feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
## Feature
<!-- What capability? -->

## Use Case
<!-- Why? Who needs it? -->

## API
```cpp
// Signature
```

## Example
```cpp
// Usage
```

## Testing
- [ ] Tests - [ ] Example - [ ] Docs

## ToDo
- [ ] Implement A
- [ ] Implement B
- [ ] Document
19 changes: 19 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE/performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
## Optimization
<!-- What? -->

## Profiling
Tool: <!-- Benchmark Tests or Profiling Tools -->
Bottleneck: <!-- What's slow? -->

## Results
| Case | Before | After | Gain |
|------|--------|-------|------|
| | | | |

## Changes
<!-- How? -->

## Testing
- [ ] Tests pass - [ ] Xe12 - [ ] Xe20

Related: #
23 changes: 23 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE/refactoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
## Refactoring
<!-- What code? -->

## Why
<!-- Motivation? -->

## Changes
<!-- Technique? -->

## Preservation
- [ ] Tests unchanged
- [ ] Perf unchanged

## Quality
| Metric | Before | After |
|--------|--------|-------|
| LOC | | |
| Performance | | |

## ToDo
- [ ] Implement A
- [ ] Implement B
- [ ] Document
24 changes: 13 additions & 11 deletions .github/actions/install-intel-graphics/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,20 +17,22 @@ runs:
run: |
shopt -s expand_aliases
which sudo || alias sudo=""
if [[ "${{ inputs.GPU }}" == "BMG" ]]; then
if [[ "${{ inputs.GPU }}" == "BMG" || "${{ inputs.GPU }}" == "PVC" ]]; then
sudo add-apt-repository ppa:kobuk-team/intel-graphics
sudo apt update
else
. /etc/os-release
wget https://repositories.intel.com/gpu/ubuntu/dists/${VERSION_CODENAME}/intel-gpu-ubuntu-${VERSION_CODENAME}.run
chmod +x intel-gpu-ubuntu-${VERSION_CODENAME}.run
sudo ./intel-gpu-ubuntu-${VERSION_CODENAME}.run
sudo apt install -y \
intel-media-va-driver-non-free libmfx-gen1 libvpl2 \
libegl-mesa0 libegl1-mesa-dev libgl1-mesa-dev \
libgles2-mesa-dev libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo \
libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev libze-dev hwinfo
# LTS PVC drivers
# . /etc/os-release
# wget https://repositories.intel.com/gpu/ubuntu/dists/${VERSION_CODENAME}/intel-gpu-ubuntu-${VERSION_CODENAME}.run
# chmod +x intel-gpu-ubuntu-${VERSION_CODENAME}.run
# sudo ./intel-gpu-ubuntu-${VERSION_CODENAME}.run
# sudo apt install -y \
# intel-media-va-driver-non-free libmfx-gen1 libvpl2 \
# libegl-mesa0 libegl1-mesa-dev libgl1-mesa-dev \
# libgles2-mesa-dev libigdgmm12 libxatracker2 mesa-va-drivers \
# mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo \
# libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev libze-dev hwinfo
exit 1
fi
sudo apt-get install -y libze-intel-gpu1 libze-dev intel-metrics-discovery \
intel-opencl-icd ocl-icd-opencl-dev clinfo intel-gsc intel-ocloc g++
Expand Down
167 changes: 167 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Copilot Coding Agent Onboarding — SYCL*TLA

Purpose
-------
This file is a short, focused onboarding guide so a coding agent (Copilot coding agent) can make correct, CI-safe changes to the SYCL*TLA repository without long exploratory searches. Keep edits conservative: prefer small, well-tested changes and follow the PR checklist described below.

Top-level constraints (read first)
---------------------------------
- **Intel copyright headers**: Many files carry dual NVIDIA/Intel copyright headers. Do not remove or alter copyright headers on modified files.
- **Intel Xe APIs**: The codebase uses new Intel "Xe" APIs (Xe12 for PVC, Xe20 for BMG) and Intel oneAPI toolchain conventions; prefer SYCL-compatible code and avoid adding CUDA-only code paths without explicit gating.
- **CI Requirements**: Changes must build and pass CI workflows in `.github/workflows/*` (notably `intel_test.yml`, `intel_test_gpp_host.yml`, `sycl_python_test.yml`).
- **Test Coverage**: Check for test coverage before making changes. C++ tests are in `test/unit/`, Python tests in `test/python/`.
- **PR Descriptions**: Must include: what changed, why, local build/test steps performed, and expected CI/benchmark impact (see PR templates in `.github/PULL_REQUEST_TEMPLATE/`).

Quick actions to always run locally before creating a PR
------------------------------------------------------
1. **ALWAYS source Intel environment first** (required for builds that target Intel compilers; if not available, CMake configure will still catch syntax errors but linking will fail):

```bash
source /opt/intel/oneapi/setvars.sh
export CXX=icpx
export CC=icx
```

2. **ALWAYS create a clean build directory** and configure for SYCL:

```bash
rm -rf build && mkdir build && cd build
cmake .. -G Ninja \
-DCUTLASS_ENABLE_SYCL=ON \
-DDPCPP_SYCL_TARGET=intel_gpu_bmg_g21 \
-DCUTLASS_SYCL_RUNNING_CI=ON
ninja
```

**Critical Notes:**
- `-DDPCPP_SYCL_TARGET` must match your hardware: `intel_gpu_bmg_g21` for BMG (Arc B580), `intel_gpu_pvc` for PVC (Data Center Max). This affects intrinsic availability.
- Build time: ~10-20 minutes for full build on 8-core machine.
- If Intel oneAPI is not installed, CMake configure will still catch syntax errors but linking and target-specific checks will fail.
- **NEVER commit without running a full build locally first**.

Build / Test / Lint summary
---------------------------
- **Bootstrap**: No special bootstrap required. Python dependencies in `pyproject.toml` (`networkx`, `numpy`, `pydot`, `scipy`, `treelib`) are needed for Python tests. Install with `pip install -e .` in project root.
- **Build**: Use CMake 3.22+ and Ninja (see commands above). **ALWAYS** run from clean build directory to avoid stale state.
- **C++ Unit Tests**: After build, run `cmake --build . --target test_unit` (runs all unit tests in `test/unit/`).
- **C++ Examples**: `cmake --build . --target test_examples` (builds and validates examples in `examples/`).
- **Python Tests**:
```bash
cd python
python3 -m pytest -q
```
CI runs specific test like `test/python/cutlass/gemm/gemm_bf16_pvc.py`. **ALWAYS** set `export CUTLASS_USE_SYCL=1` before running Python tests.
- **Linting**: No automated linter. Follow existing code style and ensure `-Werror` flag passes (set in CI).

**Environment Variables Required for Runtime:**
```bash
export ONEAPI_DEVICE_SELECTOR=level_zero:gpu
export IGC_ExtraOCLOptions="-cl-intel-256-GRF-per-thread"
export SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file -gline-tables-only"
export IGC_VectorAliasBBThreshold=100000000000
```
These are set in CI workflows and should be set locally for accurate testing.

Common failure modes & mitigations
---------------------------------
- **Missing Intel environment**: builds fail at linking or with unknown compilers. Mitigation: Source `/opt/intel/oneapi/setvars.sh` or unset `CXX`/`CC` to use system compilers for syntax-only checks.
- **Wrong SYCL target**: some intrinsics are target-specific (e.g., 2D block prefetch intrinsics). Match the CI target or use conservative code paths.
- **Layout constraints in Intel Xe epilogues** (ColumnMajor/RowMajor): prefer to reuse existing epilogue code and tests to avoid violating layout constraints. If making changes, run the affected tests locally.
- **Missing libraries in LD_LIBRARY_PATH** for runtime: set `LD_LIBRARY_PATH` to include `build/tools/library` when running `python` tests that load `.so` wrappers.
- **CMake cache issues**: If you see unexpected build behavior, **ALWAYS** delete `build/` completely and reconfigure. Stale CMake cache causes many hard-to-debug issues.
- **Python import errors**: If Python tests fail with import errors, run `pip install -e .` from project root first.

CI and validation pipelines (what will run)
-------------------------------------------
- See `.github/workflows/` for exact pipelines. Most important:
- `intel_test.yml` — primary CI build for Intel targets
- `intel_test_gpp_host.yml` — GPP host builds
- `sycl_python_test.yml` — Python test workflow
- `nvidia_test.yml` / `cuda_test.yml` — CUDA-targeted tests (keep changes SYCL-first unless explicitly modifying CUDA paths)

How the agent should validate its changes
-----------------------------------------
1. Run a local CMake configure and build (fast smoke test):

```bash
rm -rf build && mkdir build && cd build
cmake .. -G Ninja -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=intel_gpu_bmg_g21 -DCUTLASS_SYCL_RUNNING_CI=ON
ninja -k 0
```

2. Run the Python test subset that touches modified components (or all Python tests if the change is cross-cutting):

```bash
cd python
python3 -m pytest -q
```

3. For C++ kernel changes, run unit tests: `cmake --build . --target test_unit -j 8`

4. For examples changes, run: `cmake --build . --target test_examples -j 1`
--------------------------------------------------
- Short summary of change and the files modified.
- Build steps executed locally (CMake + Ninja commands, environment variables set).
- Tests run and their results (include pytest subset names and pass/fail counts).
- If the change affects performance or kernel selection, include expected performance impact and a short benchmark (size and results).
- State whether the Intel oneAPI environment was required to fully validate the change.

Project layout (quick map)
--------------------------
- Root files: `CMakeLists.txt`, `README.md`, `CHANGELOG-SYCL.md`, `SYCL.cmake`, `pyproject.toml`
- Major directories:
- `include/` — core headers and kernel templates
- `python/` — Python wrapper, generator, tests
- `examples/` — usage examples, e.g., `11_xe20_cutlass_library`
- `test/` — C++ tests and validation kernels
- `tools/` — build/test utilities
- `media/` — documentation and architecture notes (search `media/docs/cpp/xe_rearchitecture.md`, `media/docs/python/xe_cutlass_library.md`)

Files the agent should inspect when making changes
--------------------------------------------------
- `python/cutlass_library/generator.py` — kernel generation and filtering logic
- `python/cutlass_library/arch_constants.py` — architecture detection and constants
- `include/cutlass/gemm/kernel/*` — GEMM kernel implementations
- `.github/workflows/*` — CI steps; ensure changes don't break these workflows

Search tips
-----------
- Use `grep -R "HACK\|TODO\|WORKAROUND\|FIXME"` to find fragile areas.
- Search for `intel` and `Xe` keywords to find Intel-specific code paths.

Testing coverage
----------------
- The repo contains Python tests in `python/` and C++ tests under `test/`.
- Before assuming full coverage, run the test suite locally and include failing tests in your PR notes.

Special rules for changes
-------------------------
- Keep changes minimal and well-scoped. If modifying kernel selection or architecture constants, include tests or fallbacks.
- Preserve Intel copyright headers.
- Avoid introducing CUDA-only code paths in SYCL code.

When to run a wider search
--------------------------
Trust these instructions first. Only perform a broad code search if:
- The instructions are clearly missing information for the requested change, or
- A test or build step fails unexpectedly after following these steps.

Where to look for help
----------------------
- `README.md` and `media/docs/*` for architecture details
- `.github/workflows/` for CI expectations
- Open issues and PR templates in `.github/ISSUE_TEMPLATE` and `.github/PULL_REQUEST_TEMPLATE`

Short checklist before opening a PR
-----------------------------------
- [ ] Build configured and compiled locally (or syntax-checked if environment unavailable)
- [ ] Relevant tests run locally and passed
- [ ] PR description includes steps and validation results
- [ ] No removal of Intel copyright headers

If anything here proves inaccurate
----------------------------------
Run the minimal searches you need, then update this file with the corrected steps so future agents benefit.

---
This file is intentionally short (<= 2 pages). For deeper onboarding, consult `README.md`, `media/docs/*`, and the workflows in `.github/workflows/`.
Loading