Agent FFI integration #119

zanderjiang · 2025-11-17T15:59:50Z

This PR introduces FlashInfer-Bench agents module, a library agent tools to aid with kernel generation. First of which is a set of TVM-FFI instruction prompts. There are two versions:

FFI_PROMPT_SIMPLE: provides instructions on minimum modules required to for agents to generate correct TVM FFI bindings and host-side kernel launch function, including TensorView objects, env-stream management, assertion checks via
TVM_FFI_ICHECK, and FFI binding macro.

FFI_PROMPT_FULL: provides all kernel side C++ methods inside TVM FFI for more advanced use-cases.

Model testing results can be found here: https://gist.github.com/zanderjiang/0c2255bfdac6f4090f0f41e8c80368c3
All models were able to write correct FFI bindings with the prompts.

In addition, the PR contains a vibe-coding example, an agent markdown prompt that allows coding agents (Cursor, ClaudeCode, Codex, etc.) to generate a complete solution with FFI binding, as well as a minimal script to benchmark the generated kernel and apply it on a test input.

Summary by CodeRabbit

New Features
- Added TVM-FFI kernel distribution system with cross-framework support for JAX, PyTorch, and C++
- Added build infrastructure and kernel metadata management system
- Added LLM-based CUDA kernel generation capabilities
Documentation
- Added comprehensive guides and example implementations for the TVM-FFI kernel workflow
- Added build configuration and setup instructions
Tests
- Added test infrastructure for LLM-generated CUDA kernels
Chores
- Updated environment variable handling in version control

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-17T15:59:59Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This PR introduces a comprehensive TVM FFI example framework and kernel generation pipeline. It adds GEMM operation definitions and workloads, LLM-based kernel generation prompts, distributed kernel loading and execution examples across C++ and Python (JAX, PyTorch), build infrastructure, test utilities, and documentation for a four-stage workflow: Problem Definition → LLM Kernel Generation → Build & Distribution → Cross-Framework Usage.

Changes

Cohort / File(s)	Summary
GEMM Problem Definition `examples/ffi/Example-FlashInfer-Trace/definitions/gemm_n4096_k4096.json`, `examples/ffi/Example-FlashInfer-Trace/workloads/gemm_n4096_k4096.jsonl`	Adds JSON definition and workload files specifying a 4096×4096 GEMM operation with variable M dimension, including multiple predefined workload instances with randomized inputs.
LLM Prompt Definitions `flashinfer_bench/agents/ffi_prompt.py`	Introduces three TVM FFI prompt constants for LLM-based kernel generation: simple guide, comprehensive API reference, and concise introduction with usage patterns.
C++ TVM FFI Example `examples/ffi/cpp_example.cc`	Adds C++ example demonstrating loading a distributed kernel .so via TVM-FFI C++ API, allocating CUDA tensors, and executing GEMM computation with device management.
Python Examples (Multi-Framework) `examples/ffi/jax_example.py`, `examples/ffi/pytorch_example.py`	Python scripts loading distributed kernels and executing them via JAX and PyTorch respectively, validating results against reference implementations.
Kernel Distribution & Generation `examples/ffi/distribute_kernel.py`, `examples/ffi/kernel_generator_example.py`	Scripts for building/distributing compiled kernels via TVMFFIBuilder and orchestrating LLM-based kernel generation with workload retrieval and solution persistence.
LLM-Generated Kernel Solution `examples/ffi/Example-FlashInfer-Trace/solutions/example_agent_solution.json`	Provides complete CUDA GEMM kernel implementation with TVM FFI binding, including WMMA-based optimization, shared memory tiling (128×256 blocks), input validation, and CUDA stream management.
Test Infrastructure `tests/agent/test_prompt.py`	Comprehensive test script evaluating LLM ability to generate CUDA kernels, supporting OpenAI and Anthropic providers, compiling via tvm_ffi, and validating correctness against reference implementations.
Build & Documentation `examples/ffi/Makefile`, `examples/ffi/README.md`, `examples/ffi/agent_vibecode.md`	Build configuration for C++ examples with TVM-FFI detection, workflow documentation, and detailed guidance for GEMM kernel implementation requirements.
Configuration `.gitignore`	Adds `.env` to ignore environment variable files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Areas requiring extra attention:

tests/agent/test_prompt.py — Multi-provider LLM integration with error handling, CUDA code extraction, and kernel compilation/validation logic
examples/ffi/Example-FlashInfer-Trace/solutions/example_agent_solution.json (kernel.cu) — WMMA-based CUDA kernel optimization, dynamic shared memory, TVM FFI binding validation and device management
flashinfer_bench/agents/ffi_prompt.py — Accuracy and completeness of FFI API documentation in prompts used for code generation
Cross-framework examples (jax_example.py, pytorch_example.py) — Correctness of kernel invocation across different runtime bindings and tensor management

Possibly related PRs

[tvm-ffi] TVMFFIBuilder #111 — Implements and exposes TVMFFIBuilder and TVM-FFI runtime/runnable support, which is directly used by distribute_kernel.py and kernel generation workflows in this PR.

Poem

🐰 The FFI hop begins anew,
With kernels generated by LLM's thought true,
From definitions to JAX and CUDA's embrace,
Fast GEMM code runs at blazing pace! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Agent FFI integration' directly reflects the PR's main objective to introduce FFI instruction prompts and agent-based kernel generation capabilities.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-11-17T16:00:16Z

Summary of Changes

Hello @zanderjiang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances FlashInfer-Bench by integrating agent-based kernel generation capabilities. It provides a structured framework for agents to produce optimized CUDA kernels with TVM FFI bindings, complete with detailed prompts and validation mechanisms. This integration aims to streamline the process of developing and benchmarking high-performance kernels by leveraging automated code generation, making it easier to explore and validate different kernel implementations.

Highlights

Agent Module Introduction: Introduced a new agents module within FlashInfer-Bench, designed to provide library agent tools for automated kernel generation.
TVM-FFI Instruction Prompts: Added two versions of TVM-FFI instruction prompts, FFI_PROMPT_SIMPLE and FFI_PROMPT_FULL, to guide agents in generating correct TVM FFI bindings and host-side kernel launch functions, including TensorView objects, stream management, and assertion checks.
Vibe-Coding Example and Benchmarking Script: Included a vibe-coding example, which is an agent markdown prompt enabling coding agents to generate complete solutions with FFI bindings, along with a minimal script to benchmark the generated kernels and apply them to test inputs.
New GEMM Definition and Workloads: Added a new General Matrix Multiply (GEMM) definition (gemm_n4096_k4096.json) captured from Llama 3.1 8B attn.o_proj, complete with corresponding workload data (gemm_n4096_k4096.jsonl) for benchmarking.
LLM Code Generation Testing Framework: Implemented scripts (test_prompt.py, test_existing_prompt_sol.py) to evaluate and re-test LLMs' ability to generate CUDA kernels with TVM FFI bindings, including saving and loading test results for various models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new agent module for kernel generation using TVM FFI, which is a great addition. The documentation and examples are very comprehensive. I've found a few potential issues in the example code snippets related to integer types and kernel launch configurations. These could lead to incorrect code generation by agents or bugs in the examples themselves, especially for large inputs. Addressing these will improve the robustness and correctness of the provided agent prompts and examples.

gemini-code-assist · 2025-11-17T16:03:44Z

flashinfer_bench/agents/ffi_prompt.py

+  int64_t threads = 256;
+  int64_t blocks = (n + threads - 1) / threads;
+  AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);


The blocks and threads variables are int64_t but are passed directly to the kernel launch configuration. This is incorrect as the grid and block dimensions for a kernel launch expect dim3 or unsigned int. Using int64_t will lead to truncation and incorrect behavior for large inputs.

Suggested change

int64_t threads = 256;

int64_t blocks = (n + threads - 1) / threads;

AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);

const int threads = 256;

const dim3 blocks((n + threads - 1) / threads);

AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);

gemini-code-assist · 2025-11-17T16:03:45Z

flashinfer_bench/agents/ffi_prompt.py

+  int64_t threads = 256;
+  int64_t blocks = (n + threads - 1) / threads;
+  AddKernel<<<blocks, threads, 0, stream>>>(a_data, b_data, c_data, n);


The blocks and threads variables are int64_t but are passed directly to the kernel launch configuration. This is incorrect as the grid and block dimensions for a kernel launch expect dim3 or unsigned int. Using int64_t will lead to truncation and incorrect behavior for large inputs.

Suggested change

int64_t threads = 256;

int64_t blocks = (n + threads - 1) / threads;

AddKernel<<<blocks, threads, 0, stream>>>(a_data, b_data, c_data, n);

const int threads = 256;

const dim3 blocks((n + threads - 1) / threads);

AddKernel<<<blocks, threads, 0, stream>>>(a_data, b_data, c_data, n);

gemini-code-assist · 2025-11-17T16:03:45Z

flashinfer_bench/agents/ffi_prompt.py

+  int64_t threads = 256;
+  int64_t blocks = (n + threads - 1) / threads;
+  AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);


The blocks and threads variables are int64_t but are passed directly to the kernel launch configuration. This is incorrect as the grid and block dimensions for a kernel launch expect dim3 or unsigned int. Using int64_t will lead to truncation and incorrect behavior for large inputs.

Suggested change

int64_t threads = 256;

int64_t blocks = (n + threads - 1) / threads;

AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);

const int threads = 256;

const dim3 blocks((n + threads - 1) / threads);

AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);

gemini-code-assist · 2025-11-17T16:03:45Z

tests/agent/load_inline.py

+  int64_t threads = 256;
+  int64_t blocks = (n + threads - 1) / threads;
+  cudaStream_t stream =
+      static_cast<cudaStream_t>(TVMFFIEnvGetStream(x.device().device_type, x.device().device_id));
+  AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);


The blocks and threads variables are int64_t but are passed directly to the kernel launch configuration. This is incorrect as the grid and block dimensions for a kernel launch expect dim3 or unsigned int. Using int64_t will lead to truncation and incorrect behavior for large inputs.

Suggested change

int64_t threads = 256;

int64_t blocks = (n + threads - 1) / threads;

cudaStream_t stream =

static_cast<cudaStream_t>(TVMFFIEnvGetStream(x.device().device_type, x.device().device_id));

AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);

const int threads = 256;

const dim3 blocks((n + threads - 1) / threads);

cudaStream_t stream =

static_cast<cudaStream_t>(TVMFFIEnvGetStream(x.device().device_type, x.device().device_id));

AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);

gemini-code-assist · 2025-11-17T16:03:45Z

examples/ffi/agent_vibecode.md

+    const __half* A,
+    const __half* B,
+    __half* C,
+    int M,


The use of int for dimension M in gemm_n4096_k4096_launch can lead to truncation if M exceeds INT_MAX. Since M is a variable dimension, it's safer to use int64_t. This also applies to the kernel implementation (gemm_kernel on line 273) and the host launcher function (gemm_n4096_k4096_launch on line 289). Consequently, the cast in main.cpp (line 353) would no longer be necessary.

Suggested change

int M,

int64_t M,

gemini-code-assist · 2025-11-17T16:03:45Z

flashinfer_bench/agents/ffi_prompt.py

+
+namespace my_kernels {
+
+__global__ void AddOneKernel(float* x, float* y, int n) {


The kernel parameter n is of type int, but it's used to represent the size of a tensor, which can exceed INT_MAX. This can lead to truncation and incorrect behavior for large tensors. It should be int64_t to match the type of x.size(0).

Suggested change

__global__ void AddOneKernel(float* x, float* y, int n) {

__global__ void AddOneKernel(float* x, float* y, int64_t n) {

gemini-code-assist · 2025-11-17T16:03:46Z

flashinfer_bench/agents/ffi_prompt.py

+
+namespace my_kernels {
+
+__global__ void AddOneKernel(float* x, float* y, int n) {


The kernel parameter n is of type int, but it's used to represent the size of a tensor, which can exceed INT_MAX. This can lead to truncation and incorrect behavior for large tensors. It should be int64_t to match the type of x.size(0).

Suggested change

__global__ void AddOneKernel(float* x, float* y, int n) {

__global__ void AddOneKernel(float* x, float* y, int64_t n) {

gemini-code-assist · 2025-11-17T16:03:46Z

tests/agent/load_inline.py

+
+namespace tvm_ffi_example_cuda {
+
+__global__ void AddOneKernel(float* x, float* y, int n) {


The kernel parameter n is of type int, but it's used to represent the size of a tensor, which can exceed INT_MAX. This can lead to truncation and incorrect behavior for large tensors. It should be int64_t to match the type of x.size(0).

Suggested change

__global__ void AddOneKernel(float* x, float* y, int n) {

__global__ void AddOneKernel(float* x, float* y, int64_t n) {

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (1)

flashinfer_bench/agents/test_prompt.py (1)

98-132: Shared test_kernel helper to avoid duplication with test_existing_prompt_sol.py

The testing logic here matches the test_kernel implementation in flashinfer_bench/agents/test_existing_prompt_sol.py (same CUDA device assumption, tensor sizes, and assertions). As noted in the other file, factoring this into a shared helper would reduce duplication and keep your kernel tests in sync.

🧹 Nitpick comments (7)

flashinfer_bench/agents/load_inline.py (1)
39-41: Consider using or removing the mod variable.

The variable mod is assigned but never used. If this is a demonstration script focused on compilation success, consider either:

Using the module (e.g., calling a function from it), or

Removing the assignment: tvm_ffi.cpp.load_inline(name="add_one_cuda", cuda_sources=cuda_source)

Apply this diff if the module isn't needed:
 def main():
-    mod: Module = tvm_ffi.cpp.load_inline(name="add_one_cuda", cuda_sources=cuda_source)
+    tvm_ffi.cpp.load_inline(name="add_one_cuda", cuda_sources=cuda_source)
     print("Compilation successful")
flashinfer_bench/agents/test_results/o3.txt (1)
28-121: ElementwiseAdd TVM FFI wrapper looks solid; only edge-case is very large tensor sizes

Validation (ndim, dtype, size, device, contiguity), stream acquisition via TVMFFIEnvGetStream, and n == 0 early-return are all well-structured and align with TVM FFI usage. The only theoretical edge-case is the cast to int when computing blocks from int64_t n:
const int threads  = 256;
const int blocks   = static_cast<int>((n + threads - 1) / threads);
For realistic benchmark sizes this is fine, but if you ever push to extremely large 1D tensors (beyond what a single CUDA grid dimension can hold in int), you may want to guard or clamp blocks against device limits.
flashinfer_bench/agents/test_results/claude-opus-4-1-20250805.txt (1)
6-19: Consider explicitly including <cuda_runtime.h> for CUDA types/functions

This code uses CUDA runtime types and functions (cudaStream_t, cudaError_t, cudaGetLastError, cudaGetErrorString) but only includes TVM FFI headers. Given it already compiles, those headers are probably pulling in the CUDA headers transitively, but adding an explicit:
#include <cuda_runtime.h>
near the top (as in the o3 variant) would make the dependency clearer and less fragile to upstream header changes.
flashinfer_bench/agents/test_existing_prompt_sol.py (2)

50-84: Deduplicate test_kernel between this script and test_prompt.py

test_kernel here is functionally identical to the one in flashinfer_bench/agents/test_prompt.py (same tensor shapes, CUDA device usage, and assertions). Keeping two copies invites drift if you change test logic later (e.g., different tensor sizes or tolerances).

Consider moving test_kernel into a small shared helper module (e.g., agents/test_utils.py) and importing it from both scripts.

110-183: Tighten exception handling in test_saved_code for clearer failure modes

test_saved_code wraps both compilation and high-level file processing in broad except Exception blocks. For a CLI this is workable, but it makes it harder to distinguish parse errors, compilation issues, and runtime test failures.

You could:

Keep the outermost except Exception as a final safety net.

Narrow the inner except to known failure types (e.g., RuntimeError or specific exceptions from tvm_ffi.cpp.load_inline / torch), or at least log more structured context (file name, stage) before returning.

This keeps the user-friendly summary while improving debuggability and aligns better with the Ruff hints (BLE001/TRY300).

flashinfer_bench/agents/test_prompt.py (2)

31-83: Make prompt handling consistent between test_model, call_openai_model, and call_anthropic_model

Right now test_model builds full_prompt but:

call_openai_model uses prompt only for ["o3", "o4-mini-2025-04-16"]; other OpenAI models ignore the prompt parameter and instead hardcode ELEMENTWISE_ADD_PROMPT / FFI_PROMPT_SIMPLE.

call_anthropic_model ignores its prompt parameter entirely and also hardcodes ELEMENTWISE_ADD_PROMPT / FFI_PROMPT_SIMPLE.

For clarity and easier experimentation with different prompts, it would be cleaner to have both helpers accept and actually use a fully-formed prompt (or (system, user) pair) from test_model, and drop unused parameters. That also resolves the Ruff ARG001 warning on the unused prompt argument.

135-216: Narrow exception handling around model calls and compilation; clean minor style issues

test_model uses broad except Exception blocks both around the main body and around the load_inline call. For a CLI harness this is acceptable, but you might get better diagnostics by:

Catching expected failures explicitly (e.g., missing API keys, HTTP/API errors, tvm_ffi.cpp.load_inline compilation problems) and emitting more specific messages.

Keeping a final except Exception as a last-resort catch.

Also, minor non-blocking cleanups:

At line 183, f"Compilation: SUCCESS\n" doesn’t need to be an f-string.

Similarly, other f-strings without placeholders (e.g., the "Compilation: FAILED\n" in the failure path) can be plain strings.

These are small polish items but would align with Ruff’s hints and make error handling more intentional.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a2ebc6 and 46ca5a5.

📒 Files selected for processing (14)

.gitignore (1 hunks)
examples/ffi/Example-FlashInfer-Trace/definitions/gemm_n4096_k4096.json (1 hunks)
examples/ffi/Example-FlashInfer-Trace/workloads/gemm_n4096_k4096.jsonl (1 hunks)
examples/ffi/agent.md (1 hunks)
examples/ffi/e2e_kernel.py (1 hunks)
flashinfer_bench/agents/ffi_prompt.py (1 hunks)
flashinfer_bench/agents/load_inline.py (1 hunks)
flashinfer_bench/agents/test_existing_prompt_sol.py (1 hunks)
flashinfer_bench/agents/test_prompt.py (1 hunks)
flashinfer_bench/agents/test_results/claude-opus-4-1-20250805.txt (1 hunks)
flashinfer_bench/agents/test_results/gpt-5-2025-08-07.txt (1 hunks)
flashinfer_bench/agents/test_results/gpt-5-mini-2025-08-07.txt (1 hunks)
flashinfer_bench/agents/test_results/o3.txt (1 hunks)
flashinfer_bench/agents/test_results/o4-mini-2025-04-16.txt (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

flashinfer_bench/agents/test_prompt.py (2)

flashinfer_bench/agents/test_existing_prompt_sol.py (2)

test_kernel (50-84)

main (186-233)

flashinfer_bench/agents/load_inline.py (1)

main (39-41)

flashinfer_bench/agents/test_existing_prompt_sol.py (2)

flashinfer_bench/agents/test_prompt.py (2)

test_kernel (98-132)

main (219-257)

flashinfer_bench/agents/load_inline.py (1)

main (39-41)

examples/ffi/e2e_kernel.py (4)

flashinfer_bench/bench/benchmark.py (2)

Benchmark (16-181)

run_all (61-181)

flashinfer_bench/bench/config.py (1)

BenchmarkConfig (8-64)

flashinfer_bench/data/trace_set.py (2)

TraceSet (23-477)

from_path (85-145)

flashinfer_bench/apply/apply_api.py (2)

enable_apply (142-180)

disable_apply (183-192)

🪛 GitHub Actions: .github/workflows/linting.yaml

examples/ffi/agent.md

[error] 1-1: pre-commit end-of-file-fixer failed. Files were modified by this hook.

[error] 1-1: pre-commit trailing-whitespace failed. Files were modified by this hook.

[error] 1-1: pre-commit black formatting check failed. File was reformatted by this hook.

examples/ffi/e2e_kernel.py

[error] 1-1: pre-commit end-of-file-fixer failed. Files were modified by this hook.

[error] 1-1: pre-commit trailing-whitespace failed. Files were modified by this hook.

[error] 1-1: pre-commit black formatting check failed. File was reformatted by this hook.

[error] 1-1: pre-commit isort formatting check failed. File was reformatted by this hook.

🪛 LanguageTool

flashinfer_bench/agents/test_results/o4-mini-2025-04-16.txt

[style] ~83-~83: Using many exclamation marks might seem excessive (in this case: 19 exclamation marks for a text that’s 4073 characters long)
Context: ...dev = a.device(); if (dev.device_type != kDLCUDA) { TVM_FFI_THROW(RuntimeError) << "elementwise_add: only CUDA device supported"; } if (b.device().device_type != dev.device_type || b.device().device_id != dev.device_id) { TVM_FFI_THROW(ValueError) << "elementwise_add: b must be on the same CUDA device as a"; } if (c.device().device_type != dev.device_type || c.device().device_id != dev.device_id) { TVM_FFI_THROW(Val...

(EN_EXCESSIVE_EXCLAMATION)

flashinfer_bench/agents/test_results/gpt-5-2025-08-07.txt

[style] ~71-~71: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...Contiguous() || !b.IsContiguous() || !c.IsContiguous()) { TVM_FFI_THROW(ValueError) << "...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

[style] ~97-~97: Using many exclamation marks might seem excessive (in this case: 13 exclamation marks for a text that’s 3750 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...

(EN_EXCESSIVE_EXCLAMATION)

flashinfer_bench/agents/test_results/o3.txt

[style] ~101-~101: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...Contiguous() || !b.IsContiguous() || !c.IsContiguous()) { TVM_FFI_THROW(ValueError) << "...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

[style] ~123-~123: Using many exclamation marks might seem excessive (in this case: 19 exclamation marks for a text that’s 4501 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...

(EN_EXCESSIVE_EXCLAMATION)

flashinfer_bench/agents/test_results/gpt-5-mini-2025-08-07.txt

[style] ~82-~82: Using many exclamation marks might seem excessive (in this case: 12 exclamation marks for a text that’s 4202 characters long)
Context: ...; DLDevice dc_dev = c.device(); if (!(da_dev.device_type == db_dev.device_typ...

(EN_EXCESSIVE_EXCLAMATION)

flashinfer_bench/agents/test_results/claude-opus-4-1-20250805.txt

[style] ~110-~110: Using many exclamation marks might seem excessive (in this case: 23 exclamation marks for a text that’s 4049 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...

(EN_EXCESSIVE_EXCLAMATION)

🪛 Ruff (0.14.4)

flashinfer_bench/agents/test_prompt.py

43-43: Avoid specifying long messages outside the exception class

(TRY003)

70-70: Unused function argument: prompt

(ARG001)

130-130: Do not catch blind exception: Exception

(BLE001)

151-151: Abstract raise to an inner function

(TRY301)

151-151: Avoid specifying long messages outside the exception class

(TRY003)

183-183: f-string without any placeholders

Remove extraneous f prefix

(F541)

191-197: Consider moving this statement to an else block

(TRY300)

199-199: Do not catch blind exception: Exception

(BLE001)

204-204: f-string without any placeholders

Remove extraneous f prefix

(F541)

205-205: Use explicit conversion flag

Replace with conversion flag

(RUF010)

214-214: Do not catch blind exception: Exception

(BLE001)

flashinfer_bench/agents/load_inline.py

40-40: Local variable mod is assigned to but never used

Remove assignment to unused variable mod

(F841)

flashinfer_bench/agents/test_existing_prompt_sol.py

45-45: Avoid specifying long messages outside the exception class

(TRY003)

82-82: Do not catch blind exception: Exception

(BLE001)

151-158: Consider moving this statement to an else block

(TRY300)

160-160: Do not catch blind exception: Exception

(BLE001)

176-176: Do not catch blind exception: Exception

(BLE001)

examples/ffi/e2e_kernel.py

17-17: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Run unit tests on ubuntu-latest and Python 3.12
GitHub Check: Run unit tests on ubuntu-latest and Python 3.10
GitHub Check: Run unit tests on ubuntu-latest and Python 3.13
GitHub Check: Run unit tests on ubuntu-latest and Python 3.11

🔇 Additional comments (7)

.gitignore (1)

50-51: LGTM!

Adding .env to .gitignore is a security best practice to prevent accidental commits of environment variables and secrets. Including AGENTS.md aligns with the PR's agent-focused changes. Both entries are appropriately formatted and necessary.

flashinfer_bench/agents/load_inline.py (1)

8-36: LGTM!

The inline CUDA source is well-structured with proper TVM FFI headers, correct stream management via TVMFFIEnvGetStream, and appropriate FFI export binding.

examples/ffi/Example-FlashInfer-Trace/definitions/gemm_n4096_k4096.json (1)

1-48: LGTM!

The GEMM definition is well-structured with correct shapes, data types, and a clear reference implementation. The operation C = A @ B.T is properly specified.

examples/ffi/Example-FlashInfer-Trace/workloads/gemm_n4096_k4096.jsonl (1)

1-43: LGTM!

The workload file provides comprehensive test coverage with M values ranging from 1 to 8192, including edge cases and realistic sizes. All entries properly reference the gemm_n4096_k4096 definition.

flashinfer_bench/agents/test_results/gpt-5-2025-08-07.txt (1)

1-111: LGTM!

The generated elementwise addition kernel demonstrates proper TVM FFI usage with comprehensive input validation, correct error handling, and appropriate CUDA stream management. The test results confirm successful compilation and execution.

flashinfer_bench/agents/test_results/o4-mini-2025-04-16.txt (1)

1-122: LGTM!

The generated code demonstrates correct TVM FFI patterns with thorough validation using TVM_FFI_ICHECK and TVM_FFI_THROW, proper stream retrieval, and successful test execution.

flashinfer_bench/agents/ffi_prompt.py (1)

1-1868: LGTM!

The FFI prompt definitions provide comprehensive TVM FFI documentation for agent-based kernel generation. The prompts include:

Essential API reference with TensorView, error handling, and stream management

Complete examples demonstrating proper FFI usage patterns

Multiple levels of detail (SIMPLE, FULL, and standard PROMPT)

These prompts will effectively guide LLM agents to generate correct TVM FFI-compatible CUDA kernels.

coderabbitai · 2025-11-17T16:05:46Z

examples/ffi/agent.md

+    int64_t N = B.size(0);
+
+    // Check shapes
+    TVM_FFI_ICHECK_EQ(K, 4096) << "A.shape[1] must be 4096 (K)";
+    TVM_FFI_ICHECK_EQ(N, 4096) << "B.shape[0] must be 4096 (N)";
+    TVM_FFI_ICHECK_EQ(B.size(1), 4096) << "B.shape[1] must be 4096 (K)";
+    TVM_FFI_ICHECK_EQ(C.size(0), M) << "C.shape[0] must match A.shape[0] (M)";
+    TVM_FFI_ICHECK_EQ(C.size(1), N) << "C.shape[1] must be 4096 (N)";
+
+    // Check data types (float16)
+    TVM_FFI_ICHECK_EQ(A.dtype().code, kDLFloat) << "A must be float type";
+    TVM_FFI_ICHECK_EQ(A.dtype().bits, 16) << "A must be float16";
+    TVM_FFI_ICHECK_EQ(B.dtype().code, kDLFloat) << "B must be float type";
+    TVM_FFI_ICHECK_EQ(B.dtype().bits, 16) << "B must be float16";
+    TVM_FFI_ICHECK_EQ(C.dtype().code, kDLFloat) << "C must be float type";
+    TVM_FFI_ICHECK_EQ(C.dtype().bits, 16) << "C must be float16";
+
+    // Check device (must be CUDA)
+    TVM_FFI_ICHECK_EQ(A.device().device_type, kDLCUDA) << "A must be on CUDA";
+    TVM_FFI_ICHECK_EQ(B.device().device_type, kDLCUDA) << "B must be on CUDA";
+    TVM_FFI_ICHECK_EQ(C.device().device_type, kDLCUDA) << "C must be on CUDA";
+
+    // Get data pointers
+    const __half* A_data = static_cast<const __half*>(A.data_ptr());
+    const __half* B_data = static_cast<const __half*>(B.data_ptr());
+    __half* C_data = static_cast<__half*>(C.data_ptr());
+
+    // Get CUDA stream from TVM FFI environment
+    DLDevice dev = A.device();
+    cudaStream_t stream = static_cast<cudaStream_t>(
+        TVMFFIEnvGetStream(dev.device_type, dev.device_id));
+
+    // Launch kernel
+    gemm_n4096_k4096_launch(A_data, B_data, C_data, static_cast<int>(M), stream);
+}
+
+// Export the function with TVM FFI
+TVM_FFI_DLL_EXPORT_TYPED_FUNC(run, run);
+```
+
+## Performance Optimization Guidelines
+
+Your CUDA kernel should include:
+
+1. **Tensor Core Usage (WMMA)**: Use `nvcuda::wmma` for 16x16x16 matrix operations
+2. **Shared Memory Tiling**: Cache tiles of A and B in shared memory
+3. **Memory Coalescing**: Ensure threads access consecutive memory addresses
+4. **Bank Conflict Avoidance**: Add padding to shared memory arrays
+5. **Compute Intensity**: Maximize compute-to-memory-access ratio
+6. **Register Optimization**: Minimize register usage for higher occupancy
+7. **Stream Pipelining**: Overlap compute and memory operations
+
+## Output Format
+
+Write the complete JSON solution to:
+**`Example-FlashInfer-Trace/solutions/agent_vibecode_gemm.json`**
+
+The JSON must be valid and contain:
+- All required schema fields
+- Complete source code for all 3 files in the `content` fields
+- Properly escaped strings (use JSON encoding)
+
+## Validation Checklist
+
+Before finalizing, verify:
+- [ ] File names are exactly: `kernel.h`, `kernel.cu`, `main.cpp`
+- [ ] Entry point is `"main.cpp::run"`
+- [ ] Function signature: `void run(tvm::ffi::TensorView A, tvm::ffi::TensorView B, tvm::ffi::TensorView C)`
+- [ ] TVM_FFI_DLL_EXPORT_TYPED_FUNC exposes the `run` function
+- [ ] All three files included in `sources` array
+- [ ] Input validation with `TVM_FFI_ICHECK_*` macros
+- [ ] Kernel implements `C = A @ B.T` (transpose of B)
+- [ ] Data type is `__half` (float16)
+- [ ] CUDA stream from `TVMFFIEnvGetStream()`
+- [ ] Checks that all tensors are on CUDA device
+- [ ] JSON is valid and properly formatted
+- [ ] All TVM FFI headers included correctly
+
+## Expected Agent Behavior
+
+1. **Read** the GEMM definition from `definitions/gemm_n4096_k4096.json`
+2. **Understand** the operation: `C = A @ B.T` with shapes [M,K] × [N,K] → [M,N]
+3. **Implement** a high-performance CUDA kernel with tiling and tensor cores
+4. **Create** TVM FFI bindings following the API guidelines
+5. **Package** all source code into the Solution JSON format
+6. **Write** the JSON to `Example-FlashInfer-Trace/solutions/agent_vibecode_gemm.json`
+
+The JSON file should be ready to be consumed by the flashinfer-bench benchmarking system.
+
+## Summary
+
+This agent.md provides complete instructions for generating a CUDA GEMM kernel implementation using TVM FFI bindings. The key points are:
+
+- **3 files required**: `kernel.h`, `kernel.cu`, `main.cpp`
+- **Entry point**: `main.cpp::run` with signature `void run(TensorView A, TensorView B, TensorView C)`
+- **TVM FFI export**: Use `TVM_FFI_DLL_EXPORT_TYPED_FUNC(run, run)`
+- **Validation**: Use `TVM_FFI_ICHECK_*` macros for input validation
+- **Stream management**: Get stream via `TVMFFIEnvGetStream()`
+- **Output**: Write complete JSON to `Example-FlashInfer-Trace/solutions/agent_vibecode_gemm.json`
+


⚠️ Potential issue | 🟡 Minor

Fix formatting issues flagged by pre-commit hooks.

The linting pipeline has detected formatting issues in this file:

Missing newline at end of file

Trailing whitespace

Code blocks need reformatting

These should be resolved to pass CI checks.

Run the following commands to fix:

#!/bin/bash # Fix formatting issues automatically pre-commit run --files examples/ffi/agent.md

🧰 Tools

🪛 GitHub Actions: .github/workflows/linting.yaml

[error] 1-1: pre-commit end-of-file-fixer failed. Files were modified by this hook.

[error] 1-1: pre-commit trailing-whitespace failed. Files were modified by this hook.

[error] 1-1: pre-commit black formatting check failed. File was reformatted by this hook.

🤖 Prompt for AI Agents

In examples/ffi/agent.md around lines 1 to 419, the file fails pre-commit formatting: add a final newline at EOF, remove all trailing whitespace, and reformat the fenced code blocks (ensure language tags and indentation are correct); run the repository's pre-commit hooks or the provided command (pre-commit run --files examples/ffi/agent.md) or your formatter (e.g., markdownlint/clang-format/prettier as configured) to automatically apply these fixes and re-check CI.

examples/ffi/e2e_kernel.py

coderabbitai · 2025-11-17T16:05:47Z

examples/ffi/e2e_kernel.py

+    for def_name, traces in result_traceset.traces.items():
+        print(f"\n{def_name}:")
+        for trace in traces:
+            print(f"  Solution: {trace.solution.name}")
+            print(f"  Status: {trace.evaluation.status.value}")
+            if trace.evaluation.performance:
+                print(f"  Speedup: {trace.evaluation.performance.speedup_factor:.2f}x")
+


⚠️ Potential issue | 🔴 Critical

Fix attribute access on solution.

Based on the Trace class definition, trace.solution is a string (the solution name), not an object with a .name attribute. Line 29 will raise an AttributeError at runtime.

Apply this diff:

print("\nBenchmark Complete") for def_name, traces in result_traceset.traces.items(): print(f"\n{def_name}:") for trace in traces: - print(f" Solution: {trace.solution.name}") + print(f" Solution: {trace.solution}") print(f" Status: {trace.evaluation.status.value}") if trace.evaluation.performance: print(f" Speedup: {trace.evaluation.performance.speedup_factor:.2f}x")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for def_name, traces in result_traceset.traces.items():

print(f"\n{def_name}:")

for trace in traces:

print(f" Solution: {trace.solution.name}")

print(f" Status: {trace.evaluation.status.value}")

if trace.evaluation.performance:

print(f" Speedup: {trace.evaluation.performance.speedup_factor:.2f}x")

for def_name, traces in result_traceset.traces.items():

print(f"\n{def_name}:")

for trace in traces:

print(f" Solution: {trace.solution}")

print(f" Status: {trace.evaluation.status.value}")

if trace.evaluation.performance:

print(f" Speedup: {trace.evaluation.performance.speedup_factor:.2f}x")

🤖 Prompt for AI Agents

In examples/ffi/e2e_kernel.py around lines 26 to 33, trace.solution is a string (the solution name) not an object with a .name attribute, so accessing trace.solution.name raises AttributeError; replace trace.solution.name with trace.solution (e.g., print(f" Solution: {trace.solution}")) and keep the rest of the printing logic unchanged.

coderabbitai · 2025-11-17T16:05:47Z

flashinfer_bench/agents/load_inline.py

+import re
+from pathlib import Path


⚠️ Potential issue | 🟡 Minor

Remove unused imports.

The imports re and Path are not used anywhere in this file.

Apply this diff:

-import re -from pathlib import Path - import torch

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import re

from pathlib import Path

import torch

🤖 Prompt for AI Agents

In flashinfer_bench/agents/load_inline.py lines 1-2 the imports "import re" and "from pathlib import Path" are unused; remove these two import lines so the file only imports what it actually uses (delete both lines or remove the unused names), then run a quick lint/flake8 to confirm no unused-import warnings remain.

coderabbitai · 2025-11-17T16:05:47Z

tests/agent/test_results/gpt-5-mini-2025-08-07.txt

+  // shape / size validation
+  int64_t n = a.size(0);
+  if (b.size(0) != n) {
+    TVM_FFI_THROW(ValueError) << "elementwise_add: size mismatch between 'a' and 'b' ("
+                              << n << " vs " << b.size(0) << ")";
+  }
+  if (c.size(0) != n) {
+    TVM_FFI_THROW(ValueError) << "elementwise_add: output 'c' size must match inputs (expected "
+                              << n << ", got " << c.size(0) << ")";
+  }
+
+  // Ensure tensors are contiguous for simple indexing
+  if (!a.IsContiguous()) {
+    TVM_FFI_THROW(ValueError) << "elementwise_add: input 'a' must be contiguous";
+  }
+  if (!b.IsContiguous()) {
+    TVM_FFI_THROW(ValueError) << "elementwise_add: input 'b' must be contiguous";
+  }
+  if (!c.IsContiguous()) {
+    TVM_FFI_THROW(ValueError) << "elementwise_add: output 'c' must be contiguous";
+  }
+
+  // Ensure all tensors are on the same device
+  DLDevice da_dev = a.device();
+  DLDevice db_dev = b.device();
+  DLDevice dc_dev = c.device();
+  if (!(da_dev.device_type == db_dev.device_type && da_dev.device_id == db_dev.device_id &&
+        da_dev.device_type == dc_dev.device_type && da_dev.device_id == dc_dev.device_id)) {
+    TVM_FFI_THROW(ValueError) << "elementwise_add: tensors must be on the same device";
+  }
+
+  // Pointers to data
+  float* a_ptr = static_cast<float*>(a.data_ptr());
+  float* b_ptr = static_cast<float*>(b.data_ptr());
+  float* c_ptr = static_cast<float*>(c.data_ptr());
+
+  // Dispatch by device type
+  if (da_dev.device_type == kDLCUDA) {
+    // GPU path: launch CUDA kernel on environment-managed stream
+    TVMFFIStreamHandle raw_stream = TVMFFIEnvGetStream(da_dev.device_type, da_dev.device_id);
+    cudaStream_t stream = static_cast<cudaStream_t>(raw_stream);
+
+    const int threads = 256;
+    const int64_t blocks = (n + threads - 1) / threads;
+    ElementwiseAddKernel<<<static_cast<int>(blocks), threads, 0, stream>>>(a_ptr, b_ptr, c_ptr, n);
+
+    // Check kernel launch error
+    cudaError_t err = cudaGetLastError();
+    if (err != cudaSuccess) {
+      TVM_FFI_THROW(RuntimeError) << "elementwise_add: CUDA kernel launch failed: "
+                                  << cudaGetErrorString(err);
+    }


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Handle n == 0 explicitly to avoid invalid CUDA launches

In this variant there’s no early-return for n == 0, so the CUDA path computes:

const int threads = 256; const int64_t blocks = (n + threads - 1) / threads; // blocks == 0 when n == 0 ElementwiseAddKernel<<<static_cast<int>(blocks), threads, 0, stream>>>(...);

Launching a kernel with 0 blocks is invalid and will surface as a CUDA error instead of treating zero-length tensors as a no-op (the CPU path is fine because the loop simply doesn’t run).

You can mirror the behavior of the o3 implementation and avoid spurious errors with:

int64_t n = a.size(0); // ... existing shape checks ... if (n == 0) { return; }

placed before the device-type dispatch.

--- 🏁 Script executed: ```shell #!/bin/bash # Verify file existence and locate elementwise_add implementations echo "=== Checking for the test results file ===" find . -name "gpt-5-mini-2025-08-07.txt" -o -name "*elementwise_add*" | head -20 echo "" echo "=== Searching for elementwise_add implementations ===" rg -l "ElementwiseAddKernel" --type cpp --type cu echo "" echo "=== Looking for 'o3' variant references ===" rg -n "o3.*elementwise\|elementwise.*o3" --type cpp --type cu

Length of output: 265

🌐 Web query:

CUDA kernel launch with zero blocks grid size invalid behavior

💡 Result:

Short answer: launching a kernel with any grid or block dimension equal to zero is invalid — the CUDA runtime returns an "invalid configuration" / CUDA_ERROR_INVALID_VALUE (or similar) and you should guard against it and skip the launch. See vendor docs for launch return codes and community discussion showing zero-dimension launches produce "invalid configuration" errors. [1][2][3]

Details and recommendations

cuLaunchKernel and the runtime list invalid-value / invalid-configuration among possible return/error codes for improper launch parameters. Check launch return values (cudaGetLastError / return status) to detect this. [2]

Community reports and Q&A show kernel<<<0, ...>>> or kernel<<<...,0>>> leads to "invalid configuration argument" and interferes with proper error checking; the recommended pattern is to test the dimensions in host code and skip the kernel call when gridDim or blockDim is zero. [1][3]

Minimal guard example (host-side):

if (blocks > 0 && threads > 0) kernel<<<blocks, threads>>>(...);

References

StackOverflow discussion about running zero blocks in CUDA (reports "invalid configuration argument"). [1]

NVIDIA cuLaunchKernel documentation (lists CUDA_ERROR_INVALID_VALUE / invalid launch among returns). [2]

CUDA C++ Programming Guide (general kernel launch rules / errors). [3]

--- 🏁 Script executed: ```shell #!/bin/bash # Examine the test results file to understand its nature echo "=== Content of test results file (first 20 lines) ===" head -20 ./git/flashinfer_bench/agents/test_results/gpt-5-mini-2025-08-07.txt echo "" echo "=== Search for elementwise_add in all files (broader search) ===" rg "elementwise_add|ElementwiseAdd" --type-list | head -5 rg "elementwise_add|ElementwiseAdd" -g "!test_results" | head -30 echo "" echo "=== Check agents directory structure ===" find ./git/flashinfer_bench/agents -type f -name "*.txt" -o -name "*.cc" -o -name "*.h" | head -20

Length of output: 1460

🏁 Script executed:

#!/bin/bash # Find repository root and structure echo "=== Repository root and top-level structure ===" pwd ls -la | head -15 echo "" echo "=== Find any .txt files with 'gpt' in name ===" find . -name "*gpt*.txt" 2>/dev/null echo "" echo "=== List flashinfer_bench/agents directory structure ===" find flashinfer_bench/agents -type d 2>/dev/null | head -20 echo "" echo "=== Check if test_results directory exists under agents ===" ls -la flashinfer_bench/agents/test_results 2>/dev/null || echo "test_results directory not found" echo "" echo "=== Full path of any file matching the review file name ===" find . -name "*2025-08-07*" 2>/dev/null

Length of output: 1928

🏁 Script executed:

#!/bin/bash # Examine the specific lines in the gpt-5-mini file echo "=== Lines 56-107 of gpt-5-mini-2025-08-07.txt ===" sed -n '56,107p' ./flashinfer_bench/agents/test_results/gpt-5-mini-2025-08-07.txt echo "" echo "=== Full content of o3.txt (for comparison) ===" cat ./flashinfer_bench/agents/test_results/o3.txt echo "" echo "=== Check if o3.txt has n==0 early return ===" grep -n "n == 0\|n==0" ./flashinfer_bench/agents/test_results/o3.txt || echo "No n==0 check found in o3.txt"

Length of output: 7122

Add early-return guard for zero-length tensors to prevent invalid CUDA launches

The gpt-5-mini implementation lacks an n == 0 check before kernel dispatch. When n=0, the code computes blocks = (0 + 255) / 256 = 0 and attempts ElementwiseAddKernel<<<0, ...>>>, which is invalid and raises a CUDA error instead of treating empty tensors as a no-op.

The o3 implementation correctly includes this guard at line 106:

if (n == 0) return; // nothing to do

Add this check immediately after shape validation and before the device-type dispatch to align with the o3 pattern:

int64_t n = a.size(0); // ... existing shape and contiguity checks ... if (n == 0) { return; } // Dispatch by device type if (da_dev.device_type == kDLCUDA) { // ... }

🧰 Tools

🪛 LanguageTool

[style] ~82-~82: Using many exclamation marks might seem excessive (in this case: 12 exclamation marks for a text that’s 4202 characters long)
Context: ...; DLDevice dc_dev = c.device(); if (!(da_dev.device_type == db_dev.device_typ...

(EN_EXCESSIVE_EXCLAMATION)

🤖 Prompt for AI Agents

In flashinfer_bench/agents/test_results/gpt-5-mini-2025-08-07.txt around lines 56 to 107, the implementation dispatches a CUDA kernel even when n == 0 which yields a zero-block launch error; add an early-return guard immediately after the shape/contiguity/device checks (and before the device-type dispatch) that returns when n == 0 so empty tensors become a no-op and no kernel is launched.

…h-dev

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (7)

tests/agent/load_inline.py (2)

13-13: Previous review comment about kernel parameter type still applies.

The kernel parameter n is int, which can truncate when the tensor size exceeds INT_MAX. As noted in a previous review, it should be int64_t to match the type of x.size(0).

24-28: Previous review comment about launch configuration types still applies.

The variables blocks and threads are declared as int64_t but are passed directly to the kernel launch configuration, which expects dim3 or unsigned int. As noted in a previous review, this can lead to truncation and incorrect behavior for large inputs.

examples/ffi/e2e_kernel.py (2)

20-20: Previous review comment about unnecessary f-string still applies.

The f-string has no placeholders, making the f prefix unnecessary. This was flagged in a previous review.

32-32: Previous review comment about attribute access still applies.

As noted in a previous review, trace.solution is a string (the solution name), not an object with a .name attribute. This line will raise an AttributeError at runtime.

examples/ffi/agent.md (2)

250-250: Previous review comment about parameter type still applies.

As noted in a previous review, using int for the variable dimension M can lead to truncation if M exceeds INT_MAX. Using int64_t would be safer for this instructional example.

1-418: Previous review comment about formatting issues still applies.

As noted in a previous review, the file has formatting issues flagged by pre-commit hooks (missing newline at EOF, trailing whitespace). These should be resolved to pass CI checks.

tests/agent/test_results/gpt-5-mini-2025-08-07.txt (1)

56-107: Previous review comment about zero-length tensor guard still applies.

As noted in a previous review, this implementation lacks an n == 0 check before kernel dispatch, which can lead to invalid CUDA launches when n=0. The recommendation to add an early-return guard before the device-type dispatch remains valid.

🧹 Nitpick comments (11)

tests/agent/test_existing_prompt_sol.py (4)

9-36: Regex-based extraction looks correct but is tightly coupled to the file format

The two-stage regex (code_pattern with Test Results: sentinel, then alt_pattern) matches the current layout from test_prompt.py and gives a decent fallback when the results section is missing. Just be aware this is tightly coupled to the exact header strings ("Generated Code:", 80 '=', "Test Results:"); any future format drift will break parsing. If you expect multiple formats long‑term, consider centralizing these constants or using a small parser instead of hard‑coding the patterns in multiple places.

39-73: Consolidate test_kernel and consider narrowing the exception

test_kernel here is effectively the same as in tests/agent/test_prompt.py, which means any future change to the test semantics must be done in two places. Consider moving this helper to a shared module (e.g., tests/agent/utils.py) and importing it in both scripts.

Also, except Exception will swallow unexpected issues (e.g., configuration bugs) and only surface a boolean. If you care about distinguishing test failures from infrastructure errors, you might narrow the exception type or re‑raise after logging in the outer orchestration.

76-97: Result-shape assumptions in update_test_results_in_file

update_test_results_in_file assumes results always has a 'compilation' key and either 'error' or both 'test_small' and 'test_large'. That matches current call sites, but it’s easy to break accidentally. If you expect to extend the result schema (e.g., add more tests), consider:

validating the presence of required keys up front, or

making this function accept a structured object (dataclass/dict with defaults) rather than ad‑hoc keys.

99-172: Nested try/except blocks and broad exception handling

test_saved_code has an outer try/except Exception and an inner try/except Exception around compilation. This works, but it makes control flow harder to follow and can hide unexpected failures under generic "error" statuses.

Two possible simplifications:

Factor compilation + test execution into a helper that returns a structured result, and let test_saved_code just orchestrate and write results.

Narrow the inner except (e.g., to compilation-related errors) and let truly unexpected exceptions propagate to the outer handler or even to the top level when running as a script.

This would make debugging failures easier without changing current behavior.

tests/agent/test_prompt.py (7)

19-28: Prompt definition is clear but could be centralized with FFI prompt usage

ELEMENTWISE_ADD_PROMPT clearly states the kernel contract. Since you also have FFI_PROMPT_SIMPLE providing FFI‑side requirements, consider documenting (in a comment near here) that these two prompts must stay in sync with the expectations in examples/ffi/agent.md and the FFI templates. That will help future contributors avoid drifting the human‑readable description from the FFI constraints.

31-43: Model configuration is fine; consider validating API keys explicitly

get_model_config cleanly maps known model names to providers. One possible improvement is to check that the relevant API key is non‑empty and raise a clearer error (or skip the model) rather than letting the downstream client fail with a less obvious message when keys are missing.

46-68: Unify prompt handling for OpenAI models

Right now:

For "o3" / "o4-mini-2025-04-16", call_openai_model uses the passed prompt as the sole user message.

For other models, it ignores the prompt parameter and reconstructs messages from ELEMENTWISE_ADD_PROMPT and FFI_PROMPT_SIMPLE.

This works but makes test_model’s full_prompt slightly misleading and couples prompting strategy to the caller and callee in different ways.

A cleaner pattern would be one of:

Have test_model construct a single user_prompt and let call_openai_model always use that (with a per‑model system message if needed), or

Let call_openai_model fully own the prompt structure and drop the prompt argument from the signature.

Either way, keeping a single source of truth for prompt composition will make future prompt tweaks less error‑prone.

70-82: prompt parameter is unused in call_anthropic_model

call_anthropic_model ignores its prompt argument and always sends FFI_PROMPT_SIMPLE as the user content. Given test_model currently passes full_prompt, this is surprising.

Two options:

Use the prompt parameter in the Anthropic call (e.g., messages=[{"role": "user", "content": prompt}]), or

Remove the parameter and adjust the caller to make it clear Anthropic uses a fixed prompt layout.

Either change would make the interface less confusing and align with how you intend to manage prompts across providers.

85-95: CUDA code extraction heuristic may skip valid kernels or over-include text

The heuristic prefers fenced code blocks containing TVM_FFI or TensorView, otherwise it falls back to returning the entire response. That’s reasonable for current prompts, but:

If a valid kernel omits these tokens (e.g., uses different macros or pure CUDA), you’ll end up sending the whole response—including prose—to the compiler.

If multiple code blocks exist, you always pick the first that matches those tokens, which may not be the actual kernel.

Consider an additional fallback: if no matching block is found, use the first fenced block, and only as a last resort return the full response. That keeps you resilient to prompt or model changes while still prioritizing FFI‑aware code.

98-132: Duplicate test_kernel helper across files

As in test_existing_prompt_sol.py, test_kernel is duplicated here. Since this is core to validating LLM‑generated kernels, having a single shared implementation (e.g., imported from a small tests/agent/kernel_tests.py module) would reduce the risk of the two scripts drifting apart over time.

135-217: test_model orchestration is solid; consider simplifying error handling and prompt wiring

The high-level flow in test_model—get config → call model → extract code → write file → compile → run tests—is clear and the returned result dict is easy to consume.

Two optional improvements:

Error handling: there are three layers of try/except Exception. You might keep the inner compilation try/except (to always write results), but consider narrowing exception types or letting truly unexpected exceptions bubble out of the outermost block so they fail the script loudly during development.

Prompt wiring: given the comments above on call_openai_model / call_anthropic_model, you could move prompt composition entirely into those helpers and drop full_prompt here, reducing duplication and making it clearer where to update prompts in the future.

Neither change is required for correctness, but they would make the script easier to extend.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46ca5a5 and b7d62c3.

📒 Files selected for processing (10)

examples/ffi/agent.md (1 hunks)
examples/ffi/e2e_kernel.py (1 hunks)
tests/agent/load_inline.py (1 hunks)
tests/agent/test_existing_prompt_sol.py (1 hunks)
tests/agent/test_prompt.py (1 hunks)
tests/agent/test_results/claude-opus-4-1-20250805.txt (1 hunks)
tests/agent/test_results/gpt-5-2025-08-07.txt (1 hunks)
tests/agent/test_results/gpt-5-mini-2025-08-07.txt (1 hunks)
tests/agent/test_results/o3.txt (1 hunks)
tests/agent/test_results/o4-mini-2025-04-16.txt (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

tests/agent/test_existing_prompt_sol.py (2)

tests/agent/test_prompt.py (1)

main (219-257)

tests/agent/load_inline.py (1)

main (36-44)

tests/agent/test_prompt.py (2)

tests/agent/test_existing_prompt_sol.py (2)

test_kernel (39-73)

main (175-222)

tests/agent/load_inline.py (1)

main (36-44)

examples/ffi/e2e_kernel.py (4)

flashinfer_bench/bench/benchmark.py (2)

Benchmark (16-181)

run_all (61-181)

flashinfer_bench/bench/config.py (1)

BenchmarkConfig (8-64)

flashinfer_bench/data/trace_set.py (2)

TraceSet (23-477)

from_path (85-145)

flashinfer_bench/apply/apply_api.py (2)

disable_apply (183-192)

enable_apply (142-180)

🪛 LanguageTool

tests/agent/test_results/gpt-5-2025-08-07.txt

[style] ~71-~71: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...Contiguous() || !b.IsContiguous() || !c.IsContiguous()) { TVM_FFI_THROW(ValueError) << "...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

[style] ~97-~97: Using many exclamation marks might seem excessive (in this case: 13 exclamation marks for a text that’s 3750 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...

(EN_EXCESSIVE_EXCLAMATION)

tests/agent/test_results/gpt-5-mini-2025-08-07.txt

[style] ~82-~82: Using many exclamation marks might seem excessive (in this case: 12 exclamation marks for a text that’s 4202 characters long)
Context: ...; DLDevice dc_dev = c.device(); if (!(da_dev.device_type == db_dev.device_typ...

(EN_EXCESSIVE_EXCLAMATION)

tests/agent/test_results/o3.txt

[style] ~101-~101: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...Contiguous() || !b.IsContiguous() || !c.IsContiguous()) { TVM_FFI_THROW(ValueError) << "...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

[style] ~123-~123: Using many exclamation marks might seem excessive (in this case: 19 exclamation marks for a text that’s 4501 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...

(EN_EXCESSIVE_EXCLAMATION)

tests/agent/test_results/o4-mini-2025-04-16.txt

[style] ~83-~83: Using many exclamation marks might seem excessive (in this case: 19 exclamation marks for a text that’s 4073 characters long)
Context: ...dev = a.device(); if (dev.device_type != kDLCUDA) { TVM_FFI_THROW(RuntimeError) << "elementwise_add: only CUDA device supported"; } if (b.device().device_type != dev.device_type || b.device().device_id != dev.device_id) { TVM_FFI_THROW(ValueError) << "elementwise_add: b must be on the same CUDA device as a"; } if (c.device().device_type != dev.device_type || c.device().device_id != dev.device_id) { TVM_FFI_THROW(Val...

(EN_EXCESSIVE_EXCLAMATION)

tests/agent/test_results/claude-opus-4-1-20250805.txt

[style] ~110-~110: Using many exclamation marks might seem excessive (in this case: 23 exclamation marks for a text that’s 4049 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...

(EN_EXCESSIVE_EXCLAMATION)

🪛 Ruff (0.14.4)

tests/agent/test_existing_prompt_sol.py

34-34: Avoid specifying long messages outside the exception class

(TRY003)

71-71: Do not catch blind exception: Exception

(BLE001)

140-147: Consider moving this statement to an else block

(TRY300)

149-149: Do not catch blind exception: Exception

(BLE001)

165-165: Do not catch blind exception: Exception

(BLE001)

tests/agent/test_prompt.py

43-43: Avoid specifying long messages outside the exception class

(TRY003)

70-70: Unused function argument: prompt

(ARG001)

130-130: Do not catch blind exception: Exception

(BLE001)

151-151: Abstract raise to an inner function

(TRY301)

151-151: Avoid specifying long messages outside the exception class

(TRY003)

191-197: Consider moving this statement to an else block

(TRY300)

199-199: Do not catch blind exception: Exception

(BLE001)

205-205: Use explicit conversion flag

Replace with conversion flag

(RUF010)

214-214: Do not catch blind exception: Exception

(BLE001)

examples/ffi/e2e_kernel.py

20-20: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (5)

tests/agent/test_results/o3.txt (1)

1-141: Excellent reference implementation.

This generated code demonstrates comprehensive best practices:

Thorough input validation (dimensionality, dtype, shape, device, contiguity)

Zero-length tensor guard (line 106) to avoid invalid CUDA launches

Proper CUDA stream management via TVM FFI

Debug-mode kernel launch error checking

Clear structure and documentation

This serves as a strong reference for evaluating other model outputs.

tests/agent/test_results/claude-opus-4-1-20250805.txt (1)

1-124: Solid implementation with proper validation.

This implementation correctly handles:

Input validation for dimensions, dtype, shapes, device placement, and contiguity

Empty tensor edge case (lines 92-94)

CUDA stream retrieval from TVM environment

Kernel launch error detection

The code follows TVM FFI patterns appropriately and includes comprehensive error reporting.

tests/agent/test_results/gpt-5-2025-08-07.txt (1)

1-111: Well-optimized implementation.

This implementation includes several optimizations and best practices:

Use of __restrict__ pointers (lines 16-18) for potential compiler optimizations

Proper type casting to unsigned int for block count (line 89)

Zero-length tensor guard (line 81)

Comprehensive input validation

The code correctly implements TVM FFI patterns and includes appropriate error checking.

tests/agent/test_existing_prompt_sol.py (1)

175-222: Main re-test loop is straightforward and readable

The main() function’s discovery of test_results/*.txt and the final summary output are clear and match the expectations set by the file format. The truncation of long error messages is also a nice touch for CLI readability.

tests/agent/test_prompt.py (1)

219-257: Main driver is clear and matches the companion re-test script

The main loop over models, creation of test_results directory, and the final summary output mirror the workflow in test_existing_prompt_sol.main(). This symmetry makes it straightforward to understand how new runs produce .txt artifacts that the re-test script later consumes.

coderabbitai · 2025-11-17T16:25:42Z

tests/agent/test_results/o4-mini-2025-04-16.txt

+  // 6) Get CUDA stream from TVM runtime
+  cudaStream_t stream =
+      static_cast<cudaStream_t>(TVMFFIEnvGetStream(dev.device_type,
+                                                  dev.device_id));
+
+  // 7) Launch kernel
+  constexpr int THREADS = 256;
+  int blocks = static_cast<int>((n + THREADS - 1) / THREADS);
+  ElementwiseAddKernel<<<blocks, THREADS, 0, stream>>>(
+      a_ptr, b_ptr, c_ptr, n);


⚠️ Potential issue | 🟡 Minor

Add zero-length tensor guard to prevent invalid CUDA launches.

When n == 0, the code computes blocks = (0 + 255) / 256 = 0 and attempts to launch a kernel with 0 blocks, which is invalid and will raise a CUDA error. The reference implementation in o3.txt (line 106) correctly includes an early-return guard for this case.

Add this check before the CUDA stream retrieval:

// 6) Get CUDA stream from TVM runtime + if (n == 0) { + return; // nothing to do + } + cudaStream_t stream = static_cast<cudaStream_t>(TVMFFIEnvGetStream(dev.device_type, dev.device_id));

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// 6) Get CUDA stream from TVM runtime

cudaStream_t stream =

static_cast<cudaStream_t>(TVMFFIEnvGetStream(dev.device_type,

dev.device_id));

// 7) Launch kernel

constexpr int THREADS = 256;

int blocks = static_cast<int>((n + THREADS - 1) / THREADS);

ElementwiseAddKernel<<<blocks, THREADS, 0, stream>>>(

a_ptr, b_ptr, c_ptr, n);

// 6) Get CUDA stream from TVM runtime

if (n == 0) {

return; // nothing to do

}

cudaStream_t stream =

static_cast<cudaStream_t>(TVMFFIEnvGetStream(dev.device_type,

dev.device_id));

// 7) Launch kernel

constexpr int THREADS = 256;

int blocks = static_cast<int>((n + THREADS - 1) / THREADS);

ElementwiseAddKernel<<<blocks, THREADS, 0, stream>>>(

a_ptr, b_ptr, c_ptr, n);

🤖 Prompt for AI Agents

In tests/agent/test_results/o4-mini-2025-04-16.txt around lines 103 to 112, add a guard to return early when n == 0 before computing blocks or retrieving the CUDA stream; this prevents computing blocks == 0 and launching a kernel with zero blocks. Modify the function so it checks if n == 0 and returns (or no-ops) immediately, then proceed to get the stream and launch the kernel only when n > 0. Ensure the check is placed before any CUDA-related calls to avoid invalid CUDA launches or errors.

coderabbitai

Actionable comments posted: 9

♻️ Duplicate comments (1)

examples/ffi/agent_vibecode.md (1)
250-250: Address the previous review comment about integer type.

As noted in the previous review, using int for dimension M can lead to truncation for large values. Since this is instructional documentation for agents, it should promote best practices.

Consider updating the instruction to use int64_t consistently:
 void gemm_n4096_k4096_launch(
     const __half* A,
     const __half* B,
     __half* C,
-    int M,
+    int64_t M,
     cudaStream_t stream
 );
And remove the cast at line 353:
-    gemm_n4096_k4096_launch(A_data, B_data, C_data, static_cast<int>(M), stream);
+    gemm_n4096_k4096_launch(A_data, B_data, C_data, M, stream);

🧹 Nitpick comments (1)

examples/ffi/cpp_example.cc (1)

44-58: Consider deterministic initialization for reproducibility.

The init_random_tensor function uses rand() without seeding via srand(), making tensor initialization non-deterministic across runs. For a testing/example context, consider adding srand(seed) for reproducibility.

Additionally, the cudaMemcpy call lacks error checking:

 void init_random_tensor(ffi::TensorView tensor) {
     size_t num_elements = 1;
     for (int i = 0; i < tensor.ndim(); ++i) {
         num_elements *= tensor.shape()[i];
     }
     
     std::vector<__half> host_data(num_elements);
     
     for (size_t i = 0; i < num_elements; ++i) {
         host_data[i] = __float2half(static_cast<float>(rand()) / RAND_MAX);
     }
     
-    cudaMemcpy(tensor.data_ptr(), host_data.data(), 
-               num_elements * sizeof(__half), cudaMemcpyHostToDevice);
+    cudaError_t err = cudaMemcpy(tensor.data_ptr(), host_data.data(), 
+                                  num_elements * sizeof(__half), cudaMemcpyHostToDevice);
+    if (err != cudaSuccess) {
+        std::cerr << "cudaMemcpy failed: " << cudaGetErrorString(err) << std::endl;
+        abort();
+    }
 }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7d62c3 and 19a32bb.

📒 Files selected for processing (8)

examples/ffi/Example-FlashInfer-Trace/solutions/example_agent_solution.json (1 hunks)
examples/ffi/Makefile (1 hunks)
examples/ffi/README.md (1 hunks)
examples/ffi/agent_vibecode.md (1 hunks)
examples/ffi/cpp_example.cc (1 hunks)
examples/ffi/distribute_kernel.py (1 hunks)
examples/ffi/jax_example.py (1 hunks)
examples/ffi/pytorch_example.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

examples/ffi/distribute_kernel.py (5)

flashinfer_bench/data/trace_set.py (2)

TraceSet (23-477)

from_path (85-145)

flashinfer_bench/compile/builders/tvm_ffi_builder.py (1)

TVMFFIBuilder (34-409)

examples/ffi/cpp_example.cc (2)

main (123-173)

main (123-123)

examples/ffi/jax_example.py (1)

main (15-57)

examples/ffi/pytorch_example.py (1)

main (9-46)

examples/ffi/pytorch_example.py (3)

examples/ffi/cpp_example.cc (2)

main (123-173)

main (123-123)

examples/ffi/distribute_kernel.py (1)

main (8-34)

examples/ffi/jax_example.py (1)

main (15-57)

🪛 checkmake (0.2.2)

examples/ffi/Makefile

[warning] 48-48: Missing required phony target "test"

(minphony)

[warning] 52-52: Target body for "check" exceeds allowed length of 5 (9).

(maxbodylength)

[warning] 82-82: Target body for "help" exceeds allowed length of 5 (21).

(maxbodylength)

🪛 Clang (14.0.6)

examples/ffi/cpp_example.cc

[error] 11-11: 'tvm/ffi/container/tensor.h' file not found

(clang-diagnostic-error)

🪛 GitHub Actions: .github/workflows/linting.yaml

examples/ffi/distribute_kernel.py

[error] 1-1: Pre-commit changes detected in Python file (trailing whitespace/EOF/formatting). Run pre-commit locally to re-check.

examples/ffi/README.md

[error] 1-1: pre-commit: end-of-file-fixer hook failed. Files were modified by this hook (e.g., README.md). Re-run pre-commit to apply fixes.

[error] 1-1: pre-commit: end-of-file-fixer hook failed. Files were modified by this hook (README.md).

examples/ffi/Example-FlashInfer-Trace/solutions/example_agent_solution.json

[error] 1-1: pre-commit: end-of-file-fixer hook modified this JSON file. Please commit the changes.

examples/ffi/jax_example.py

[error] 1-1: Isort formatting changed imports; Black formatting may also be required. Run 'isort' and 'black' to fix.

examples/ffi/cpp_example.cc

[error] 1-1: pre-commit: formatting changes applied to C++ example (cpp_example.cc).

examples/ffi/pytorch_example.py

[error] 1-1: Black formatting check failed. File reformatted by Black.

🪛 LanguageTool

examples/ffi/README.md

[grammar] ~3-~3: Use a hyphen to join words.
Context: ...how to build, distribute, and load agent generated CUDA kernels using TVM-FFI acr...

(QB_NEW_EN_HYPHEN)

[grammar] ~20-~20: Use a hyphen to join words.
Context: ...VM-FFI C++ headers and libraries - C++17 compatible compiler ## Usage ### 1. Ge...

(QB_NEW_EN_HYPHEN)

🪛 Ruff (0.14.6)

examples/ffi/pytorch_example.py

20-20: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Run unit tests on ubuntu-latest and Python 3.10
GitHub Check: Run unit tests on ubuntu-latest and Python 3.11
GitHub Check: Run unit tests on ubuntu-latest and Python 3.12
GitHub Check: Run unit tests on ubuntu-latest and Python 3.13

🔇 Additional comments (6)

examples/ffi/Makefile (1)

1-104: LGTM!

The Makefile properly detects TVM-FFI paths, provides sensible fallbacks, and includes helpful targets with clear documentation. The checkmake warnings about target body length are stylistic preferences and don't affect functionality.

examples/ffi/pytorch_example.py (1)

9-46: LGTM! Good error handling pattern.

The code properly validates the entry symbol is found before use, unlike some other examples. The error message is appropriately descriptive despite the Ruff style warning.

examples/ffi/Example-FlashInfer-Trace/solutions/example_agent_solution.json (2)

17-22: Kernel implementation should be validated through testing.

The embedded CUDA kernel is complex with WMMA operations, shared memory management, and dynamic sizing. While the JSON structure is correct, the kernel correctness should be validated through the test suite mentioned in the PR.

Based on the PR description, model testing was performed and all models wrote correct FFI bindings.

4-4: I'll verify the model attribution concern by examining the file and searching for related references in the codebase.
<function_calls>

find . -name "example_agent_solution.json" -type f 2>/dev/null | head -5

</function_calls>

Let me search for the file:
<function_calls>

find . -path "Example-FlashInfer-Trace" -name "*.json" 2>/dev/null

</function_calls>

<function_calls>

fd "example_agent_solution" --type f

</function_calls>

<function_calls>

fd -e json --type f | grep -i "flashinfer|example" | head -20

</function_calls>

<function_calls>

git ls-files | grep -i "flashinfer"

</function_calls>

<function_calls>

git ls-files | grep -i "example.*solution"

</function_calls>

<function_calls>

git ls-files "examples/ffi/*" | head -20

</function_calls>

<function_calls>

ls -la examples/ 2>/dev/null | head -20

</function_calls>

<function_calls>

git ls-files | head -50

</function_calls>

Now let me search for references to "gpt-5" in the codebase to understand the context:
<function_calls>

rg "gpt-5" --type-list | head -5

</function_calls>

<function_calls>

rg "gpt-5" -i

</function_calls>

Let me get the full directory structure to understand what we're working with:
<function_calls>

git ls-files | grep -E "(examples|README)" | sort

</function_calls>

<function_calls>

cat README.md 2>/dev/null | head -100

</function_calls>

Based on the repository structure, let me search more broadly:
<function_calls>

rg "2025-08-07" -i

</function_calls>

<function_calls>

rg "gpt.*5" -i

</function_calls>

Let me check if the path exists exactly as specified:
<function_calls>

cat "examples/ffi/Example-FlashInfer-Trace/solutions/example_agent_solution.json" 2>&1

</function_calls>

<function_calls>

ls -la examples/ffi/ 2>&1 | head -20

</function_calls>

Let me get all JSON files in the repository:
<function_calls>

fd -e json --type f | head -30

</function_calls>

<function_calls>

git log --all --oneline -- "examples/ffi/Example-FlashInfer-Trace/solutions/example_agent_solution.json" 2>&1 | head -5

</function_calls>

Let me check what branch we're on and verify the current state:
<function_calls>

git status

</function_calls>

<function_calls>

git branch -a

</function_calls>

<function_calls>

git log --oneline -n 10

</function_calls>

Let me list all files in the examples directory:
<function_calls>

find examples -type f 2>/dev/null | sort

</function_calls>

<function_calls>

pwd && ls -la

</function_calls>

<function_calls>

git remote -v

</function_calls>

Based on my exploration of the repository, the file referenced in the review comment does not exist in the current codebase. The path examples/ffi/Example-FlashInfer-Trace/solutions/example_agent_solution.json could not be located after comprehensive searches using multiple methods (fd, find, git ls-files, rg).

However, since this is a review comment that has been flagged for verification, it appears the file may be:

A new file being added in the PR that isn't yet visible in the current branch state

Located in a different branch or on a different PR

Without access to the actual file content, I cannot verify whether "gpt-5-2025-08-07" is indeed a placeholder model name or a legitimate reference. To proceed with verification, could you confirm:

Whether this file is newly added in the PR?

Whether the file path is correct as specified?

Whether the repository has been properly cloned to the PR branch?

examples/ffi/agent_vibecode.md (1)

1-418: Comprehensive agent instructions.

The document provides thorough guidance on TVM-FFI integration, covering all necessary headers, validation patterns, and best practices. The structure is clear and includes complete working examples.

examples/ffi/README.md (1)

26-26: I'll help you verify the model name reference in the review comment. Let me start by examining the file and the context around line 26.
<function_calls>

cat -n examples/ffi/README.md | head -40

</function_calls>

Let me also search for the agent_vibecode.md file and any other references to this model name:
<function_calls>

find . -name "agent_vibecode.md" -o -name "vibecode" 2>/dev/null

</function_calls>

And let me search for all occurrences of "gpt-5" or similar model references in the repository:
<function_calls>

rg "gpt-5|gpt-4" -i --type md --type txt

</function_calls>

Finally, let me check if there are any other references to model names in the ffi examples directory:
<function_calls>

rg "gpt|claude|model" -i examples/ffi/ --type md

</function_calls>

coderabbitai · 2025-11-28T07:48:48Z

examples/ffi/cpp_example.cc

+     std::string entry_symbol = read_entry_symbol(dist_dir + "/kernel_metadata.txt");
+
+     ffi::Module mod = ffi::Module::LoadFromFile(dist_dir + "/kernel.so");
+     ffi::Function kernel_fn = mod->GetFunction(entry_symbol).value();
+
+     std::cout << "Loaded kernel: " << entry_symbol << std::endl;
+
+     // Prepare inputs: C = A @ B.T
+     const int64_t M = 1024, N = 4096, K = 4096;
+     DLDataType dtype{kDLFloat, 16, 1};
+
+     ffi::Tensor A = allocate_cuda_tensor({M, K}, dtype);
+     ffi::Tensor B = allocate_cuda_tensor({N, K}, dtype);
+     ffi::Tensor C = allocate_cuda_tensor({M, N}, dtype);
+
+     init_random_tensor(A);
+     init_random_tensor(B);
+     cudaMemset(C.data_ptr(), 0, M * N * sizeof(__half));
+
+     kernel_fn(A, B, C);
+     cudaDeviceSynchronize();


⚠️ Potential issue | 🟠 Major

Add error handling for critical operations.

Several operations that can fail lack error checking:

Empty entry symbol from read_entry_symbol (line 134)

Module loading (line 136)

Kernel execution and synchronization (lines 153-154)

Apply this diff:

const std::string dist_dir = "distributed"; std::string entry_symbol = read_entry_symbol(dist_dir + "/kernel_metadata.txt"); + if (entry_symbol.empty()) { + std::cerr << "Failed to read entry symbol from metadata" << std::endl; + return 1; + } ffi::Module mod = ffi::Module::LoadFromFile(dist_dir + "/kernel.so"); + if (!mod.defined()) { + std::cerr << "Failed to load module" << std::endl; + return 1; + } + ffi::Function kernel_fn = mod->GetFunction(entry_symbol).value(); std::cout << "Loaded kernel: " << entry_symbol << std::endl;

And add error checking after kernel execution:

kernel_fn(A, B, C); - cudaDeviceSynchronize(); + cudaError_t err = cudaDeviceSynchronize(); + if (err != cudaSuccess) { + std::cerr << "Kernel execution failed: " << cudaGetErrorString(err) << std::endl; + return 1; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::string entry_symbol = read_entry_symbol(dist_dir + "/kernel_metadata.txt");

ffi::Module mod = ffi::Module::LoadFromFile(dist_dir + "/kernel.so");

ffi::Function kernel_fn = mod->GetFunction(entry_symbol).value();

std::cout << "Loaded kernel: " << entry_symbol << std::endl;

// Prepare inputs: C = A @ B.T

const int64_t M = 1024, N = 4096, K = 4096;

DLDataType dtype{kDLFloat, 16, 1};

ffi::Tensor A = allocate_cuda_tensor({M, K}, dtype);

ffi::Tensor B = allocate_cuda_tensor({N, K}, dtype);

ffi::Tensor C = allocate_cuda_tensor({M, N}, dtype);

init_random_tensor(A);

init_random_tensor(B);

cudaMemset(C.data_ptr(), 0, M * N * sizeof(__half));

kernel_fn(A, B, C);

cudaDeviceSynchronize();

std::string entry_symbol = read_entry_symbol(dist_dir + "/kernel_metadata.txt");

if (entry_symbol.empty()) {

std::cerr << "Failed to read entry symbol from metadata" << std::endl;

return 1;

}

ffi::Module mod = ffi::Module::LoadFromFile(dist_dir + "/kernel.so");

if (!mod.defined()) {

std::cerr << "Failed to load module" << std::endl;

return 1;

}

ffi::Function kernel_fn = mod->GetFunction(entry_symbol).value();

std::cout << "Loaded kernel: " << entry_symbol << std::endl;

// Prepare inputs: C = A @ B.T

const int64_t M = 1024, N = 4096, K = 4096;

DLDataType dtype{kDLFloat, 16, 1};

ffi::Tensor A = allocate_cuda_tensor({M, K}, dtype);

ffi::Tensor B = allocate_cuda_tensor({N, K}, dtype);

ffi::Tensor C = allocate_cuda_tensor({M, N}, dtype);

init_random_tensor(A);

init_random_tensor(B);

cudaMemset(C.data_ptr(), 0, M * N * sizeof(__half));

kernel_fn(A, B, C);

cudaError_t err = cudaDeviceSynchronize();

if (err != cudaSuccess) {

std::cerr << "Kernel execution failed: " << cudaGetErrorString(err) << std::endl;

return 1;

}

examples/ffi/cpp_example.cc

coderabbitai · 2025-11-28T07:48:48Z

examples/ffi/distribute_kernel.py

+    traceset = TraceSet.from_path("Example-FlashInfer-Trace")
+
+    definition_name = "gemm_n4096_k4096"
+    definition = traceset.definitions[definition_name]
+
+    solutions = list(traceset.solutions[definition_name])
+    solution = solutions[0]


⚠️ Potential issue | 🟠 Major

Add validation for path existence and empty solutions list.

The code assumes the TraceSet path exists and contains at least one solution, which could lead to errors if the path is missing or no solutions are found.

Apply this diff to add validation:

def main(): traceset = TraceSet.from_path("Example-FlashInfer-Trace") definition_name = "gemm_n4096_k4096" + if definition_name not in traceset.definitions: + raise ValueError(f"Definition '{definition_name}' not found in traceset") + definition = traceset.definitions[definition_name] solutions = list(traceset.solutions[definition_name]) + if not solutions: + raise ValueError(f"No solutions found for definition '{definition_name}'") + solution = solutions[0] print(f"Building solution: {solution.name}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

traceset = TraceSet.from_path("Example-FlashInfer-Trace")

definition_name = "gemm_n4096_k4096"

definition = traceset.definitions[definition_name]

solutions = list(traceset.solutions[definition_name])

solution = solutions[0]

traceset = TraceSet.from_path("Example-FlashInfer-Trace")

definition_name = "gemm_n4096_k4096"

if definition_name not in traceset.definitions:

raise ValueError(f"Definition '{definition_name}' not found in traceset")

definition = traceset.definitions[definition_name]

solutions = list(traceset.solutions[definition_name])

if not solutions:

raise ValueError(f"No solutions found for definition '{definition_name}'")

solution = solutions[0]

🤖 Prompt for AI Agents

In examples/ffi/distribute_kernel.py around lines 9 to 15, the code assumes the TraceSet path and requested definition/solutions exist; add checks to validate the path exists before calling TraceSet.from_path, verify the requested definition_name is present in traceset.definitions, and confirm traceset.solutions[definition_name] is non-empty; if any check fails, raise a clear exception or exit with an explanatory message so the script fails fast with actionable info instead of throwing obscure errors later.

coderabbitai · 2025-11-28T07:48:48Z

examples/ffi/distribute_kernel.py

+
+if __name__ == "__main__":
+    main()
+


⚠️ Potential issue | 🟡 Minor

Add missing newline at end of file.

Pre-commit detected formatting issues including missing EOF newline.

🤖 Prompt for AI Agents

In examples/ffi/distribute_kernel.py around line 39, the file is missing a trailing newline at EOF; fix by adding a single newline character at the end of the file (ensure the file ends with a blank line so pre-commit/formatters see an EOF newline).

examples/ffi/Example-FlashInfer-Trace/solutions/example_agent_solution.json

coderabbitai · 2025-11-28T07:48:49Z

examples/ffi/jax_example.py

+    entry_symbol = None
+    for line in (dist_dir / "kernel_metadata.txt").read_text().split('\n'):
+        if line.startswith("Entry Symbol:"):
+            entry_symbol = line.split(":", 1)[1].strip()
+            break
+
+    # Load and register kernel
+    mod = tvm_ffi.load_module(str(so_path))
+    kernel_fn = getattr(mod, entry_symbol)
+    jax_tvm_ffi.register_ffi_target(entry_symbol, kernel_fn, platform="gpu")


⚠️ Potential issue | 🟠 Major

Add validation for missing entry symbol.

Unlike the PyTorch example, this code doesn't validate that the entry symbol was found before using it, which could result in None being passed to getattr().

Apply this diff:

entry_symbol = None for line in (dist_dir / "kernel_metadata.txt").read_text().split('\n'): if line.startswith("Entry Symbol:"): entry_symbol = line.split(":", 1)[1].strip() break + if entry_symbol is None: + raise ValueError("Entry symbol not found in metadata") + # Load and register kernel mod = tvm_ffi.load_module(str(so_path))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

entry_symbol = None

for line in (dist_dir / "kernel_metadata.txt").read_text().split('\n'):

if line.startswith("Entry Symbol:"):

entry_symbol = line.split(":", 1)[1].strip()

break

# Load and register kernel

mod = tvm_ffi.load_module(str(so_path))

kernel_fn = getattr(mod, entry_symbol)

jax_tvm_ffi.register_ffi_target(entry_symbol, kernel_fn, platform="gpu")

entry_symbol = None

for line in (dist_dir / "kernel_metadata.txt").read_text().split('\n'):

if line.startswith("Entry Symbol:"):

entry_symbol = line.split(":", 1)[1].strip()

break

if entry_symbol is None:

raise ValueError("Entry symbol not found in metadata")

# Load and register kernel

mod = tvm_ffi.load_module(str(so_path))

kernel_fn = getattr(mod, entry_symbol)

jax_tvm_ffi.register_ffi_target(entry_symbol, kernel_fn, platform="gpu")

🤖 Prompt for AI Agents

In examples/ffi/jax_example.py around lines 19 to 28, the code extracts entry_symbol from kernel_metadata.txt but does not validate it before using getattr; add a check after the loop to verify entry_symbol is not None (or empty), and if missing raise a clear exception or log an error and exit (e.g., raise RuntimeError with message including the dist_dir path and suggest checking kernel_metadata.txt), so getattr(mod, entry_symbol) and subsequent register_ffi_target are only called with a valid symbol.

examples/ffi/jax_example.py

examples/ffi/pytorch_example.py

examples/ffi/README.md

Ubospica

Thanks. The main changes LGTM.

Could you provide a complete workflow of 1) llm generate ffi kernel using our internal ffi prompt; 2) deliver and distribute into tvm-ffi?

cc @YiyanZhai could you also review the integration part to jax/cpp/torch?

Ubospica · 2025-12-01T05:15:34Z

tests/agent/load_inline.py

@@ -0,0 +1,48 @@
+import torch


This is more like a tvm-ffi example. Shall we just skip this test?

Sure, this test script was meant to play around with ffi, I'll update it to use the new tvm ffi builder

Ubospica · 2025-12-01T05:36:56Z

examples/ffi/cpp_example.cc

+     cudaStreamDestroy(stream);
+
+     return 0;
+ }


please install pre commit hook and format it

Ubospica · 2025-12-01T06:00:23Z

tests/agent/test_prompt.py

@@ -0,0 +1,261 @@
+"""


Shall we put this inside examples/ffi

Ubospica · 2025-12-01T06:01:22Z

tests/agent/test_existing_prompt_sol.py

@@ -0,0 +1,226 @@
+import re


Ditto. Seems this is also a tvm-ffi test, so shall we just skip this?

YiyanZhai · 2025-12-02T20:32:27Z

To run the jax example, I will need to install pip install --upgrade "jax[cuda12_local]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html to run using GPU. The default jaxlib I installed with pip install jax jax-tvm-ffi only supports CPU.

Other than that, the torch, jax, and cpp examples look good to me.

zanderjiang · 2025-12-02T20:43:09Z

To run the jax example, I will need to install pip install --upgrade "jax[cuda12_local]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html to run using GPU. The default jaxlib I installed with pip install jax jax-tvm-ffi only supports CPU.

Other than that, the torch, jax, and cpp examples look good to me.

Ah yes, I'll update the readme for that.

Expanded the README with an overview of the FFI workflow.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

examples/ffi/README.md (4)
3-3: Use hyphens for compound adjectives.

Line 3: "agent generated CUDA kernels" should be "agent-generated CUDA kernels".
-This directory contains examples demonstrating how to build, distribute, and load agent generated CUDA kernels using TVM-FFI across different environments.
+This directory contains examples demonstrating how to build, distribute, and load agent-generated CUDA kernels using TVM-FFI across different environments.
10-10: Specify language for fenced code block.

The ASCII diagram block is missing a language identifier. Add text to comply with markdown linting standards.
-```
+```text
 ┌──────────────────────────────────────────────────────────────────────┐
57-57: Use hyphens for compound adjectives.

Line 57: "C++17 compatible compiler" should be "C++17-compatible compiler".
 - CUDA Toolkit
 - TVM-FFI C++ headers and libraries
-- C++17 compatible compiler
+- C++17-compatible compiler
106-106: Ensure file ends with newline.

Per prior review feedback, the file should end with a trailing newline character.
 - Entry point format: `file.cu::function_name`
+- Entry point format: `file.cu::function_name`
+

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 19a32bb and 170b14f.

📒 Files selected for processing (1)

examples/ffi/README.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

examples/ffi/README.md

[grammar] ~3-~3: Use a hyphen to join words.
Context: ...how to build, distribute, and load agent generated CUDA kernels using TVM-FFI acr...

(QB_NEW_EN_HYPHEN)

[grammar] ~57-~57: Use a hyphen to join words.
Context: ...VM-FFI C++ headers and libraries - C++17 compatible compiler ## Usage ### 1. Ge...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)

examples/ffi/README.md

10-10: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Run unit tests on ubuntu-latest and Python 3.10
GitHub Check: Run unit tests on ubuntu-latest and Python 3.12
GitHub Check: Run unit tests on ubuntu-latest and Python 3.13
GitHub Check: Run unit tests on ubuntu-latest and Python 3.11

🔇 Additional comments (1)

examples/ffi/README.md (1)

63-63: Clarify model reference.

Line 63 references "gpt-5-2025-08-07" as the example model. Confirm whether this is a placeholder, internal version, or public model reference. If internal or outdated, consider updating to a more generic statement (e.g., "using a capable LLM").

coderabbitai · 2025-12-02T21:31:40Z

examples/ffi/README.md

+### For JAX Example
+```bash
+pip install jax jax-tvm-ffi
+```


⚠️ Potential issue | 🟡 Minor

Document JAX GPU setup requirement.

Per PR feedback, the default jaxlib from pip install jax jax-tvm-ffi supports only CPU. JAX examples require CUDA-enabled jaxlib installation. Update this section to guide users.

### For JAX Example ```bash -pip install jax jax-tvm-ffi +pip install jax jax-tvm-ffi +pip install --upgrade "jax[cuda12_local]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

+Note: This installs CUDA 12 support. For GPU execution, ensure you have CUDA Toolkit 12 installed.

<details> <summary>🤖 Prompt for AI Agents</summary>

In examples/ffi/README.md around lines 49 to 52, the JAX installation note is
incomplete because the default pip jax/jax-tvm-ffi installs CPU-only jaxlib;
update the instructions to first show the existing pip install jax jax-tvm-ffi
line, then add a follow-up command to install CUDA-enabled jaxlib (for example
pip install --upgrade "jax[cuda12_local]" -f
https://storage.googleapis.com/jax-releases/jax_cuda_releases.html) and a short
note that this provides CUDA 12 support and requires CUDA Toolkit 12 on the
system; keep wording concise and ensure the added command and the note are
directly under the JAX Example section.

</details> 

…, reformat

…h-dev

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

examples/ffi/distribute_kernel.py (1)

8-15: Guard against missing TraceSet directory/definition/solutions

As written, a missing Example-FlashInfer-Trace, unknown definition_name, or empty solutions will produce obscure exceptions. Add explicit checks and fail fast with clear errors.

 def main():
-    traceset = TraceSet.from_path("Example-FlashInfer-Trace")
-
-    definition_name = "gemm_n4096_k4096"
-    definition = traceset.definitions[definition_name]
-
-    solutions = list(traceset.solutions[definition_name])
-    solution = solutions[0]
+    trace_dir = Path("Example-FlashInfer-Trace")
+    if not trace_dir.exists():
+        raise FileNotFoundError(f"TraceSet directory not found: {trace_dir.resolve()}")
+
+    traceset = TraceSet.from_path(trace_dir)
+
+    definition_name = "gemm_n4096_k4096"
+    if definition_name not in traceset.definitions:
+        raise ValueError(f"Definition '{definition_name}' not found in traceset")
+
+    solutions = list(traceset.solutions.get(definition_name, []))
+    if not solutions:
+        raise ValueError(f"No solutions found for definition '{definition_name}'")
+
+    solution = solutions[0]

examples/ffi/jax_example.py (1)

20-29: Validate entry_symbol before using it

If kernel_metadata.txt is missing or malformed, entry_symbol remains None and getattr(mod, entry_symbol) will fail with a confusing error. Mirror the PyTorch example and guard this.

     entry_symbol = None
     for line in (dist_dir / "kernel_metadata.txt").read_text().split("\n"):
         if line.startswith("Entry Symbol:"):
             entry_symbol = line.split(":", 1)[1].strip()
             break
 
-    # Load and register kernel
+    if entry_symbol is None:
+        raise ValueError("Entry symbol not found in metadata")
+
+    # Load and register kernel
     mod = tvm_ffi.load_module(str(so_path))
     kernel_fn = getattr(mod, entry_symbol)

examples/ffi/cpp_example.cc (1)

133-155: Add basic error checks for metadata, module load, and kernel execution

If metadata/module load fails or the kernel errors, the current code continues with unclear behavior. Add minimal checks to fail fast with diagnostics.

     const std::string dist_dir = "distributed";
     std::string entry_symbol = read_entry_symbol(dist_dir + "/kernel_metadata.txt");
+    if (entry_symbol.empty()) {
+        std::cerr << "Failed to read entry symbol from metadata: "
+                  << dist_dir << "/kernel_metadata.txt" << std::endl;
+        return 1;
+    }
 
     ffi::Module mod = ffi::Module::LoadFromFile(dist_dir + "/kernel.so");
+    if (!mod.defined()) {
+        std::cerr << "Failed to load module from " << dist_dir << "/kernel.so" << std::endl;
+        return 1;
+    }
+
     ffi::Function kernel_fn = mod->GetFunction(entry_symbol).value();
@@
     init_random_tensor(B);
     cudaMemset(C.data_ptr(), 0, M * N * sizeof(__half));
 
     kernel_fn(A, B, C);
-    cudaDeviceSynchronize();
+    cudaError_t err = cudaDeviceSynchronize();
+    if (err != cudaSuccess) {
+        std::cerr << "Kernel execution failed: "
+                  << cudaGetErrorString(err) << std::endl;
+        return 1;
+    }

🧹 Nitpick comments (2)

examples/ffi/kernel_generator_example.py (1)

31-31: Tighten logging style and acknowledge broad exception handling

You can keep the current control flow but make it lint‑clean by dropping unnecessary f-strings and explicitly allowing broad exceptions used for CLI UX.

-    print(f"Loading Example TraceSet")
+    print("Loading Example TraceSet")
@@
-    for attempt in range(1, max_attempts + 1):
-        try:
+    for attempt in range(1, max_attempts + 1):
+        try:
@@
-            print(f"Successfully generated solution for {definition_name}")
+            print(f"Successfully generated solution for {definition_name}")
             break
 
-        except Exception as e:
+        except Exception as e:  # noqa: BLE001
             print(f"Attempt {attempt} failed: {e}")
             if attempt < max_attempts:
                 print(f"Retrying... ({attempt + 1}/{max_attempts})")
             else:
-                print(f"All attempts failed - aborting")
+                print("All attempts failed - aborting")
                 return
@@
-    if solution:
-        try:
+    if solution:
+        try:
@@
-            print(f"\n{'='*60}")
-            print(f"SUCCESS!")
+            print(f"\n{'='*60}")
+            print("SUCCESS!")
@@
-        except Exception as e:
+        except Exception as e:  # noqa: BLE001
             print(f"Failed to save solution: {e}")
             return

Also applies to: 73-94, 96-114

examples/ffi/Makefile (1)

48-50: Add a minimal test target to satisfy tooling

Tooling expects a test phony target; adding a thin alias keeps the Makefile self‑contained and avoids checkmake warnings.
-.PHONY: all run clean help check
+.PHONY: all run clean help check test
@@
 run: $(TARGET)
 	@echo "Running $(TARGET)..."
 	@echo ""
 	./$(TARGET)
+
+test: check
+	@echo "No C++ tests defined; ran 'check' instead."
Also applies to: 73-80

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 170b14f and a0027ab.

📒 Files selected for processing (8)

examples/ffi/Example-FlashInfer-Trace/solutions/example_agent_solution.json (1 hunks)
examples/ffi/Makefile (1 hunks)
examples/ffi/README.md (1 hunks)
examples/ffi/cpp_example.cc (1 hunks)
examples/ffi/distribute_kernel.py (1 hunks)
examples/ffi/jax_example.py (1 hunks)
examples/ffi/kernel_generator_example.py (1 hunks)
examples/ffi/pytorch_example.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/ffi/pytorch_example.py (2)

examples/ffi/cpp_example.cc (2)

main (123-173)

main (123-123)

examples/ffi/distribute_kernel.py (1)

main (8-34)

🪛 checkmake (0.2.2)

examples/ffi/Makefile

[warning] 52-52: Target body for "check" exceeds allowed length of 5 (9).

(maxbodylength)

[warning] 82-82: Target body for "help" exceeds allowed length of 5 (21).

(maxbodylength)

[warning] 48-48: Missing required phony target "test"

(minphony)

🪛 Clang (14.0.6)

examples/ffi/cpp_example.cc

[error] 11-11: 'tvm/ffi/container/tensor.h' file not found

(clang-diagnostic-error)

🪛 LanguageTool

examples/ffi/README.md

[grammar] ~3-~3: Use a hyphen to join words.
Context: ...how to build, distribute, and load agent generated CUDA kernels using TVM-FFI acr...

(QB_NEW_EN_HYPHEN)

[grammar] ~57-~57: Use a hyphen to join words.
Context: ...VM-FFI C++ headers and libraries - C++17 compatible compiler ## Usage ### 1. Ge...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)

examples/ffi/README.md

10-10: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

65-65: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

76-76: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🪛 Ruff (0.14.7)

examples/ffi/pytorch_example.py

22-22: Avoid specifying long messages outside the exception class

(TRY003)

examples/ffi/kernel_generator_example.py

31-31: f-string without any placeholders

Remove extraneous f prefix

(F541)

88-88: Do not catch blind exception: Exception

(BLE001)

93-93: f-string without any placeholders

Remove extraneous f prefix

(F541)

108-108: f-string without any placeholders

Remove extraneous f prefix

(F541)

112-112: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Run unit tests on ubuntu-latest and Python 3.11
GitHub Check: Run unit tests on ubuntu-latest and Python 3.12
GitHub Check: Run unit tests on ubuntu-latest and Python 3.13
GitHub Check: Run unit tests on ubuntu-latest and Python 3.10

🔇 Additional comments (2)

examples/ffi/Example-FlashInfer-Trace/solutions/example_agent_solution.json (1)

1-24: Solution metadata and FFI binding look consistent

Name/definition/entry point align with the GEMM examples, and the binding does thorough shape/dtype/contiguity/device checks before launching the kernel. No issues spotted.

examples/ffi/pytorch_example.py (1)

11-48: PyTorch example flow looks solid

Entry symbol parsing is validated, tensor shapes/dtypes align with the kernel contract, and the reference matmul check makes the example self‑verifying.

coderabbitai · 2025-12-03T04:44:15Z

examples/ffi/README.md

+This directory contains examples demonstrating how to build, distribute, and load agent generated CUDA kernels using TVM-FFI across different environments.
+
+
+## Overview
+
+The workflow consists of 4 main stages:
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ Stage 1: Problem Definition                                          │
+│ - Definition in FlashInfer-Trace dataset                             │
+└────────────────────────────┬─────────────────────────────────────────┘
+                             │
+                             ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│ Stage 2: LLM Kernel Generation                                       │
+│ - LLM reads the Definition + agent_vibecode.md prompt                │
+│ - Generates CUDA kernel with TVM-FFI bindings                        │
+│ - Outputs Solution JSON with embedded source code                    │
+└────────────────────────────┬─────────────────────────────────────────┘
+                             │
+                             ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│ Stage 3: Build & Distribution                                        │
+│ - TVMFFIBuilder compiles the Solution                                │
+│ - Generates framework-agnostic .so binary                            │
+│ - Extracts to distributed/ folder with metadata                      │
+└────────────────────────────┬─────────────────────────────────────────┘
+                             │
+                             ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│ Stage 4: Cross-Framework Usage                                       │
+│ - JAX/PyTorch/C++ load the same .so file                             │
+│ - Execute kernel without recompilation                               │
+│ - Benchmarking or Apply the kernel                                   │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+
+## Installation Prerequisites
+
+### Python Dependencies
+```bash
+pip install flashinfer-bench tvm-ffi torch
+```
+
+### For JAX Example
+```bash
+pip install jax[cuda13-local] jax-tvm-ffi
+```
+
+### For C++ Example
+- CUDA Toolkit
+- TVM-FFI C++ headers and libraries
+- C++17 compatible compiler


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find and inspect the pytorch_example.py to check module loading method fd pytorch_example examples/ffi

Repository: flashinfer-ai/flashinfer-bench

Length of output: 173

🌐 Web query:

JAX GPU CUDA installation pip command current best practice 2024 2025 jaxlib cuda12_local

💡 Result:

Short answer — use the official JAX pip extras. For NVIDIA GPUs the current best practice (2024/2025) is:

If you want JAX to install its CUDA/cuDNN wheels for you (recommended, Linux x86_64 or aarch64):
pip install --upgrade pip
pip install --upgrade "jax[cuda13]" # preferred (migrate to CUDA 13)
or
pip install --upgrade "jax[cuda12]" # if you must stay on CUDA 12

If you already have a local CUDA/toolkit installed and prefer to use it:
pip install --upgrade pip
pip install --upgrade "jax[cuda13-local]"
or
pip install --upgrade "jax[cuda12-local]"

Notes: Linux-only for prebuilt GPU wheels; JAX recommends the pip-provided CUDA wheels and is moving users toward CUDA 13. Verify GPU detection with python -c "import jax; print(jax.devices())". (docs.jax.dev)

Citations:

1: https://docs.jax.dev/en/latest/installation.html?utm_source=openai

🏁 Script executed:

# Read the full README to verify markdown issues cat -n examples/ffi/README.md | head -130

Repository: flashinfer-ai/flashinfer-bench

Length of output: 128

🏁 Script executed:

# Find all README files and examples directory git ls-files | grep -i readme

Repository: flashinfer-ai/flashinfer-bench

Length of output: 97

🏁 Script executed:

# Check repository structure git ls-files | grep -i example | head -20

Repository: flashinfer-ai/flashinfer-bench

Length of output: 364

🏁 Script executed:

# Look for any pytorch files git ls-files | grep -i pytorch

Repository: flashinfer-ai/flashinfer-bench

Length of output: 56

🏁 Script executed:

# Check git history for examples/ffi/README.md git log --oneline --all -- "examples/ffi/README.md" | head -10

Repository: flashinfer-ai/flashinfer-bench

Length of output: 327

🏁 Script executed:

# Check all branches for this path git branch -r --contains HEAD | xargs -I {} git ls-tree -r {} -- "examples/ffi" 2>/dev/null | head -5

Repository: flashinfer-ai/flashinfer-bench

Length of output: 56

🏁 Script executed:

# Try to find any ffi-related files git ls-files | grep -i ffi

Repository: flashinfer-ai/flashinfer-bench

Length of output: 164

🏁 Script executed:

# Show the file from the commit that added the workflow overview git show 170b14f:examples/ffi/README.md | head -150

Repository: flashinfer-ai/flashinfer-bench

Length of output: 3642

🏁 Script executed:

# Also check for pytorch_example.py at that commit git show 170b14f:examples/ffi/pytorch_example.py 2>/dev/null | head -100

Repository: flashinfer-ai/flashinfer-bench

Length of output: 1405

Fix PyTorch documentation mismatch and modernize JAX installation guidance

Three actionable issues:

PyTorch bullet contradicts actual implementation
The README states PyTorch uses torch.utils.cpp_extension.load() with custom CUDA extensions, but pytorch_example.py actually uses tvm_ffi.load_module() (identical to the JAX approach). Update to:

-**PyTorch**: Uses `torch.utils.cpp_extension.load()` with custom CUDA extensions +**PyTorch**: Uses `tvm_ffi.load_module()` for kernel loading

JAX installation lacks GPU support
The current command pip install jax jax-tvm-ffi installs CPU-only JAX. For GPU support, use the modern pip extras (no extra index URL needed):

### For JAX Example ```bash -pip install jax jax-tvm-ffi +pip install "jax[cuda13]" jax-tvm-ffi

Adjust cuda13 to cuda12 or cuda12-local if needed; see JAX installation docs for your CUDA version.

Minor markdown and style fixes

Add language tag to ASCII diagram and fix hyphenation:

-``` +```text

"agent generated" → "agent-generated"

"C++17 compatible" → "C++17-compatible"

🧰 Tools

🪛 LanguageTool

[grammar] ~3-~3: Use a hyphen to join words.
Context: ...how to build, distribute, and load agent generated CUDA kernels using TVM-FFI acr...

(QB_NEW_EN_HYPHEN)

[grammar] ~57-~57: Use a hyphen to join words.
Context: ...VM-FFI C++ headers and libraries - C++17 compatible compiler ## Usage ### 1. Ge...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)

10-10: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In examples/ffi/README.md around lines 3 to 57, update the documentation to match the actual implementation and modernize instructions: change the PyTorch section to state that the example uses tvm_ffi.load_module() (same as JAX) instead of torch.utils.cpp_extension.load(), update the JAX install command to recommend GPU-enabled installs like pip install "jax[cuda13]" jax-tvm-ffi (note to adjust cuda version per environment), add a language tag "text" to the ASCII diagram block, replace "agent generated" with "agent-generated", and change "C++17 compatible" to "C++17-compatible".

zanderjiang added 5 commits November 10, 2025 13:03

feat: add FFI prompt tests with environment-based API keys

3e2e893

formatting

25092ad

full ffi prompt

365536e

agent vibecoding end-to-end example

ac2ee6f

Merge branch 'flashinfer-ai:main' into main

46ca5a5

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

coderabbitai bot reviewed Nov 17, 2025

View reviewed changes

zanderjiang added 2 commits November 17, 2025 11:15

reformat tests

2f5af2f

Merge branch 'main' of https://github.com/zanderjiang/flashinfer-benc…

b7d62c3

…h-dev

coderabbitai bot reviewed Nov 17, 2025

View reviewed changes

zanderjiang added 3 commits November 20, 2025 13:51

Merge branch 'flashinfer-ai:main' into main

5d60fe9

remove testing results

9a0ce21

updated agent vibecode ffi example

19a32bb

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

Ubospica reviewed Dec 1, 2025

View reviewed changes

Add workflow overview to README

170b14f

Expanded the README with an overview of the FFI workflow.

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

zanderjiang added 3 commits December 2, 2025 23:34

use updated kernel generator to generate solution (additional option)…

47de488

…, reformat

Merge branch 'main' of https://github.com/zanderjiang/flashinfer-benc…

d775ae9

…h-dev

remove unnecessary tests

a0027ab

coderabbitai bot reviewed Dec 3, 2025

View reviewed changes


		namespace my_kernels {

		__global__ void AddOneKernel(float* x, float* y, int n) {


		namespace tvm_ffi_example_cuda {

		__global__ void AddOneKernel(float* x, float* y, int n) {

Agent FFI integration #119

Are you sure you want to change the base?

Agent FFI integration #119

Uh oh!

Conversation

zanderjiang commented Nov 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Nov 17, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 28, 2025

Choose a reason for hiding this comment

zanderjiang commented Nov 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 17, 2025 •

edited

Loading

Ubospica Dec 1, 2025 •

edited

Loading