MT4G - Memory Topology 4 GPUs

MT4G is a vendor-agnostic collection of microbenchmarks and APIs that explores the compute and memory topologies of both AMD and NVIDIA GPUs based on the HIP toolchain. By capturing system properties such as the number of SMs/CUs, warp size, memory and cache sizes, cache line sizes and load latencies as well as exposing deep cache subsystems and their physical layouts, it provides critical support for GPU performance modeling and analysis within one unified interface.

A detailed description of the concept, implementation and benchmarks can be found in this research paper.

Overview

The MT4G CLI tool enables a unified and cross-platform introspection of the hardware topology of GPUs and thus provides crucial information that is otherwise scattered throughout vendor-specific APIs, data sheets and incomplete one-off studies, or not available at all. Key features include:

Compilation of existing APIs and over 50 microbenchmarks for statistical topology attribute measurement
Unified build system for AMD and NVIDIA targets
Comprehensive report of collected benchmark results as structured JSON with optional plot generation for effortless manual inspection of the raw results

MT4G works reliably on all AMD CDNA GPUs and all recent NVIDIA microarchitectures from Pascal onwards. Currently, we do not support AMD RDNA GPUs given our primary focus on HPC/AI systems. Tested microarchitectures include:

GPU Name	Vendor	Microarch.
MI100	AMD	CDNA
MI210	AMD	CDNA2
MI300X	AMD	CDNA3
P6000	NVIDIA	Pascal
V100	NVIDIA	Volta
T1000	NVIDIA	Turing
RTX2080	NVIDIA	Turing
A100	NVIDIA	Ampere
H100-80	NVIDIA	Hopper
H100-96	NVIDIA	Hopper

Topological metrics

General & Compute Resource Information

GPU vendor and model
GPU clock rate
Compute capability
Number of SMs/CUs
Max. number of blocks per SM/CU
Max. number of threads per block and SM/CU
Number of cores and warps/SIMD per SM/CU
Warp size
Number of registers per block and SM/CU
Mapping of logical to physical CU IDs (AMD only)

Memory Resource Information

✅ = Available

❌ = Not Available

➖ = Not Applicable

AMD

Memory Element	Size	Load Latency	Read & Write Bandwidth	Cache Line Size	Fetch Granularity	Amount per SM/CU or GPU	Physically Shared With
vL1 cache	✅	✅	❌	✅	✅	✅	➖
sL1d cache	✅	✅	❌	✅	✅	➖	✅
L2 cache	✅	✅	✅	✅	✅	✅	➖
L3 cache	✅	❌	✅	✅	❌	✅	➖
LDS	✅	✅	❌	➖	➖	➖	➖
Device Memory	✅	✅	✅	➖	➖	➖	➖

NVIDIA

Memory Element	Size	Load Latency	Read & Write Bandwidth	Cache Line Size	Fetch Granularity	Amount per SM/CU or GPU	Physically Shared With
L1 cache	✅	✅	❌	✅	✅	✅	✅
L2 cache	✅	✅	✅	✅	✅	✅	➖
Texture cache	✅	✅	❌	✅	✅	✅	✅
Readonly cache	✅	✅	❌	✅	✅	✅	✅
Constant L1 cache	✅	✅	❌	✅	✅	✅	✅
Constant L1.5 cache	✅	✅	❌	✅	✅	❌	➖
Shared Memory	✅	✅	❌	➖	➖	➖	➖
Device Memory	✅	✅	✅	➖	➖	➖	➖

Installation

Dependencies

ROCm or CUDA backend including drivers, compilers and libraries for AMD or NVIDIA targets respectively
HIP SDK with the hipcc compiler
nlohmann-json for JSON output
cxxopts for CLI parsing
Python 3 including the matplotlib, pandas and numpy packages for graphical plots

A suitable HIP environment can for instance be obtained via Spack:

spack install hip           # includes ROCm backend for AMD targets
spack install hip+cuda      # includes CUDA backend for NVIDIA targets

spack load hip              # exports binaries and libraries

The HIP_PATH environment variable should be set to the HIP installation directory. Please export manually if not automatically set by spack, e.g.

export HIP_PATH=<path_to_spack>/opt/spack/<system_architecture>/hip-<version>-<hash>

Additionally for NVIDIA targets, the CUDA_PATH environment variable needs to be set to the CUDA installation directory.

MT4G has been tested successfully with hip@6.3.3 and cuda@12.8.

Build

Use the GPU_TARGET_ARCH build flag to select the target GPU architecture for AMD (e.g. gfx90a) and NVIDIA (e.g. sm_90) respectively. Some of the identifiers of the LLVM targets for AMD can be found here, while the compute capabilites for NVIDIA can be found here. To build and install MT4G, run

git clone https://github.com/caps-tum/mt4g.git
cd mt4g
mkdir build && cd build
cmake .. -DGPU_TARGET_ARCH=<gfxXXX|sm_XX>
# optional build flags:
# -DCMAKE_BUILD_TYPE=<Release|Debug>             -- to choose between release and debug builds
# -DCMAKE_INSTALL_PREFIX=<install_prefix>        -- to set the install destination (default on UNIX platforms: /usr/local)
make all install -j $(nproc)

Usage

<install_prefix>/bin/mt4g [options]

Options

Option	Description
`-d, --device-id <id>`	GPU device to use (default `0`)
`-f, --file <name>`	Specify name of output files (default `<GPU_NAME>`)
`-g, --graphs`	Generate graphical plots for each benchmark
`-l, --location <path>`	Specify location of output files (default `.`)
`-o, --raw`	Write raw timing data
`-p, --report`	Create Markdown report in output directory
`-r, --random`	Randomize P-Chase arrays
`-s, --stdout`	Dump final JSON result into stdout
`-q, --quiet`	Only write the final JSON to stdout
`--l1`	Run L1 cache benchmarks
`--l2`	Run L2 cache benchmarks
`--l3`	Run L3 cache benchmarks (AMD only)
`--scalar`	Run AMD scalar cache benchmarks
`--constant`	Run NVIDIA constant cache benchmarks
`--readonly`	Run NVIDIA read-only cache benchmarks
`--texture`	Run NVIDIA texture cache benchmarks
`--shared`	Run shared memory benchmarks
`--memory`	Run main memory benchmarks
`--departuredelay`	Run departure delay benchmarks
`--resourceshare`	Run resource sharing benchmarks
`-v, --version`	Display the version of MT4G and exit
`-h, --help`	Display a detailed help message and exit

If no benchmark group is chosen, all available groups are executed. Unsupported groups are disabled automatically depending on the platform. Exclusive GPU access is recommended for more reliable measurement results.

Output

By default, benchmark results are written as structured JSON into the file <GPU_NAME>.json of the current working directory. However, the name and path of the output file and directory may be changed through the flags -f/--file and -l/--location respectively. With -s/--stdout, the final JSON output file may be dumped into stdout instead. When --graphs, --raw or --report is enabled, additional files are written to results/<GPU_NAME>. The --report flag generates a README.md that embeds all graphs and links to the raw data.

Known Issues and Limitations

L2 segment size measurements on AMD GPUs are currently unreliable due to the platform's complex cache behaviour.
Constant L1.5 Cache Size detection is capped at 64 KiB. Denoted by 64 KiB + 1 and confidence = 0. (> 64 KiB)
Bandwidths are not optimal because we currently do not use a (dynamically found) optimal number of blocks.
Cache Line Size detection uses a heuristical approach and is therefore not guaranteed to be correct.
Constant L1 shared with L1 is not too reliable. Hence, as a hotfix we repeat the measurements 10 times and on one unsuccessful run return not shared. We are working on a cleaner solution.
Incomplete support for CDNA3.
Runs only on Linux.

Repository Layout & Contribution Guidelines

mt4g
├── CMakeLists.txt        -- Build configuration
├── include               -- Header files
├── LICENSE               -- Project license
├── README.md             -- Project description
├── sample_results        -- Exemplary output files from selected hardware
└── src                   -- Benchmark implementation and CLI helpers

Adding new Measurements

Pre-measured results for selected GPUs live in the sample_results directory. If your hardware is not yet listed, we would greatly appreciate additional reports: Run the tool with --graphs --report (optionally also with --raw) and open a pull request to share your measurements.

Adding a new Benchmark

To add a new benchmark to the MT4G, follow the subsequent instructions:

Implement the benchmark in src/benchmarks/ and expose a suitable interface in include/.
Try to follow the pattern of measureXXX(), XXXLauncher() and XXXKernel() to keep the structure modular and readable. Every benchmark should get its own file to keep code flow as easy as possible to follow -- this is not about software engineering!
Update CMakeLists.txt if necessary.
Document the new benchmark and its command line switch in the README.md if suitable.

Coding Style

The codebase follows modern C++20 guidelines. Use -Wall -Wextra -Wpedantic for clean builds and keep functions small and well documented.

About

Developed at the Chair for Computer Architecture and Parallel Systems at the Technical University of Munich (CAPS TUM). Originally authored by Dominik Größler, completely reworked by Manuel Walter Mußbacher and currently maintained by Stepan Vanecek. The research paper surrounding this work can be found here.

This project is licensed under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MT4G - Memory Topology 4 GPUs

Overview

Topological metrics

General & Compute Resource Information

Memory Resource Information

AMD

NVIDIA

Installation

Dependencies

Build

Usage

Options

Output

Known Issues and Limitations

Repository Layout & Contribution Guidelines

Adding new Measurements

Adding a new Benchmark

Coding Style

About

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
include		include
sample_results		sample_results
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

MT4G - Memory Topology 4 GPUs

Overview

Topological metrics

General & Compute Resource Information

Memory Resource Information

AMD

NVIDIA

Installation

Dependencies

Build

Usage

Options

Output

Known Issues and Limitations

Repository Layout & Contribution Guidelines

Adding new Measurements

Adding a new Benchmark

Coding Style

About

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages