Skip to content

caps-tum/mt4g

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MT4G - Memory Topology 4 GPUs

MT4G is a vendor-agnostic collection of microbenchmarks and APIs that explores the compute and memory topologies of both AMD and NVIDIA GPUs based on the HIP toolchain. By capturing system properties such as the number of SMs/CUs, warp size, memory and cache sizes, cache line sizes and load latencies as well as exposing deep cache subsystems and their physical layouts, it provides critical support for GPU performance modeling and analysis within one unified interface.

A detailed description of the concept, implementation and benchmarks can be found in this research paper.

Overview

The MT4G CLI tool enables a unified and cross-platform introspection of the hardware topology of GPUs and thus provides crucial information that is otherwise scattered throughout vendor-specific APIs, data sheets and incomplete one-off studies, or not available at all. Key features include:

  • Compilation of existing APIs and over 50 microbenchmarks for statistical topology attribute measurement
  • Unified build system for AMD and NVIDIA targets
  • Comprehensive report of collected benchmark results as structured JSON with optional plot generation for effortless manual inspection of the raw results

MT4G works reliably on all AMD CDNA GPUs and all recent NVIDIA microarchitectures from Pascal onwards. Currently, we do not support AMD RDNA GPUs given our primary focus on HPC/AI systems. Tested microarchitectures include:

GPU Name Vendor Microarch.
MI100 AMD CDNA
MI210 AMD CDNA2
MI300X AMD CDNA3
P6000 NVIDIA Pascal
V100 NVIDIA Volta
T1000 NVIDIA Turing
RTX2080 NVIDIA Turing
A100 NVIDIA Ampere
H100-80 NVIDIA Hopper
H100-96 NVIDIA Hopper

Topological metrics

General & Compute Resource Information

  • GPU vendor and model
  • GPU clock rate
  • Compute capability
  • Number of SMs/CUs
  • Max. number of blocks per SM/CU
  • Max. number of threads per block and SM/CU
  • Number of cores and warps/SIMD per SM/CU
  • Warp size
  • Number of registers per block and SM/CU
  • Mapping of logical to physical CU IDs (AMD only)

Memory Resource Information

✅ = Available

❌ = Not Available

➖ = Not Applicable

AMD

Memory Element Size Load Latency Read & Write Bandwidth Cache Line Size Fetch Granularity Amount per SM/CU or GPU Physically Shared With
vL1 cache
sL1d cache
L2 cache
L3 cache
LDS
Device Memory

NVIDIA

Memory Element Size Load Latency Read & Write Bandwidth Cache Line Size Fetch Granularity Amount per SM/CU or GPU Physically Shared With
L1 cache
L2 cache
Texture cache
Readonly cache
Constant L1 cache
Constant L1.5 cache
Shared Memory
Device Memory

Installation

Dependencies

  • ROCm or CUDA backend including drivers, compilers and libraries for AMD or NVIDIA targets respectively
  • HIP SDK with the hipcc compiler
  • nlohmann-json for JSON output
  • cxxopts for CLI parsing
  • Python 3 including the matplotlib, pandas and numpy packages for graphical plots

A suitable HIP environment can for instance be obtained via Spack:

spack install hip           # includes ROCm backend for AMD targets
spack install hip+cuda      # includes CUDA backend for NVIDIA targets

spack load hip              # exports binaries and libraries

The HIP_PATH environment variable should be set to the HIP installation directory. Please export manually if not automatically set by spack, e.g.

export HIP_PATH=<path_to_spack>/opt/spack/<system_architecture>/hip-<version>-<hash>

Additionally for NVIDIA targets, the CUDA_PATH environment variable needs to be set to the CUDA installation directory.

MT4G has been tested successfully with hip@6.3.3 and cuda@12.8.

Build

Use the GPU_TARGET_ARCH build flag to select the target GPU architecture for AMD (e.g. gfx90a) and NVIDIA (e.g. sm_90) respectively. Some of the identifiers of the LLVM targets for AMD can be found here, while the compute capabilites for NVIDIA can be found here. To build and install MT4G, run

git clone https://github.com/caps-tum/mt4g.git
cd mt4g
mkdir build && cd build
cmake .. -DGPU_TARGET_ARCH=<gfxXXX|sm_XX>
# optional build flags:
# -DCMAKE_BUILD_TYPE=<Release|Debug>             -- to choose between release and debug builds
# -DCMAKE_INSTALL_PREFIX=<install_prefix>        -- to set the install destination (default on UNIX platforms: /usr/local)
make all install -j $(nproc)

Usage

<install_prefix>/bin/mt4g [options]

Options

Option Description
-d, --device-id <id> GPU device to use (default 0)
-f, --file <name> Specify name of output files (default <GPU_NAME>)
-g, --graphs Generate graphical plots for each benchmark
-l, --location <path> Specify location of output files (default .)
-o, --raw Write raw timing data
-p, --report Create Markdown report in output directory
-r, --random Randomize P-Chase arrays
-s, --stdout Dump final JSON result into stdout
-q, --quiet Only write the final JSON to stdout
--l1 Run L1 cache benchmarks
--l2 Run L2 cache benchmarks
--l3 Run L3 cache benchmarks (AMD only)
--scalar Run AMD scalar cache benchmarks
--constant Run NVIDIA constant cache benchmarks
--readonly Run NVIDIA read-only cache benchmarks
--texture Run NVIDIA texture cache benchmarks
--shared Run shared memory benchmarks
--memory Run main memory benchmarks
--departuredelay Run departure delay benchmarks
--resourceshare Run resource sharing benchmarks
-v, --version Display the version of MT4G and exit
-h, --help Display a detailed help message and exit

If no benchmark group is chosen, all available groups are executed. Unsupported groups are disabled automatically depending on the platform. Exclusive GPU access is recommended for more reliable measurement results.

Output

By default, benchmark results are written as structured JSON into the file <GPU_NAME>.json of the current working directory. However, the name and path of the output file and directory may be changed through the flags -f/--file and -l/--location respectively. With -s/--stdout, the final JSON output file may be dumped into stdout instead. When --graphs, --raw or --report is enabled, additional files are written to results/<GPU_NAME>. The --report flag generates a README.md that embeds all graphs and links to the raw data.

Known Issues and Limitations

  • L2 segment size measurements on AMD GPUs are currently unreliable due to the platform's complex cache behaviour.
  • Constant L1.5 Cache Size detection is capped at 64 KiB. Denoted by 64 KiB + 1 and confidence = 0. (> 64 KiB)
  • Bandwidths are not optimal because we currently do not use a (dynamically found) optimal number of blocks.
  • Cache Line Size detection uses a heuristical approach and is therefore not guaranteed to be correct.
  • Constant L1 shared with L1 is not too reliable. Hence, as a hotfix we repeat the measurements 10 times and on one unsuccessful run return not shared. We are working on a cleaner solution.
  • Incomplete support for CDNA3.
  • Runs only on Linux.

Repository Layout & Contribution Guidelines

mt4g
├── CMakeLists.txt        -- Build configuration
├── include               -- Header files
├── LICENSE               -- Project license
├── README.md             -- Project description
├── sample_results        -- Exemplary output files from selected hardware
└── src                   -- Benchmark implementation and CLI helpers

Adding new Measurements

Pre-measured results for selected GPUs live in the sample_results directory. If your hardware is not yet listed, we would greatly appreciate additional reports: Run the tool with --graphs --report (optionally also with --raw) and open a pull request to share your measurements.

Adding a new Benchmark

To add a new benchmark to the MT4G, follow the subsequent instructions:

  1. Implement the benchmark in src/benchmarks/ and expose a suitable interface in include/.
  2. Try to follow the pattern of measureXXX(), XXXLauncher() and XXXKernel() to keep the structure modular and readable. Every benchmark should get its own file to keep code flow as easy as possible to follow -- this is not about software engineering!
  3. Update CMakeLists.txt if necessary.
  4. Document the new benchmark and its command line switch in the README.md if suitable.

Coding Style

The codebase follows modern C++20 guidelines. Use -Wall -Wextra -Wpedantic for clean builds and keep functions small and well documented.

About

Developed at the Chair for Computer Architecture and Parallel Systems at the Technical University of Munich (CAPS TUM). Originally authored by Dominik Größler, completely reworked by Manuel Walter Mußbacher and currently maintained by Stepan Vanecek. The research paper surrounding this work can be found here.

This project is licensed under the Apache License 2.0.

About

Memory Topology for GPUs

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors