Skip to content

APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.

License

Notifications You must be signed in to change notification settings

habanero-lab/APPy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

825 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Documentation Status

APPy (Annotated Parallelism for Python) enables users to parallelize generic Python loops and tensor expressions for execution on GPUs by adding OpenMP-like compiler directives (annotations) to Python code. With APPy, parallelizing a Python for loop on the GPU can be as simple as adding a #pragma parallel for before the loop, like the following:

#pragma parallel for
for i in range(N):
    C[i] = A[i] + B[i]

The APPy compiler will recognize the pragma, JIT-compile the loop to GPU code, and execute the loop on the GPU. A detailed description of APPy can be found in APPy: Annotated Parallelism for Python on GPUs. This document provides a quick guide to get started.

A simple example to show how to use APPy:

import numpy as np
import appy

@appy.jit
def inc_by_1(a):
    #pragma parallel for simd
    for i in range(a.shape[0]):
        a[i] += 1


a = np.arange(10)
inc_by_1(a)
print(a)  # Should print [ 1  2  3  4  5  6  7  8  9 10]

Install

APPy tries to keep dependences minimal, to install the minimal version of APPy which only includes the code generator itself, run:

pip install -e .

In addition, if you want to be able to execute the generated GPU code, run:

pip install -e .[triton]

APPy currently has a Triton backend, which requires torch and triton installed and a Linux platform with an NVIDIA GPU (Compute Capability 7.0+).

Quick Start

python examples/01-vec_add.py

Loop-Oriented programming interface

Parallelization

A loop can be parallelized by being annotated with #pragma parallel for, where the end of the loop acts as a synchronization point. Each loop iteration is said to be assigned to a worker, and the number of workers launched is always equal to the number of loop iterations. Each worker is scheduled to a single vector processor, and executes its instructions sequentially. A parallel for-loop must be a for-range loop, and the number of loop iterations must be known at kernel launch time, i.e. no dynamic parallelism.

A vector addition example is shown below to parallelize a for loop with APPy via #pragma parallel for. #pragma ... is a regular comment in Python, but will be parsed and treated as a directive by APPy.

@appy.jit
def vector_add(A, B, C, N):
    #pragma parallel for
    for i in range(N):
        C[i] = A[i] + B[i]

APPy's Machine Model

A key design of APPy is that it assumes a simple abstract machine model, i.e. a multi-vector processor, instead of directly exposing the complex GPU architecture to the programmer. In this multi-vector processor, there are 2 layers of parallelism: 1) each vector processor is able to do vector processing (SIMD); 2) different vector processors run independently and simultaneously (MIMD). Pragma #pragma parallel for corresponds to the MIMD parallelism, which is also referred to as parallelization. The SIMD parallelism is referred to as vectorization, as described in more detail in the next section. Maximum parallelism is achieved with the loop is both parallelized and vectorized.

Vectorization

Although #pragma parallel for parallelizes a loop, maximum parallelism is achieved when the loop body is also vectorized, when applicable. APPy provides two high-level ways to achieve vectorization: 1) use tensor/array expressions (compiler generates a loop automatically, though this feature is not included as of v0.3.0); 2) annotate a loop with the #pragma simd, which divides the loop into smaller chunks.

Vector addition example.

@appy.jit
def vector_add(A, B, C):
    #pragma parallel for simd
    for i in range(A.shape[0]):
        C[i] = A[i] + B[i]

SpMV example.

@appy.jit
def spmv(A_row, A_col, A_val, x, y, N):
    #pragma parallel for
    for i in range(N - 1):
        yi = 0.0
        #pragma simd
        for j in range(A_row[i], A_row[1+i]):
            yi += A_val[j] * x[A_col[j]]
        y[i] = yi

A loop that is not applicable for parallelization may be vectorizable. One example is the j loop in the SpMV example, where it has dynamic loop bounds.

Data Scope

Array variables must already be defined before executing the parallel region, while their data can either reside in CPU memory or GPU memory. For CPU arrays, e.g. NumPy arrays, the compiler will automatically move data to the device before launching the kernel and move data back to the host after the kernel finishes. For GPU arrays, e.g. PyTorch CUDA tensors, the compiler does not do move them, e.g. they stay where they are throughout the kernel.

Scalar variables may be defined either outside the parallel region, or inside the parallel region. If defined outside and used inside, the variable has an "argument passing by value" semantic, where it gets the initial value from outside when the kernel is launched but any updates are only visible inside the kernel. To make the updates visible outside the kernel, the variable must be declared in the shared clause, which tells the compiler to copy the variable to the GPU memory where it can be updated and copy it back after the kernel finishes. Scalar variables defined inside parallel region are considered local to each worker, e.g. can be safely parallelized.

Parallel reduction

A parallel reduction example.

@appy.jit
def vector_sum(A):
    s = 0.0
    #pragma parallel for simd shared(s)
    for i in range(A.shape[0]):
        s += A[i]

The compiler automatically recognizes the parallel reduction pattern, and generates correct code for it, e.g. using atomic operations. Clause shared(s) makes the update to s inside the kernel visible outside the kernel, which essentially treats s as a single-element array.

Besides pure loops, 1D array expressions can also be used inside a parallel for loop, for example:

@appy.jit
def syrk(alpha, beta, C, A):
    #pragma parallel for
    for i in range(A.shape[0]):
        C[i, :i + 1] *= beta
        for k in range(A.shape[1]):
            C[i, :i + 1] += alpha * A[i, k] * A[:i + 1, k]
    return C

Supported operations

APPy supports the following kinds of operations inside the parallel region:

On scalar integer or float values or a 1D slice of an array:

Arithmetic operations
Math functions (via the math package)
Bitwise operations
Logical operations
Compare operations

On arrays of integers or floats:

Array indexing (store or load)

Control flows:

Ternary operators

About

APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages