Examples/Chap_affinity.tex at main · OpenMP/Examples · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
\cchapter{OpenMP Affinity}{affinity}
\label{chap:openmp_affinity}

OpenMP defines \emph{thread affinity} with respect to \emph{places}, where a
place is an abstraction that represents a set of processors (e.g., one or more
processor IDs, a hardware thread, a core, a socket, etc.). Thread affinity
control enables users to assign threads that perform computation in a parallel
region to specific places, while allowing the runtime implementation to freely
migrate threads to different execution units within a given place. A thread
that is assigned to a place for a given parallel region remains bound to that
place for the duration of that region.

The places available for thread affinity control (referred to as a \emph{place
partition}) can be set via the \kcode{OMP_PLACES} environment variable. The
binding of threads to places can be managed explicitly or handled implicitly.
Without the \kcode{OMP_PLACES} variable being set, the initial place partition
is implementation defined.  The method by which threads are assigned to places
for a given parallel region is determined by the specified thread affinity
policy. This policy can be set via the \kcode{OMP_PROC_BIND} environment
variable or can be explicitly set for a particular \kcode{parallel} construct
with the \kcode{proc_bind} clause.

The OpenMP specification document defines a \emph{processor} as a hardware
execution unit on which one or more OpenMP threads may execute. The actual
hardware mechanism that a given processor ID represents depends on the
implementation and architecture. For example, a processor could correspond to a
core on the device that does not have simultaneous multi-threading (SMT)
support or for which SMT is disabled. While for an SMT-enabled device, a
processor could correspond to a hardware thread. Processor IDs are the
resulting sequential numbering of processors, starting from 0. The initial
place partition can be defined explicitly with processor IDs or using an
\emph{abstract name}. For example, \pout{OMP_PLACES="\{0,1\},\{2,3\}"}
defines two places in the initial place partition, the first place consisting
of processors 0 and 1 from the device and the second place consisting of
processors 2 and 3 from the device. Alternatively, \pout{OMP_PLACES="cores"}
defines there to be one place per core on the host device.

The processors that are available to an OpenMP program process may be a subset
of the processors on the system. This restriction may be the result of a
wrapper process controlling the execution (such as \ucode{numactl} on Linux
systems), compiler options, library-specific environment variables, or default
kernel settings. For instance, the execution of multiple MPI processes, launched
on a single compute node, will each have a subset of processors as determined
by the MPI launcher or set by MPI affinity environment variables for the MPI
library.

The threads that are under affinity control for a given parallel region include
the threads assigned to its team and additionally any free-agent threads (see
Section~\ref{sec:free_agent}) that execute tasks bound to the region. Affinity
control for threads can be disabled (i.e., allowing threads to migrate freely
across processors) by setting \kcode{OMP_PROC_BIND} to \vcode{false}. If instead
\kcode{OMP_PROC_BIND} is \vcode{true}, then threads will bind to places but the
places to which they bind are implementation defined. Finally, three affinity
policies that are more prescriptive are available via the environment variable
or the \kcode{proc_bind} clause: \kcode{spread}, \kcode{close}, and
\kcode{primary}. These are detailed in the following section.

% We need an example of using sockets, cores and threads:

% case 1 cores:

%     Hyper-Threads on (2 hardware threads per core)
%     1 socket x 4 cores x 2 HW-threads
%
%     export OMP_NUM_THREADS=4
%     export OMP_PLACES=threads
%
%          core #      0    1    2    3
%     processor #     0,1  2,3  4,5  6,7
%     thread #     0  * _  _ _  _ _  _ _   #mask for thread 0
%     thread #     1  _ _  * _  _ _  _ _   #mask for thread 1
%     thread #     2  _ _  _ _  * _  _ _   #mask for thread 2
%     thread #     3  _ _  _ _  _ _  * _   #mask for thread 3

% case 2 threads:
%
%     Hyper-Threads on (2 hardware threads per core)
%     1 socket x 4 cores x 2 HW-threads
%
%     export OMP_NUM_THREADS=4
%     export OMP_PLACES=cores
%
%          core #      0    1    2    3
%     processor #     0,1  2,3  4,5  6,7
%     thread #     0  * *  _ _  _ _  _ _   #mask for thread 0
%     thread #     1  _ _  * *  _ _  _ _   #mask for thread 1
%     thread #     2  _ _  _ _  * *  _ _   #mask for thread 2
%     thread #     3  _ _  _ _  _ _  * *   #mask for thread 3

% case 3 sockets:
%
%     No Hyper-Threads
%     3 socket x 4 cores
%
%     export OMP_NUM_THREADS=3
%     export OMP_PLACES=sockets
%
%        socket #        0         1          2
%     processor #     0,1,2,3   4,5,6,7   8,9,10,11
%     thread #     0  * * * *   _ _ _ _   _ _  _  _   #mask for thread 0
%     thread #     0  _ _ _ _   * * * *   _ _  _  _   #mask for thread 1
%     thread #     0  _ _ _ _   _ _ _ _   * *  *  *   #mask for thread 2


%===== Examples Sections =====
\input{affinity/affinity}
\input{affinity/task_affinity}
\input{affinity/affinity_display}
\input{affinity/affinity_query}