OpenMP-Examples/Chap_affinity.tex

\cchapter{OpenMP Affinity}{affinity}
\label{chap:openmp_affinity}

OpenMP Affinity consists of a \kcode{proc_bind} policy (thread affinity policy) and a specification of
places (``location units'' or \plc{processors} that may be cores, hardware
threads, sockets, etc.).
OpenMP Affinity enables users to bind computations on specific places.
The placement will hold for the duration of the parallel region.
However, the runtime is free to migrate the OpenMP threads
to different cores (hardware threads, sockets, etc.) prescribed within a given place,
if two or more cores (hardware threads, sockets, etc.) have been assigned to a given place.

Often the binding can be managed without resorting to explicitly setting places.
Without the specification of places in the \kcode{OMP_PLACES} variable,
the OpenMP runtime will distribute and bind threads using the entire range of processors for
the OpenMP program, according to the \kcode{OMP_PROC_BIND} environment variable
or the \kcode{proc_bind} clause.  When places are specified, the OMP runtime
binds threads to the places according to a default distribution policy, or
those specified in the \kcode{OMP_PROC_BIND} environment variable or the
\kcode{proc_bind} clause.

In the OpenMP Specifications document a processor refers to an execution unit that
is enabled for an OpenMP thread to use.  A processor is a core when there is
no SMT (Simultaneous Multi-Threading) support or SMT is disabled.  When
SMT is enabled, a processor is a hardware thread (HW-thread). (This is the
usual case; but actually, the execution unit is implementation defined.) Processor
numbers are numbered sequentially from 0 to the number of cores less one (without SMT), or
0 to the number HW-threads less one (with SMT). OpenMP places use the processor number to designate
binding locations (unless an ``abstract name'' is used.)


The processors available to a process may be a subset of the system's
processors.  This restriction may be the result of a
wrapper process controlling the execution (such as \plc{numactl} on Linux systems),
compiler options, library-specific environment variables, or default
kernel settings.  For instance, the execution of multiple MPI processes,
launched on a single compute node, will each have a subset of processors as
determined by the MPI launcher or set by MPI affinity environment
variables for the MPI library.  %Forked threads within an MPI process
%(for a hybrid execution of MPI and OpenMP code) inherit the valid
%processor set for execution from the parent process (the initial task region)
%when a parallel region forks threads.  The binding policy set in
%\code{OMP\_PROC\_BIND} or the \code{proc\_bind} clause will be applied to
%the subset of processors available to \plc{the particular} MPI process.

%Also, setting an explicit list of processor numbers in the \code{OMP\_PLACES}
%variable before an MPI launch (which involves more than one MPI process) will
%result in unspecified behavior (and doesn't make sense) because the set of
%processors in the places list must not contain processors outside the subset
%of processors for an MPI process. A separate \code{OMP\_PLACES} variable must
%be set for each MPI process, and is usually accomplished by launching a script
%which sets \code{OMP\_PLACES} specifically for the MPI process.

Threads of a team are positioned onto places in a compact manner, a
scattered distribution, or onto the primary thread's place, by setting the
\kcode{OMP_PROC_BIND} environment variable or the \kcode{proc_bind} clause  to
\kcode{close}, \kcode{spread}, or \kcode{primary} (\kcode{master} has been deprecated), respectively.  When
\kcode{OMP_PROC_BIND} is set to FALSE no binding is enforced; and
when the value is TRUE, the binding is implementation defined to
a set of places in the \kcode{OMP_PLACES} variable or to places
defined by the implementation if the \kcode{OMP_PLACES} variable
is not set.

The \kcode{OMP_PLACES} variable can also be set to an abstract name
(\kcode{threads}, \kcode{cores}, \kcode{sockets}) to specify that a place is
either a single hardware thread, a core, or a socket, respectively.
This description of the \kcode{OMP_PLACES} is most useful when the
number of threads is equal to the number of hardware thread, cores
or sockets.  It can also be used with a \kcode{close} or \kcode{spread}
distribution policy when the equality doesn't hold.


% We need an example of using sockets, cores and threads:

% case 1 cores:

%     Hyper-Threads on (2 hardware threads per core)
%     1 socket x 4 cores x 2 HW-threads
%
%     export OMP_NUM_THREADS=4
%     export OMP_PLACES=threads
%
%          core #      0    1    2    3
%     processor #     0,1  2,3  4,5  6,7
%     thread #     0  * _  _ _  _ _  _ _   #mask for thread 0
%     thread #     1  _ _  * _  _ _  _ _   #mask for thread 1
%     thread #     2  _ _  _ _  * _  _ _   #mask for thread 2
%     thread #     3  _ _  _ _  _ _  * _   #mask for thread 3

% case 2 threads:
%
%     Hyper-Threads on (2 hardware threads per core)
%     1 socket x 4 cores x 2 HW-threads
%
%     export OMP_NUM_THREADS=4
%     export OMP_PLACES=cores
%
%          core #      0    1    2    3
%     processor #     0,1  2,3  4,5  6,7
%     thread #     0  * *  _ _  _ _  _ _   #mask for thread 0
%     thread #     1  _ _  * *  _ _  _ _   #mask for thread 1
%     thread #     2  _ _  _ _  * *  _ _   #mask for thread 2
%     thread #     3  _ _  _ _  _ _  * *   #mask for thread 3

% case 3 sockets:
%
%     No Hyper-Threads
%     3 socket x 4 cores
%
%     export OMP_NUM_THREADS=3
%     export OMP_PLACES=sockets
%
%        socket #        0         1          2
%     processor #     0,1,2,3   4,5,6,7   8,9,10,11
%     thread #     0  * * * *   _ _ _ _   _ _  _  _   #mask for thread 0
%     thread #     0  _ _ _ _   * * * *   _ _  _  _   #mask for thread 1
%     thread #     0  _ _ _ _   _ _ _ _   * *  *  *   #mask for thread 2


%===== Examples Sections =====
\input{affinity/affinity}
\input{affinity/task_affinity}
\input{affinity/affinity_display}
\input{affinity/affinity_query}