mirror of
https://github.com/OpenMP/Examples.git
synced 2025-04-04 05:41:33 +01:00
394 lines
22 KiB
TeX
394 lines
22 KiB
TeX
%\pagebreak
|
|
\section{Context-based Variant Selection}
|
|
\label{sec:context_based_variants}
|
|
|
|
\index{directives!declare variant@\kcode{declare variant}}
|
|
\index{directives!metadirective@\kcode{metadirective}}
|
|
|
|
\index{OpenMP context@OpenMP context}
|
|
\index{context selector@context selector}
|
|
\index{trait selector set@trait selector set}
|
|
\index{trait selector@trait selector}
|
|
\index{trait property@trait property}
|
|
|
|
Certain directives, including \kcode{declare variant},
|
|
\kcode{begin declare variant}, and \kcode{metadirective}
|
|
directives, specify function or directive variants for callsite or directive
|
|
substitution. They use \plc{context selectors} to specify the contexts in which
|
|
the variant may be selected for substitution. A context selector specifies
|
|
various \plc{trait selectors}, grouped into \plc{trait selector sets}. A trait
|
|
selector, for a given trait selector set, identifies a corresponding trait
|
|
(and, in some cases, its trait properties) that may or may not be active in an
|
|
\plc{OpenMP context}. A context selector is considered to be \plc{compatible}
|
|
with a given OpenMP context if all traits and trait properties corresponding to
|
|
trait selectors are active in that context.
|
|
|
|
Each context selector is a comma-separated list of trait selector sets and each
|
|
trait selector set has the form \plc{trait-selector-name}~\kcode{=\{}~
|
|
\plc{trait-selector-list}~\kcode{\}}, where \plc{trait-selector-list} is a
|
|
comma-separated list of trait selectors. Some trait selectors may in turn
|
|
specify one or more \plc{trait properties}. Additionally, a trait selector may
|
|
optionally specify a \plc{trait score} for explicit control over variant
|
|
selection.
|
|
|
|
Consider this context selector: \kcode{construct=\{teams,parallel,for\},}
|
|
\kcode{device=\{arch(nvptx)\},} \kcode{user=\{condition(\ucode{N>32})\}}.
|
|
|
|
The context selector specifies three distinct trait selector sets, a
|
|
\kcode{construct} trait selector set, a \kcode{device} trait selector set, and
|
|
a \kcode{user} trait selector set. The \kcode{construct} trait selector set
|
|
specifies three trait selectors: \kcode{teams}, \kcode{parallel}, and
|
|
\kcode{for}. The \kcode{device} trait selector set specifies one trait
|
|
selector: \kcode{arch(nvptx)}. And the \kcode{user} trait selector set
|
|
specifies one trait selector: \kcode{condition(\ucode{N>32})}.
|
|
|
|
The \kcode{teams}, \kcode{parallel}, and \kcode{for} trait selectors
|
|
respectively require that the \plc{teams}, \plc{parallel}, and \plc{for} traits
|
|
are active in the \plc{construct} trait set of the OpenMP context (i.e., the
|
|
\kcode{teams}, \kcode{parallel}, and \kcode{for} constructs are enclosing
|
|
constructs that do not appear outside any enclosing \kcode{target} construct at
|
|
the program point of interest). The \kcode{arch} trait selector specifies the
|
|
\kcode{nvptx} trait property, requiring that \plc{nvptx} is one of the
|
|
supported architectures per the \plc{arch} trait of the \plc{device} trait set
|
|
of the OpenMP context. Finally, the \kcode{condition} trait selector specifies
|
|
the \ucode{N>32} expression as a trait property, requiring that \ucode{N>32}
|
|
evaluates to \plc{true} in the OpenMP context.
|
|
|
|
The remainder of this section presents examples that make use of context
|
|
selectors for function and directive variant selection. Sections
|
|
\ref{subsec:declare_variant} and \ref{subsec:metadirective} cover cases where
|
|
only one context selector is compatible.
|
|
Section \ref{subsec:context_selector_scoring} covers cases where
|
|
multiple compatible context selectors exist and a scoring algorithm
|
|
determines which one of the variants is selected.
|
|
|
|
\subsection{\kcode{declare variant} Directive}
|
|
\label{subsec:declare_variant}
|
|
\index{directives!declare variant@\kcode{declare variant}}
|
|
\index{declare variant directive@\kcode{declare variant} directive}
|
|
\index{declare variant directive@\kcode{declare variant} directive!match clause@\kcode{match} clause}
|
|
\index{clauses!match@\kcode{match}}
|
|
\index{match clause@\kcode{match} clause}
|
|
|
|
A \kcode{declare variant} directive specifies an alternate function,
|
|
\plc{function variant}, to be used in place of the \plc{base function}
|
|
%when the trait within the \kcode{match} clause has a valid context.
|
|
when the trait within the \kcode{match} clause matches the OpenMP context at a given callsite.
|
|
The base function follows the directive in the C and C++ languages.
|
|
In Fortran, either a subroutine or function may be used as the base function,
|
|
and the \kcode{declare variant} directive must be in the specification
|
|
part of a subroutine or function (unless a \plc{base-proc-name}
|
|
modifier is used, as in the case of a procedure declaration statement). See
|
|
the OpenMP 5.0 Specification for details on the modifier.
|
|
|
|
When multiple \kcode{declare variant} directives are used
|
|
a function variant becomes a candidate for replacing the base function if the
|
|
%base function call context matches the traits of all selectors in the \kcode{match} clause.
|
|
context at the base function call matches the traits of all selectors in the \kcode{match} clause.
|
|
If there are multiple candidates, a score is assigned with rules for each
|
|
of the selector traits.
|
|
See Section \ref{subsec:context_selector_scoring} for details.
|
|
%The scoring algorithm can be found in the OpenMP 5.0 Specification.
|
|
|
|
In the first example the \ucode{vxv()} function is called within a \kcode{parallel} region,
|
|
a \kcode{target} region, and in a sequential part of the program. Two function variants, \ucode{p_vxv()} and \ucode{t_vxv()},
|
|
are defined for the first two regions by using \kcode{parallel} and \kcode{target} selectors (within
|
|
the \plc{construct} trait set) in a \kcode{match} clause. The \ucode{p_vxv()} function variant includes
|
|
a \kcode{for} construct (\kcode{do} construct for Fortran) for the \kcode{parallel} region,
|
|
while \ucode{t_vxv()} includes a \kcode{distribute simd} construct for the \kcode{target} region.
|
|
The \ucode{t_vxv()} function is explicitly compiled for the device using a declare target directive.
|
|
|
|
Since the two \kcode{declare variant} directives have no selectors that match traits for the context
|
|
of the base function call in the sequential part of the program, the base \ucode{vxv()} function is used there,
|
|
as expected.
|
|
(The vectors in the \ucode{p_vxv} and \ucode{t_vxv} functions have been multiplied
|
|
by 3 and 2, respectively, for checking the validity of the replacement. Normally
|
|
the purpose of a function variant is to produce the same results by a different method.)
|
|
|
|
%Note: a \code{target teams} construct is used to direct execution onto a device, with a
|
|
%\code{distribute simd} construct in the function variant. As of the OpenMP 5.0 implementation
|
|
%no intervening code is allowed between a \code{target} and \code{teams} construct. So
|
|
%using a \code{target} construct to direct execution onto a device, and including
|
|
%\code{teams distribute simd} in the variant function would produce non conforming code.
|
|
|
|
\cexample[5.1]{declare_variant}{1}
|
|
|
|
\ffreeexample[5.0]{declare_variant}{1}
|
|
|
|
In this example, traits from the \plc{device} set are used to select a function variant.
|
|
In the \kcode{declare variant} directive, an \kcode{isa} trait selector
|
|
specifies that if the implementation of the ``\vcode{core-avx512}''
|
|
instruction set is detected at compile time the \ucode{avx512_saxpy()}
|
|
variant function is used for the call to \ucode{base_saxpy()}.
|
|
|
|
A compilation of \ucode{avx512_saxpy()} is aware of
|
|
the AVX-512 instruction set that supports 512-bit vector extensions.
|
|
Within \ucode{avx512_saxpy()}, the \kcode{parallel for simd} construct performs parallel execution, and
|
|
takes advantage of 64-byte data alignment.
|
|
When the \ucode{avx512_saxpy()} function variant is not selected, the base \ucode{base_saxpy()} function variant
|
|
containing only a basic \kcode{parallel for} construct is used for the call to \ucode{base_saxpy()}.
|
|
|
|
%Note:
|
|
%An allocator is used to set the alignment to 64 bytes when an OpenMP compilation is performed.
|
|
%Details about allocator variable declarations and functions
|
|
%can be found in the allocator example of the Memory Management Chapter.
|
|
|
|
\cexample[5.0]{declare_variant}{2}
|
|
|
|
\ffreeexample[5.0]{declare_variant}{2}
|
|
|
|
The \kcode{begin declare variant} with a paired \kcode{end declare variant} directive was introduced
|
|
for C/C++ in the OpenMP 5.1 to allow nesting of declare variant directives.
|
|
This example shows a practical situation where nested declare variant directives can be used
|
|
to include the correct specialized user function based on the underlying vendor \kcode{isa} trait.
|
|
The function name \ucode{my_fun()} is identical in all the header files and the version called will
|
|
differ based on the calling context. The example assumes that either NVIDIA or AMD target devices are used.
|
|
|
|
\index{directives!begin declare variant@\kcode{begin declare variant}}
|
|
\index{begin declare variant directive@\kcode{begin declare variant} directive}
|
|
|
|
\cexample[5.1]{declare_variant}{3}
|
|
|
|
%%%%%%%%%%%%%
|
|
\subsection{Metadirectives}
|
|
\label{subsec:metadirective}
|
|
\index{directives!metadirective@\kcode{metadirective}}
|
|
\index{metadirective directive@\kcode{metadirective} directive}
|
|
|
|
\index{metadirective directive@\kcode{metadirective} directive!when clause@\kcode{when} clause}
|
|
\index{metadirective directive@\kcode{metadirective} directive!otherwise clause@\kcode{otherwise} clause}
|
|
\index{clauses!when@\kcode{when}}
|
|
\index{when clause@\kcode{when} clause}
|
|
\index{clauses!otherwise@\kcode{otherwise}}
|
|
\index{otherwise clause@\kcode{otherwise} clause}
|
|
A \kcode{metadirective} directive provides a mechanism to select a directive in
|
|
a \kcode{when} clause to be used, depending upon one or more contexts:
|
|
implementation, available devices and the present enclosing construct.
|
|
The directive in an \kcode{otherwise} clause is used when a directive of the
|
|
\kcode{when} clause is not selected.
|
|
|
|
\index{context selector!construct@\plc{construct}}
|
|
In the \kcode{when} clause the \plc{context selector} (or just \plc{selector}) defines traits that are
|
|
evaluated for selection of the directive that follows the selector.
|
|
This ``selectables'' directive is called a \plc{directive variant}.
|
|
%Traits are grouped by \plc{construct}, \plc{implementation} and
|
|
%\plc{device} \plc{sets} to be used by a selector of the same name.
|
|
|
|
\index{context selector!device@\plc{device}}
|
|
In the first example the \plc{arch} trait of the
|
|
\kcode{device} selector set specifies that if an \ucode{nvptx} architecture is
|
|
active in the OpenMP context, then the \kcode{teams loop}
|
|
directive variant is selected as the directive; otherwise, the \kcode{parallel loop}
|
|
directive variant of the \kcode{otherwise} clause is selected as the directive.
|
|
That is, if a device of \ucode{nvptx} architecture is supported by the implementation within
|
|
the enclosing \kcode{target} construct, its directive variant is selected.
|
|
The architecture names, such as \ucode{nvptx}, are implementation defined.
|
|
Also, note that the \kcode{device} clause specified in a \kcode{target} construct specifies
|
|
a device number, while \kcode{device}, as used in the \kcode{metadirective}
|
|
directive as selector set, has traits of \plc{kind}, \plc{isa} and \plc{arch}.
|
|
|
|
|
|
\cexample[5.2]{metadirective}{1}
|
|
|
|
\ffreeexample[5.2]{metadirective}{1}
|
|
\pagebreak
|
|
|
|
\index{context selector!implementation@\plc{implementation}}
|
|
In the second example, the \kcode{implementation} selector set is specified
|
|
in the \kcode{when} clause to distinguish between platforms.
|
|
Additionally, specific architectures are specified with the \kcode{device}
|
|
selector set.
|
|
|
|
In the code, different \kcode{teams} constructs are employed as determined
|
|
by the \kcode{metadirective} directive.
|
|
The number of teams is restricted by a \kcode{num_teams} clause
|
|
and a thread limit is also set by a \kcode{thread_limit} clause for
|
|
vendor platforms and specific architecture
|
|
traits. Otherwise, just the \kcode{teams} construct is used without
|
|
any clauses, as prescribed by the \kcode{otherwise} clause.
|
|
|
|
|
|
\cexample[5.2]{metadirective}{2}
|
|
|
|
\ffreeexample[5.2]{metadirective}{2}
|
|
|
|
\index{context selector!construct@\plc{construct}}
|
|
|
|
\index{directives!declare target@\kcode{declare target}}
|
|
\index{declare target directive@\kcode{declare target} directive}
|
|
|
|
\index{directives!begin declare target@\kcode{begin declare target}}
|
|
\index{begin declare target directive@\kcode{begin declare target} directive}
|
|
|
|
In the third example, a \kcode{construct} selector set is specified in the \kcode{when} clause.
|
|
Here, a \kcode{metadirective} directive is used within a function that is also
|
|
compiled as a function for a target device as directed by a declare target directive.
|
|
The \kcode{target} directive name of the \kcode{construct} selector ensures that the
|
|
\kcode{distribute parallel for/do} construct is employed for the target compilation.
|
|
Otherwise, for the host-compiled version the \kcode{parallel for/do simd} construct is used.
|
|
|
|
In the first call to the \ucode{exp_pi_diff()} routine the context is a
|
|
\kcode{target teams} construct and the \kcode{distribute parallel for/do}
|
|
construct version of the function is invoked,
|
|
while in the second call the \kcode{parallel for/do simd} construct version is used.
|
|
|
|
%%%%%%%%
|
|
This case illustrates an important point for users that may want to hoist the
|
|
\kcode{target} directive out of a function that contains the usual
|
|
\kcode{target teams distribute parallel for/do} construct
|
|
(for providing alternate constructs through the \kcode{metadirective} directive as here).
|
|
While this combined construct can be decomposed into a \kcode{target} and
|
|
\kcode{teams distribute parallel for/do} constructs, the OpenMP 5.0 specification has the restriction:
|
|
``If a \kcode{teams} construct is nested within a \kcode{target} construct, that \kcode{target} construct must
|
|
contain no statements, declarations or directives outside of the \kcode{teams} construct''.
|
|
So, the \kcode{teams} construct must immediately follow the \kcode{target} construct without any intervening
|
|
code statements (which includes function calls).
|
|
Since the \kcode{target} construct alone cannot be hoisted out of a function,
|
|
the \kcode{target teams} construct has been hoisted out of the function, and
|
|
the \kcode{distribute parallel for/do} construct is used
|
|
as the variant directive of the \kcode{metadirective} directive within the function.
|
|
%%%%%%%%
|
|
|
|
\cexample[5.2]{metadirective}{3}
|
|
|
|
\ffreeexample[5.2]{metadirective}{3}
|
|
\pagebreak
|
|
|
|
\index{context selector!user@\plc{user}}
|
|
\index{context selector!condition selector@\kcode{condition} selector}
|
|
|
|
The \kcode{user} selector set can be used in a \kcode{metadirective}
|
|
to select directives at execution time when the
|
|
\kcode{condition( \plc{boolean-expr} )} selector expression is not a constant expression.
|
|
In this case it is a \plc{dynamic} trait set, and the selection is made at run time, rather
|
|
than at compile time.
|
|
|
|
In the following example the \ucode{foo} function employs the \kcode{condition}
|
|
selector to choose a device for execution at run time.
|
|
In the \ucode{bar} routine metadirectives are nested.
|
|
At the outer level a selection between serial and parallel execution in performed
|
|
at run time, followed by another run time selection on the schedule kind in the inner
|
|
level when the active \plc{construct} trait is \kcode{parallel}.
|
|
|
|
(Note, the variable \ucode{b} in two of the ``selected'' constructs is declared private for the sole purpose
|
|
of detecting and reporting that the construct is used. Since the variable is private, its value
|
|
is unchanged outside of the construct region, whereas it is changed if the ``unselected'' construct
|
|
is used.)
|
|
|
|
%(Note: The value of \plc{b} after the \code{parallel} region remains 0 for the
|
|
%\code{guided} scheduling case, because its \code{parallel} construct also contains
|
|
%the \code{private(}~\plc{b}~\code{)} clause.
|
|
%The variable \plc{b} is employed for the sole purpose of distinguishing which
|
|
%\code{parallel} construct is selected-- for testing.)
|
|
|
|
%While there might be other ways to make these decisions at run time, such as using
|
|
%an \code{if} clause on a \code{parallel} construct, this mechanism is much more general.
|
|
%For instance, an input ``gpu\_type'' string could be used and tested in boolean expressions
|
|
%to select from one of several possible \code{target} constructs.
|
|
%Also, setting the scheduling variable (\plc{unbalanced}) within the execution through a
|
|
%``work balance'' function might be a more practical approach for setting the schedule kind.
|
|
|
|
|
|
\cexample[5.2]{metadirective}{4}
|
|
|
|
\ffreeexample[5.2]{metadirective}{4}
|
|
\pagebreak
|
|
|
|
Metadirectives can be used in conjunction with templates as shown in the C++ code below.
|
|
Here the template definition generates two versions of the Fibonacci function.
|
|
The \ucode{tasking} boolean is used in the \kcode{condition} selector to enable tasking.
|
|
The true form implements a parallel version with \kcode{task} and \kcode{taskwait}
|
|
constructs as in the \example{tasking.4.c} code in Section~\ref{sec:task_taskwait}.
|
|
The false form implements a serial version without any tasking constructs.
|
|
Note that the serial version is used in the parallel function for optimally
|
|
processing numbers less than 8.
|
|
|
|
\cppexample[5.0]{metadirective}{5}
|
|
|
|
%\pagebreak
|
|
\subsection{Context Selector Scoring}
|
|
|
|
\label{subsec:context_selector_scoring}
|
|
\index{context selector scoring@context selector scoring}
|
|
|
|
Each context selector for which all specified traits are active in the current
|
|
\plc{OpenMP context} is a \plc{compatible context selector}, and the associated
|
|
function variant or directive variant for such a context selector is a
|
|
\plc{replacement candidate}. The final \plc{score} of each of the compatible
|
|
context selectors determine which of the replacement candidates is selected for
|
|
substitution.
|
|
|
|
For a given compatible context selector, the score is calculated according
|
|
to the specified trait selectors and their corresponding traits. If the trait
|
|
selectors are a strict subset of the trait selectors specified by another
|
|
compatible context selector then the score of the context selector is zero.
|
|
Otherwise, the final score is one plus the sum of the score values of each
|
|
specified trait selector.
|
|
|
|
A replacement candidate is selected if no other candidate has a higher scoring
|
|
context selector. If multiple replacement candidates have a context selector
|
|
with the same highest score, the one specified first on the metadirective is
|
|
selected. If multiple function variants are replacement candidates that have
|
|
context selectors with the same highest score, the one that is selected is
|
|
implementation defined.
|
|
|
|
If a \kcode{construct} selector set is specified in the context selector, each
|
|
active construct trait that is named in that selector set contributes a score
|
|
of $2^{p-1}$, where $p$ is the position of that trait in the current
|
|
\plc{construct} trait set (the set of traits in the OpenMP context). If a
|
|
\kcode{device} or \kcode{target_device} selector set is specified in the
|
|
selector, then an active \plc{kind}, \plc{arch}, or \plc{isa} trait that is
|
|
named in the selector set contributes a score of $2^l$, $2^{l+1}$, and
|
|
$2^{l+2}$, respectively, where $l$ is the number of traits in the
|
|
\plc{construct} trait set. For any other active traits that are named in the
|
|
context selector that are not implementation-defined extensions, the
|
|
contributed score, by default, is zero.
|
|
|
|
The default score for any active traits other than \plc{construct} traits and
|
|
the \plc{kind}, \plc{arch}, or \plc{isa} traits may be overridden with an
|
|
explicit score expression. Specifying an explicit score is only recommended
|
|
for prioritizing replacement candidates for which a selection is not
|
|
dependent on construct traits. That is, none of the compatible context
|
|
selectors specify a \kcode{construct} trait selector or a \kcode{kind},
|
|
\kcode{arch}, or \kcode{isa} trait selector.
|
|
|
|
In the following example, four function variants are declared for the procedure
|
|
\ucode{f}: \ucode{fx1}, \ucode{fx2}, \ucode{fx3}, and \ucode{fx4}. Suppose that
|
|
the target device for the \kcode{target} region has the \plc{gpu} device kind,
|
|
has the \plc{nvptx} architecture, and supports the \splc{sm_70} instruction set
|
|
architecture. Hence, the context selectors for all function variants are
|
|
compatible with the context at the callsite for \ucode{f} inside the
|
|
\kcode{target} region. The \plc{construct} trait set at the callsite,
|
|
consisting of all enclosing constructs and having a count of \plc{l=6}, is:
|
|
\{\plc{target}, \plc{teams}, \plc{distribute}, \plc{parallel},
|
|
\plc{for}/\plc{do}, \plc{task}\}. Note that only \plc{context-matching}
|
|
constructs, which does not include \kcode{distribute} or \kcode{task}, may be
|
|
named by a \kcode{construct} trait selector as of OpenMP 5.2. The score for
|
|
\ucode{fx1} is $1+2^0=2$, for \ucode{fx2} is $1+2^1+2^3+2^4=27$,
|
|
for \ucode{fx3} is $1+2^6+2^8=321$, and for \ucode{fx4} is $1+2^7+2^8=385$.
|
|
Since \ucode{fx4} is the function variant that has the highest scoring
|
|
selector, it is selected by the implementation at the callsite.
|
|
|
|
\cexample[5.0]{selector_scoring}{1}
|
|
|
|
\ffreeexample[5.0]{selector_scoring}{1}
|
|
|
|
In the next example, three function variants are declared for the procedure
|
|
\ucode{kernel}: \ucode{kernel_target_ua}, \ucode{kernel_target_usm}, and
|
|
\ucode{kernel_target_usm_v2}. Suppose that the implementation supports the
|
|
\splc{unified_address} and \splc{unified_shared_memory} requirements, so that
|
|
the context selectors for all function variants are compatible. The score for
|
|
\ucode{kernel_target_ua} is 1, which is one plus the zero score associated with
|
|
the active \splc{unified_address} requirement. The score for
|
|
\ucode{kernel_target_usm} is 0, as the selector is a strict subset of the
|
|
selector for \ucode{kernel_target_usm_v2}. The score for
|
|
\ucode{kernel_target_usm_v2} is 2, which is one plus the explicit score of 1
|
|
for the \plc{condition} trait and the zero score associated with the acive
|
|
\splc{unified_shared_memory} requirement . Since \ucode{kernel_target_usm_v2}
|
|
is the function variant that has the highest scoring selector, it is selected
|
|
by the implementation at the callsite.
|
|
|
|
\cexample[5.0]{selector_scoring}{2}
|
|
|
|
\ffreeexample[5.0]{selector_scoring}{2}
|