2024-11-13 11:07:08 -08:00

429 lines
21 KiB
TeX

\pagebreak
\section{Reduction}
\label{sec:reduction}
This section covers ways to perform reductions in parallel, task, taskloop, and SIMD regions.
\subsection{\kcode{reduction} Clause}
\label{subsec:reduction}
\index{clauses!reduction@\kcode{reduction}}
\index{reduction clause@\kcode{reduction} clause}
\index{reductions!reduction clause@\kcode{reduction} clause}
The following example demonstrates the \kcode{reduction} clause; note that some
reductions can be expressed in the loop in several ways, as shown for the \kcode{max}
and \kcode{min} reductions below:
\cexample[3.1]{reduction}{1}
\pagebreak
\ffreeexample{reduction}{1}
A common implementation of the preceding example is to treat it as if it had been
written as follows:
\cexample{reduction}{2}
\begin{fortranspecific}
\ffreenexample{reduction}{2}
The following program is non-conforming because the reduction is on the
\emph{intrinsic procedure name} \bcode{MAX} but that name has been redefined to be the variable
named \ucode{MAX}.
\ffreenexample{reduction}{3}
\topmarker{Fortran}
The following conforming program performs the reduction using the
\emph{intrinsic procedure name} \kcode{MAX} even though the intrinsic \bcode{MAX} has been renamed
to \ucode{REN}.
\ffreenexample{reduction}{4}
The following conforming program performs the reduction using
\plc{intrinsic procedure name} \kcode{MAX} even though the intrinsic \bcode{MAX} has been renamed
to \ucode{MIN}.
\ffreenexample{reduction}{5}
\end{fortranspecific}
%\pagebreak
The following example is non-conforming because the initialization (\ucode{a =
0}) of the original list item \ucode{a} is not synchronized with the update of
\ucode{a} as a result of the reduction computation in the \bcode{for} loop. Therefore,
the example may print an incorrect value for \ucode{a}.
To avoid this problem, the initialization of the original list item \ucode{a}
should complete before any update of \ucode{a} as a result of the \kcode{reduction}
clause. This can be achieved by adding an explicit barrier after the assignment
\ucode{a = 0}, or by enclosing the assignment \ucode{a = 0} in a \kcode{single}
directive (which has an implied barrier), or by initializing \ucode{a} before
the start of the \kcode{parallel} region.
\cexample[5.1]{reduction}{6}
\fexample[5.1]{reduction}{6}
The following example demonstrates the reduction of array \ucode{a}. In C/C++ this is illustrated by the explicit use of an array section \ucode{a[0:N]} in the \kcode{reduction} clause. The corresponding Fortran example uses array syntax supported in the base language. As of the OpenMP 4.5 specification the explicit use of array section in the \kcode{reduction} clause in Fortran is not permitted. But this oversight has been fixed in the OpenMP 5.0 specification.
\cexample[4.5]{reduction}{7}
\ffreeexample{reduction}{7}
\subsection{Task Reduction}
\label{subsec:task_reduction}
\index{clauses!task_reduction@\kcode{task_reduction}}
\index{task_reduction clause@\kcode{task_reduction} clause}
\index{reductions!task_reduction clause@\kcode{task_reduction} clause}
\index{clauses!in_reduction@\kcode{in_reduction}}
\index{in_reduction clause@\kcode{in_reduction} clause}
\index{reductions!in_reduction clause@\kcode{in_reduction} clause}
In OpenMP 5.0 the \kcode{task_reduction} clause was created for the \kcode{taskgroup} construct,
to allow reductions among explicit tasks that have an \kcode{in_reduction} clause.
In the \example{task_reduction.1} example below a reduction is performed as the algorithm
traverses a linked list. The reduction statement is assigned to be an explicit task using
a \kcode{task} construct and is specified to be a reduction participant with
the \kcode{in_reduction} clause.
A \kcode{taskgroup} construct encloses the tasks participating in the reduction, and
specifies, with the \kcode{task_reduction} clause, that the taskgroup has tasks participating
in a reduction. After the \kcode{taskgroup} region the original variable will contain
the final value of the reduction.
Note: The \ucode{res} variable is private in the \ucode{linked_list_sum} routine
and is not required to be shared (as in the case of a \kcode{parallel} construct
reduction).
\cexample[5.0]{task_reduction}{1}
\ffreeexample[5.0]{task_reduction}{1}
\index{reduction clause@\kcode{reduction} clause!task modifier@\kcode{task} modifier}
\index{task modifier@\kcode{task} modifier}
In OpenMP 5.0 the \kcode{task} \plc{reduction-modifier} for the \kcode{reduction} clause was
introduced to provide a means of performing reductions among implicit and explicit tasks.
The \kcode{reduction} clause of a \kcode{parallel} or worksharing construct may
specify the \kcode{task} \plc{reduction-modifier} to include explicit task reductions
within their region, provided the reduction operators (\plc{reduction-identifiers})
and variables (list items) of the participating tasks match those of the
implicit tasks.
There are 2 reduction use cases (identified by USE CASE \#) in the \example{task_reduction.2} example below.
In USE CASE 1 a \kcode{task} modifier in the \kcode{reduction} clause
of the \kcode{parallel} construct is used to include the reductions of any
participating tasks, those with an \kcode{in_reduction} clause and matching
\plc{reduction-identifiers} (\kcode{+}) and list items (\ucode{x}).
Note, a \kcode{taskgroup} construct (with a \kcode{task_reduction} clause) is not
necessary to scope the explicit task reduction (as seen in the example above).
Hence, even without the implicit task reduction statement (without the C \ucode{x++;}
and Fortran \ucode{x=x+1} statements), the \kcode{task} \plc{reduction-modifier}
in a \kcode{reduction} clause of the \kcode{parallel} construct
can be used to avoid having to create a \kcode{taskgroup} construct
(and its \kcode{task_reduction} clause) around the task generating structure.
In USE CASE 2 tasks participating in the reduction are within a
worksharing region (a parallel worksharing-loop construct).
Here, too, no \kcode{taskgroup} is required, and the \plc{reduction-identifier} (\kcode{+})
and list item (variable \ucode{x}) match as required.
\cexample[5.0]{task_reduction}{2}
\ffreeexample[5.0]{task_reduction}{2}
\subsection{Reduction on Combined Target Constructs}
\label{subsec:target_reduction}
\index{reduction clause@\kcode{reduction} clause!on target construct@on \kcode{target} construct}
\index{constructs!target@\kcode{target}}
\index{target construct@\kcode{target} construct}
When a \kcode{reduction} clause appears on a combined construct that combines
a \kcode{target} construct with another construct, there is an implicit map
of the list items with a \kcode{tofrom} map type for the \kcode{target} construct.
Otherwise, the list items (if they are scalar variables) would be
treated as firstprivate by default in the \kcode{target} construct, which
is unlikely to provide the intended behavior since the result of the
reduction that is in the firstprivate variable would be discarded
at the end of the \kcode{target} region.
In the following example, the use of the \kcode{reduction} clause on \ucode{sum1}
or \ucode{sum2} should, by default, result in an implicit \kcode{tofrom} map for
that variable. So long as neither \ucode{sum1} nor \ucode{sum2} were already
present on the device, the mapping behavior ensures the value for
\ucode{sum1} computed in the first \kcode{target} construct is used in the
second \kcode{target} construct.
Note: a \kcode{declare target} directive is needed for procedures,
\ucode{f} and \ucode{g}, called in \kcode{target} region in Fortran codes.
This directive is not required in C codes because functions, \ucode{f}
and \ucode{g}, are defined in the same compilation unit of the \kcode{target}
construct in which these functions are called.
\cexample[5.0]{target_reduction}{1}
\ffreeexample[5.0]{target_reduction}{1}
%\clearpage
In next example, the variables \ucode{sum1} and \ucode{sum2} remain on the
device for the duration of the \kcode{target data} region so that it is
their device copies that are updated by the reductions. Note the significance
of mapping \ucode{sum1} on the second \kcode{target} construct; otherwise, it
would be treated by default as firstprivate and the result computed for
\ucode{sum1} in the prior \kcode{target} region may not be used. Alternatively, a
\kcode{target update} construct could be used between the two
\kcode{target} constructs to update the host version of \ucode{sum1} with the
value that is in the corresponding device version after the completion of the
first construct.
\cexample[5.0]{target_reduction}{2}
\ffreeexample[5.0]{target_reduction}{2}
\subsection{Task Reduction with Target Constructs}
\label{subsec:target_task_reduction}
\index{in_reduction clause@\kcode{in_reduction} clause}
\index{constructs!target@\kcode{target}}
\index{target construct@\kcode{target} construct}
\index{clauses!enter@\kcode{enter}}
\index{enter clause@\kcode{enter} clause}
The following examples illustrate how task reductions can apply to target tasks
that result from a \kcode{target} construct with the \kcode{in_reduction}
clause. Here, the \kcode{in_reduction} clause specifies that the target task
participates in the task reduction defined in the scope of the enclosing
\kcode{taskgroup} construct. Partial results from all tasks participating in the
task reduction will be combined (in some order) into the original variable
listed in the \kcode{task_reduction} clause before exiting the \kcode{taskgroup}
region.
\cexample[5.2]{target_task_reduction}{1}
\ffreeexample[5.2]{target_task_reduction}{1}
\clearpage
\index{reduction clause@\kcode{reduction} clause!task modifier@\kcode{task} modifier}
\index{task modifier@\kcode{task} modifier}
In the next pair of examples, the task reduction is defined by a
\kcode{reduction} clause with the \kcode{task} modifier, rather than a
\kcode{task_reduction} clause on a \kcode{taskgroup} construct. Again, the
partial results from the participating tasks will be combined in some order
into the original reduction variable, \ucode{sum}.
\cexample[5.2]{target_task_reduction}{2a}
\ffreeexample[5.2]{target_task_reduction}{2a}
\index{in_reduction clause@\kcode{in_reduction} clause!with target construct@with \kcode{target} construct}
\index{constructs!target@\kcode{target}}
\index{target construct@\kcode{target} construct}
Next, the \kcode{task} modifier is again used to define a task reduction over
participating tasks. This time, the participating tasks are a target task
resulting from a \kcode{target} construct with the \kcode{in_reduction} clause,
and the implicit task (executing on the primary thread) that calls
\ucode{host_compute}. As before, the partial results from these participating
tasks are combined in some order into the original reduction variable.
\cexample[5.2]{target_task_reduction}{2b}
\ffreeexample[5.2]{target_task_reduction}{2b}
\subsection{Taskloop Reduction}
\label{subsec:taskloop_reduction}
\index{reduction clause@\kcode{reduction} clause!on taskloop construct@on \kcode{taskloop} construct}
\index{constructs!taskloop@\kcode{taskloop}}
\index{taskloop construct@\kcode{taskloop} construct}
In the OpenMP 5.0 Specification the \kcode{taskloop} construct
was extended to include the reductions.
The following two examples show how to implement a reduction over an array
using taskloop reduction in two different ways.
In the first
example we apply the \kcode{reduction} clause to the \kcode{taskloop} construct. As it was
explained above in the task reduction examples, a reduction over tasks is
divided in two components: the scope of the reduction, which is defined by a
\kcode{taskgroup} region, and the tasks that participate in the reduction. In this
example, the \kcode{reduction} clause defines both semantics. First, it specifies that
the implicit \kcode{taskgroup} region associated with the \kcode{taskloop} construct is the scope of the
reduction, and second, it defines all tasks created by the \kcode{taskloop} construct as
participants of the reduction. About the first property, it is important to note
that if we add the \kcode{nogroup} clause to the \kcode{taskloop} construct the code will be
nonconforming, basically because we have a set of tasks that participate in a
reduction that has not been defined.
\cexample[5.0]{taskloop_reduction}{1}
\ffreeexample[5.0]{taskloop_reduction}{1}
%In the second example, we are computing exactly the same
%value but we do it in a very different way. The first thing that we do in the
%\plc{array\_sum} function is to create a \code{taskgroup} region that defines the scope of a
%new reduction using the \code{task\_reduction} clause.
%After that, we specify that a task and also the tasks generated
%by a taskloop will participate in that reduction using the \code{in\_reduction} clause
%on the \code{task} and \code{taskloop} constructs, respectively. Note that
%we also added the \code{nogroup} clause to the \code{taskloop} construct. This is allowed
%because what we are expressing with the \code{in\_reduction} clause is different
%from what we were expressing with the \code{reduction} clause. In one case we specify
%that the generated tasks will participate in a previously declared reduction
%(\code{in\_reduction} clause) whereas in the other case we specify that we want to
%create a new reduction and also that all tasks generated by the taskloop will
%participate on it.
The second example computes exactly the same value as in the preceding \example{taskloop_reduction.1} code section,
but in a very different way.
First, in the \ucode{array_sum} function a \kcode{taskgroup} region is created
that defines the scope of a new reduction using the \kcode{task_reduction} clause.
After that, a task and also the tasks generated by a taskloop participate in
that reduction by using the \kcode{in_reduction} clause on the \kcode{task}
and \kcode{taskloop} constructs, respectively.
Note that the \kcode{nogroup} clause was added to the \kcode{taskloop} construct.
This is allowed because what is expressed with the \kcode{in_reduction} clause
is different from what is expressed with the \kcode{reduction} clause.
In one case the generated tasks are specified to participate in a previously
declared reduction (\kcode{in_reduction} clause) whereas in the other case
creation of a new reduction is specified and also all tasks generated
by the taskloop will participate on it.
\cexample[5.0]{taskloop_reduction}{2}
\ffreeexample[5.0]{taskloop_reduction}{2}
%\clearpage
In the OpenMP 5.0 Specification, \kcode{reduction} clauses for the
\kcode{taskloop simd} construct were also added.
\index{reduction clause@\kcode{reduction} clause!on taskloop simd construct@on \kcode{taskloop simd} construct}
\index{combined constructs!taskloop simd@\kcode{taskloop simd}}
\index{taskloop simd construct@\kcode{taskloop simd} construct}
The examples below compare reductions for the \kcode{taskloop} and the \kcode{taskloop simd} constructs.
These examples illustrate the use of \kcode{reduction} clauses within
``stand-alone'' \kcode{taskloop} constructs, and the use of \kcode{in_reduction} clauses for tasks of taskloops to participate
with other reductions within the scope of a parallel region.
\textbf{taskloop reductions:}
In the \plc{taskloop reductions} section of the example below,
\example{taskloop 1} uses the \kcode{reduction} clause
in a \kcode{taskloop} construct for a sum reduction, accumulated in \ucode{asum}.
The behavior is as though a \kcode{taskgroup} construct encloses the
taskloop region with a \kcode{task_reduction} clause, and each taskloop
task has an \kcode{in_reduction} clause with the specifications
of the \kcode{reduction} clause.
At the end of the taskloop region \ucode{asum} contains the result of the reduction.
The next taskloop, \example{taskloop 2}, illustrates the use of the
\kcode{in_reduction} clause to participate in a previously defined
reduction scope of a \kcode{parallel} construct.
The task reductions of \example{task 2} and \example{taskloop 2} are combined
across the \kcode{taskloop} construct and the single \kcode{task} construct, as specified
in the \kcode{reduction(task,+: \ucode{asum})} clause of the \kcode{parallel} construct.
At the end of the parallel region \ucode{asum} contains the combined result of all reductions.
\textbf{taskloop simd reductions:}
Reductions for the \kcode{taskloop simd} construct are shown in the second half of the code.
Since each component construct, \kcode{taskloop} and \kcode{simd},
can accept a reduction clause, the \kcode{taskloop simd} construct
is a composite construct, and the specific application of the reduction clause is defined
within the \docref{\kcode{taskloop simd} Construct} section of the OpenMP 5.0 Specification.
The code below illustrates use cases for these reductions.
In the \plc{taskloop simd reduction} section of the example below,
\example{taskloop simd 3} uses the \kcode{reduction} clause
in a \kcode{taskloop simd} construct for a sum reduction within a loop.
For this case a \kcode{reduction} clause is used, as one would use
for a \kcode{simd} construct.
The SIMD reductions of each task are combined, and the results of these tasks are further
combined just as in the \kcode{taskloop} construct with the \kcode{reduction} clause for \example{taskloop 1}.
At the end of the taskloop region \ucode{asum} contains the combined result of all reductions.
If a \kcode{taskloop simd} construct is to participate in a previously defined
reduction scope, the reduction participation should be specified with
a \kcode{in_reduction} clause, as shown in the \kcode{parallel} region enclosing
\example{task 4} and \example{taskloop simd 4} code sections.
Here the \kcode{taskloop simd} construct's
\kcode{in_reduction} clause specifies participation of the construct's tasks as
a task reduction within the scope of the parallel region.
That is, the results of each task of the \kcode{taskloop} construct component
contribute to the reduction in a broader level, just as in \example{parallel reduction a} code section above.
Also, each \kcode{simd}-component construct
occurs as if it has a \kcode{reduction} clause, and the
SIMD results of each task are combined as though to form a single result for
each task (that participates in the \kcode{in_reduction} clause).
At the end of the parallel region \ucode{asum} contains the combined result of all reductions.
%Just as in \plc{parallel reduction a} the
%\code{taskloop simd} construct reduction results are combined
%with the \code{task} construct reduction results
%as specified by the \code{in\_reduction} clause of the \code{task} construct
%and the \plc{task} reduction-modifier of the \code{reduction} clause of
%the \code{parallel} construct.
%At the end of the parallel region \plc{asum} contains the combined result of all reductions.
\cexample[5.1]{taskloop_simd_reduction}{1}
\ffreeexample[5.1]{taskloop_simd_reduction}{1}
\subsection{Reduction with the \kcode{scope} Construct}
\label{subsec:reduction_scope}
\index{reduction clause@\kcode{reduction} clause!on scope construct@on \kcode{scope} construct}
\index{constructs!scope@\kcode{scope}}
\index{scope construct@\kcode{scope} construct}
The following example illustrates the use of the \kcode{scope} construct
to perform a reduction in a \kcode{parallel} region. The case is useful for
producing a reduction and accessing reduction variables inside a \kcode{parallel} region
without using a worksharing-loop construct.
\cppexample[5.1]{scope_reduction}{1}
\clearpage
\ffreeexample[5.1]{scope_reduction}{1}
\subsection{Reduction on Private Variables in a \kcode{parallel} Region}
\label{subsec:priv_reduction}
\index{reduction clause@\kcode{reduction} clause!on private variables}
\index{reduction clause@\kcode{reduction} clause!original(private) modifier@\kcode{original(private)} modifier}
The following example shows reduction on a private variable (\ucode{sum_v})
for an orphaned worksharing loop in routine \ucode{do_red},
which is called in a \kcode{parallel} region.
At the end of the loop, private variable of each thread should have the same combined value.
\cexample[6.0]{priv_reduction}{1}
\ffreeexample[6.0]{priv_reduction}{1}
The following example is slightly modified from the previous example
where the \kcode{original(private)} modifier is explicitly specified
for variable \ucode{sum_v} in the \kcode{reduction} clause.
This modifier indicates that variable \ucode{sum_v} is private
for reduction as opposed to shared by default for a variable
passed as a procedure argument.
\cppexample[6.0]{priv_reduction}{2}
\ffreeexample[6.0]{priv_reduction}{2}
The following example shows the effect of nested \kcode{reduction} constructs.
For the \kcode{parallel} construct, the reduction is on the shared variable
\ucode{x}. For the worksharing loop nested inside the \kcode{parallel}
region, the reduction is performed on the private copy of \ucode{x}
for each thread.
With 4 threads assigned for the \kcode{parallel} region
(enforced by the \kcode{strict} modifier in the \kcode{num_threads} clause),
the code should print 40 at the end.
\cexample[6.0]{priv_reduction}{3}
\ffreeexample[6.0]{priv_reduction}{3}