v5.0.1 release

This commit is contained in:
Henry Jin 2020-06-26 07:54:45 -07:00
parent eaec9ede64
commit 3052c10566
406 changed files with 14118 additions and 557 deletions

View File

@ -48,15 +48,14 @@ document.
In this chapter, examples illustrate how race conditions may arise for accesses
to variables with a \plc{shared} data-sharing attribute when flush operations
are not properly employed. A race condition can exist when two or more threads
are involved in accessing a variable in which not all of the accesses are
reads; that is, a WaR, RaW or WaW condition exists (R=read, a=after, W=write).
A RaR does not produce a race condition. In particular, a data race will arise
when conflicting accesses do not have a well-defined \emph{completion order}.
The existence of data races in OpenMP programs result in undefined behavior,
and so they should generally be avoided for programs to be correct. The
completion order of accesses to a shared variable is guaranteed in OpenMP
through a set of memory consistency rules that are described in the \plc{OpenMP
Memory Consitency} section of the OpenMP Specifications document.
are involved in accessing a variable and at least one of the accesses modifies
the variable. In particular, a data race will arise when conflicting accesses
do not have a well-defined \emph{completion order}. The existence of data
races in OpenMP programs result in undefined behavior, and so they should
generally be avoided for programs to be correct. The completion order of
accesses to a shared variable is guaranteed in OpenMP through a set of memory
consistency rules that are described in the \plc{OpenMP Memory Consitency}
section of the OpenMP Specifications document.
%This chapter also includes examples that exhibit non-sequentially consistent
%(\emph{non-SC}) behavior. Sequential consistency (\emph{SC}) is the desirable

View File

@ -40,8 +40,9 @@ for executing the task to completion, even though it may leave the
execution at a scheduling point and return later. The thread is tied
to the task. Scheduling points can be introduced with the \code{taskyield}
construct. With an \code{untied} clause any other thread is allowed to continue
the task. An \code{if} clause with a \plc{true} expression allows the
generating thread to immediately execute the task as an undeferred task.
the task. An \code{if} clause with an expression that evaluates to \plc{false}
results in an \emph{undeferred} task, which instructs the runtime to suspend
the generating task until the undeferred task completes its execution.
By including the data environment of the generating task into the generated task with the
\code{mergeable} and \code{final} clauses, task generation overhead can be reduced.

View File

@ -19,3 +19,8 @@ source form. \plc{ext} is one of the following:
\item \plc{f90} -- Fortran code in free form.
\end{compactitem}
Some of the example labels may include version information
(\code{\small{}omp\_\plc{verno}}) to indicate features that are illustrated
by an example for a specific OpenMP version, such as ``\plc{scan.1.c}
\;(\code{\small{}omp\_5.0}).''

View File

@ -5,9 +5,9 @@
The following example illustrates the basic use of the \code{simd} construct
to assure the compiler that the loop can be vectorized.
\cexample{SIMD}{1}
\cexample[4.0]{SIMD}{1}
\ffreeexample{SIMD}{1}
\ffreeexample[4.0]{SIMD}{1}
\clearpage
@ -40,9 +40,9 @@ In the \code{simd} constructs for the loops the \code{private(tmp)} clause is
necessary to assure that the each vector operation has its own \plc{tmp}
variable.
\cexample{SIMD}{2}
\cexample[4.0]{SIMD}{2}
\ffreeexample{SIMD}{2}
\ffreeexample[4.0]{SIMD}{2}
\pagebreak
A thread that encounters a SIMD construct executes a vectorized code of the
@ -52,9 +52,9 @@ privatized and declared as reductions with clauses. The example below
illustrates the use of \code{private} and \code{reduction} clauses in a SIMD
construct.
\cexample{SIMD}{3}
\cexample[4.0]{SIMD}{3}
\ffreeexample{SIMD}{3}
\ffreeexample[4.0]{SIMD}{3}
\pagebreak
@ -68,9 +68,9 @@ code is safe for vectors up to and including size 16. In the loop, \plc{m} can
be 16 or greater, for correct code execution. If the value of \plc{m} is less
than 16, the behavior is undefined.
\cexample{SIMD}{4}
\cexample[4.0]{SIMD}{4}
\ffreeexample{SIMD}{4}
\ffreeexample[4.0]{SIMD}{4}
\pagebreak
The following SIMD construct instructs the compiler to collapse the \plc{i} and
@ -78,9 +78,9 @@ The following SIMD construct instructs the compiler to collapse the \plc{i} and
threads of the team. Within the workshared loop chunks of a thread, the SIMD
chunks are executed in the lanes of the vector units.
\cexample{SIMD}{5}
\cexample[4.0]{SIMD}{5}
\ffreeexample{SIMD}{5}
\ffreeexample[4.0]{SIMD}{5}
%%% section
@ -95,9 +95,9 @@ the other hand, the \code{inbranch} clause for the function goo indicates that
the function is always called conditionally in the SIMD loop inside
the function \plc{myaddfloat}.
\cexample{SIMD}{6}
\cexample[4.0]{SIMD}{6}
\ffreeexample{SIMD}{6}
\ffreeexample[4.0]{SIMD}{6}
In the code below, the function \plc{fib()} is called in the main program and
@ -106,9 +106,9 @@ condition. The compiler creates a masked vector version and a non-masked vector
version for the function \plc{fib()} while retaining the original scalar
version of the \plc{fib()} function.
\cexample{SIMD}{7}
\cexample[4.0]{SIMD}{7}
\ffreeexample{SIMD}{7}
\ffreeexample[4.0]{SIMD}{7}
@ -124,7 +124,7 @@ A loop can be vectorized even though the iterations are not completely independe
This test assures that the compiler preserves the loop carried lexical forward-dependence for generating a correct SIMD code.
\cexample{SIMD}{8}
\cexample[4.0]{SIMD}{8}
\ffreeexample{SIMD}{8}
\ffreeexample[4.0]{SIMD}{8}

View File

@ -67,8 +67,8 @@ by thread 0 and the read from \plc{x} by thread 1, and so thread 1 must see that
\plc{x} equals 10.
\pagebreak
\cexample{acquire_release}{1}
\ffreeexample{acquire_release}{1}
\cexample[5.0]{acquire_release}{1}
\ffreeexample[5.0]{acquire_release}{1}
In the second example, the \code{critical} constructs are exchanged with
\code{atomic} constructs that have \textit{explicit} memory ordering specified. When the
@ -77,8 +77,8 @@ results in a release/acquire synchronization that in turn implies that the
assignment to \plc{x} on thread 0 happens before the read of \plc{x} on thread
1. Therefore, thread 1 will print ``x = 10''.
\cexample{acquire_release}{2}
\ffreeexample{acquire_release}{2}
\cexample[5.0]{acquire_release}{2}
\ffreeexample[5.0]{acquire_release}{2}
\pagebreak
In the third example, \code{atomic} constructs that specify relaxed atomic
@ -105,8 +105,8 @@ construct used in Example 2 for thread 1.
%}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%3
\cexample{acquire_release}{3}
\ffreeexample{acquire_release}{3}
\cexample[5.0]{acquire_release}{3}
\ffreeexample[5.0]{acquire_release}{3}
Example 4 will fail to order the write to \plc{x} on thread 0 before the read
from \plc{x} on thread 1. Importantly, the implicit release flush on exit from
@ -137,5 +137,5 @@ modifies \plc{y} and provides release semantics must be specified.
%by thread 0.
%}
\cexample{acquire_release_broke}{4}
\ffreeexample{acquire_release_broke}{4}
\cexample[5.0]{acquire_release_broke}{4}
\ffreeexample[5.0]{acquire_release_broke}{4}

View File

@ -34,9 +34,9 @@ the partition list when the number of threads is less than or equal to the numbe
of places in the parent's place partition, for the machine architecture depicted
above. Note that the threads are bound to the first place of each subpartition.
\cexample{affinity}{1}
\cexample[4.0]{affinity}{1}
\fexample{affinity}{1}
\fexample[4.0]{affinity}{1}
It is unspecified on which place the master thread is initially started. If the
master thread is initially started on p0, the following placement of threads will
@ -75,9 +75,9 @@ parent's place partition. The first \plc{T/P} threads of the team (including the
thread) execute on the parent's place. The next \plc{T/P} threads execute on the next
place in the place partition, and so on, with wrap around.
\cexample{affinity}{2}
\cexample[4.0]{affinity}{2}
\ffreeexample{affinity}{2}
\ffreeexample[4.0]{affinity}{2}
It is unspecified on which place the master thread is initially started. If the
master thread is initially started on p0, the following placement of threads will
@ -130,9 +130,9 @@ the partition list when the number of threads is less than or equal to the numbe
of places in parent's place partition, for the machine architecture depicted above.
The place partition is not changed by the \code{close} policy.
\cexample{affinity}{3}
\cexample[4.0]{affinity}{3}
\fexample{affinity}{3}
\fexample[4.0]{affinity}{3}
It is unspecified on which place the master thread is initially started. If the
master thread is initially started on p0, the following placement of threads will
@ -171,9 +171,9 @@ thread) execute on the parent's place. The next \plc{T/P} threads execute on the
place in the place partition, and so on, with wrap around. The place partition
is not changed by the \code{close} policy.
\cexample{affinity}{4}
\cexample[4.0]{affinity}{4}
\ffreeexample{affinity}{4}
\ffreeexample[4.0]{affinity}{4}
It is unspecified on which place the master thread is initially started. If the
master thread is initially running on p0, the following placement of threads will
@ -225,9 +225,9 @@ The following example shows the result of the \code{master} affinity policy on
the partition list for the machine architecture depicted above. The place partition
is not changed by the master policy.
\cexample{affinity}{5}
\cexample[4.0]{affinity}{5}
\fexample{affinity}{5}
\fexample[4.0]{affinity}{5}
It is unspecified on which place the master thread is initially started. If the
master thread is initially running on p0, the following placement of threads will

View File

@ -28,9 +28,9 @@ not changed, so affinity is NOT reported.
In the last parallel region, the thread affinities are reported
because the thread affinity has changed.
\cexample{affinity_display}{1}
\cexample[5.0]{affinity_display}{1}
\ffreeexample{affinity_display}{1}
\ffreeexample[5.0]{affinity_display}{1}
In the following example 2 threads are forked, and each executes on a socket. Next,
@ -58,9 +58,9 @@ the parallel nesting level (\%L), the ancestor thread number (\%a), the thread n
and the thread affinity (\%A). In the nested parallel region within the \plc{socket\_work} routine
the affinities for the threads on each socket are printed according to this format.
\cexample{affinity_display}{2}
\cexample[5.0]{affinity_display}{2}
\ffreeexample{affinity_display}{2}
\ffreeexample[5.0]{affinity_display}{2}
The next example illustrates more details about affinity formatting.
First, the \code{omp\_get\_affininity\_format()} API routine is used to
@ -98,7 +98,7 @@ The maximum value for the number of characters (\plc{nchars}) returned by
clause and the \plc{if(nchars >= max\_req\_store) max\_req\_store=nchars} statement.
It is used to report possible truncation (if \plc{max\_req\_store} > \plc{buffer\_store}).
\cexample{affinity_display}{3}
\cexample[5.0]{affinity_display}{3}
\ffreeexample{affinity_display}{3}
\ffreeexample[5.0]{affinity_display}{3}

View File

@ -37,7 +37,7 @@ On some systems there are utilities, files or user guides that provide configura
information. For instance, the socket number and proc\_id's for a socket
can be found in the /proc/cpuinfo text file on Linux systems.
\cexample{affinity_query}{1}
\cexample[4.5]{affinity_query}{1}
\ffreeexample{affinity_query}{1}
\ffreeexample[4.5]{affinity_query}{1}

View File

@ -57,7 +57,7 @@ and the set of all variables used in the allocate statement is specified in the
%\pagebreak
\cexample{allocators}{1}
\ffreeexample{allocators}{1}
\cexample[5.0]{allocators}{1}
\ffreeexample[5.0]{allocators}{1}

View File

@ -8,31 +8,31 @@ on \code{target} and \code{target} \code{data} constructs.
This example shows the invalid usage of two separate sections of the same array
inside of a \code{target} construct.
\cexample{array_sections}{1}
\cexample[4.0]{array_sections}{1}
\ffreeexample{array_sections}{1}
\ffreeexample[4.0]{array_sections}{1}
\pagebreak
This example shows the invalid usage of two separate sections of the same array
inside of a \code{target} construct.
\cexample{array_sections}{2}
\cexample[4.0]{array_sections}{2}
\ffreeexample{array_sections}{2}
\ffreeexample[4.0]{array_sections}{2}
\pagebreak
This example shows the valid usage of two separate sections of the same array inside
of a \code{target} construct.
\cexample{array_sections}{3}
\cexample[4.0]{array_sections}{3}
\ffreeexample{array_sections}{3}
\ffreeexample[4.0]{array_sections}{3}
\pagebreak
This example shows the valid usage of a wholly contained array section of an already
mapped array section inside of a \code{target} construct.
\cexample{array_sections}{4}
\cexample[4.0]{array_sections}{4}
\ffreeexample{array_sections}{4}
\ffreeexample[4.0]{array_sections}{4}

View File

@ -23,5 +23,13 @@ Note the use of additional parentheses
around the shape-operator and $a$ to ensure the correct precedence
over array-section operations.
\cnexample{array_shaping}{1}
\cnexample[5.0]{array_shaping}{1}
\ccppspecificend
The shape operator is not defined for Fortran. Explicit array shaping
of procedure arguments can be used instead to achieve a similar goal.
Below is the Fortran-equivalent of the above example that illustrates
the support of transferring two rows of noncontiguous boundary
data in the \code{target}~\code{update} directive.
\ffreeexample[5.0]{array_shaping}{1}

View File

@ -11,13 +11,13 @@ name \plc{b} is associated with the shared variable \plc{a}. With the predetermi
attribute rule, the associate name \plc{b} is not allowed to be specified on the \code{private}
clause.
\fnexample{associate}{1}
\fnexample[4.0]{associate}{1}
In next example, within the \code{parallel} construct, the association name \plc{thread\_id}
is associated with the private copy of \plc{i}. The print statement should output the
unique thread number.
\fnexample{associate}{2}
\fnexample[4.0]{associate}{2}
The following example illustrates the effect of specifying a selector name on a data-sharing
attribute clause. The associate name \plc{u} is associated with \plc{v} and the variable \plc{v}
@ -27,6 +27,6 @@ The association between \plc{u} and the original \plc{v} is retained (see the Da
Attribute Rules section in the OpenMP 4.0 API Specifications). Inside the \code{parallel}
region, \plc{v} has the value of -1 and \plc{u} has the value of the original \plc{v}.
\ffreenexample{associate}{3}
\ffreenexample[4.0]{associate}{3}
\fortranspecificend

View File

@ -26,6 +26,6 @@ one thread is being used for offload generation. In the situation where
little time is spent by the \plc{target task} in setting
up and tearing down the the target execution, \code{static} scheduling may be desired.
\cexample{async_target}{3}
\cexample[4.5]{async_target}{3}
\ffreeexample{async_target}{3}
\ffreeexample[4.5]{async_target}{3}

View File

@ -11,8 +11,8 @@ The last dependence is produced by array \plc{p} with the \code{out} dependence
The \code{nowait} clause on the \code{target} construct creates a deferrable \plc{target task}, allowing the encountering task to continue execution without waiting for the completion of the \plc{target task}.
\cexample{async_target}{4}
\cexample[4.5]{async_target}{4}
\ffreeexample{async_target}{4}
\ffreeexample[4.5]{async_target}{4}
%end

View File

@ -9,20 +9,20 @@ scheduling point while waiting for the execution of the \code{target} region
to complete, allowing the thread to switch back to the execution of the encountering
task or one of the previously generated explicit tasks.
\cexample{async_target}{1}
\cexample[4.0]{async_target}{1}
\pagebreak
The Fortran version has an interface block that contains the \code{declare} \code{target}.
An identical statement exists in the function declaration (not shown here).
\ffreeexample{async_target}{1}
\ffreeexample[4.0]{async_target}{1}
The following example shows how the \code{task} and \code{target} constructs
are used to execute multiple \code{target} regions asynchronously. The task dependence
ensures that the storage is allocated and initialized on the device before it is
accessed.
\cexample{async_target}{2}
\cexample[4.0]{async_target}{2}
The Fortran example below is similar to the C version above. Instead of pointers, though, it uses
the convenience of Fortran allocatable arrays on the device. In order to preserve the arrays
@ -52,4 +52,4 @@ section of the specification.)
However, the intention is to relax the restrictions on mapping of allocatable variables in the next release
of the specification so that the example will be compliant.
\ffreeexample{async_target}{2}
\ffreeexample[4.0]{async_target}{2}

View File

@ -14,9 +14,9 @@ Note that the \code{atomic} directive applies only to the statement immediately
following it. As a result, elements of \plc{y} are not updated atomically in
this example.
\cexample{atomic}{1}
\cexample[3.1]{atomic}{1}
\fexample{atomic}{1}
\fexample[3.1]{atomic}{1}
The following example illustrates the \code{read} and \code{write} clauses
for the \code{atomic} directive. These clauses ensure that the given variable
@ -26,9 +26,9 @@ another part of the variable. Note that most hardware provides atomic reads and
writes for some set of properly aligned variables of specific sizes, but not necessarily
for all the variable types supported by the OpenMP API.
\cexample{atomic}{2}
\cexample[3.1]{atomic}{2}
\fexample{atomic}{2}
\fexample[3.1]{atomic}{2}
The following example illustrates the \code{capture} clause for the \code{atomic}
directive. In this case the value of a variable is captured, and then the variable
@ -37,8 +37,8 @@ be implemented using the fetch-and-add instruction available on many kinds of ha
The example also shows a way to implement a spin lock using the \code{capture}
and \code{read} clauses.
\cexample{atomic}{3}
\cexample[3.1]{atomic}{3}
\fexample{atomic}{3}
\fexample[3.1]{atomic}{3}

View File

@ -5,21 +5,21 @@
The following non-conforming examples illustrate the restrictions on the \code{atomic}
construct.
\cexample{atomic_restrict}{1}
\cexample[3.1]{atomic_restrict}{1}
\fexample{atomic_restrict}{1}
\fexample[3.1]{atomic_restrict}{1}
\cexample{atomic_restrict}{2}
\cexample[3.1]{atomic_restrict}{2}
\fortranspecificstart
The following example is non-conforming because \code{I} and \code{R} reference
the same location but have different types.
\fnexample{atomic_restrict}{2}
\fnexample[3.1]{atomic_restrict}{2}
Although the following example might work on some implementations, this is also
non-conforming:
\fnexample{atomic_restrict}{3}
\fnexample[3.1]{atomic_restrict}{3}
\fortranspecificend

View File

@ -11,7 +11,7 @@ exception is properly handled in the sequential part. If cancellation of the \co
region has been requested, some threads might have executed \code{phase\_1()}.
However, it is guaranteed that none of the threads executed \code{phase\_2()}.
\cppexample{cancellation}{1}
\cppexample[4.0]{cancellation}{1}
The following example illustrates the use of the \code{cancel} construct in error
@ -20,7 +20,7 @@ the cancellation is activated. The encountering thread sets the shared variable
\code{err} and other threads of the binding thread set proceed to the end of
the worksharing construct after the cancellation has been activated.
\ffreeexample{cancellation}{1}
\ffreeexample[4.0]{cancellation}{1}
\clearpage
@ -34,11 +34,11 @@ task group to control the effect of the \code{cancel taskgroup} directive. The
\plc{level} argument is used to create undeferred tasks after the first ten
levels of the tree.
\cexample{cancellation}{2}
\cexample[4.0]{cancellation}{2}
The following is the equivalent parallel search example in Fortran.
\ffreeexample{cancellation}{2}
\ffreeexample[4.0]{cancellation}{2}

View File

@ -16,9 +16,9 @@ The variable \code{j} can be omitted from the \code{private} clause when the
from the \code{private} clause. In either case, \code{k} is implicitly private
and could be omitted from the \code{private} clause.
\cexample{collapse}{1}
\cexample[3.0]{collapse}{1}
\fexample{collapse}{1}
\fexample[3.0]{collapse}{1}
In the next example, the \code{k} and \code{j} loops are associated with the
loop construct. So the iterations of the \code{k} and \code{j} loops are collapsed
@ -33,9 +33,9 @@ will have the value \code{2} and \code{j} will have the value \code{3}. Since
by the sequentially last iteration of the collapsed \code{k} and \code{j} loop.
This example prints: \code{2 3}.
\cexample{collapse}{2}
\cexample[3.0]{collapse}{2}
\fexample{collapse}{2}
\fexample[3.0]{collapse}{2}
The next example illustrates the interaction of the \code{collapse} and \code{ordered}
clauses.
@ -71,8 +71,8 @@ The code prints
\\
\code{1 3 2}
\cexample{collapse}{3}
\cexample[3.0]{collapse}{3}
\fexample{collapse}{3}
\fexample[3.0]{collapse}{3}

View File

@ -10,5 +10,5 @@ The following example shows the use of reference types in data-sharing clauses i
Additionally it shows how the data-sharing of formal arguments with a C++ reference type on an orphaned task generating construct is determined implicitly. (See the Data-sharing Attribute Rules for Variables Referenced in a Construct Section of the 4.5 OpenMP specification.)
\cppnexample{cpp_reference}{1}
\cppnexample[4.5]{cpp_reference}{1}
\cppspecificend

View File

@ -17,4 +17,4 @@ The following example extends the previous example by adding the \code{hint} cla
\cexample{critical}{2}
\fexample{critical}{2}
\fexample[4.5]{critical}{2}

View File

@ -16,7 +16,7 @@ the \code{target} region (thus \code{fib}) will execute on the host device.
For C/C++ codes the declaration of the function \code{fib} appears between the \code{declare}
\code{target} and \code{end} \code{declare} \code{target} directives.
\cexample{declare_target}{1}
\cexample[4.0]{declare_target}{1}
The Fortran \code{fib} subroutine contains a \code{declare} \code{target} declaration
to indicate to the compiler to create an device executable version of the procedure.
@ -27,7 +27,7 @@ The program uses the \code{module\_fib} module, which presents an explicit inter
the compiler with the \code{declare} \code{target} declarations for processing
the \code{fib} call.
\ffreeexample{declare_target}{1}
\ffreeexample[4.0]{declare_target}{1}
The next Fortran example shows the use of an external subroutine. Without an explicit
interface (through module use or an interface block) the \code{declare} \code{target}
@ -35,7 +35,7 @@ declarations within a external subroutine are unknown to the main program unit;
therefore, a \code{declare} \code{target} must be provided within the program
scope for the compiler to determine that a target binary should be available.
\ffreeexample{declare_target}{2}
\ffreeexample[4.0]{declare_target}{2}
\subsection{\code{declare} \code{target} Construct for Class Type}
\label{subsec:declare_target_class}
@ -47,7 +47,7 @@ of a variable \plc{varY} with a class type \code{typeY}. The member function \co
be accessed on a target device because its declaration did not appear between \code{declare}
\code{target} and \code{end} \code{declare} \code{target} directives.
\cppnexample{declare_target}{2}
\cppnexample[4.0]{declare_target}{2}
\cppspecificend
\subsection{\code{declare} \code{target} and \code{end} \code{declare} \code{target} for Variables}
@ -65,13 +65,13 @@ is then used to manage the consistency of the variables \plc{p}, \plc{v1}, and \
data environment of the encountering host device task and the implicit device data
environment of the default target device.
\cexample{declare_target}{3}
\cexample[4.0]{declare_target}{3}
The Fortran version of the above C code uses a different syntax. Fortran modules
use a list syntax on the \code{declare} \code{target} directive to declare
mapped variables.
\ffreeexample{declare_target}{3}
\ffreeexample[4.0]{declare_target}{3}
The following example also indicates that the function \code{Pfun()} is available on the
target device, as well as the variable \plc{Q}, which is mapped to the implicit device
@ -84,7 +84,7 @@ In the following example, the function and variable declarations appear between
the \code{declare} \code{target} and \code{end} \code{declare} \code{target}
directives.
\cexample{declare_target}{4}
\cexample[4.0]{declare_target}{4}
The Fortran version of the above C code uses a different syntax. In Fortran modules
a list syntax on the \code{declare} \code{target} directive is used to declare
@ -93,7 +93,7 @@ separated list. When the \code{declare} \code{target} directive is used to
declare just the procedure, the procedure name need not be listed -- it is implicitly
assumed, as illustrated in the \code{Pfun()} function.
\ffreeexample{declare_target}{4}
\ffreeexample[4.0]{declare_target}{4}
\subsection{\code{declare} \code{target} and \code{end} \code{declare} \code{target} with \code{declare} \code{simd}}
\label{subsec:declare_target_simd}
@ -104,7 +104,7 @@ is available on a target device. The \code{declare} \code{simd} directive indica
that there is a SIMD version of the function \code{P()} that is available on the target
device as well as one that is available on the host device.
\cexample{declare_target}{5}
\cexample[4.0]{declare_target}{5}
The Fortran version of the above C code uses a different syntax. Fortran modules
use a list syntax of the \code{declare} \code{target} declaration for the mapping.
@ -113,7 +113,7 @@ The function declaration does not use a list and implicitly assumes the function
name. In this Fortran example row and column indices are reversed relative to the
C/C++ example, as is usual for codes optimized for memory access.
\ffreeexample{declare_target}{5}
\ffreeexample[4.0]{declare_target}{5}
\subsection{\code{declare}~\code{target} Directive with \code{link} Clause}
@ -137,6 +137,6 @@ globally on the device for part of the program execution. The single precision d
are allocated and persist only for the first \code{target} region. Similarly, the
double precision data are in scope on the device only for the second \code{target} region.
\cexample{declare_target}{6}
\ffreeexample{declare_target}{6}
\cexample[4.5]{declare_target}{6}
\ffreeexample[4.5]{declare_target}{6}

View File

@ -44,6 +44,6 @@ effectively destroying the depend object.
After an object has been uninitialized it can be initialized again
with a new dependence type \emph{and} a new variable.
\cexample{depobj}{1}
\cexample[5.0]{depobj}{1}
\ffreeexample{depobj}{1}
\ffreeexample[5.0]{depobj}{1}

View File

@ -10,9 +10,9 @@ can be used to query if a code is executing on the initial host device or on a
target device. The example then sets the number of threads in the \code{parallel}
region based on where the code is executing.
\cexample{device}{1}
\cexample[4.0]{device}{1}
\ffreeexample{device}{1}
\ffreeexample[4.0]{device}{1}
\subsection{\code{omp\_get\_num\_devices} Routine}
\label{subsec:device_num_devices}
@ -20,9 +20,9 @@ region based on where the code is executing.
The following example shows how the \code{omp\_get\_num\_devices} runtime library routine
can be used to determine the number of devices.
\cexample{device}{2}
\cexample[4.0]{device}{2}
\ffreeexample{device}{2}
\ffreeexample[4.0]{device}{2}
\subsection{\code{omp\_set\_default\_device} and \\
\code{omp\_get\_default\_device} Routines}
@ -32,9 +32,9 @@ The following example shows how the \code{omp\_set\_default\_device} and \code{o
runtime library routines can be used to set the default device and determine the
default device respectively.
\cexample{device}{3}
\cexample[4.0]{device}{3}
\ffreeexample{device}{3}
\ffreeexample[4.0]{device}{3}
\subsection{Target Memory and Device Pointers Routines}
@ -53,5 +53,5 @@ in a \code{target} region by exposing the device pointer in an \code{is\_device\
The example creates an array of cosine values on the default device, to be used
on the host device. The function fails if a default device is not available.
\cexample{device}{4}
\cexample[4.5]{device}{4}

View File

@ -19,9 +19,9 @@ with an \code{ordered} clause without a parameter, on the loop directive,
and a single \code{ordered} directive without the \code{depend} clause
specified for the statement executing the \plc{bar} function.
\cexample{doacross}{1}
\cexample[4.5]{doacross}{1}
\ffreeexample{doacross}{1}
\ffreeexample[4.5]{doacross}{1}
The following code is similar to the previous example but with
\plc{doacross loop nest} extended to two nested loops, \plc{i} and \plc{j},
@ -37,9 +37,9 @@ Likewise, the \code{depend(sink:j-1,i)} and \code{depend(sink:j,i-1)} clauses
in the Fortran code define cross-iteration dependences from iterations
(\plc{j-1, i}) and (\plc{j, i-1}) to iteration (\plc{j, i}).
\cexample{doacross}{2}
\cexample[4.5]{doacross}{2}
\ffreeexample{doacross}{2}
\ffreeexample[4.5]{doacross}{2}
The following example shows the incorrect use of the \code{ordered}
@ -51,9 +51,9 @@ clauses define dependences on lexicographically later
source iterations (\plc{i+1, j}) and (\plc{i, j+1}), which could cause
a deadlock as well since they may not start to execute until the current iteration completes.
\cexample{doacross}{3}
\cexample[4.5]{doacross}{3}
\ffreeexample{doacross}{3}
\ffreeexample[4.5]{doacross}{3}
The following example illustrates the use of the \code{collapse} clause for
@ -63,6 +63,6 @@ The example also shows a compliant usage of the dependence source
directive placed before the corresponding sink directive.
Checking the completion of computation from previous iterations at the sink point can occur after the source statement.
\cexample{doacross}{4}
\cexample[4.5]{doacross}{4}
\ffreeexample{doacross}{4}
\ffreeexample[4.5]{doacross}{4}

View File

@ -22,7 +22,7 @@ to execute double precision code. Two teams are required, and
the thread limit for each team is set to 1/2 of the number of
available processors.
\cexample{host_teams}{1}
\cexample[5.0]{host_teams}{1}
\ffreeexample{host_teams}{1}
\ffreeexample[5.0]{host_teams}{1}

View File

@ -5,6 +5,6 @@
The following example demonstrates how to initialize an array of locks in a \code{parallel} region by using \code{omp\_init\_lock\_with\_hint}.
Note, hints are combined with an \code{|} or \code{+} operator in C/C++ and a \code{+} operator in Fortran.
\cppexample{init_lock_with_hint}{1}
\cppexample[4.5]{init_lock_with_hint}{1}
\fexample{init_lock_with_hint}{1}
\fexample[4.5]{init_lock_with_hint}{1}

View File

@ -11,4 +11,14 @@ sequentially.
\fexample{lastprivate}{1}
\clearpage
The next example illustrates the use of the \code{conditional} modifier in
a \code{lastprivate} clause to return the last value when it may not come from
the last iteration of a loop.
That is, users can preserve the serial equivalence semantics of the loop.
The conditional lastprivate ensures the final value of the variable after the loop
is as if the loop iterations were executed in a sequential order.
\cexample[5.0]{lastprivate}{2}
\ffreeexample[5.0]{lastprivate}{2}

View File

@ -7,7 +7,7 @@ an induction variable (\plc{j}). At the end of the execution of
the loop construct, the original variable \plc{j} is updated with
the value \plc{N/2} from the last iteration of the loop.
\cexample{linear_in_loop}{1}
\cexample[4.5]{linear_in_loop}{1}
\ffreeexample{linear_in_loop}{1}
\ffreeexample[4.5]{linear_in_loop}{1}

View File

@ -0,0 +1,76 @@
%%% section
\section{\code{ref}, \code{val}, \code{uval} Modifiers for \code{linear} Clause}
\label{sec:linear_modifier}
When generating vector functions from \code{declare}~\code{simd} directives, it is important for a compiler to know the proper types of function arguments in
order to generate efficient codes.
This is especially true for C++ reference types and Fortran arguments.
In the following example, the function \plc{add\_one2} has a C++ reference
parameter (or Fortran argument) \plc{p}. Variable \plc{p} gets incremented by 1 in the function.
The caller loop \plc{i} in the main program passes
a variable \plc{k} as a reference to the function \plc{add\_one2} call.
The \code{ref} modifier for the \code{linear} clause on the
\code{declare}~\code{simd} directive is used to annotate the
reference-type parameter \plc{p} to match the property of the variable
\plc{k} in the loop.
This use of reference type is equivalent to the second call to
\plc{add\_one2} with a direct passing of the array element \plc{a[i]}.
In the example, the preferred vector
length 8 is specified for both the caller loop and the callee function.
When \code{linear(ref(p))} is applied to an argument passed by reference,
it tells the compiler that the addresses in its vector argument are consecutive,
and so the compiler can generate a single vector load or store instead of
a gather or scatter. This allows more efficient SIMD code to be generated with
less source changes.
\cppexample[4.5]{linear_modifier}{1}
\ffreeexample[4.5]{linear_modifier}{1}
\clearpage
The following example is a variant of the above example. The function \plc{add\_one2} in the C++ code includes an additional C++ reference parameter \plc{i}.
The loop index \plc{i} of the caller loop \plc{i} in the main program
is passed as a reference to the function \plc{add\_one2} call.
The loop index \plc{i} has a uniform address with
linear value of step 1 across SIMD lanes.
Thus, the \code{uval} modifier is used for the \code{linear} clause
to annotate the C++ reference-type parameter \plc{i} to match
the property of loop index \plc{i}.
In the correponding Fortran code the arguments \plc{p} and
\plc{i} in the routine \plc{add\_on2} are passed by references.
Similar modifiers are used for these variables in the \code{linear} clauses
to match with the property at the caller loop in the main program.
When \code{linear(uval(i))} is applied to an argument passed by reference, it
tells the compiler that its addresses in the vector argument are uniform
so that the compiler can generate a scalar load or scalar store and create
linear values. This allows more efficient SIMD code to be generated with
less source changes.
\cppexample[4.5]{linear_modifier}{2}
\ffreeexample[4.5]{linear_modifier}{2}
In the following example, the function \plc{func} takes arrays \plc{x} and \plc{y} as arguments, and accesses the array elements referenced by
the index \plc{i}.
The caller loop \plc{i} in the main program passes a linear copy of
the variable \plc{k} to the function \plc{func}.
The \code{val} modifier is used for the \code{linear} clause
in the \code{declare}~\code{simd} directive for the function
\plc{func} to annotate argument \plc{i} to match the property of
the actual argument \plc{k} passed in the SIMD loop.
Arrays \plc{x} and \plc{y} have uniform addresses across SIMD lanes.
When \code{linear(val(i):1)} is applied to an argument,
it tells the compiler that its addresses in the vector argument may not be
consecutive, however, their values are linear (with stride 1 here). When the value of \plc{i} is used
in subscript of array references (e.g., \plc{x[i]}), the compiler can generate
a vector load or store instead of a gather or scatter. This allows more
efficient SIMD code to be generated with less source changes.
\cexample[4.5]{linear_modifier}{3}
\ffreeexample[4.5]{linear_modifier}{3}

View File

@ -9,5 +9,5 @@ of the loop are free of data dependencies and may be executed concurrently.
It allows the compiler to use heuristics to select the parallelization scheme
and compiler-level optimizations for the concurrency.
\cexample{loop}{1}
\ffreeexample{loop}{1}
\cexample[5.0]{loop}{1}
\ffreeexample[5.0]{loop}{1}

View File

@ -3,39 +3,52 @@
\section{The OpenMP Memory Model}
\label{sec:mem_model}
In the following example, at Print 1, the value of \plc{x} could be either 2
or 5, depending on the timing of the threads, and the implementation of the assignment
to \plc{x}. There are two reasons that the value at Print 1 might not be 5.
First, Print 1 might be executed before the assignment to \plc{x} is executed.
Second, even if Print 1 is executed after the assignment, the value 5 is not guaranteed
to be seen by thread 1 because a flush may not have been executed by thread 0 since
the assignment.
The following examples illustrate two major concerns for concurrent thread
execution: ordering of thread execution and memory accesses that may or may not
lead to race conditions.
The barrier after Print 1 contains implicit flushes on all threads, as well as
a thread synchronization, so the programmer is guaranteed that the value 5 will
be printed by both Print 2 and Print 3.
In the following example, at Print 1, the value of \code{xval} could be either 2
or 5, depending on the timing of the threads. The \code{atomic} directives are
necessary for the accesses to \code{x} by threads 1 and 2 to avoid a data race.
If the atomic write completes before the atomic read, thread 1 is guaranteed to
see 5 in \code{xval}. Otherwise, thread 1 is guaranteed to see 2 in \code{xval}.
\cexample{mem_model}{1}
The barrier after Print 1 contains implicit flushes on all threads, as well as
a thread synchronization, so the programmer is guaranteed that the value 5 will
be printed by both Print 2 and Print 3. Since neither Print 2 or Print 3 are modifying
\code{x}, they may concurrently access \code{x} without requiring \code{atomic}
directives to avoid a data race.
\ffreeexample{mem_model}{1}
\cexample[3.1]{mem_model}{1}
\ffreeexample[3.1]{mem_model}{1}
\pagebreak
The following example demonstrates why synchronization is difficult to perform
correctly through variables. The value of flag is undefined in both prints on thread
1 and the value of data is only well-defined in the second print.
The following example demonstrates why synchronization is difficult to perform
correctly through variables. The write to \code{flag} on thread 0 and the read
from \code{flag} in the loop on thread 1 must be atomic to avoid a data race.
When thread 1 breaks out of the loop, \code{flag} will have the value of 1.
However, \code{data} will still be undefined at the first print statement. Only
after the flush of both \code{flag} and \code{data} after the first print
statement will \code{data} have the well-defined value of 42.
\cexample{mem_model}{2}
\cexample[3.1]{mem_model}{2}
\fexample{mem_model}{2}
\fexample[3.1]{mem_model}{2}
\pagebreak
The next example demonstrates why synchronization is difficult to perform correctly
through variables. Because the \plc{write}(1)-\plc{flush}(1)-\plc{flush}(2)-\plc{read}(2)
sequence cannot be guaranteed in the example, the statements on thread 0 and thread
1 may execute in either order.
The next example demonstrates why synchronization is difficult to perform
correctly through variables. As in the preceding example, the updates to
\code{flag} and the reading of \code{flag} in the loops on threads 1 and 2 are
performed atomically to avoid data races on \code{flag}. However, the code still
contains data race due to the incorrect use of ``flush with a list'' after the
assignment to \code{data1} on thread 1. By not including \code{flag} in the
flush-set of that \code{flush} directive, the assignment can be reordered with
respect to the subsequent atomic update to \code{flag}. Consequentially,
\code{data1} is undefined at the print statement on thread 2.
\cexample{mem_model}{3}
\cexample[3.1]{mem_model}{3}
\fexample{mem_model}{3}
\fexample[3.1]{mem_model}{3}

View File

@ -28,9 +28,9 @@ directive as selector set, has traits of \plc{kind}, \plc{isa} and \plc{arch}.
\cexample{metadirective}{1}
\cexample[5.0]{metadirective}{1}
\ffreeexample{metadirective}{1}
\ffreeexample[5.0]{metadirective}{1}
%\pagebreak
In the second example, the \plc{implementation} selector set is specified
@ -47,9 +47,9 @@ traits. Otherwise, just the \code{teams} construct is used without
any clauses, as prescribed by the \code{default} clause.
\cexample{metadirective}{2}
\cexample[5.0]{metadirective}{2}
\ffreeexample{metadirective}{2}
\ffreeexample[5.0]{metadirective}{2}
\clearpage
@ -83,6 +83,6 @@ the \code{target}~\code{teams} construct has been hoisted out of the function, a
as the \plc{variant} directive of the \code{metadirective} directive within the function.
%%%%%%%%
\cexample{metadirective}{3}
\cexample[5.0]{metadirective}{3}
\ffreeexample{metadirective}{3}
\ffreeexample[5.0]{metadirective}{3}

View File

@ -28,6 +28,6 @@ with appropriate restrictions. The combination of the \code{parallel}~\code{mast
with the \code{taskloop} or \code{taskloop}~\code{simd} construct produces no additional
restrictions.
\cexample{parallel_master_taskloop}{1}
\cexample[5.0]{parallel_master_taskloop}{1}
\ffreeexample{parallel_master_taskloop}{1}
\ffreeexample[5.0]{parallel_master_taskloop}{1}

View File

@ -5,7 +5,7 @@
The following example shows a parallel random access iterator loop.
\cppnexample{pra_iterator}{1}
\cppnexample[3.0]{pra_iterator}{1}
\cppspecificend

View File

@ -12,7 +12,7 @@ The following example demonstrates the \code{reduction} clause; note that some
reductions can be expressed in the loop in several ways, as shown for the \code{max}
and \code{min} reductions below:
\cexample{reduction}{1}
\cexample[3.1]{reduction}{1}
\pagebreak
@ -66,38 +66,148 @@ the start of the \code{parallel} region.
\fexample{reduction}{6}
The following example demonstrates the reduction of array \plc{a}. In C/C++ this is illustrated by the explicit use of an array section \plc{a[0:N]} in the \code{reduction} clause. The corresponding Fortran example uses array syntax supported in the base language. As of the OpenMP 4.5 specification the explicit use of array section in the \code{reduction} clause in Fortran is not permitted. But this oversight will be fixed in the next release of the specification.
The following example demonstrates the reduction of array \plc{a}. In C/C++ this is illustrated by the explicit use of an array section \plc{a[0:N]} in the \code{reduction} clause. The corresponding Fortran example uses array syntax supported in the base language. As of the OpenMP 4.5 specification the explicit use of array section in the \code{reduction} clause in Fortran is not permitted. But this oversight has been fixed in the OpenMP 5.0 specification.
\cexample{reduction}{7}
\cexample[4.5]{reduction}{7}
\ffreeexample{reduction}{7}
\subsection{Task Reduction}
\label{subsec:task_reduction}
The following C/C++ and Fortran examples show how to implement
a task reduction over a linked list.
In OpenMP 5.0 the \code{task\_reduction} clause was created for the \code{taskgroup} construct,
to allow reductions among explicit tasks that have an \code{in\_reduction} clause.
Task reductions are supported by the \code{task\_reduction} clause which can only be
applied to the \code{taskgroup} directive, and a \code{in\_reduction} clause
which can be applied to the \code{task} construct among others.
The \code{task\_reduction} clause on the \code{taskgroup} construct is used to
define the scope of a new reduction, and after the \code{taskgroup}
region the original variable will contain the final value of the reduction.
In the task-generating while loop the \code{in\_reduction} clause of the \code{task}
construct is used to specify that the task participates "in" the reduction.
In the \plc{task\_reduction.1} example below a reduction is performed as the algorithm
traverses a linked list. The reduction statement is assigned to be an explicit task using
a \code{task} construct and is specified to be a reduction participant with
the \code{in\_reduction} clause.
A \code{taskgroup} construct encloses the tasks participating in the reduction, and
specifies, with the \code{task\_reduction} clause, that the taskgroup has tasks participating
in a reduction. After the \code{taskgroup} region the original variable will contain
the final value of the reduction.
Note: The \plc{res} variable is private in the \plc{linked\_list\_sum} routine
and is not required to be shared (as in the case of a \code{parallel} construct
reduction).
\cexample{task_reduction}{1}
\cexample[5.0]{task_reduction}{1}
\ffreeexample{task_reduction}{1}
\ffreeexample[5.0]{task_reduction}{1}
In OpenMP 5.0 the \code{task} \plc{reduction-modifier} for the \code{reduction} clause was
introduced to provide a means of performing reductions among implicit and explicit tasks.
The \code{reduction} clause of a \code{parallel} or worksharing construct may
specify the \code{task} \plc{reduction-modifier} to include explicit task reductions
within their region, provided the reduction operators (\plc{reduction-identifiers})
and variables (\plc{list items}) of the participating tasks match those of the
implicit tasks.
There are 2 reduction use cases (identified by USE CASE \#) in the \plc{task\_reduction.2} example below.
In USE CASE 1 a \code{task} modifier in the \code{reduction} clause
of the \code{parallel} construct is used to include the reductions of any
participating tasks, those with an \code{in\_reduction} clause and matching
\plc{reduction-identifiers} (\code{+}) and list items (\code{x}).
Note, a \code{taskgroup} construct (with a \code{task\_reduction} clause) in not
necessary to scope the explicit task reduction (as seen in the example above).
Hence, even without the implicit task reduction statement (without the C \code{x++\;}
and Fortran \code{x=x+1} statements), the \code{task} \plc{reduction-modifier}
in a \code{reduction} clause of the \code{parallel} construct
can be used to avoid having to create a \code{taskgroup} construct
(and its \code{task\_reduction} clause) around the task generating structure.
In USE CASE 2 tasks participating in the reduction are within a
worksharing region (a parallel worksharing-loop construct).
Here, too, no \code{taskgroup} is required, and the \plc{reduction-identifier} (\code{+})
and list item (variable \code{x}) match as required.
\cexample[5.0]{task_reduction}{2}
\ffreeexample[5.0]{task_reduction}{2}
\subsection{Reduction on Combined Target Constructs}
\label{subsec:target_reduction}
When a \code{reduction} clause appears on a combined construct that combines
a \code{target} construct with another construct, there is an implicit map
of the list items with a \code{tofrom} map type for the \code{target} construct.
Otherwise, the list items (if they are scalar variables) would be
treated as firstprivate by default in the \code{target} construct, which
is unlikely to provide the intended behavior since the result of the
reduction that is in the firstprivate variable would be discarded
at the end of the \code{target} region.
In the following example, the use of the \code{reduction} clause on \code{sum1}
or \code{sum2} should, by default, result in an implicit \code{tofrom} map for
that variable. So long as neither \code{sum1} nor \code{sum2} were already
present on the device, the mapping behavior ensures the value for
\code{sum1} computed in the first \code{target} construct is used in the
second \code{target} construct.
\cexample[5.0]{target_reduction}{1}
\ffreeexample[5.0]{target_reduction}{1}
\clearpage
In next example, the variables \code{sum1} and \code{sum2} remain on the
device for the duration of the \code{target}~\code{data} region so that it is
their device copies that are updated by the reductions. Note the significance
of mapping \code{sum1} on the second \code{target} construct; otherwise, it
would be treated by default as firstprivate and the result computed for
\code{sum1} in the prior \code{target} region may not be used. Alternatively, a
\code{target}~\code{update} construct could be used between the two
\code{target} constructs to update the host version of \code{sum1} with the
value that is in the corresponding device version after the completion of the
first construct.
\cexample[5.0]{target_reduction}{2}
\ffreeexample[5.0]{target_reduction}{2}
\subsection{Task Reduction with Target Constructs}
\label{subsec:target_task_reduction}
The following examples illustrate how task reductions can apply to target tasks
that result from a \code{target} construct with the \code{in\_reduction}
clause. Here, the \code{in\_reduction} clause specifies that the target task
participates in the task reduction defined in the scope of the enclosing
\code{taskgroup} construct. Partial results from all tasks participating in the
task reduction will be combined (in some order) into the original variable
listed in the \code{task\_reduction} clause before exiting the \code{taskgroup}
region.
\cexample[5.0]{target_task_reduction}{1}
\ffreeexample[5.0]{target_task_reduction}{1}
In the next pair of examples, the task reduction is defined by a
\code{reduction} clause with the \code{task} modifier, rather than a
\code{task\_reduction} clause on a \code{taskgroup} construct. Again, the
partial results from the participating tasks will be combined in some order
into the original reduction variable, \code{sum}.
\cexample[5.0]{target_task_reduction}{2a}
\ffreeexample[5.0]{target_task_reduction}{2a}
Next, the \code{task} modifier is again used to define a task reduction over
participating tasks. This time, the participating tasks are a target task
resulting from a \code{target} construct with the \code{in\_reduction} clause,
and the implicit task (executing on the master thread) that calls
\code{host\_compute}. As before, the partial results from these paricipating
tasks are combined in some order into the original reduction variable.
\cexample[5.0]{target_task_reduction}{2b}
\ffreeexample[5.0]{target_task_reduction}{2b}
\subsection{Taskloop Reduction}
@ -121,8 +231,8 @@ that if we add the \code{nogroup} clause to the \code{taskloop} construct the co
nonconforming, basically because we have a set of tasks that participate in a
reduction that has not been defined.
\cexample{taskloop_reduction}{1}
\ffreeexample{taskloop_reduction}{1}
\cexample[5.0]{taskloop_reduction}{1}
\ffreeexample[5.0]{taskloop_reduction}{1}
%In the second example, we are computing exactly the same
%value but we do it in a very different way. The first thing that we do in the
@ -154,8 +264,9 @@ declared reduction (\code{in\_reduction} clause) whereas in the other case
creation of a new reduction is specified and also that all tasks generated
by the taskloop will participate on it.
\cexample{taskloop_reduction}{2}
\ffreeexample{taskloop_reduction}{2}
\cexample[5.0]{taskloop_reduction}{2}
\ffreeexample[5.0]{taskloop_reduction}{2}
\clearpage
In the OpenMP 5.0 Specification, \code{reduction} clauses for the
\code{taskloop}~\code{ simd} construct were also added.
@ -228,10 +339,8 @@ At the end of the parallel region \plc{asum} contains the combined result of all
%At the end of the parallel region \plc{asum} contains the combined result of all reductions.
\cexample{taskloop_simd_reduction}{1}
\cexample[5.0]{taskloop_simd_reduction}{1}
\ffreeexample{taskloop_simd_reduction}{1}
\ffreeexample[5.0]{taskloop_simd_reduction}{1}
% All other reductions

View File

@ -26,6 +26,6 @@ not updated on the host.
%\pagebreak
\cppexample{requires}{1}
\cppexample[5.0]{requires}{1}
\ffreeexample{requires}{1}
\ffreeexample[5.0]{requires}{1}

38
Examples_scan.tex Normal file
View File

@ -0,0 +1,38 @@
\pagebreak
\section{The \code{scan} Directive}
\label{sec:scan}
The following examples illustrate how to parallelize a loop that saves
the \emph{prefix sum} of a reduction. This is accomplished by using
the \code{inscan} modifier in the \code{reduction} clause for the input
variable of the scan, and specifying with a \code{scan} directive whether
the storage statement includes or excludes the scan input of the present
iteration (\texttt{k}).
Basically, the \code{inscan} modifier connects a loop and/or SIMD reduction to
the scan operation, and a \code{scan} construct with an \code{inclusive} or
\code{exclusive} clause specifies whether the ``scan phase'' (lexical block
before and after the directive, respectively) is to use an \plc{inclusive} or
\plc{exclusive} scan value for the list item (\texttt{x}).
The first example uses the \plc{inclusive} scan operation on a composite
loop-SIMD construct. The \code{scan} directive separates the reduction
statement on variable \texttt{x} from the use of \texttt{x} (saving to array \texttt{b}).
The order of the statements in this example indicates that
value \texttt{a[k]} (\texttt{a(k)} in Fortran) is included in the computation of
the prefix sum \texttt{b[k]} (\texttt{b(k)} in Fortran) for iteration \texttt{k}.
\cexample[5.0]{scan}{1}
\ffreeexample[5.0]{scan}{1}
The second example uses the \plc{exclusive} scan operation on a composite
loop-SIMD construct. The \code{scan} directive separates the use of \texttt{x}
(saving to array \texttt{b}) from the reduction statement on variable \texttt{x}.
The order of the statements in this example indicates that
value \texttt{a[k]} (\texttt{a(k)} in Fortran) is excluded from the computation
of the prefix sum \texttt{b[k]} (\texttt{b(k)} in Fortran) for iteration \texttt{k}.
\cexample[5.0]{scan}{2}
\ffreeexample[5.0]{scan}{2}

View File

@ -7,7 +7,7 @@ The following example is non-conforming, because the \code{flush}, \code{barrier
\code{taskwait}, and \code{taskyield} directives are stand-alone directives
and cannot be the immediate substatement of an \code{if} statement.
\cexample{standalone}{1}
\cexample[3.1]{standalone}{1}
\pagebreak
The following example is non-conforming, because the \code{flush}, \code{barrier},
@ -15,19 +15,19 @@ The following example is non-conforming, because the \code{flush}, \code{barrier
and cannot be the action statement of an \code{if} statement or a labeled branch
target.
\ffreeexample{standalone}{1}
\ffreeexample[3.1]{standalone}{1}
The following version of the above example is conforming because the \code{flush},
\code{barrier}, \code{taskwait}, and \code{taskyield} directives are enclosed
in a compound statement.
\cexample{standalone}{2}
\cexample[3.1]{standalone}{2}
\pagebreak
The following example is conforming because the \code{flush}, \code{barrier},
\code{taskwait}, and \code{taskyield} directives are enclosed in an \code{if}
construct or follow the labeled branch target.
\ffreeexample{standalone}{2}
\ffreeexample[3.1]{standalone}{2}

View File

@ -9,9 +9,9 @@ This following example shows how the \code{target} construct offloads a code
region to a target device. The variables \plc{p}, \plc{v1}, \plc{v2}, and \plc{N} are implicitly mapped
to the target device.
\cexample{target}{1}
\cexample[4.0]{target}{1}
\ffreeexample{target}{1}
\ffreeexample[4.0]{target}{1}
\subsection{\code{target} Construct with \code{map} Clause}
\label{subsec:target_map}
@ -21,9 +21,9 @@ region to a target device. The variables \plc{p}, \plc{v1} and \plc{v2} are expl
target device using the \code{map} clause. The variable \plc{N} is implicitly mapped to
the target device.
\cexample{target}{2}
\cexample[4.0]{target}{2}
\ffreeexample{target}{2}
\ffreeexample[4.0]{target}{2}
\subsection{\code{map} Clause with \code{to}/\code{from} map-types}
\label{subsec:target_map_tofrom}
@ -46,14 +46,14 @@ the variable \plc{p} is not initialized with the value of the corresponding vari
on the host device, and at the end of the \code{target} region the variable \plc{p}
is assigned to the corresponding variable on the host device.
\cexample{target}{3}
\cexample[4.0]{target}{3}
The \code{to} and \code{from} map-types allow programmers to optimize data
motion. Since data for the \plc{v} arrays are not returned, and data for the \plc{p} array
are not transferred to the device, only one-half of the data is moved, compared
to the default behavior of an implicit mapping.
\ffreeexample{target}{3}
\ffreeexample[4.0]{target}{3}
\subsection{\code{map} Clause with Array Sections}
\label{subsec:target_array_section}
@ -64,14 +64,15 @@ the mapping of variables to the target device. Because variables \plc{p}, \plc{v
pointers, array section notation must be used to map the arrays. The notation \code{:N}
is equivalent to \code{0:N}.
\cexample{target}{4}
\cexample[4.0]{target}{4}
\clearpage
In C, the length of the pointed-to array must be specified. In Fortran the extent
of the array is known and the length need not be specified. A section of the array
can be specified with the usual Fortran syntax, as shown in the following example.
The value 1 is assumed for the lower bound for array section \plc{v2(:N)}.
\ffreeexample{target}{4}
\ffreeexample[4.0]{target}{4}
A more realistic situation in which an assumed-size array is passed to \code{vec\_mult}
requires that the length of the arrays be specified, because the compiler does
@ -79,7 +80,7 @@ not know the size of the storage. A section of the array must be specified with
the usual Fortran syntax, as shown in the following example. The value 1 is assumed
for the lower bound for array section \plc{v2(:N)}.
\ffreeexample{target}{4b}
\ffreeexample[4.0]{target}{4b}
\subsection{\code{target} Construct with \code{if} Clause}
\label{subsec:target_if}
@ -95,9 +96,9 @@ The \code{if} clause on the \code{parallel} construct indicates that if the
variable \plc{N} is smaller than a second threshold then the \code{parallel} region
is inactive.
\cexample{target}{5}
\cexample[4.0]{target}{5}
\ffreeexample{target}{5}
\ffreeexample[4.0]{target}{5}
The following example is a modification of the above \plc{target.5} code to show the combined \code{target}
and parallel loop directives. It uses the \plc{directive-name} modifier in multiple \code{if}
@ -107,9 +108,9 @@ The \code{if} clause with the \code{target} modifier applies to the \code{target
combined directive, and the \code{if} clause with the \code{parallel} modifier applies
to the \code{parallel} component of the combined directive.
\cexample{target}{6}
\cexample[4.5]{target}{6}
\ffreeexample{target}{6}
\ffreeexample[4.5]{target}{6}
\subsection{target Reverse Offload}
\label{subsec:target_reverse_offload}
@ -139,6 +140,6 @@ function. This feature may be necessary if the function
exists in another compile unit.
\cexample{target_reverse_offload}{7}
\cexample[5.0]{target_reverse_offload}{7}
\ffreeexample{target_reverse_offload}{7}
\ffreeexample[5.0]{target_reverse_offload}{7}

View File

@ -14,14 +14,14 @@ variables \plc{v1}, \plc{v2}, and \plc{p} from the enclosing device data environ
\plc{N} is mapped into the new device data environment from the encountering task's data
environment.
\cexample{target_data}{1}
\cexample[4.0]{target_data}{1}
\pagebreak
The Fortran code passes a reference and specifies the extent of the arrays in the
declaration. No length information is necessary in the map clause, as is required
with C/C++ pointers.
\ffreeexample{target_data}{1}
\ffreeexample[4.0]{target_data}{1}
\subsection{\code{target} \code{data} Region Enclosing Multiple \code{target} Regions}
\label{subsec:target_data_multiregion}
@ -39,7 +39,7 @@ In the following example the variables \plc{v1} and \plc{v2} are mapped at each
construct. Instead of mapping the variable \plc{p} twice, once at each \code{target}
construct, \plc{p} is mapped once by the \code{target} \code{data} construct.
\cexample{target_data}{2}
\cexample[4.0]{target_data}{2}
The Fortran code uses reference and specifies the extent of the \plc{p}, \plc{v1} and \plc{v2} arrays.
@ -48,7 +48,7 @@ C/C++ pointers. The arrays \plc{v1} and \plc{v2} are mapped at each \code{target
Instead of mapping the array \plc{p} twice, once at each target construct, \plc{p} is mapped
once by the \code{target} \code{data} construct.
\ffreeexample{target_data}{2}
\ffreeexample[4.0]{target_data}{2}
In the following example, the array \plc{Q} is mapped once at the enclosing
\code{target}~\code{data} region instead of at each \code{target} construct.
@ -58,9 +58,9 @@ the \code{tofrom} map-type at the first \code{target} construct in order to retu
its reduced value from the parallel loop construct to the host.
The variable defaults to firstprivate at the second \code{target} construct.
\cexample{target_data}{3}
\cexample[4.0]{target_data}{3}
\ffreeexample{target_data}{3}
\ffreeexample[4.0]{target_data}{3}
\subsection{\code{target} \code{data} Construct with Orphaned Call}
@ -87,7 +87,7 @@ of the storage location associated with their corresponding array sections. Note
that the following pairs of array section storage locations are equivalent (\plc{p0[:N]},
\plc{p1[:N]}), (\plc{v1[:N]},\plc{v3[:N]}), and (\plc{v2[:N]},\plc{v4[:N]}).
\cexample{target_data}{4}
\cexample[4.0]{target_data}{4}
The Fortran code maps the pointers and storage in an identical manner (same extent,
but uses indices from 1 to \plc{N}).
@ -103,7 +103,7 @@ assigned the address of the storage location associated with their corresponding
array sections. Note that the following pair of array storage locations are equivalent
(\plc{p0},\plc{p1}), (\plc{v1},\plc{v3}), and (\plc{v2},\plc{v4}).
\ffreeexample{target_data}{4}
\ffreeexample[4.0]{target_data}{4}
In the following example, the variables \plc{p1}, \plc{v3}, and \plc{v4} are references to the pointer
@ -112,7 +112,7 @@ environment inherits the pointer variables \plc{p0}, \plc{v1}, and \plc{v2} from
\code{data} construct's device data environment. Thus, \plc{p1}, \plc{v3}, and \plc{v4} are already
present in the device data environment.
\cppexample{target_data}{5}
\cppexample[4.0]{target_data}{5}
In the following example, the usual Fortran approach is used for dynamic memory.
The \plc{p0}, \plc{v1}, and \plc{v2} arrays are allocated in the main program and passed as references
@ -122,7 +122,7 @@ environment inherits the arrays \plc{p0}, \plc{v1}, and \plc{v2} from the enclos
device data environment. Thus, \plc{p1}, \plc{v3}, and \plc{v4} are already present in the device
data environment.
\ffreeexample{target_data}{5}
\ffreeexample[4.0]{target_data}{5}
\subsection{\code{target} \code{data} Construct with \code{if} Clause}
\label{subsec:target_data_if}
@ -140,7 +140,7 @@ variable \plc{p} is implicitly mapped with a map-type of \code{tofrom}, but the
location for the array section \plc{p[0:N]} will not be mapped in the device data environments
of the \code{target} constructs.
\cexample{target_data}{6}
\cexample[4.0]{target_data}{6}
\pagebreak
The \code{if} clauses work the same way for the following Fortran code. The \code{target}
@ -149,7 +149,7 @@ an \code{if} clause with the same condition, so that the \code{target} \code{dat
region and the \code{target} region are either both created for the device, or
are both ignored.
\ffreeexample{target_data}{6}
\ffreeexample[4.0]{target_data}{6}
\pagebreak
In the following example, when the \code{if} clause conditional expression on
@ -161,7 +161,7 @@ region the array section \plc{p[0:N]} will be assigned from the device data envi
to the corresponding variable in the data environment of the task that encountered
the \code{target} \code{data} construct, resulting in undefined values in \plc{p[0:N]}.
\cexample{target_data}{7}
\cexample[4.0]{target_data}{7}
\pagebreak
The \code{if} clauses work the same way for the following Fortran code. When
@ -174,5 +174,5 @@ region the \plc{p} array will be assigned from the device data environment to th
variable in the data environment of the task that encountered the \code{target}
\code{data} construct, resulting in undefined values in \plc{p}.
\ffreeexample{target_data}{7}
\ffreeexample[4.0]{target_data}{7}

View File

@ -0,0 +1,56 @@
\pagebreak
\section{\code{defaultmap} Clause}
\label{sec:defaultmap}
The implicitly-determined, data-mapping and data-sharing attribute
rules of variables referenced in a \code{target} construct can be
changed by the \code{defaultmap} clause introduced in OpenMP 5.0.
The implicit behavior is specified as
\code{alloc}, \code{to}, \code{from}, \code{tofrom},
\code{firstprivate}, \code{none}, \code{default} or \code{present},
and is applied to a variable-category, where
\code{scalar}, \code{aggregate}, \code{allocatable},
and \code{pointer} are the variable categories.
In OpenMP, a ``category'' has a common data-mapping and data-sharing
behavior for variable types within the category.
In C/C++, \code{scalar} refers to base-language scalar variables, except pointers.
In Fortran it refers to a scalar variable, as defined by the base language,
with intrinsic type, and excludes character type.
Also, \code{aggregate} refers to arrays and structures (C/C++) and
derived types (Fortran). Fortran has the additional category of \code{allocatable}.
%In the example below, the first \code{target} construct uses \code{defaultmap}
%clauses to explicitly set data-mapping attributes that reproduce
%the default implicit mapping (data-mapping and data-sharing attributes). That is,
%if the \code{defaultmap} clauses were removed, the results would be identical.
In the example below, the first \code{target} construct uses \code{defaultmap}
clauses to set data-mapping and possibly data-sharing attributes that reproduce
the default implicit mapping (data-mapping and data-sharing attributes). That is,
if the \code{defaultmap} clauses were removed, the results would be identical.
In the second \code{target} construct all implicit behavior is removed
by specifying the \code{none} implicit behavior in the \code{defaultmap} clause.
Hence, all variables must be explicitly mapped.
In the C/C++ code a scalar (\texttt{s}), an array (\texttt{A}) and a structure
(\texttt{S}) are explicitly mapped \code{tofrom}.
The Fortran code uses a derived type (\texttt{D}) in lieu of structure.
The third \code{target} construct shows another usual case for using the \code{defaultmap} clause.
The default mapping for (non-pointer) scalar variables is specified as \code{tofrom}.
Here, the default implicit mapping for \texttt{s3} is \code{tofrom} as specified
in the \code{defaultmap} clause, and \texttt{s1} and \texttt{s2} are explicitly
mapped with the \code{firstprivate} data-sharing attribute.
In the fourth \code{target} construct all arrays, structures (C/C++) and derived
types (Fortran) are mapped with \code{firstprivate} data-sharing behavior by a
\code{defaultmap} clause with an \code{aggregate} variable category.
For the \texttt{H} allocated array in the Fortran code, the \code{allocable}
category must be used in a separate \code{defaultmap} clause to acquire
\code{firsprivate} data-sharing behavior (\texttt{H} has the Fortran allocatable attribute).
% (Common use cases for C/C++ heap storage can be found in \specref{sec:pointer_mapping}.)
\cexample[5.0]{target_defaultmap}{1}
\ffreeexample[5.0]{target_defaultmap}{1}

View File

@ -26,11 +26,11 @@ full structure, plus the dynamic storage of the \plc{data} element.
%The associated Fortran allocatable \plc{data} array is automatically mapped with the derived
%type, it does not require an array section as in the C/C++ example.
\cexample{target_mapper}{1}
\cexample[5.0]{target_mapper}{1}
\ffreeexample{target_mapper}{1}
\ffreeexample[5.0]{target_mapper}{1}
\pagebreak
%\pagebreak
The next example illustrates the use of the \plc{mapper-identifier} and deep copy within a structure.
The structure, \plc{dzmat\_t}, represents a complex matrix,
with separate real (\plc{r\_m}) and imaginary (\plc{i\_m}) elements.
@ -56,11 +56,11 @@ Note, the \plc{is} and \plc{ie} scalars are firstprivate
by default for a target region, but are declared firstprivate anyway
to remind the user of important firstprivate data-sharing properties required here.
\cexample{target_mapper}{2}
\cexample[5.0]{target_mapper}{2}
\ffreeexample{target_mapper}{2}
\ffreeexample[5.0]{target_mapper}{2}
\pagebreak
%\pagebreak
In the third example \plc{myvec} structures are
nested within a \plc{mypoints} structure. The \plc{myvec\_t} type is mapped
as in the first example. Following the \plc{mypoints} structure declaration,
@ -80,7 +80,7 @@ type structure.
%Note, in the main program \plc{P} is an array of \plc{mypoints\_t} type structures,
%and hence every element of the array is mapped with the mapper prescription.
\cexample{target_mapper}{3}
\cexample[5.0]{target_mapper}{3}
\ffreeexample{target_mapper}{3}
\ffreeexample[5.0]{target_mapper}{3}

View File

@ -34,10 +34,10 @@ when the \code{OMP\_DISPLAY\_ENV}
environment variable is set to \code{TRUE} or \code{VERBOSE}.
%\pagebreak
\cexample{target_offload_control}{1}
\cexample[5.0]{target_offload_control}{1}
%\pagebreak
\ffreeexample{target_offload_control}{1}
\ffreeexample[5.0]{target_offload_control}{1}
% OMP 4.5 target offload 15:9-11

View File

@ -2,6 +2,23 @@
\section{Pointer mapping}
\label{sec:pointer_mapping}
Pointers that contain host addresses require that those addresses are translated to device addresses for them to be useful in the context of a device data environment. Broadly speaking, there are two scenarios where this is important.
The first scenario is where the pointer is mapped to the device data environment, such that references to the pointer inside a \code{target} region are to the corresponding pointer. Pointer attachment ensures that the corresponding pointer will contain a device address when all of the following conditions are true:
\begin{itemize}
\item the pointer is mapped by directive $A$ to a device;
\item a list item that uses the pointer as its base pointer (call it the \emph{pointee}) is mapped, to the same device, by directive $B$, which may be the same as $A$;
\item the effect of directive $B$ is to create either the corresponding pointer or pointee in the device data environment of the device.
\end{itemize}
Given the above conditions, pointer attachment is initiated as a result of directive $B$ and subsequent references to the pointee list item in a target region that use the pointer will access the corresponding pointee. The corresponding pointer remains in this \emph{attached} state until it is removed from the device data environment.
The second scenario, which is only applicable for C/C++, is where the pointer is implicitly privatized inside a \code{target} construct when it appears as the base pointer to a list item on the construct and does not appear explicitly as a list item in a \code{map} clause, \code{is\_device\_ptr} clause, or data-sharing attribute clause. This scenario can be further split into two cases: the list item is a zero-length array section (e.g., \plc{p[:0]}) or it is not.
If it is a zero-length array section, this will trigger a runtime check on entry to the \code{target} region for a previously mapped list item where the value of the pointer falls within the range of its base address and ending address. If such a match is found the private pointer is initialized to the device address corresponding to the value of the original pointer, and otherwise it is initialized to NULL (or retains its original value if the \code{unified\_address} requirement is specified for that compilation unit).
If the list item (again, call it the \emph{pointee}) is not a zero-length array section, the private pointer will be initialized such that references in the \code{target} region to the pointee list item that use the pointer will access the corresponding pointee.
The following example shows the basics of mapping pointers with and without
associated storage on the host.
@ -24,18 +41,23 @@ data at the end of the \code{target} region.
As a comparison, note that the \plc{aray} array is automatically mapped,
since the compiler knows the extent of the array.
The pointer \plc{ptr3} is used in the \code{target} region and has
a data-sharing attribute of firstprivate.
The pointer is implicitly mapped to a zero-length array section.
Neither the pointer address nor any
of its locally assigned data on the device is returned
to the host.
The pointer \plc{ptr3} is used inside the \code{target} construct, but it does
not appear in a data-mapping or data-sharing clause. Nor is there a
\code{defaultmap} clause on the construct to indicate what its implicit
data-mapping or data-sharing attribute should be. For such a case, \plc{ptr3}
will be implicitly privatized within the construct and there will be a runtime
check to see if the host memory to which it is pointing has corresponding memory
in the device data environment. If this runtime check passes, the private
\plc{ptr3} would be initialized to point to the corresponding memory. But in
this case the check does not pass and so it is initialized to null.
Since \plc{ptr3} is private, the value to which it is assigned in the
\code{target} region is not returned into the original \plc{ptr3} on the host.
\cexample{target_ptr_map}{1}
\cexample[5.0]{target_ptr_map}{1}
In the following example the global pointer \plc{p} appears in a
\code{declare}~\code{target} directive. Hence, the pointer \plc{p} will
persist on the device throughout executions in all target regions.
persist on the device throughout executions in all \code{target} regions.
The pointer is also used in an array section of a \code{map} clause on
a \code{target} construct. When storage associated with
@ -50,4 +72,49 @@ pointer on the device is \emph{attached}.)
% For globals with declare target is there such a things a
% original and corresponding?
\cexample{target_ptr_map}{2}
\cexample[5.0]{target_ptr_map}{2}
The following two examples illustrate subtle differences in pointer attachment
to device address because of the order of data mapping.
In example \plc{target\_ptr\_map.3a}
the global pointer \plc{p1} points to array \plc{x} and \plc{p2} points to
array \plc{y} on the host.
The array section \plc{x[:N]} is mapped by the \code{target}~\code{enter}~\code{data} directive while array \plc{y} is mapped
on the \code{target} construct.
Since the \code{declare}~\code{target} directive is applied to the declaration
of \plc{p1}, \plc{p1} is a treated like a mapped variable on the \code{target}
construct and references to \plc{p1} inside the construct will be to the
corresponding \plc{p1} that exists on the device. However, the corresponding
\plc{p1} will be undefined since there is no pointer attachment for it. Pointer
attachment for \plc{p1} would require that (1) \plc{p1} (or an lvalue
expression that refers to the same storage as \plc{p1}) appears as a base
pointer to a list item in a \code{map} clause, and (2) the construct that has
the \code{map} clause causes the list item to transition from \emph{not mapped}
to \emph{mapped}. The conditions are clearly not satisifed for this example.
The problem for \plc{p2} in this example is also subtle. It will be privatized
inside the \code{target} construct, with a runtime check for whether the memory
to which it is pointing has corresponding memory that is accessible on the
device. If this check is successful then the \plc{p2} inside the construct
would be appropriately initialized to point to that corresponding memory.
Unfortunately, despite there being an implicit map of the array \plc{y} (to
which \plc{p2} is pointing) on the construct, the order of this map relative to
the initialization of \plc{p2} is unspecified. Therefore, the initial value of
\plc{p2} will also be undefined.
Thus, referencing values via either \plc{p1} or \plc{p2} inside
the \code{target} region would be invalid.
\cexample[5.0]{target_ptr_map}{3a}
In example \plc{target\_ptr\_map.3b} the mapping orders for arrays \plc{x}
and \plc{y} were rearranged to allow proper pointer attachments.
On the \code{target} construct, the \code{map(x)} clause triggers pointer
attachment for \plc{p1} to the device address of \plc{x}.
Pointer \plc{p2} is assigned the device address of the previously mapped
array \plc{y}.
Referencing values via either \plc{p1} or \plc{p2} inside the \code{target} region is now valid.
\cexample[5.0]{target_ptr_map}{3b}

View File

@ -19,7 +19,7 @@ Note: The buffer arrays and the \plc{x} variable have been grouped together, so
the components that will reside on the device are all together (without gaps).
This allows the runtime to optimize the transfer and the storage footprint on the device.
\cexample{target_struct_map}{1}
\cexample[5.0]{target_struct_map}{1}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -28,7 +28,7 @@ a C++ class. In the member function \plc{SAXPY::driver}
the array section \plc{p[:N]} is \emph{attached} to the pointer member \plc{p}
on the device.
\cppexample{target_struct_map}{2}
\cppexample[5.0]{target_struct_map}{2}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@ -22,7 +22,7 @@ back to the host. Note, the stand-alone \code{target}~\code{enter}~\code{data} o
after the host vector is created, and the \code{target}~\code{exit}~\code{data}
construct occurs before the host data is deleted.
\cppexample{target_unstructured_data}{1}
\cppexample[4.5]{target_unstructured_data}{1}
\pagebreak
The following C code allocates and frees the data member of a Matrix structure.
@ -33,7 +33,7 @@ and then frees the memory on the host. Note, the stand-alone
\code{target}~\code{enter}~\code{data} occurs after the host memory is allocated, and the
\code{target}~\code{exit}~\code{data} construct occurs before the host data is freed.
\cexample{target_unstructured_data}{1}
\cexample[4.5]{target_unstructured_data}{1}
\pagebreak
The following Fortran code allocates and deallocates a module array. The
@ -44,6 +44,6 @@ then deallocates the array on the host. Note, the stand-alone
\code{target}~\code{enter}~\code{data} occurs after the host memory is allocated, and the
\code{target}~\code{exit}~\code{data} construct occurs before the host data is deallocated.
\ffreeexample{target_unstructured_data}{1}
\ffreeexample[4.5]{target_unstructured_data}{1}
%end

View File

@ -27,9 +27,9 @@ region and waits for the completion of the region.
The second \code{target} region uses the updated values of \plc{v1[:N]} and \plc{v2[:N]}.
\cexample{target_update}{1}
\cexample[4.0]{target_update}{1}
\ffreeexample{target_update}{1}
\ffreeexample[4.0]{target_update}{1}
\subsection{\code{target} \code{update} Construct with \code{if} Clause}
\label{subsec:target_update_if}
@ -49,7 +49,7 @@ assigns the new values of \plc{v1} and \plc{v2} from the task's data environment
mapped array sections in the \code{target} \code{data} construct's device data
environment.
\cexample{target_update}{2}
\cexample[4.0]{target_update}{2}
\ffreeexample{target_update}{2}
\ffreeexample[4.0]{target_update}{2}

View File

@ -26,7 +26,7 @@ The use of the \plc{A} array is sufficient for this case, because one
would expect the storage for \plc{A} and \plc{B} would be physically "close"
(as provided by the hint in the first task).
\cexample{affinity}{6}
\cexample[5.0]{affinity}{6}
\ffreeexample{affinity}{6}
\ffreeexample[5.0]{affinity}{6}

View File

@ -8,9 +8,9 @@
This example shows a simple flow dependence using a \code{depend}
clause on the \code{task} construct.
\cexample{task_dep}{1}
\cexample[4.0]{task_dep}{1}
\ffreeexample{task_dep}{1}
\ffreeexample[4.0]{task_dep}{1}
The program will always print \texttt{"}x = 2\texttt{"}, because the \code{depend}
clauses enforce the ordering of the tasks. If the \code{depend} clauses had been
@ -23,9 +23,9 @@ would have a race condition.
This example shows an anti-dependence using the \code{depend}
clause on the \code{task} construct.
\cexample{task_dep}{2}
\cexample[4.0]{task_dep}{2}
\ffreeexample{task_dep}{2}
\ffreeexample[4.0]{task_dep}{2}
The program will always print \texttt{"}x = 1\texttt{"}, because the \code{depend}
clauses enforce the ordering of the tasks. If the \code{depend} clauses had been
@ -38,9 +38,9 @@ race condition.
This example shows an output dependence using the \code{depend}
clause on the \code{task} construct.
\cexample{task_dep}{3}
\cexample[4.0]{task_dep}{3}
\ffreeexample{task_dep}{3}
\ffreeexample[4.0]{task_dep}{3}
The program will always print \texttt{"}x = 2\texttt{"}, because the \code{depend}
clauses enforce the ordering of the tasks. If the \code{depend} clauses had been
@ -55,9 +55,9 @@ In this example we show potentially concurrent execution of tasks using multiple
flow dependences expressed using the \code{depend} clause on the \code{task}
construct.
\cexample{task_dep}{4}
\cexample[4.0]{task_dep}{4}
\ffreeexample{task_dep}{4}
\ffreeexample[4.0]{task_dep}{4}
The last two tasks are dependent on the first task. However there is no dependence
between the last two tasks, which may execute in any order (or concurrently if
@ -72,9 +72,9 @@ in any order and the program would have a race condition.
This example shows a task-based blocked matrix multiplication. Matrices are of
NxN elements, and the multiplication is implemented using blocks of BSxBS elements.
\cexample{task_dep}{5}
\cexample[4.0]{task_dep}{5}
\ffreeexample{task_dep}{5}
\ffreeexample[4.0]{task_dep}{5}
\subsection{\code{taskwait} with Dependences}
\label{subsec:taskwait_depend}
@ -105,9 +105,9 @@ Hence, immediately after the first \code{taskwait} it is unsafe to access the
The second \code{taskwait} ensures that the second child task has completed; hence
it is safe to access the \plc{y} variable in the following print statement.
\cexample{task_dep}{6}
\cexample[5.0]{task_dep}{6}
\ffreeexample{task_dep}{6}
\ffreeexample[5.0]{task_dep}{6}
In this example the first two tasks are serialized, because a dependence on
the first child is produced by \plc{x} with the \code{in} dependence type
@ -123,9 +123,9 @@ included to illustrate how the child task dependences can be completely annotate
in a data-flow model.)
\cexample{task_dep}{7}
\cexample[5.0]{task_dep}{7}
\ffreeexample{task_dep}{7}
\ffreeexample[5.0]{task_dep}{7}
This example is similar to the previous one, except the generating task is
@ -149,9 +149,9 @@ the dependence type of variables in the \code{taskwait} \code{depend} clause
when selecting child tasks that the generating task must wait on, so that its execution after the
taskwait does not produce race conditions on variables accessed by non-completed child tasks.
\cexample{task_dep}{8}
\cexample[5.0]{task_dep}{8}
\ffreeexample{task_dep}{8}
\ffreeexample[5.0]{task_dep}{8}
\pagebreak
\subsection{Mutually Exclusive Execution with Dependences}
@ -168,18 +168,18 @@ to the \code{mutexinoutset} dependence type on \code{c}, T4 and T5 may be
scheduled in any order with respect to each other, but not at the same
time. Tasks T6 will be scheduled after both T4 and T5 are completed.
\cexample{task_dep}{9}
\cexample[5.0]{task_dep}{9}
\ffreeexample{task_dep}{9}
\ffreeexample[5.0]{task_dep}{9}
The following example demonstrates a situation where the \code{mutexinoutset}
dependence type is advantageous. If \code{shortTaskB} completes
before \code{longTaskA}, the runtime can take advantage of this by
scheduling \code{longTaskBC} before \code{shortTaskAC}.
\cexample{task_dep}{10}
\cexample[5.0]{task_dep}{10}
\ffreeexample{task_dep}{10}
\ffreeexample[5.0]{task_dep}{10}
\subsection{Multidependences Using Iterators}
\label{subsec:depend_iterator}
@ -211,6 +211,31 @@ identical nor disjoint to the storage prescibed by the elements of the
loop tasks. The iterator overcomes this restriction by effectively
creating n disjoint storage areas.
\cexample{task_dep}{11}
\cexample[5.0]{task_dep}{11}
\ffreeexample[5.0]{task_dep}{11}
\subsection{Dependence for Undeferred Tasks}
\label{subsec:depend_undefer_task}
In the following example, we show that even if a task is undeferred as specified
by an \code{if} clause that evaluates to \plc{false}, task dependences are
still honored.
The \code{depend} clauses of the first and second explicit tasks specify that
the first task is completed before the second task.
The second explicit task has an \code{if} clause that evaluates to \plc{false}.
This means that the execution of the generating task (the implicit task of
the \code{single} region) must be suspended until the second explict task
is completed.
But, because of the dependence, the first explicit task must complete first,
then the second explicit task can execute and complete, and only then
the generating task can resume to the print statement.
Thus, the program will always print "\texttt{x = 2}".
\cexample[4.0]{task_dep}{12}
\clearpage
\ffreeexample[4.0]{task_dep}{12}
\ffreeexample{task_dep}{11}

View File

@ -16,7 +16,7 @@ The creation of tasks occurs in ascending order (according to the iteration spac
the loop) but a hint, by means of the \code{priority} clause, is provided to reverse
the execution order.
\cexample{task_priority}{1}
\cexample[4.5]{task_priority}{1}
\ffreeexample{task_priority}{1}
\ffreeexample[4.5]{task_priority}{1}

View File

@ -14,7 +14,7 @@ does not participate in the synchronization, and is left free to execute in para
This is opposed to the behavior of the \code{taskwait} construct, which would
include the background tasks in the synchronization.
\cexample{taskgroup}{1}
\cexample[4.0]{taskgroup}{1}
\ffreeexample{taskgroup}{1}
\ffreeexample[4.0]{taskgroup}{1}

View File

@ -9,17 +9,17 @@ note that the tasks will be executed in no specified order because there are no
synchronization directives. Thus, assuming that the traversal will be done in post
order, as in the sequential code, is wrong.
\cexample{tasking}{1}
\cexample[3.0]{tasking}{1}
\ffreeexample{tasking}{1}
\ffreeexample[3.0]{tasking}{1}
In the next example, we force a postorder traversal of the tree by adding a \code{taskwait}
directive. Now, we can safely assume that the left and right sons have been executed
before we process the current node.
\cexample{tasking}{2}
\cexample[3.0]{tasking}{2}
\ffreeexample{tasking}{2}
\ffreeexample[3.0]{tasking}{2}
The following example demonstrates how to use the \code{task} construct to process
elements of a linked list in parallel. The thread executing the \code{single}
@ -28,18 +28,19 @@ in the current team. The pointer \plc{p} is \code{firstprivate} by default
on the \code{task} construct so it is not necessary to specify it in a \code{firstprivate}
clause.
\cexample{tasking}{3}
\cexample[3.0]{tasking}{3}
\ffreeexample{tasking}{3}
\ffreeexample[3.0]{tasking}{3}
The \code{fib()} function should be called from within a \code{parallel} region
for the different specified tasks to be executed in parallel. Also, only one thread
of the \code{parallel} region should call \code{fib()} unless multiple concurrent
Fibonacci computations are desired.
\cexample{tasking}{4}
\cexample[3.0]{tasking}{4}
\fexample{tasking}{4}
\fexample[3.0]{tasking}{4}
\clearpage
Note: There are more efficient algorithms for computing Fibonacci numbers. This
classic recursion algorithm is for illustrative purposes.
@ -52,9 +53,9 @@ loop to suspend its task at the task scheduling point in the \code{task} directi
and start executing unassigned tasks. Once the number of unassigned tasks is sufficiently
low, the thread may resume execution of the task generating loop.
\cexample{tasking}{5}
\cexample[3.0]{tasking}{5}
\fexample{tasking}{5}
\fexample[3.0]{tasking}{5}
The following example is the same as the previous one, except that the tasks are
generated in an untied task. While generating the tasks, the implementation may
@ -69,9 +70,9 @@ to resume the task generating loop. In the previous examples, the other threads
would be forced to idle until the generating thread finishes its long task, since
the task generating loop was in a tied task.
\cexample{tasking}{6}
\cexample[3.0]{tasking}{6}
\fexample{tasking}{6}
\fexample[3.0]{tasking}{6}
The following two examples demonstrate how the scheduling rules illustrated in
Section 2.11.3 of the OpenMP 4.0 specification affect the usage of
@ -86,20 +87,20 @@ both of the task regions that modify \code{tp}. The parts of these task regions
in which \code{tp} is modified may be executed in any order so the resulting
value of \code{var} can be either 1 or 2.
\cexample{tasking}{7}
\cexample[3.0]{tasking}{7}
\fexample{tasking}{7}
\fexample[3.0]{tasking}{7}
In this example, scheduling constraints prohibit a thread in the team from executing
a new task that modifies \code{tp} while another such task region tied to the
same thread is suspended. Therefore, the value written will persist across the
task scheduling point.
\cexample{tasking}{8}
\cexample[3.0]{tasking}{8}
\fexample{tasking}{8}
\fexample[3.0]{tasking}{8}
The following two examples demonstrate how the scheduling rules illustrated in
Section 2.11.3 of the OpenMP 4.0 specification affect the usage of locks
@ -112,20 +113,20 @@ it encounters the task scheduling point at task 3, it could suspend task 1 and
begin task 2 which will result in a deadlock when it tries to enter critical region
1.
\cexample{tasking}{9}
\cexample[3.0]{tasking}{9}
\fexample{tasking}{9}
\fexample[3.0]{tasking}{9}
In the following example, \code{lock} is held across a task scheduling point.
However, according to the scheduling restrictions, the executing thread can't
begin executing one of the non-descendant tasks that also acquires \code{lock} before
the task region is complete. Therefore, no deadlock is possible.
\cexample{tasking}{10}
\cexample[3.0]{tasking}{10}
\ffreeexample{tasking}{10}
\ffreeexample[3.0]{tasking}{10}
\clearpage
The following examples illustrate the use of the \code{mergeable} clause in the
\code{task} construct. In this first example, the \code{task} construct has
@ -139,9 +140,9 @@ outcome does not depend on whether or not the task is merged (that is, the task
will always increment the same variable and will always compute the same value
for \code{x}).
\cexample{tasking}{11}
\cexample[3.1]{tasking}{11}
\ffreeexample{tasking}{11}
\ffreeexample[3.1]{tasking}{11}
This second example shows an incorrect use of the \code{mergeable} clause. In
this example, the created task will access different instances of the variable
@ -150,9 +151,9 @@ it will access the same variable \code{x} if the task is merged. As a result,
the behavior of the program is unspecified and it can print two different values
for \code{x} depending on the decisions taken by the implementation.
\cexample{tasking}{12}
\cexample[3.1]{tasking}{12}
\ffreeexample{tasking}{12}
\ffreeexample[3.1]{tasking}{12}
The following example shows the use of the \code{final} clause and the \code{omp\_in\_final}
API call in a recursive binary search program. To reduce overhead, once a certain
@ -170,9 +171,9 @@ in the stack could also be avoided but it would make this example less clear. Th
clause since all tasks created in a \code{final} task region are included tasks
that can be merged if the \code{mergeable} clause is present.
\cexample{tasking}{13}
\cexample[3.1]{tasking}{13}
\ffreeexample{tasking}{13}
\ffreeexample[3.1]{tasking}{13}
The following example illustrates the difference between the \code{if} and the
\code{final} clauses. The \code{if} clause has a local effect. In the first
@ -184,7 +185,7 @@ task itself. In the second nest of tasks, the nested tasks will be created as in
tasks. Note also that the conditions for the \code{if} and \code{final} clauses
are usually the opposite.
\cexample{tasking}{14}
\cexample[3.1]{tasking}{14}
\ffreeexample{tasking}{14}
\ffreeexample[3.1]{tasking}{14}

View File

@ -9,9 +9,9 @@ The \code{grainsize} clause specifies that each task is to execute at least 500
The \code{nogroup} clause removes the implicit taskgroup of the \code{taskloop} construct; the explicit \code{taskgroup} construct in the example ensures that the function is not exited before the long-running task and the loops have finished execution.
\cexample{taskloop}{1}
\cexample[4.5]{taskloop}{1}
\ffreeexample{taskloop}{1}
\ffreeexample[4.5]{taskloop}{1}
%\clearpage
@ -34,6 +34,6 @@ tasks. This is the common use case for the \code{taskloop} construct.)
In the example, the code thus prints \code{x1 = 16384} (\plc{T}*\plc{N}) and
\code{x2 = 1024} (\plc{N}).
\cexample{taskloop}{2}
\cexample[4.5]{taskloop}{2}
\ffreeexample{taskloop}{2}
\ffreeexample[4.5]{taskloop}{2}

View File

@ -8,7 +8,7 @@ that must be done in a critical region. By using \code{taskyield} when a task
cannot get access to the \code{critical} region the implementation can suspend
the current task and schedule some other task that can do something useful.
\cexample{taskyield}{1}
\cexample[3.1]{taskyield}{1}
\ffreeexample{taskyield}{1}
\ffreeexample[3.1]{taskyield}{1}

View File

@ -16,9 +16,9 @@ region. The \code{omp\_get\_team\_num} routine returns the team number, which is
between 0 and one less than the value returned by \code{omp\_get\_num\_teams}. The following
example manually distributes a loop across two teams.
\cexample{teams}{1}
\cexample[4.0]{teams}{1}
\ffreeexample{teams}{1}
\ffreeexample[4.0]{teams}{1}
\subsection{\code{target}, \code{teams}, and \code{distribute} Constructs}
\label{subsec:teams_distribute}
@ -47,9 +47,9 @@ created by the \code{teams} construct. At the end of the \code{teams} region,
each master thread's private copy of \plc{sum} is reduced into the final \plc{sum} that is
implicitly mapped into the \code{target} region.
\cexample{teams}{2}
\cexample[4.0]{teams}{2}
\ffreeexample{teams}{2}
\ffreeexample[4.0]{teams}{2}
\subsection{\code{target} \code{teams}, and Distribute Parallel Loop Constructs}
\label{subsec:teams_distribute_parallel}
@ -62,9 +62,9 @@ team executes the \code{teams} region.
The distribute parallel loop construct schedules the loop iterations across the
master threads of each team and then across the threads of each team.
\cexample{teams}{3}
\cexample[4.5]{teams}{3}
\ffreeexample{teams}{3}
\ffreeexample[4.5]{teams}{3}
\subsection{\code{target} \code{teams} and Distribute Parallel Loop
Constructs with Scheduling Clauses}
@ -87,9 +87,9 @@ The \code{schedule} clause indicates that the 1024 iterations distributed to
a master thread are then assigned to the threads in its associated team in chunks
of 64 iterations.
\cexample{teams}{4}
\cexample[4.0]{teams}{4}
\ffreeexample{teams}{4}
\ffreeexample[4.0]{teams}{4}
\subsection{\code{target} \code{teams} and \code{distribute} \code{simd} Constructs}
\label{subsec:teams_distribute_simd}
@ -102,9 +102,9 @@ master thread of each team executes the \code{teams} region.
The \code{distribute} \code{simd} construct schedules the loop iterations across
the master thread of each team and then uses SIMD parallelism to execute the iterations.
\cexample{teams}{5}
\cexample[4.0]{teams}{5}
\ffreeexample{teams}{5}
\ffreeexample[4.0]{teams}{5}
\subsection{\code{target} \code{teams} and Distribute Parallel Loop SIMD Constructs}
\label{subsec:teams_distribute_parallel_simd}
@ -118,7 +118,7 @@ The distribute parallel loop SIMD construct schedules the loop iterations across
the master thread of each team and then across the threads of each team where each
thread uses SIMD parallelism.
\cexample{teams}{6}
\cexample[4.0]{teams}{6}
\ffreeexample{teams}{6}
\ffreeexample[4.0]{teams}{6}

View File

@ -26,35 +26,36 @@ The initializer of the \code{declare}~\code{reduction} directive specifies
the initial value for the private variable of each implicit task.
The \code{omp\_priv} identifier is used to denote the private variable.
\cexample{udr}{1}
\cexample[4.0]{udr}{1}
\clearpage
The following example shows the corresponding code in Fortran.
The \code{declare}~\code{reduction} directives are specified as part of
the declaration in subroutine \plc{find\_enclosing\_rectangle} and
the procedures that perform the min and max operations are specified as subprograms.
\ffreeexample{udr}{1}
\ffreeexample[4.0]{udr}{1}
The following example shows the same computation as \plc{udr.1} but it illustrates that you can craft complex expressions in the user-defined reduction declaration. In this case, instead of calling the \plc{minproc} and \plc{maxproc} functions we inline the code in a single expression.
\cexample{udr}{2}
\cexample[4.0]{udr}{2}
The corresponding code of the same example in Fortran is very similar
except that the assignment expression in the \code{declare}~\code{reduction}
directive can only be used for a single variable, in this case through
a type structure constructor \plc{point($\ldots$)}.
\ffreeexample{udr}{2}
\ffreeexample[4.0]{udr}{2}
The following example shows the use of special variables in arguments for combiner (\code{omp\_in} and \code{omp\_out}) and initializer (\code{omp\_priv} and \code{omp\_orig}) routines. This example returns the maximum value of an array and the corresponding index value. The \code{declare}~\code{reduction} directive specifies a user-defined reduction operation \plc{maxloc} for data type \plc{struct} \plc{mx\_s}. The function \plc{mx\_combine} is the combiner and the function \plc{mx\_init} is the initializer.
\cexample{udr}{3}
\cexample[4.0]{udr}{3}
Below is the corresponding Fortran version of the above example. The \code{declare}~\code{reduction} directive specifies the user-defined operation \plc{maxloc} for user-derived type \plc{mx\_s}. The combiner \plc{mx\_combine} and the initializer \plc{mx\_init} are specified as subprograms.
\ffreeexample{udr}{3}
\ffreeexample[4.0]{udr}{3}
The following example explains a few details of the user-defined reduction
@ -74,16 +75,16 @@ has the \code{initializer} clause, the subroutine specified on the clause
must be accessible in the current scoping unit. In this case,
the subroutine \plc{dt\_init} is accessible by use association.
\ffreeexample{udr}{4}
\ffreeexample[4.0]{udr}{4}
The following example uses user-defined reductions to declare a plus (+) reduction for a C++ class. As the \code{declare}~\code{reduction} directive is inside the context of the \plc{V} class the expressions in the \code{declare}~\code{reduction} directive are resolved in the context of the class. Also, note that the \code{initializer} clause uses a copy constructor to initialize the private variables of the reduction and it uses as parameter to its original variable by using the special variable \code{omp\_orig}.
\cppexample{udr}{5}
\cppexample[4.0]{udr}{5}
The following examples shows how user-defined reductions can be defined for some STL containers. The first \code{declare}~\code{reduction} defines the plus (+) operation for \plc{std::vector<int>} by making use of the \plc{std::transform} algorithm. The second and third define the merge (or concatenation) operation for \plc{std::vector<int>} and \plc{std::list<int>}.
%It shows how the same user-defined reduction operation can be defined to be done differently depending on the specified data type.
It shows how the user-defined reduction operation can be applied to specific data types of an STL.
\cppexample{udr}{6}
\cppexample[4.0]{udr}{6}

View File

@ -46,9 +46,9 @@ the purpose of a function variant is to produce the same results by a different
%\code{teams distribute simd} in the variant function would produce non conforming code.
%\pagebreak
\cexample{declare_variant}{1}
\cexample[5.0]{declare_variant}{1}
\ffreeexample{declare_variant}{1}
\ffreeexample[5.0]{declare_variant}{1}
%\pagebreak
@ -72,6 +72,6 @@ containing only a basic \code{parallel}~\code{for} construct is used for the cal
%can be found in the allocator example of the Memory Management Chapter.
%\pagebreak
\cexample{declare_variant}{2}
\cexample[5.0]{declare_variant}{2}
\ffreeexample{declare_variant}{2}
\ffreeexample[5.0]{declare_variant}{2}

View File

@ -5,11 +5,11 @@
The OpenMP Examples document has been updated with new features
found in the OpenMP 5.0 Specification. The additional examples and updates
are referenced in the Document Revision History of the Appendix, \specref{sec:history_45_to_50}.
are referenced in the Document Revision History of the Appendix on page~\pageref{chap:history}.
Text describing an example with a 5.0 feature specifically states
that the feature support begins in the OpenMP 5.0 Specification. Also,
an \plc{omp\_5.0} keyword has been added to metadata in the source code.
an \code{\small omp\_5.0} keyword has been added to metadata in the source code.
These distinctions are presented to remind readers that a 5.0 compliant
OpenMP implementation is necessary to use these features in codes.

View File

@ -1,6 +1,46 @@
\chapter{Document Revision History}
\label{chap:history}
\section{Changes from 5.0.0 to 5.0.1}
\label{sec:history_50_to_501}
\begin{itemize}
\item Added version tags (\code{\small{}omp\_}\plc{x.y}) in example labels
and the corresponding source codes for all examples that feature
OpenMP 3.0 and later.
\item Included additional examples for the 5.0 features:
\begin{itemize}
\item Extension to the \code{defaultmap} clause
(\specref{sec:defaultmap})
\item Transferring noncontiguous data with the \code{target}~\code{update} directive in Fortran (\specref{sec:array-shaping})
\item \code{conditional} modifier for the \code{lastprivate} clause (\specref{sec:lastprivate})
\item \code{task} modifier for the \code{reduction} clause (\specref{subsec:task_reduction})
\item Reduction on combined target constructs (\specref{subsec:target_reduction})
\item Task reduction with target constructs
(\specref{subsec:target_task_reduction})
\item \code{scan} directive for returning the \emph{prefix sum} of a reduction (\specref{sec:scan})
\end{itemize}
\item Included additional examples for the 4.x features:
\begin{itemize}
\item Dependence for undeferred tasks
(\specref{subsec:depend_undefer_task})
\item \code{ref}, \code{val}, \code{uval} modifiers for \code{linear} clause (\specref{sec:linear_modifier})
\end{itemize}
\item Clarified the description of pointer mapping and pointer attachment in
\specref{sec:pointer_mapping}.
\item Clarified the description of memory model examples
in \specref{sec:mem_model}.
\end{itemize}
\section{Changes from 4.5.0 to 5.0.0}
\label{sec:history_45_to_50}
@ -21,6 +61,8 @@
\item Combined constructs: \code{parallel}~\code{master}~\code{taskloop} and \code{parallel}~\code{master}~\code{taskloop}~\code{simd}
(\specref{sec:parallel_master_taskloop})
\item Reverse Offload through \plc{ancestor} modifier of \code{device} clause. (\specref{subsec:target_reverse_offload})
\item Pointer Mapping - behavior of mapped pointers (\specref{sec:pointer_mapping}) %Example_target_ptr_map*
\item Structure Mapping - behavior of mapped structures (\specref{sec:structure_mapping}) %Examples_target_structure_mapping.tex target_struct_map*
\item Array Shaping with the \plc{shape-operator} (\specref{sec:array-shaping})
\item The \code{declare}~\code{mapper} construct (\specref{sec:declare_mapper})
\item Acquire and Release Semantics Synchronization: Memory ordering
@ -36,12 +78,16 @@
\item \code{requires} directive specifies required features of implementation (\specref{sec:requires})
\item \code{declare}~\code{variant} directive - for function variants (\specref{sec:declare_variant})
\item \code{metadirective} directive - for directive variants (\specref{sec:metadirective})
\item \code{OMP\_TARGET\_OFFLOAD} Environment Variable - controls offload behavior (\specref{sec:target_offload})
\end{itemize}
\item Included the following additional examples for the 4.x features:
\begin{itemize}
\item more taskloop examples (\specref{sec:taskloop})
\item user-defined reduction (UDR) (\specref{subsec:UDR})
%NEW 5.0
%\item \code{target} \code{enter} and \code{exit} \code{data} unstructured data constructs (\specref{sec:target_enter_exit_data}) %Example_target_unstructured_data.* ?
\end{itemize}
\end{itemize}
@ -101,12 +147,12 @@ Added the following new examples:
\begin{itemize}
\item task dependences (\specref{sec:task_depend})
\item \code{target} construct (\specref{sec:target})
\item array sections in device constructs (\specref{sec:array_sections})
\item \code{target}~\code{data} construct (\specref{sec:target_data})
\item \code{target}~\code{update} construct (\specref{sec:target_update})
\item \code{declare}~\code{target} construct (\specref{sec:declare_target})
\item \code{teams} constructs (\specref{sec:teams})
\item asynchronous execution of a \code{target} region using tasks (\specref{subsec:async_target_with_tasks})
\item array sections in device constructs (\specref{sec:array_sections})
\item device runtime routines (\specref{sec:device})
\item Fortran ASSOCIATE construct (\specref{sec:associate})
\item cancellation constructs (\specref{sec:cancellation})

View File

@ -1,8 +1,9 @@
# Makefile for the OpenMP Examples document in LaTex format.
# For more information, see the master document, openmp-examples.tex.
version=5.0.0
version=5.0.1
default: openmp-examples.pdf
diff: openmp-diff-abridged.pdf
CHAPTERS=Title_Page.tex \
@ -25,6 +26,9 @@ INTERMEDIATE_FILES=openmp-examples.pdf \
openmp-examples.out \
openmp-examples.log
# check for branches names with "name_XXX"
DIFF_TICKET_ID=$(shell git rev-parse --abbrev-ref HEAD)
openmp-examples.pdf: $(CHAPTERS) $(SOURCES) openmp.sty openmp-examples.tex openmp-logo.png
rm -f $(INTERMEDIATE_FILES)
pdflatex -interaction=batchmode -file-line-error openmp-examples.tex
@ -34,4 +38,46 @@ openmp-examples.pdf: $(CHAPTERS) $(SOURCES) openmp.sty openmp-examples.tex openm
clean:
rm -f $(INTERMEDIATE_FILES)
rm -f openmp-diff-full.pdf openmp-diff-abridged.pdf
rm -rf *.tmpdir
ifdef DIFF_TO
VC_DIFF_TO := -r ${DIFF_TO}
else
VC_DIFF_TO :=
endif
ifdef DIFF_FROM
VC_DIFF_FROM := -r ${DIFF_FROM}
else
VC_DIFF_FROM := -r master
endif
DIFF_TO:=HEAD
DIFF_FROM:=master
DIFF_TYPE:=UNDERLINE
COMMON_DIFF_OPTS:=--math-markup=whole \
--append-safecmd=plc,code,hcode,scode,pcode,splc \
--append-textcmd=subsubsubsection
VC_DIFF_OPTS:=${COMMON_DIFF_OPTS} --force -c latexdiff.cfg --flatten --type="${DIFF_TYPE}" --git --pdf ${VC_DIFF_FROM} ${VC_DIFF_TO} --subtype=ZLABEL --graphics-markup=none
VC_DIFF_MINIMAL_OPTS:= --only-changes --force
%.tmpdir: $(wildcard *.sty) $(wildcard *.png) $(wildcard *.aux) openmp-examples.pdf
mkdir -p $@/sources
mkdir -p $@/figs
cp -f $^ "$@/"
cp -f sources/* "$@/sources"
cp -f figs/* "$@/figs"
openmp-diff-abridged.pdf: diff-fast-minimal.tmpdir openmp-examples.pdf
env PATH="$(shell pwd)/util/latexdiff:$(PATH)" latexdiff-vc ${VC_DIFF_MINIMAL_OPTS} --fast -d $< ${VC_DIFF_OPTS} openmp-examples.tex
cp $</openmp-examples.pdf $@
if [ "x$(DIFF_TICKET_ID)" != "x" ]; then cp $@ ${@:.pdf=-$(DIFF_TICKET_ID).pdf}; fi
# Slow but portable diffs
openmp-diff-minimal.pdf: diffs-slow-minimal.tmpdir
env PATH="$(shell pwd)/util/latexdiff:$(PATH)" latexdiff-vc ${VC_DIFF_MINIMAL_OPTS} -d $< ${VC_DIFF_OPTS} openmp-examples.tex
cp $</openmp-examples.pdf $@
if [ "x$(DIFF_TICKET_ID)" != "x" ]; then cp $@ ${@:.pdf=-$(DIFF_TICKET_ID).pdf}; fi

24
README
View File

@ -32,6 +32,7 @@ For copyright information, please see omp_copyright.txt.
@@compilable: yes|no|maybe
@@linkable: yes|no|maybe
@@expect: success|failure|nothing|rt-error
@@version: omp_<verno>
"name" is the name of an example
"type" is the source code type, which can be translated into or from
@ -43,20 +44,27 @@ For copyright information, please see omp_copyright.txt.
"rt-error" is for a case where compilation may be successful,
but the code contains potential runtime issues (such as race condition).
Alternative would be to just use "conforming" or "non-conforming".
"version" indicates features for a specific OpenMP version, such as "omp_5.0"
3) LaTeX macros for examples
- Source code with language h-rules
\cexample{<ename>}{<seq-no>} % for C/C++ examples
\cppexample{<ename>}{<seq-no>} % for C++ examples
\fexample{<ename>}{<seq-no>} % for fixed-form Fortran examples
\ffreeexample{<ename>}{<seq-no>} % for free-form Fortran examples
\cexample[<verno>]{<ename>}{<seq-no>} % for C/C++ examples
\cppexample[<verno>]{<ename>}{<seq-no>} % for C++ examples
\fexample[<verno>]{<ename>}{<seq-no>} % for fixed-form Fortran examples
\ffreeexample[<verno>]{<ename>}{<seq-no>} % for free-form Fortran examples
- Source code without language h-rules
\cnexample{<ename>}{<seq-no>}
\cppnexample{<ename>}{<seq-no>}
\fnexample{<ename>}{<seq-no>}
\ffreenexample{<ename>}{<seq-no>}
\cnexample[<verno>]{<ename>}{<seq-no>}
\cppnexample[<verno>]{<ename>}{<seq-no>}
\fnexample[<verno>]{<ename>}{<seq-no>}
\ffreenexample[<verno>]{<ename>}{<seq-no>}
Optional <verno> can be supplied in a macro to include a specific OpenMP
version in the example header. This option also suggests one additional
tag (@@version) line is included in the corresponding source code.
If this is not the case (i.e., no @@version tag line), one needs to
prefix <verno> with an underscore '_' symbol in the macro.
- Language h-rules
\cspecificstart, \cspecificend

View File

@ -27,7 +27,7 @@ Source codes for OpenMP \PVER{} Examples can be downloaded from
\href{https://github.com/OpenMP/Examples/tree/v\VER}{github}.\\
\begin{adjustwidth}{0pt}{1em}\setlength{\parskip}{0.25\baselineskip}%
Copyright © 1997-2019 OpenMP Architecture Review Board.\\
Copyright © 1997-2020 OpenMP Architecture Review Board.\\
Permission to copy without fee all or part of this material is granted,
provided the OpenMP Architecture Review Board copyright notice and
the title of this document appear. Notice is given that copying is by

5
latexdiff.cfg Normal file
View File

@ -0,0 +1,5 @@
PICTUREENV=(?:picture|DIFnomarkup)[\w\d*@]*
VERBATIMLINEENV=(?:boxedcode|omptCallback|omptRecord|omptInquiry|omptEnum|omptOther|ompcPragma|ompfPragma|ompfSyntax|ompfFunction|ompfSubroutine|ompcEnum|ompcFunction|ompfEnum|ompEnv|ompSyntax|indentedcodelist|codepar)
COUNTERCMD=subsubsubsection
CUSTOMDIFCMD=(?:binding|comments|constraints|crossreferences|descr|argdesc|effect|format|restrictions|summary|syntax|events|tools|record|glossaryterm)
FLOATENV=(?:note|(?:c|cpp|ccpp|c90|c99|fortran)specific)

View File

@ -48,9 +48,9 @@
\documentclass[10pt,letterpaper,twoside,makeidx,hidelinks]{scrreprt}
% Text to appear in the footer on even-numbered pages:
\newcommand{\VER}{5.0.0}
\newcommand{\VER}{5.0.1}
\newcommand{\PVER}{\VER{}p1}
\newcommand{\VERDATE}{February 2018}
\newcommand{\VERDATE}{May 2020}
\newcommand{\footerText}{OpenMP Examples Version \PVER{} - \VERDATE}
% Unified style sheet for OpenMP documents:

View File

@ -49,9 +49,9 @@
\documentclass[10pt,letterpaper,twoside,makeidx,hidelinks]{scrreprt}
% Text to appear in the footer on even-numbered pages:
\newcommand{\VER}{5.0.0}
\newcommand{\VER}{5.0.1}
\newcommand{\PVER}{\VER{}}
\newcommand{\VERDATE}{November 2019}
\newcommand{\VERDATE}{June 2020}
\newcommand{\footerText}{OpenMP Examples Version \PVER{} - \VERDATE}
% Unified style sheet for OpenMP documents:
@ -119,6 +119,7 @@
\input{Chap_devices}
\input{Examples_target}
\input{Examples_target_defaultmap}
\input{Examples_target_pointer_mapping}
\input{Examples_target_structure_mapping}
\input{Examples_array_sections}
@ -146,6 +147,8 @@
% Forward Depend 370
% simdlen 476
% simd linear modifier 480
\input{Examples_linear_modifier}
\input{Chap_synchronization}
\input{Examples_critical}
@ -182,6 +185,7 @@
\input{Examples_reduction}
% User UDR 287
\input{Examples_udr}
\input{Examples_scan}
\input{Examples_copyin}
\input{Examples_copyprivate}
\input{Examples_cpp_reference}

View File

@ -417,10 +417,10 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Code example formatting for the Examples document
% This defines:
% /cexample formats blue markers, caption, and code for C examples
% /cppexample formats blue markers, caption, and code for C++ examples
% /fexample formats blue markers, caption, and code for Fortran (fixed) examples
% /ffreeexample formats blue markers, caption, and code for Fortran90 (free) examples
% \cexample formats blue markers, caption, and code for C examples
% \cppexample formats blue markers, caption, and code for C++ examples
% \fexample formats blue markers, caption, and code for Fortran (fixed) examples
% \ffreeexample formats blue markers, caption, and code for Fortran90 (free) examples
% Thanks to Jin, Haoqiang H. for the original definitions of the following:
\usepackage{color,fancyvrb} % for \VerbatimInput
@ -437,7 +437,7 @@
\newcommand{\escstr}[1]{\myreplace{_}{\_}{#1}}
\def\exampleheader#1#2#3#4{%
\def\exampleheader#1#2#3#4#5{%
\ifthenelse{ \equal{#1}{} }{
\def\cname{#2}
\def\ename\cname
@ -448,52 +448,61 @@
% Use following for mneumonics
\def\ename{\escstr{#1}.#2.#3}
}
\newcount\cnt
\cnt=#4
\ifthenelse{ \equal{#5}{} }{
\def\vername{}
}{
\def\myver##1{\toolboxSplitAt{##1}{_}\lefttext\righttext
\lefttext\toolboxIfElse{\ifx\righttext\undefined}%
{\global\advance\cnt by 1}{\expandafter{\righttext}}}
\def\vername{\;\;(\code{\small{}omp\_\myver{#5}})}
}
\noindent
\textit{Example \ename}
\textit{Example \ename}\vername
\def\fcnt{\the\cnt}
%\vspace*{-3mm}
\code{\VerbatimInput[numbers=left,numbersep=5ex,firstnumber=1,firstline=#4,fontsize=\small]%
%\code{\VerbatimInput[numbers=left,firstnumber=1,firstline=#4,fontsize=\small]%
%\code{\VerbatimInput[firstline=#4,fontsize=\small]%
\code{\VerbatimInput[numbers=left,numbersep=5ex,firstnumber=1,firstline=\fcnt,fontsize=\small]%
{sources/Example_\cname}}
}
\def\cnexample#1#2{%
\exampleheader{#1}{#2}{c}{8}
\newcommand\cnexample[3][]{%
\exampleheader{#2}{#3}{c}{8}{#1}
}
\def\cppnexample#1#2{%
\exampleheader{#1}{#2}{cpp}{8}
\newcommand\cppnexample[3][]{%
\exampleheader{#2}{#3}{cpp}{8}{#1}
}
\def\fnexample#1#2{%
\exampleheader{#1}{#2}{f}{6}
\newcommand\fnexample[3][]{%
\exampleheader{#2}{#3}{f}{6}{#1}
}
\def\ffreenexample#1#2{%
\exampleheader{#1}{#2}{f90}{6}
\newcommand\ffreenexample[3][]{%
\exampleheader{#2}{#3}{f90}{6}{#1}
}
\newcommand\cexample[2]{%
\newcommand\cexample[3][]{%
\needspace{5\baselineskip}\ccppspecificstart
\cnexample{#1}{#2}
\cnexample[#1]{#2}{#3}
\ccppspecificend
}
\newcommand\cppexample[2]{%
\newcommand\cppexample[3][]{%
\needspace{5\baselineskip}\cppspecificstart
\cppnexample{#1}{#2}
\cppnexample[#1]{#2}{#3}
\cppspecificend
}
\newcommand\fexample[2]{%
\newcommand\fexample[3][]{%
\needspace{5\baselineskip}\fortranspecificstart
\fnexample{#1}{#2}
\fnexample[#1]{#2}{#3}
\fortranspecificend
}
\newcommand\ffreeexample[2]{%
\newcommand\ffreeexample[3][]{%
\needspace{5\baselineskip}\fortranspecificstart
\ffreenexample{#1}{#2}
\ffreenexample[#1]{#2}{#3}
\fortranspecificend
}

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: no
* @@expect: success
* @@version: omp_4.0
*/
void star( double *a, double *b, double *c, int n, int *ioff )
{

View File

@ -3,6 +3,7 @@
! @@compilable: yes
! @@linkable: no
! @@expect: success
! @@version: omp_4.0
subroutine star(a,b,c,n,ioff_ptr)
implicit none
double precision :: a(*),b(*),c(*)

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: yes
* @@expect: success
* @@version: omp_4.0
*/
#include <stdio.h>

View File

@ -3,6 +3,7 @@
! @@compilable: yes
! @@linkable: yes
! @@expect: success
! @@version: omp_4.0
program main
implicit none
integer, parameter :: N=32
@ -19,15 +20,15 @@ program main
end program
function add1(a,b,fact) result(c)
!$omp declare simd(add1) uniform(fact)
implicit none
!$omp declare simd(add1) uniform(fact)
double precision :: a,b,fact, c
c = a + b + fact
end function
function add2(a,b,i, fact) result(c)
!$omp declare simd(add2) uniform(a,b,fact) linear(i:1)
implicit none
!$omp declare simd(add2) uniform(a,b,fact) linear(i:1)
integer :: i
double precision :: a(*),b(*),fact, c
c = a(i) + b(i) + fact

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: no
* @@expect: success
* @@version: omp_4.0
*/
double work( double *a, double *b, int n )
{

View File

@ -3,6 +3,7 @@
! @@compilable: yes
! @@linkable: no
! @@expect: success
! @@version: omp_4.0
subroutine work( a, b, n, sum )
implicit none
integer :: i, n

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: no
* @@expect: success
* @@version: omp_4.0
*/
void work( float *b, int n, int m )
{

View File

@ -3,6 +3,7 @@
! @@compilable: yes
! @@linkable: no
! @@expect: success
! @@version: omp_4.0
subroutine work( b, n, m )
implicit none
real :: b(n)

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: no
* @@expect: success
* @@version: omp_4.0
*/
void work( double **a, double **b, double **c, int n )
{

View File

@ -3,6 +3,7 @@
! @@compilable: yes
! @@linkable: no
! @@expect: success
! @@version: omp_4.0
subroutine work( a, b, c, n )
implicit none
integer :: i,j,n

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: no
* @@expect: success
* @@version: omp_4.0
*/
#pragma omp declare simd linear(p:1) notinbranch
int foo(int *p){

View File

@ -3,9 +3,10 @@
! @@compilable: yes
! @@linkable: no
! @@expect: success
! @@version: omp_4.0
function foo(p) result(r)
!$omp declare simd(foo) notinbranch
implicit none
!$omp declare simd(foo) notinbranch
integer :: p, r
p = p + 10
r = p
@ -26,8 +27,8 @@ function myaddint(a, b, n) result(r)
end function myaddint
function goo(p) result(r)
!$omp declare simd(goo) inbranch
implicit none
!$omp declare simd(goo) inbranch
real :: p, r
p = p + 18.5
r = p

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: yes
* @@expect: success
* @@version: omp_4.0
*/
#include <stdio.h>
#include <stdlib.h>

View File

@ -3,6 +3,7 @@
! @@compilable: yes
! @@linkable: yes
! @@expect: success
! @@version: omp_4.0
program fibonacci
implicit none
integer,parameter :: N=45
@ -25,8 +26,8 @@ program fibonacci
end program
recursive function fib(n) result(r)
!$omp declare simd(fib) inbranch
implicit none
!$omp declare simd(fib) inbranch
integer :: n, r
if (n <= 1) then

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: yes
* @@expect: success
* @@version: omp_4.0
*/
#include <stdio.h>
#include <math.h>

View File

@ -3,6 +3,7 @@
! @@compilable: yes
! @@linkable: yes
! @@expect: success
! @@version: omp_4.0
module work
integer :: P(1000)

View File

@ -1,9 +1,10 @@
/*
* @@name: acquire_release.1.c
* @@type: C
* @@compilable: yes, omp_5.0
* @@compilable: yes
* @@linkable: yes
* @@expect: success
* @@version: omp_5.0
*/
#include <stdio.h>

View File

@ -1,8 +1,9 @@
! @@name: acquire_release.1.f90
! @@type: F-free
! @@compilable: yes, omp_5.0
! @@compilable: yes
! @@linkable: yes
! @@expect: success
! @@version: omp_5.0
program rel_acq_ex1
use omp_lib

View File

@ -1,9 +1,10 @@
/*
* @@name: acquire_release.2.c
* @@type: C
* @@compilable: yes, omp_5.0
* @@compilable: yes
* @@linkable: yes
* @@expect: success
* @@version: omp_5.0
*/
#include <stdio.h>

View File

@ -1,8 +1,9 @@
! @@name: acquire_release.2.f90
! @@type: F-free
! @@compilable: yes, omp_5.0
! @@compilable: yes
! @@linkable: yes
! @@expect: success
! @@version: omp_5.0
program rel_acq_ex2
use omp_lib

View File

@ -1,9 +1,10 @@
/*
* @@name: acquire_release.3.c
* @@type: C
* @@compilable: yes, omp_5.0
* @@compilable: yes
* @@linkable: yes
* @@expect: success
* @@version: omp_5.0
*/
#include <stdio.h>

View File

@ -1,8 +1,9 @@
! @@name: acquire_release.3.f90
! @@type: F-free
! @@compilable: yes, omp_5.0
! @@compilable: yes
! @@linkable: yes
! @@expect: success
! @@version: omp_5.0
program rel_acq_ex3
use omp_lib

View File

@ -1,9 +1,10 @@
/*
* @@name: acquire_release.4.c
* @@type: C
* @@compilable: yes, omp_5.0
* @@compilable: yes
* @@linkable: yes
* @@expect: success
* @@version: omp_5.0
*/
#include <stdio.h>

View File

@ -1,8 +1,9 @@
! @@name: acquire_release.4.f90
! @@type: F-free
! @@compilable: yes, omp_5.0
! @@compilable: yes
! @@linkable: yes
! @@expect: success
! @@version: omp_5.0
program rel_acq_ex4
use omp_lib
@ -13,7 +14,7 @@ program rel_acq_ex4
!! !!! THIS CODE WILL FAIL TO PRODUCE CONSISTENT RESULTS !!!!!!!
!! !!! DO NOT PROGRAM SYNCHRONIZATION THIS WAY !!!!!!!
!$omp parallel num_threads private(thrd) private(tmp)
!$omp parallel num_threads(2) private(thrd) private(tmp)
thrd = omp_get_thread_num()
if (thrd == 0) then
!$omp critical

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: yes
* @@expect: success
* @@version: omp_4.0
*/
void work();

View File

@ -3,6 +3,7 @@
! @@compilable: yes
! @@linkable: yes
! @@expect: success
! @@version: omp_4.0
PROGRAM EXAMPLE
!$OMP PARALLEL PROC_BIND(SPREAD) NUM_THREADS(4)
CALL WORK()

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: no
* @@expect: success
* @@version: omp_4.0
*/
void work();
void foo()

View File

@ -3,6 +3,7 @@
! @@compilable: yes
! @@linkable: no
! @@expect: success
! @@version: omp_4.0
subroutine foo
!$omp parallel num_threads(16) proc_bind(spread)
call work()

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: yes
* @@expect: success
* @@version: omp_4.0
*/
void work();
int main()

View File

@ -3,6 +3,7 @@
! @@compilable: yes
! @@linkable: yes
! @@expect: success
! @@version: omp_4.0
PROGRAM EXAMPLE
!$OMP PARALLEL PROC_BIND(CLOSE) NUM_THREADS(4)
CALL WORK()

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: no
* @@expect: success
* @@version: omp_4.0
*/
void work();
void foo()

View File

@ -3,6 +3,7 @@
! @@compilable: yes
! @@linkable: no
! @@expect: success
! @@version: omp_4.0
subroutine foo
!$omp parallel num_threads(16) proc_bind(close)
call work()

View File

@ -4,6 +4,7 @@
* @@compilable: yes
* @@linkable: yes
* @@expect: success
* @@version: omp_4.0
*/
void work();
int main()

Some files were not shown because too many files have changed in this diff Show More