OpenMP-Examples/tasking/task_dep.tex

%\pagebreak
\section{Task Dependences}
\label{sec:task_depend}
\index{dependences!task dependences}

\subsection{Flow Dependence}
\label{subsec:task_flow_depend}
\index{task dependences!flow dependence}
\index{task construct@\kcode{task} construct!depend clause@\kcode{depend} clause}
\index{task construct@\kcode{task} construct}
\index{constructs!task@\kcode{task}}
\index{depend clause@\kcode{depend} clause}
\index{clauses!depend@\kcode{depend}}

This example shows a simple flow dependence using a \kcode{depend}
clause on the \kcode{task} construct.

\cexample[4.0]{task_dep}{1}

\ffreeexample[4.0]{task_dep}{1}

The program will always print \pout{x = 2}, because the \kcode{depend}
clauses enforce the ordering of the tasks. If the \kcode{depend} clauses had been
omitted, then the tasks could execute in any order and the program and the program
would have a race condition.

\subsection{Anti-dependence}
\label{subsec:task_anti_depend}
\index{task dependences!anti dependence}

This example shows an anti-dependence using the \kcode{depend}
clause on the \kcode{task} construct.

\cexample[4.0]{task_dep}{2}

\ffreeexample[4.0]{task_dep}{2}

The program will always print \pout{x = 1}, because the \kcode{depend}
clauses enforce the ordering of the tasks. If the \kcode{depend} clauses had been
omitted, then the tasks could execute in any order and the program would have a
race condition.

\subsection{Output Dependence}
\label{subsec:task_out_depend}
\index{task dependences!output dependence}

This example shows an output dependence using the \kcode{depend}
clause on the \kcode{task} construct.

\cexample[4.0]{task_dep}{3}

\ffreeexample[4.0]{task_dep}{3}

The program will always print \pout{x = 2}, because the \kcode{depend}
clauses enforce the ordering of the tasks. If the \kcode{depend} clauses had been
omitted, then the tasks could execute in any order and the program would have a
race condition.

\pagebreak
\subsection{Concurrent Execution with Dependences}
\label{subsec:task_concurrent_depend}
\index{task dependences!concurrent execution with}

In this example we show potentially concurrent execution of tasks using multiple
flow dependences expressed using the \kcode{depend} clause on the \kcode{task}
construct.

The last two tasks are dependent on the first task. However, there is no dependence
between the last two tasks, which may execute in any order (or concurrently if
more than one thread is available). Thus, the possible outputs are
\pout{x + 1 = 3. x + 2 = 4.} and \pout{x + 2 = 4. x + 1 = 3.}.
If the \kcode{depend} clauses had been omitted, then all of the tasks could execute
in any order and the program would have a race condition.

\cexample[4.0]{task_dep}{4}

\ffreeexample[4.0]{task_dep}{4}

The following example illustrates the semantic difference between \kcode{inout}
and \kcode{inoutset} dependence types.  In Case 1, tasks generated at T1
inside the loop have dependences among themselves due to
the \kcode{inout} dependence type and with task T2.
As a result, these tasks are executed sequentially before the print
statement from task T2.
In Case 2, tasks generated at T3 inside the loop have no dependences
among themselves from the \kcode{inoutset} dependence type, but have
dependences with task T4.
As a result, these tasks are executed concurrently before the print
statement from task T4.

\cexample[5.1]{task_dep}{4b}

\ffreeexample[5.1]{task_dep}{4b}

\subsection{Matrix multiplication}
\label{subsec:task_matrix_mult}
\index{task dependences!matrix multiplication}

This example shows a task-based blocked matrix multiplication. Matrices are of
\ucode{N}x\ucode{N} elements, and the multiplication is implemented using blocks
of \ucode{BS}x\ucode{BS} elements.

\cexample[4.0]{task_dep}{5}

\ffreeexample[4.0]{task_dep}{5}

\subsection{\kcode{taskwait} with Dependences}
\label{subsec:taskwait_depend}
\index{task dependences!taskwait construct with@\kcode{taskwait} construct with}
\index{taskwait construct@\kcode{taskwait} construct}
\index{constructs!taskwait@\kcode{taskwait}}
\index{taskwait construct@\kcode{taskwait} construct!depend clause@\kcode{depend} clause}
\index{depend clause@\kcode{depend} clause}
\index{clauses!depend@\kcode{depend}}

In this subsection three examples illustrate how the
\kcode{depend} clause can be applied to a \kcode{taskwait} construct to make the
generating task wait for specific child tasks to complete. This is an OpenMP 5.0 feature.
 In the same manner that
dependences can order executions among child tasks with \kcode{depend} clauses on
\kcode{task} constructs, the generating task can be scheduled to wait on child tasks
at a \kcode{taskwait} before it can proceed.

Note: Since the \kcode{depend} clause on a \kcode{taskwait} construct relaxes the
default synchronization behavior (waiting for all children to finish), it is important to
realize that child tasks that are not predecessor tasks, as determined by the \kcode{depend}
clause of the \kcode{taskwait} construct, may be running concurrently while the
generating task is executing after the taskwait.

In the first example the generating task waits at the \kcode{taskwait} construct
for the completion of the first child task because a dependence on the first task
is produced by \ucode{x} with an \kcode{in} dependence type within the \kcode{depend}
clause of the \kcode{taskwait} construct.
Immediately after the first \kcode{taskwait} construct it is safe to access the
\ucode{x} variable by the generating task, as shown in the print statement.
There is no completion restraint on the second child task.
Hence, immediately after the first \kcode{taskwait} it is unsafe to access the
\ucode{y} variable since the second child task may still be executing.
The second \kcode{taskwait} ensures that the second child task has completed; hence
it is safe to access the \ucode{y} variable in the following print statement.

\cexample[5.0]{task_dep}{6}

\ffreeexample[5.0]{task_dep}{6}

In this example the first two tasks are serialized, because a dependence on
the first child is produced by \ucode{x} with the \kcode{in} dependence type
in the \kcode{depend} clause of the second task.
However, the generating task at the first \kcode{taskwait} waits only on the
first child task to complete, because a dependence on only the first child task
is produced by \ucode{x} with an \kcode{in} dependence type within the
\kcode{depend} clause of the \kcode{taskwait} construct.
The second \kcode{taskwait} (without a \kcode{depend} clause) is included
to guarantee completion of the second task before \ucode{y} is accessed.
(While unnecessary, the \kcode{depend(inout: \ucode{y})} clause on the 2nd child task is
included to illustrate how the child task dependences can be completely annotated
in a data-flow model.)


\cexample[5.0]{task_dep}{7}

\ffreeexample[5.0]{task_dep}{7}
\clearpage


This example is similar to the previous one, except the generating task is
directed to also wait for completion of the second task.

The \kcode{depend} clause of the \kcode{taskwait} construct now includes an
\kcode{in} dependence type for \ucode{y}.  Hence the generating task must now
wait on completion of any child task having \ucode{y} with an \kcode{out}
(here \kcode{inout}) dependence type in its \kcode{depend} clause.
So, the \kcode{depend} clause of the \kcode{taskwait} construct now constrains
the second task to complete at the taskwait, too.
%--both tasks must now complete execution at the \code{taskwait}.
(This change makes the second taskwait of the previous example unnecessary--
it has been removed in this example.)

Note: While a \kcode{taskwait} construct ensures that all child tasks have completed; a depend clause on a \kcode{taskwait}
construct only waits for specific child tasks (prescribed by the dependence type and list
items in the \kcode{taskwait}'s \kcode{depend} clause).
This and the previous example illustrate the need to carefully determine
the dependence type of variables in the \kcode{depend} clause of the \kcode{taskwait} construct.
when selecting child tasks that the generating task must wait on, so that its execution after the
taskwait does not produce race conditions on variables accessed by non-completed child tasks.

\cexample[5.0]{task_dep}{8}

\ffreeexample[5.0]{task_dep}{8}

\pagebreak
\subsection{Mutually Exclusive Execution with Dependences}
\label{subsec:task_dep_mutexinoutset}
\index{task dependences!mutually exclusive execution}

In this example we show a series of tasks, including mutually exclusive
tasks, expressing dependences using the \kcode{depend} clause on the
\kcode{task} construct.

The program will always print \pout{6}. Tasks T1, T2 and T3 will be scheduled first,
in any order. Task T4 will be scheduled after tasks T1 and T2 are
completed. T5 will be scheduled after tasks T1 and T3 are completed. Due
to the \kcode{mutexinoutset} dependence type on \ucode{c}, T4 and T5 may be
scheduled in any order with respect to each other, but not at the same
time. Tasks T6 will be scheduled after both T4 and T5 are completed.

\cexample[5.0]{task_dep}{9}

\ffreeexample[5.0]{task_dep}{9}

The following example demonstrates a situation where the \kcode{mutexinoutset}
dependence type is advantageous. If \ucode{shortTaskB} completes
before \ucode{longTaskA}, the runtime can take advantage of this by
scheduling \ucode{longTaskBC} before \ucode{shortTaskAC}.

\cexample[5.0]{task_dep}{10}

\ffreeexample[5.0]{task_dep}{10}

\subsection{Multidependences Using Iterators}
\label{subsec:depend_iterator}
\index{task dependences!using iterators}
\index{depend clause@\kcode{depend} clause!iterator modifier@\kcode{iterator} modifier}
\index{iterator modifier@\kcode{iterator} modifier}

The following example uses an iterator to define a dynamic number of
dependences.

In the \kcode{single} construct of a parallel region a loop generates \ucode{n} tasks
and each task has an \kcode{out} dependence specified through an element of
the \ucode{v} array.  This is followed by a single task that defines an \kcode{in}
dependence on each element of the array.  This is accomplished by
using the \kcode{iterator} modifier in the \kcode{depend} clause, supporting a dynamic number
of dependences (\ucode{n} here).

The task for the \ucode{print_all_elements} procedure is not executed until all dependences
prescribed (or registered) by the iterator are fulfilled; that is,
after all the tasks generated by the loop have completed.

Note, one cannot simply use an array section in the \kcode{depend} clause
of the second task construct because this would violate the \kcode{depend} clause restriction:

``List items used in \kcode{depend} clauses of the same task or sibling tasks
must indicate identical storage locations or disjoint storage locations''.

In this case each of the loop tasks use a single disjoint (different storage)
element in their \kcode{depend} clause; however,
the array-section storage area prescribed in the commented directive is neither
identical nor disjoint to the storage prescribed by the elements of the
loop tasks.  The iterator overcomes this restriction by effectively
creating \ucode{n} disjoint storage areas.

\cexample[5.0]{task_dep}{11}

\ffreeexample[5.0]{task_dep}{11}

\subsection{Dependence for Undeferred Tasks}
\label{subsec:depend_undefer_task}
\index{task dependences!undeferred tasks}

In the following example, we show that even if a task is undeferred as specified
by an \kcode{if} clause that evaluates to \vcode{false}, task dependences are
still honored.

The \kcode{depend} clauses of the first and second explicit tasks specify that
the first task is completed before the second task.

The second explicit task has an \kcode{if} clause that evaluates to \vcode{false}.
This means that the execution of the generating task (the implicit task of
the \kcode{single} region) must be suspended until the second explicit task
is completed.
But, because of the dependence, the first explicit task must complete first,
then the second explicit task can execute and complete, and only then
the generating task can resume to the print statement.
Thus, the program will always print \pout{x = 2}.

\cexample[4.0]{task_dep}{12}
\clearpage

\ffreeexample[4.0]{task_dep}{12}


In OpenMP 5.1 the \kcode{omp_all_memory} \plc{reserved locator} was introduced
to specify storage of all objects in memory. In the following example,
it is used in Task 4 as a convenient way to specify that the locator
(list item) denotes the storage of all objects (locations) in memory, and
will therefore match the \ucode{a} and \ucode{d} locators of Task 2, Task 3 and Task 6.
The dependences guarantee the ordered execution of Tasks 2 and 3 before 4, and
Task 4 before Task 6.
Since there are no dependences imposed on Task 1 and Task 5, they can be
scheduled to execute at any time, with no ordering.

\cexample[5.1]{task_dep}{13}
\ffreeexample[5.1]{task_dep}{13}

\subsection{Transparent Task Dependences}
\label{subsec:depend_trans_task}
\index{task dependences!transparent tasks}


In the following example, each iteration of the \ucode{h}-loop updates all
elements of array \ucode{M} and task dependences are used to synchronize
updates across different iterations of the loop. The code uses two levels of
dependent tasks and assumes that
\ucode{N_ROWS} is evenly divisible by \ucode{ROWS_PER_TASK}.
The \ucode{h}-loop generates the first level of tasks, with
the \kcode{depend} clause serializing their execution and each task calling
\ucode{my_func}. A second level of tasks are generated by the \ucode{i}-loop in
\ucode{my_func}.

However, the dependences for this second level of tasks are between tasks from
different calls to \ucode{my_func}. In order to enforce these dependences, the
first-level tasks are specified as transparent tasks with the
\kcode{transparent(omp_impex)} clause. The \kcode{omp_impex} argument (which
is the default if not explicitly specified) indicates that the task is both an
exporting and importing task. For the purposes of dependence matching, an
exporting task is one that makes its child tasks visible to its successors and
an importing task is one that makes its preceding tasks (such as earlier
sibling tasks) visible to its child tasks. As a result of the exposed
dependences, the task generated in the $i^{th}$ iteration of the
\ucode{h}=$h_0$ instance of \ucode{my_func} is guaranteed to be ordered before
the task generated in the $i^{th}$ iteration of the \ucode{h}=$h_1$ instance of
\ucode{my_func}, where $h_0 < h_1$.

\cexample[6.0]{task_dep}{14}
\ffreeexample[6.0]{task_dep}{14}