mirror of
https://github.com/OpenMP/Examples.git
synced 2025-04-04 05:41:33 +01:00
88 lines
4.6 KiB
TeX
88 lines
4.6 KiB
TeX
\pagebreak
|
|
\section{\kcode{unroll} Construct}
|
|
\label{sec:unroll}
|
|
\index{constructs!unroll@\kcode{unroll}}
|
|
\index{unroll construct@\kcode{unroll} construct}
|
|
\index{unroll construct@\kcode{unroll} construct!full clause@\kcode{full} clause}
|
|
\index{full clause@\kcode{full} clause}
|
|
\index{clauses!full@\kcode{full}}
|
|
\index{unroll construct@\kcode{unroll} construct!partial clause@\kcode{partial} clause}
|
|
\index{partial clause@\kcode{partial} clause}
|
|
\index{clauses!partial@\kcode{partial}}
|
|
|
|
The \kcode{unroll} construct is a loop transformation that increases the
|
|
number of loop blocks in a loop, while reducing the number of iterations.
|
|
The \kcode{full} clause specifies that the loop is to be completely unrolled.
|
|
That is, a loop block for each iteration is created, and the loop is removed.
|
|
A \kcode{partial} clause with an \plc{unroll-factor} specifies that the number of
|
|
iterations will be reduced multiplicatively by the factor while the number of
|
|
blocks will be increased by the same factor.
|
|
Operationally, the loop is tiled by the factor, and the tiled loop is
|
|
fully expanded, resulting in a single loop with multiple blocks.
|
|
|
|
Unrolling can reduce control-flow overhead and provide additional
|
|
optimization opportunities for the compiler and the processor
|
|
pipeline. Nevertheless, unrolling can increase the code size, and saturate
|
|
the instruction cache. Hence, the trade-off may need to be assessed.
|
|
Unrolling a loop does not change the code's semantics. Also, compilers
|
|
may unroll loops without explicit directives, at various optimization levels.
|
|
|
|
In the example below, the \kcode{unroll} construct is used without any clause, and then
|
|
with a \kcode{full} clause, in the first two functions, respectively.
|
|
When no clause is used, it is up to the implementation (compiler)
|
|
to decide if and how the loop is to be unrolled.
|
|
The iteration count can have a run time value.
|
|
In the second function, the \kcode{unroll} construct uses a \kcode{full} clause
|
|
to completely unroll the loop. A compile-time constant is required for the iteration count.
|
|
The statements in the third function (\ucode{unroll_full_equivalent}) illustrates
|
|
equivalent code for the full unrolling in the second function.
|
|
|
|
\cexample[5.1]{unroll}{1}
|
|
\ffreeexample[5.1]{unroll}{1}
|
|
|
|
|
|
The next example shows cases when it is incorrect to use full unrolling.
|
|
|
|
\cexample[5.1]{unroll}{2}
|
|
\ffreeexample[5.1]{unroll}{2}
|
|
|
|
In many cases, when the iteration count is large and/or dynamic, it is
|
|
reasonable to partially unroll a loop by including a \kcode{partial} clause.
|
|
In the \ucode{unroll3_partial} function below, the \plc{unroll-factor} value
|
|
of 4 is used to create a tile size of 4 that is unrolled to create 4 unrolled statements.
|
|
The equivalent ``hand unrolled'' loop code is presented in the
|
|
\ucode{unroll3_partial_equivalent} function.
|
|
If the \plc{unroll-factor} is omitted, as in the \ucode{unroll3_partial_nofactor}
|
|
function, the implementation may optimally select a factor from 1
|
|
(no unrolling) to the iteration count (full unrolling).
|
|
In the latter case the construct generates a loop with a single iteration.
|
|
|
|
\cexample[5.1]{unroll}{3}
|
|
\ffreeexample[5.1]{unroll}{3}
|
|
|
|
When the iteration count is not a multiple of the \plc{unroll-factor},
|
|
iterations that should not produce executions must be conditionally
|
|
protected from execution. In this example, the first function
|
|
unrolls a loop that has a variable iteration count. Since the \kcode{unroll}
|
|
construct uses a \kcode{partial(\ucode{4})} clause, the compiler will need to
|
|
create code that can account for cases when the iteration count is not a
|
|
multiple of 4. A brute-force, simple-to-understand approach for implementing
|
|
the conditionals is shown in the \ucode{unroll_partial_remainder_option1} function.
|
|
|
|
The remaining two functions show more optimal algorithms the compiler
|
|
may select to implement the transformation.
|
|
Optimal approaches may reduce the number of conditionals as shown in
|
|
\ucode{unroll_partial_remainder_option2}, and
|
|
may eliminate conditionals completely by peeling off a ``remainder''
|
|
into a separate loop as in \ucode{unroll_partial_remainder_option3}.
|
|
|
|
Regardless of the optimization, implementations must ensure that the semantics
|
|
remain the same, especially when additional directives are applied to
|
|
the unrolled loop. For the case in the \ucode{unroll_partial_remainder_option3}
|
|
function, the fission of the worksharing-loop construct may result in a different
|
|
distribution of threads to the iterations. Since no reproducible scheduling
|
|
is specified on the work-sharing construct, the worksharing-loop and unrolling are compliant.
|
|
|
|
\cexample[5.1]{unroll}{4}
|
|
\ffreeexample[5.1]{unroll}{4}
|