\pagebreak \section{\kcode{unroll} Construct} \label{sec:unroll} \index{constructs!unroll@\kcode{unroll}} \index{unroll construct@\kcode{unroll} construct} \index{unroll construct@\kcode{unroll} construct!full clause@\kcode{full} clause} \index{full clause@\kcode{full} clause} \index{clauses!full@\kcode{full}} \index{unroll construct@\kcode{unroll} construct!partial clause@\kcode{partial} clause} \index{partial clause@\kcode{partial} clause} \index{clauses!partial@\kcode{partial}} The \kcode{unroll} construct is a loop transformation that increases the number of loop blocks in a loop, while reducing the number of iterations. The \kcode{full} clause specifies that the loop is to be completely unrolled. That is, a loop block for each iteration is created, and the loop is removed. A \kcode{partial} clause with an \plc{unroll-factor} specifies that the number of iterations will be reduced multiplicatively by the factor while the number of blocks will be increased by the same factor. Operationally, the loop is tiled by the factor, and the tiled loop is fully expanded, resulting in a single loop with multiple blocks. Unrolling can reduce control-flow overhead and provide additional optimization opportunities for the compiler and the processor pipeline. Nevertheless, unrolling can increase the code size, and saturate the instruction cache. Hence, the trade-off may need to be assessed. Unrolling a loop does not change the code's semantics. Also, compilers may unroll loops without explicit directives, at various optimization levels. In the example below, the \kcode{unroll} construct is used without any clause, and then with a \kcode{full} clause, in the first two functions, respectively. When no clause is used, it is up to the implementation (compiler) to decide if and how the loop is to be unrolled. The iteration count can have a run time value. In the second function, the \kcode{unroll} construct uses a \kcode{full} clause to completely unroll the loop. A compile-time constant is required for the iteration count. The statements in the third function (\ucode{unroll_full_equivalent}) illustrates equivalent code for the full unrolling in the second function. \cexample[5.1]{unroll}{1} \ffreeexample[5.1]{unroll}{1} The next example shows cases when it is incorrect to use full unrolling. \cexample[5.1]{unroll}{2} \ffreeexample[5.1]{unroll}{2} In many cases, when the iteration count is large and/or dynamic, it is reasonable to partially unroll a loop by including a \kcode{partial} clause. In the \ucode{unroll3_partial} function below, the \plc{unroll-factor} value of 4 is used to create a tile size of 4 that is unrolled to create 4 unrolled statements. The equivalent ``hand unrolled'' loop code is presented in the \ucode{unroll3_partial_equivalent} function. If the \plc{unroll-factor} is omitted, as in the \ucode{unroll3_partial_nofactor} function, the implementation may optimally select a factor from 1 (no unrolling) to the iteration count (full unrolling). In the latter case the construct generates a loop with a single iteration. \cexample[5.1]{unroll}{3} \ffreeexample[5.1]{unroll}{3} When the iteration count is not a multiple of the \plc{unroll-factor}, iterations that should not produce executions must be conditionally protected from execution. In this example, the first function unrolls a loop that has a variable iteration count. Since the \kcode{unroll} construct uses a \kcode{partial(\ucode{4})} clause, the compiler will need to create code that can account for cases when the iteration count is not a multiple of 4. A brute-force, simple-to-understand approach for implementing the conditionals is shown in the \ucode{unroll_partial_remainder_option1} function. The remaining two functions show more optimal algorithms the compiler may select to implement the transformation. Optimal approaches may reduce the number of conditionals as shown in \ucode{unroll_partial_remainder_option2}, and may eliminate conditionals completely by peeling off a ``remainder'' into a separate loop as in \ucode{unroll_partial_remainder_option3}. Regardless of the optimization, implementations must ensure that the semantics remain the same, especially when additional directives are applied to the unrolled loop. For the case in the \ucode{unroll_partial_remainder_option3} function, the fission of the worksharing-loop construct may result in a different distribution of threads to the iterations. Since no reproducible scheduling is specified on the work-sharing construct, the worksharing-loop and unrolling are compliant. \cexample[5.1]{unroll}{4} \ffreeexample[5.1]{unroll}{4}