shfl_scan - CUDA Parallel Prefix Sum with Shuffle Intrinsics (SHFL_Scan)

Description

This example demonstrates how to use the shuffle intrinsic __shfl_up_sync to perform a scan operation across a thread block.

Key Concepts

Data-Parallel Algorithms, Performance Strategies

Supported SM Architectures

Supported OSes

Linux, Windows

Supported CPU Architecture

x86_64, armv7l, aarch64

CUDA APIs involved

CUDA Runtime API

cudaMemcpy, cudaFree, cudaMallocHost, cudaEventSynchronize, cudaEventRecord, cudaFreeHost, cudaGetDevice, cudaMemset, cudaMalloc, cudaEventElapsedTime, cudaGetDeviceProperties, cudaEventCreate

Dependencies needed to build/run

CPP11

Prerequisites

Download and install the CUDA Toolkit 12.5 for your corresponding platform. Make sure the dependencies mentioned in Dependencies section above are installed.

References (for more details)