Efficient parallel programming on GPUs requires not just raw compute power, but careful attention to how data is stored and accessed. CUDA exposes multiple distinct memory types, each with unique properties. This blog introduces each CUDA memory type and explains the essential concept of the compute-to-memory access ratio, a key metric for GPU performance optimization.


CUDA Memory Types

Roofline model
Figure 1: An overview of the CUDA device memory model.

CUDA devices provide several types of memory, each serving different roles in executing massively parallel computations:

Memory TypeLocationScopeLifetimeAccess SpeedCapacityTypical Use
RegisterOn-chip, per threadThreadKernelFastestFew KB/threadScalars, loop counters
LocalOff-chip (DRAM)ThreadKernelSlowLargeLarge thread-private arrays, register spill
SharedOn-chip, per blockBlockKernelFast (on-chip)~100 KB/SMBlock-wide data sharing, frequent reuse
GlobalOff-chip (DRAM)AllApplicationSlowGBsMain data arrays
ConstantOff-chip + cachedAllApplicationFast if cached64 KBInvariants, kernel parameters (read-only to GPU)

Register Memory

  • Scope: Private to each thread
  • Performance: Fastest memory with extremely low latency and high bandwidth
  • Limitation: Limited quantity; excessive use reduces occupancy (number of simultaneous threads per SM)
  • Usage: Scalars and frequently accessed variables

Local Memory

  • Scope: Private to each thread
  • Location: Actually resides in slow global memory
  • Usage: Large per-thread arrays or register spillover
  • Performance: High latency, avoid for performance-critical variables

Shared Memory

  • Scope: Shared among threads in a block
  • Location: On-chip, configurable per block
  • Performance: Much faster than global memory, ideal for data reused across threads in a block
  • Usage: Cooperative loading/reuse (e.g., block-level tile in matrix multiply)

Global Memory

  • Scope: Accessible to all threads and host
  • Location: Large off-chip DRAM
  • Performance: High latency, limited bandwidth
  • Usage: Main storage for arrays and data structures

Constant Memory

  • Scope: All threads, read-only for device
  • Location: Cached on-chip
  • Performance: Fast if accesses are uniform (broadcast), small total size
  • Usage: Kernel parameters, constants, look-up tables

Variable Lifetime and Declaration

CUDA uses type qualifiers to associate variables with memory types:

  • __device__ places data in global memory
  • __shared__ requests shared memory within a block
  • __constant__ allocates constant memory.
  • Automatic (non-array) variables default to registers; automatic arrays go to local memory

The Compute-to-Memory Access Ratio

What Is It?

The compute-to-memory access ratio (also called arithmetic intensity or computational intensity) is the number of floating point operations (FLOPs) performed for every byte loaded from global memory:

  • Low ratio: Kernel is memory-bound—performance limited by the rate at which data can be fetched from memory.
  • High ratio: Kernel is compute-bound—performance limited by computational hardware, not data movement.

The Roofline Model

Roofline model
Figure 2: Roofline model showing memory- vs compute-bound regions.

The roofline model visually captures the interaction:

  • X-axis: Arithmetic intensity (FLOP/B)
  • Y-axis: Throughput (GFLOPS)
  • Performance is capped by either the memory bandwidth “roof” or the compute “roof”.

Points close to the rooflines are efficient and points far below indicate wasted compute or bandwidth. In general, implementing kernel that have higher computational intensity/data reuse is key for better performance.

How To Improve the Ratio

Optimization Checklist

  • Maximize data reuse: Use shared memory or registers to serve multiple computations with a single memory load.
  • Reduce memory traffic: Carefully coalesce accesses and avoid unnecessary loads/stores.
  • Exploit locality: Rearrange algorithms to reuse data in fast, on-chip memories.

Summary

CUDA programming performance is tightly linked to memory system use:

  • Use faster on-chip memories (registers/shared/constant) wherever possible
  • Structure kernels for high compute-to-memory ratios
  • Use metrics—and models like the roofline—to diagnose and bottleneck performance
  • Optimize for both hardware and algorithmic efficiency for best results

References

Hwu, W.-M. W., Kirk, D. B., & El Hajj, I. (2022). Programming massively parallel processors (4th ed.). doi:10.1016/c2020-0-02969-5