README.md



CUTLASS 3.4
CUTLASS 3.4 - February 2024
CUTLASS is a collection of CUDA C++ template abstractions for implementing
high-performance matrix-matrix multiplication (GEMM) and related computations at all levels
and scales within CUDA. It incorporates strategies for hierarchical decomposition and
data movement similar to those used to implement cuBLAS and cuDNN.  CUTLASS decomposes
these "moving parts" into reusable, modular software components abstracted by C++ template
classes.  Primitives for different levels of a conceptual parallelization hierarchy
can be specialized and tuned via custom tiling sizes, data types,
and other algorithmic policy. The resulting flexibility simplifies their use
as building blocks within custom kernels and applications.
To support a wide variety of applications, CUTLASS provides extensive support for
mixed-precision computations, providing specialized data-movement and
multiply-accumulate abstractions for half-precision floating
point (FP16), BFloat16 (BF16), Tensor Float 32 (TF32),
single-precision floating point (FP32),
FP32 emulation via tensor core instruction,
double-precision floating
point (FP64) types, integer data types (4b and 8b), and binary data types (1b).
CUTLASS demonstrates warp-synchronous matrix multiply operations
targeting the programmable, high-throughput Tensor Cores implemented by
NVIDIA's Volta, Turing, Ampere, and Hopper architectures.
See the Quick Start Guide to get started quickly.
See the functionality listing for the list of operations
supported at each level of the execution model hierarchy.
CUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data.
CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides Layout and Tensor objects that compactly package the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations.
The core abstractions of CuTe are hierarchically multidimensional layouts which can be composed with data arrays to represent tensors. The representation of layouts is powerful enough to represent nearly everything we need to implement efficient dense linear algebra. Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.
CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates. This greatly simplifies the design
and improves code composability and readability. More documentation specific to CuTe can be found in its dedicated documentation directory.
In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.

What's New in CUTLASS 3.4
CUTLASS 3.4.1 is an update to CUTLASS adding:

Statically available CUTLASS Version macros that allow for handling API changes between CUTLASS releases on the users' side.
Improvements for Hopper Group-GEMM and Pointer-Array Batched GEMM.
Updates and bugfixes from the community (thanks!).

CUTLASS 3.4.0 is an update to CUTLASS adding:

Improved Mixed-input Hopper GEMMs supporting {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors tuned for optimal performance on Hopper H100.
Beta release of Pointer-Array Batched GEMMs utilizing TMA and Hopper H100 tensor cores now available. (Requires CUDA 12.3 or above)
Beta release of Group-GEMM - commonly used in optimization of Mixture-Of-Expert models, is now available on Hopper GPUs taking advantage of TMA and Hopper H100 tensor cores. (Requires CUDA 12.3 or above)

Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now.
Improvements to NamedBarriers including details of ReservedNamedBarriers used within the CUTLASS library.
Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved.

Minimum requirements:

Architecture: Volta
Compiler: Must support at least C++17
CUDA Toolkit version: 11.4

Starting from CUTLASS 3.0, CUTLASS removed support for the following:

Maxwell and Pascal GPU architectures
Ubuntu 16.04
CUDA 10.2
C++ language versions less than 17.