Warp: A Python Multi-GPU Programming Model with CUDA Performance and Flexibility
Abstract
The rise of artificial intelligence has firmly established Python as the primary programming language for AI, and a similar trend is emerging in scientific applications. However, designing efficient and scalable applications in Python presents unique challenges, driving the development of innovative frameworks. For example, tools like JAX offer a tensor-based programming model inspired by NumPy, which can be mapped to multi-GPU execution. While these abstractions provide significant capabilities, they often obscure low-level details, limiting the user's ability to write optimal code. Tensor-based frameworks often struggle with efficient conditional logic, loops, and memory management.
NVIDIA Warp introduces a new solution: a Python multi-GPU programming model with a CUDA-like interface. This approach empowers developers to write high-performance, fine-grained parallel code in Python with CUDA-level efficiency. Warp leverages just-in-time (JIT) compilation not only to directly generate and compile CUDA code but also to provide a differentiable programming interface, which is essential for optimization and AI workflows.
In this tutorial, we will introduce the Warp multi-GPU programming model and demonstrate how to write high-performance, multi-GPU applications in Python. Beginning with simple Warp kernel development and profiling, we will progress to more advanced features such as automatic differentiability and multi-GPU programming.
Schedule
The tutorial will be divided into two sections, each one starting with a general introduction to Warp and then focusing on different aspects of the programming model. The following is a tentative schedule for the tutorial that is subject to change. Please check the tutorial website for the most up-to-date information closer to the tutorial date.
Day One
Duration | Title | Topics |
---|---|---|
1h | Introduction to GPU programming in Warp | GPU architectures and Warp programming model |
Warp data types and kernel launches | ||
Synchronization and GPU-CPU data movement | ||
1.5h | Computational fluid dynamics with Warp and XLB | A simple 2D CFD based on LBM |
Profiling Warp and comparison with plain CUDA apps | ||
Optimizations | ||
1.5h | Multi-GPU fluid dynamics simulation | Managing multiple devices in Warp |
Porting the simple CFD app to multi-GPU with 1D partitioning | ||
Overlapping computation and data transfer |
Day Two
Duration | Title | Topics |
---|---|---|
1h | Introduction to GPU programming in Warp | GPU architectures and Warp programming model |
Warp data types and kernel launches | ||
Synchronization and GPU-CPU data movement | ||
1.5h | Computational fluid dynamics with Warp and XLB | A simple 2D CFD based on LBM |
Profiling Warp and comparison with plain CUDA apps | ||
Optimizations | ||
1.5h | Out-of-core fluid dynamics simulation | Unified memory and NVIDIA Grace systems |
Out-of-core capabilities for our CFD app | ||
Overlapping computation and data transfer |
Prerequisites
The tutorial is designed for developers with a basic understanding of Python. Participants should have some experience in writing and executing Python code. Familiarity with GPU architectures or CUDA is beneficial but not required.
Tutorial Setup
Attendees are required to have a laptop with a working Python environment and Nsight Systems installed. Attendees will be provided with credentials to log into an HPC cluster with NVIDIA GPUs.
Installing Nsight Systems
Instructions to install Nsight Systems can be found here.
How to Access the Cluster
More detailed information will be provided closer to the tutorial date.