Warp: A Python Multi-GPU Programming Model with CUDA Performance and Flexibility

Abstract

The rise of artificial intelligence has firmly established Python as the primary programming language for AI, and a similar trend is emerging in scientific applications. However, designing efficient and scalable applications in Python presents unique challenges, driving the development of innovative frameworks. For example, tools like JAX offer a tensor-based programming model inspired by NumPy, which can be mapped to multi-GPU execution. While these abstractions provide significant capabilities, they often obscure low-level details, limiting the user's ability to write optimal code. Tensor-based frameworks often struggle with efficient conditional logic, loops, and memory management.

NVIDIA Warp introduces a new solution: a Python multi-GPU programming model with a CUDA-like interface. This approach empowers developers to write high-performance, fine-grained parallel code in Python with CUDA-level efficiency. Warp leverages just-in-time (JIT) compilation not only to directly generate and compile CUDA code but also to provide a differentiable programming interface, which is essential for optimization and AI workflows.

In this tutorial, we will introduce the Warp multi-GPU programming model and demonstrate how to write high-performance, multi-GPU applications in Python. Beginning with simple Warp kernel development and profiling, we will progress to more advanced features such as automatic differentiability and multi-GPU programming.

Schedule

The tutorial will be divided into two sections, each one starting with a general introduction to Warp and then focusing on different aspects of the programming model. The following is a tentative schedule for the tutorial that is subject to change. Please check the tutorial website for the most up-to-date information closer to the tutorial date.

Day One

Duration	Title	Topics
1h	Introduction to GPU programming in Warp	GPU architectures and Warp programming model
		Warp data types and kernel launches
		Synchronization and GPU-CPU data movement
1.5h	Computational fluid dynamics with Warp and XLB	A simple 2D CFD based on LBM
		Profiling Warp and comparison with plain CUDA apps
		Optimizations
1.5h	Multi-GPU fluid dynamics simulation	Managing multiple devices in Warp
		Porting the simple CFD app to multi-GPU with 1D partitioning
		Overlapping computation and data transfer

Day Two

Duration	Title	Topics
1h	Introduction to GPU programming in Warp	GPU architectures and Warp programming model
		Warp data types and kernel launches
		Synchronization and GPU-CPU data movement
1.5h	Computational fluid dynamics with Warp and XLB	A simple 2D CFD based on LBM
		Profiling Warp and comparison with plain CUDA apps
		Optimizations
1.5h	Out-of-core fluid dynamics simulation	Unified memory and NVIDIA Grace systems
		Out-of-core capabilities for our CFD app
		Overlapping computation and data transfer

Prerequisites

The tutorial is designed for developers with a basic understanding of Python. Participants should have some experience in writing and executing Python code. Familiarity with GPU architectures or CUDA is beneficial but not required.

Tutorial Setup

Attendees are required to have a laptop with a working Python environment and Nsight Systems installed. Attendees will be provided with credentials to log into an HPC cluster with NVIDIA GPUs.

Installing Nsight Systems

Instructions to install Nsight Systems can be found here.

How to Access the Cluster

More detailed information will be provided closer to the tutorial date.