DeepSpeed

What is DeepSpeed?

At its simplest, DeepSpeed is an open-source deep learning optimization library for PyTorch, created by Microsoft.

Its one and only goal is to solve the single biggest bottleneck in modern AI: memory. It enables you to train and run inference on massive models with billions or even trillions of parameters—models that would be impossible to fit onto even the largest state-of-the-art GPUs.

DeepSpeed is not a new training framework. It's a powerful set of tools that works with your existing PyTorch code to make it incredibly efficient in terms of memory, speed, and scale.

The Core Problem: Why Do We Need DeepSpeed?

To train a large model, a GPU must store three main categories of "model states" in its limited memory (VRAM):

Model Parameters (Weights): These are the actual "knowledge" of the network (e.g., the 175 billion parameters in GPT-3).

Gradients: These are the updates calculated during the backward pass for every single parameter. They are typically the same size as the parameters.

Optimizer States: Modern optimizers like Adam or AdamW need to store "momentum" and "variance" for every parameter to work effectively. This means the optimizer states are often 2x larger than the model parameters.

Let's do the math for a 1.5 billion parameter model (like GPT-2) using standard 32-bit precision (4 bytes per parameter) and the Adam optimizer:

Parameters: 1.5B params * 4 bytes/param = 6 GB

Gradients: 1.5B params * 4 bytes/param = 6 GB

Optimizer States: 1.5B params * 4 bytes/param * 2 (momentum + variance) = 12 GB

Total Memory for Model States: 24 GB

This is before you even store the input data, activations, and other temporary buffers. A 24GB VRAM GPU (like an RTX 3090) is already at its limit with a relatively small 1.5B parameter model. A 175B model like GPT-3 would require over 2.8 TB of VRAM, which is impossible.

The Solution: ZeRO (The Zero Redundancy Optimizer)

This is the flagship technology of DeepSpeed. ZeRO is a family of optimizations that brilliantly solves the memory problem by partitioning (sharding) these model states across all your available GPUs instead of wastefully replicating them.

In standard Data Parallelism (DP), every GPU holds a full copy of the parameters, gradients, and optimizer states. ZeRO changes this.

Here is a detailed breakdown of its stages:

ZeRO Stage 1: Partition Optimizer States

What it does: This stage partitions the optimizer states (the largest memory consumer) across all GPUs.

How it works: Each GPU holds a full copy of the parameters and gradients but only a slice of the optimizer states. After the backward pass, each GPU updates the parameters for which it holds the optimizer state slice.

Memory Savings: ~4x (when using Adam).

Best for: When your model fits on a single GPU, but the Adam optimizer's memory makes it crash.

ZeRO Stage 2: Partition Optimizer States + Gradients

What it does: This stage partitions both the optimizer states AND the gradients.

How it works: Each GPU still holds the full model parameters, but now it only stores a slice of the gradients and optimizer states. During the backward pass, gradients are averaged and reduced, and each GPU discards the gradients it is not responsible for.

Memory Savings: ~8x (when using Adam).

Best for: This is the most common "sweet spot." It provides massive memory savings with almost no communication overhead, making it as fast as standard data parallelism.

ZeRO Stage 3: Partition Everything

What it does: This is the ultimate stage. It partitions the optimizer states, the gradients, AND the model parameters themselves.

How it works: No single GPU holds the full model. Each GPU holds only a slice of the parameters. When a layer is needed for the forward or backward pass, all GPUs participate in an "all-gather" operation to retrieve the full layer, compute with it, and then discard the parameter slices they don't own.

Memory Savings: Scales linearly with the number of GPUs (e.g., 64 GPUs = ~64x memory reduction for model states).

Best for: Training truly massive models (100B+ parameters) where the model parameters alone cannot fit on a single GPU.

Beyond ZeRO: Offloading with ZeRO-Offload & ZeRO-Infinity

DeepSpeed pushes this even further by using your system's CPU RAM and NVMe (fast SSD) storage.

ZeRO-Offload (for Stage 2): Offloads the partitioned gradients and optimizer states from the GPU VRAM to your computer's main CPU RAM. This keeps the parameters on the GPU for fast computation but moves the less-used components off the expensive VRAM. This allows you to train models up to 13B parameters on a single GPU.

ZeRO-Infinity (for Stage 3): The "next generation" of offloading. Since ZeRO-3 already partitions the parameters, ZeRO-Infinity can offload all the partitioned states (parameters, gradients, and optimizer) to either CPU RAM or, for maximum scale, to NVMe SSDs. This is what enables the training of trillion-parameter models.

How to Use DeepSpeed: A Practical Guide

The easiest way to use DeepSpeed is with the Hugging Face Trainer API. It requires almost no code changes.

Here is a 3-step guide:

Step 1: Create a DeepSpeed Configuration File

Create a JSON file named ds_config.json. This file tells DeepSpeed which optimizations to use. This example enables ZeRO Stage 2, mixed-precision (FP16), and the Adam optimizer.

ds_config.json

Note: When using the Hugging Face Trainer, parameters like lr and betas in this file will be overridden by the values you set in TrainingArguments.

Step 2: Modify Your Python Training Script

In your normal Hugging Face script, you only need to add one argument to TrainingArguments:

Step 3: Launch Your Script with the `deepspeed` Command

You do not run your script with python train.py. You must use the DeepSpeed launcher, which handles setting up the distributed environment.

If you have 4 GPUs on your machine:

That's it. DeepSpeed will now launch 4 processes, load your configuration, and train your model using ZeRO Stage 2.

The Four Pillars of DeepSpeed

DeepSpeed is more than just ZeRO. It's a comprehensive library organized into four "pillars":

DeepSpeed-Training: This is the core pillar, containing everything we've discussed: ZeRO Stages 1-3, ZeRO-Offload, and ZeRO-Infinity. It also includes 3D Parallelism, an advanced technique that combines ZeRO (Data Parallelism) with Pipeline Parallelism (splitting model layers across GPUs) and Tensor Parallelism (splitting individual math operations within a layer).

DeepSpeed-Inference: Training is only half the battle. This pillar optimizes models for high-throughput, low-latency inference. It uses ZeRO-3 to partition massive models across multiple GPUs and replaces standard PyTorch operations with its own high-performance custom kernels (e.g., kernel fusion) to speed up computations.

DeepSpeed-Compression: This pillar is a library of tools to make models smaller and faster after training. It includes techniques like Pruning (removing unimportant parameters), Layer Reduction, and Quantization. Its key feature is ZeroQuant, an advanced post-training quantization method that shrinks models with minimal accuracy loss.

DeepSpeed4Science: This is a newer initiative that adapts DeepSpeed's technologies for unique scientific challenges. Standard AI models (like LLMs) are different from scientific models (like protein folding or genomic analysis). This pillar creates custom tools, like kernels for Evoformer (used in AlphaFold) and optimizations for extremely long sequences (500k+ tokens) found in genomic data.

References:

DeepSpeedLatest News

What is DeepSpeed?

At its simplest, DeepSpeed is an open-source deep learning optimization library for PyTorch, created by Microsoft.

DeepSpeed is not a new training framework. It's a powerful set of tools that works with your existing PyTorch code to make it incredibly efficient in terms of memory, speed, and scale.

The Core Problem: Why Do We Need DeepSpeed?

To train a large model, a GPU must store three main categories of "model states" in its limited memory (VRAM):

Model Parameters (Weights): These are the actual "knowledge" of the network (e.g., the 175 billion parameters in GPT-3).

Gradients: These are the updates calculated during the backward pass for every single parameter. They are typically the same size as the parameters.

Optimizer States: Modern optimizers like Adam or AdamW need to store "momentum" and "variance" for every parameter to work effectively. This means the optimizer states are often 2x larger than the model parameters.

Let's do the math for a 1.5 billion parameter model (like GPT-2) using standard 32-bit precision (4 bytes per parameter) and the Adam optimizer:

Parameters: 1.5B params * 4 bytes/param = 6 GB

Gradients: 1.5B params * 4 bytes/param = 6 GB

Optimizer States: 1.5B params * 4 bytes/param * 2 (momentum + variance) = 12 GB

Total Memory for Model States: 24 GB

The Solution: ZeRO (The Zero Redundancy Optimizer)

In standard Data Parallelism (DP), every GPU holds a full copy of the parameters, gradients, and optimizer states. ZeRO changes this.

Here is a detailed breakdown of its stages:

ZeRO Stage 1: Partition Optimizer States

What it does: This stage partitions the optimizer states (the largest memory consumer) across all GPUs.

How it works: Each GPU holds a full copy of the parameters and gradients but only a slice of the optimizer states. After the backward pass, each GPU updates the parameters for which it holds the optimizer state slice.

Memory Savings: ~4x (when using Adam).

Best for: When your model fits on a single GPU, but the Adam optimizer's memory makes it crash.

ZeRO Stage 2: Partition Optimizer States + Gradients

What it does: This stage partitions both the optimizer states AND the gradients.

How it works: Each GPU still holds the full model parameters, but now it only stores a slice of the gradients and optimizer states. During the backward pass, gradients are averaged and reduced, and each GPU discards the gradients it is not responsible for.

Memory Savings: ~8x (when using Adam).

Best for: This is the most common "sweet spot." It provides massive memory savings with almost no communication overhead, making it as fast as standard data parallelism.

ZeRO Stage 3: Partition Everything

What it does: This is the ultimate stage. It partitions the optimizer states, the gradients, AND the model parameters themselves.

How it works: No single GPU holds the full model. Each GPU holds only a slice of the parameters. When a layer is needed for the forward or backward pass, all GPUs participate in an "all-gather" operation to retrieve the full layer, compute with it, and then discard the parameter slices they don't own.

Memory Savings: Scales linearly with the number of GPUs (e.g., 64 GPUs = ~64x memory reduction for model states).

Best for: Training truly massive models (100B+ parameters) where the model parameters alone cannot fit on a single GPU.

Beyond ZeRO: Offloading with ZeRO-Offload & ZeRO-Infinity

DeepSpeed pushes this even further by using your system's CPU RAM and NVMe (fast SSD) storage.

ZeRO-Offload (for Stage 2): Offloads the partitioned gradients and optimizer states from the GPU VRAM to your computer's main CPU RAM. This keeps the parameters on the GPU for fast computation but moves the less-used components off the expensive VRAM. This allows you to train models up to 13B parameters on a single GPU.

ZeRO-Infinity (for Stage 3): The "next generation" of offloading. Since ZeRO-3 already partitions the parameters, ZeRO-Infinity can offload all the partitioned states (parameters, gradients, and optimizer) to either CPU RAM or, for maximum scale, to NVMe SSDs. This is what enables the training of trillion-parameter models.

How to Use DeepSpeed: A Practical Guide

The easiest way to use DeepSpeed is with the Hugging Face Trainer API. It requires almost no code changes.

Here is a 3-step guide:

Step 1: Create a DeepSpeed Configuration File

Create a JSON file named ds_config.json. This file tells DeepSpeed which optimizations to use. This example enables ZeRO Stage 2, mixed-precision (FP16), and the Adam optimizer.

ds_config.json

Note: When using the Hugging Face Trainer, parameters like lr and betas in this file will be overridden by the values you set in TrainingArguments.

Step 2: Modify Your Python Training Script

In your normal Hugging Face script, you only need to add one argument to TrainingArguments:

Step 3: Launch Your Script with the `deepspeed` Command

You do not run your script with python train.py. You must use the DeepSpeed launcher, which handles setting up the distributed environment.

If you have 4 GPUs on your machine:

That's it. DeepSpeed will now launch 4 processes, load your configuration, and train your model using ZeRO Stage 2.

The Four Pillars of DeepSpeed

DeepSpeed is more than just ZeRO. It's a comprehensive library organized into four "pillars":

DeepSpeed-Training: This is the core pillar, containing everything we've discussed: ZeRO Stages 1-3, ZeRO-Offload, and ZeRO-Infinity. It also includes 3D Parallelism, an advanced technique that combines ZeRO (Data Parallelism) with Pipeline Parallelism (splitting model layers across GPUs) and Tensor Parallelism (splitting individual math operations within a layer).

DeepSpeed-Inference: Training is only half the battle. This pillar optimizes models for high-throughput, low-latency inference. It uses ZeRO-3 to partition massive models across multiple GPUs and replaces standard PyTorch operations with its own high-performance custom kernels (e.g., kernel fusion) to speed up computations.

DeepSpeed-Compression: This pillar is a library of tools to make models smaller and faster after training. It includes techniques like Pruning (removing unimportant parameters), Layer Reduction, and Quantization. Its key feature is ZeroQuant, an advanced post-training quantization method that shrinks models with minimal accuracy loss.

DeepSpeed4Science: This is a newer initiative that adapts DeepSpeed's technologies for unique scientific challenges. Standard AI models (like LLMs) are different from scientific models (like protein folding or genomic analysis). This pillar creates custom tools, like kernels for Evoformer (used in AlphaFold) and optimizations for extremely long sequences (500k+ tokens) found in genomic data.

References:

DeepSpeedLatest News

DeepSpeed

What is DeepSpeed?

The Core Problem: Why Do We Need DeepSpeed?

The Solution: ZeRO (The Zero Redundancy Optimizer)

ZeRO Stage 1: Partition Optimizer States

ZeRO Stage 2: Partition Optimizer States + Gradients

ZeRO Stage 3: Partition Everything

Beyond ZeRO: Offloading with ZeRO-Offload & ZeRO-Infinity

How to Use DeepSpeed: A Practical Guide

Step 1: Create a DeepSpeed Configuration File

Step 2: Modify Your Python Training Script

Step 3: Launch Your Script with the `deepspeed` Command

The Four Pillars of DeepSpeed

More posts

DeepSpeed

What is DeepSpeed?

The Core Problem: Why Do We Need DeepSpeed?

The Solution: ZeRO (The Zero Redundancy Optimizer)

ZeRO Stage 1: Partition Optimizer States

ZeRO Stage 2: Partition Optimizer States + Gradients

ZeRO Stage 3: Partition Everything

Beyond ZeRO: Offloading with ZeRO-Offload & ZeRO-Infinity

How to Use DeepSpeed: A Practical Guide

Step 1: Create a DeepSpeed Configuration File

Step 2: Modify Your Python Training Script

Step 3: Launch Your Script with the `deepspeed` Command

The Four Pillars of DeepSpeed

More posts

What is DeepSpeed?

The Core Problem: Why Do We Need DeepSpeed?

The Solution: ZeRO (The Zero Redundancy Optimizer)

ZeRO Stage 1: Partition Optimizer States

ZeRO Stage 2: Partition Optimizer States + Gradients

ZeRO Stage 3: Partition Everything

Beyond ZeRO: Offloading with ZeRO-Offload & ZeRO-Infinity

How to Use DeepSpeed: A Practical Guide

Step 1: Create a DeepSpeed Configuration File

Step 2: Modify Your Python Training Script

Step 3: Launch Your Script with the deepspeed Command

The Four Pillars of DeepSpeed

More posts

What is DeepSpeed?

The Core Problem: Why Do We Need DeepSpeed?

The Solution: ZeRO (The Zero Redundancy Optimizer)

ZeRO Stage 1: Partition Optimizer States

ZeRO Stage 2: Partition Optimizer States + Gradients

ZeRO Stage 3: Partition Everything

Beyond ZeRO: Offloading with ZeRO-Offload & ZeRO-Infinity

How to Use DeepSpeed: A Practical Guide

Step 1: Create a DeepSpeed Configuration File

Step 2: Modify Your Python Training Script

Step 3: Launch Your Script with the deepspeed Command

The Four Pillars of DeepSpeed

More posts

Step 3: Launch Your Script with the `deepspeed` Command

Step 3: Launch Your Script with the `deepspeed` Command