| 2.1 - Single-GPU Training |
Single-GPU Techniques |
Protein localization classifier combining mixed precision, gradient checkpointing & accumulation |
$\nabla_\theta \mathcal{L} = \frac{1}{K}\sum_{k=1}^{K} \nabla_\theta \mathcal{L}_k$ |
|
Single-GPU Transformer |
DNA sequence transformer with mixed precision, remat, gradient accumulation & layer scanning |
$\text{Attention}(Q,K,V) = \text{softmax}_{\text{f32}}!\left(\frac{QK^\top}{\sqrt{d_k}}\right)!V$ |
| 2.2 - Multi-Device Training |
Single-Device Training |
Drug response classifier baseline: model, loss, TrainState, jit-compiled step, gradient accumulation |
$\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}$ |
|
FSDP Step by Step |
Building FSDP from first principles: sharding, all_gather, psum_scatter, and transparent module wrapping |
$W = \text{all_gather}(W_{\text{shard}})$ |
|
Pipeline Parallelism Step by Step |
Splitting a model across devices by layers: micro-batching, ppermute ring communication, stage masking |
$\text{bubble} = \frac{S-1}{M+S-1}$ |