PyTorch 2.12 dropped on May 13, 2026, and it brings meaningful changes for creators running AI image and video generation on AMD GPUs, plus infrastructure improvements that affect everyone using ComfyUI, Stable Diffusion, and other generation pipelines on CUDA.
What Happened
The PyTorch team released version 2.12.0 with 2,926 commits from 457 contributors. The headline improvement is up to 100x faster batched eigendecomposition on CUDA (linalg.eigh), but the bigger story for creators is on the AMD side. ROCm users see 5-26% speedups on FlexAttention pipelining, with new support for rocSHMEM symmetric collectives and expandable memory segments. If you run Stable Diffusion or ComfyUI on an AMD RX 7900 XTX or similar card, this is the update that makes AMD a more serious option for AI generation workflows.
The other notable change is torch.cond support inside CUDA Graphs. Diffusion sampling pipelines that use conditional branching can now capture the entire forward pass in a CUDA graph, eliminating CPU overhead between steps and reducing latency on longer generation runs.
Why It Matters for Creators
PyTorch underpins virtually every open-source AI generation tool you use. When PyTorch gets faster on AMD hardware, ComfyUI, Automatic1111 WebUI, and all custom diffusion pipelines inherit that speedup automatically after upgrading.
The 5-26% ROCm FlexAttention improvement is specifically relevant for attention-heavy models like FLUX.1 and SDXL, which use cross-attention extensively during denoising. On an AMD card with 24GB VRAM, that speedup compounds across hundreds of diffusion steps per generation. The new ROCm 6.3 support ships alongside the release and includes hipSPARSELt acceleration.
For CUDA users, the Microscaling (MX) quantization export support opens the door to deploying aggressively compressed generative models. MXFP4 and MXFP8 formats are now first-class citizens in torch.export, making it easier to quantize and ship production-grade image models to edge hardware.
Key Details
- ROCm FlexAttention: 5-26% speedup on attention pipelining for AMD RDNA3 and CDNA GPUs
- CUDA Graph torch.cond: Conditional control flow captured in GPU graphs via CUDA 12.4 conditional IF nodes
- MX Quantization export: MXFP4, MXFP6, MXFP8, and float8_e8m0fnu in
torch.export.save - Fused Adagrad: Single-kernel execution joins Adam, AdamW, and SGD, relevant for LoRA fine-tuning workflows
- Apple MPS: Metal-4 offline shader compilation for faster startup on M-series Macs running local generation
- 100x faster linalg.eigh: Batched eigendecomposition on CUDA via the updated cuSolver backend
Creator Outcome: How to Upgrade
Upgrading is a one-command operation. From your venv or conda environment:
pip install torch==2.12.0 --upgrade
For CUDA 12.4 (recommended for the CUDA Graph improvements):
pip install torch==2.12.0 --index-url https://download.pytorch.org/whl/cu124
For ROCm 6.3 (AMD GPU users):
pip install torch==2.12.0 --index-url https://download.pytorch.org/whl/rocm6.3
After upgrading, ComfyUI and most WebUI forks pick up the new version automatically on next launch. Check your ComfyUI terminal on startup to confirm the loaded PyTorch version. No workflow changes needed. Full release notes are on GitHub. Select your install options at PyTorch Get Started.
See also our ComfyUI 2026 Workflow Guide for context on how PyTorch fits into the full generation stack, including which models benefit most from each hardware platform.