Skip to content
KoishiAI
ไทย
← Back to all articles

AMD ROCm 7 Enables CUDA-Free LLM Fine-Tuning

AMD ROCm 7 allows CUDA-free LLM fine-tuning on MI325X hardware. Learn how this breakthrough eliminates custom kernels and challenges NVIDIA's AI dominance.

AI-drafted from cited sources, fact-checked and reviewed by a human editor. How we work · Standards · Report an error
A modern server room featuring network equipment with blue illumination. Ideal for technology themes.
Photo by panumas nikhomkhai on Pexels

TL;DR: AMD’s ROCm 7 software stack now allows developers to fine-tune large language models on MI325X hardware using standard PyTorch and Hugging Face tools, removing the need for custom kernels [1]. Meanwhile, the MedQA medical AI benchmark has been archived after top models consistently scored above 95%, indicating the field has reached a performance ceiling [8].

Key facts

  • Liquid AI’s LFM2.5-1.2B-Instruct model was successfully fine-tuned on an AMD MI325X using official PyTorch and Hugging Face libraries on the ROCm 7 stack without custom kernels [1].
  • The AMD MI325X hardware is considered competitive with NVIDIA’s A100 and GH200 for similar specifications [1].
  • NVIDIA’s historical advantage stems from software tooling and kernel libraries rather than raw hardware capability [1].
  • Inference workloads involving speculative decoding and custom attention kernels still face performance gaps on ROCm compared to training workloads [1].
  • The MedQA benchmark has been archived because it reached saturation, with top models scoring over 95% [8].
  • Top-performing models on the archived MedQA benchmark include o1 (96.52%), GPT 5.1 (96.38%), and Gemini 3.1 Pro Preview (96.37%) [8].

The ROCm 7 Breakthrough: Fine-Tuning Without Custom Kernels

A significant shift in the AI hardware landscape occurred in early 2026, as AMD’s ROCm 7 software stack demonstrated its capability to fine-tune large language models (LLMs) without relying on NVIDIA’s CUDA ecosystem. A report from February 2026 highlighted a successful fine-tuning of the Liquid AI LFM2.5-1.2B-Instruct model on an AMD MI325X accelerator [1]. Crucially, this process utilized official PyTorch and Hugging Face libraries, eliminating the need for custom kernels or obscure software forks that previously hindered AMD adoption [1].

This development challenges the long-held narrative that NVIDIA holds an insurmountable advantage in AI training. While AMD’s hardware, such as the MI325X, is now considered competitive with NVIDIA’s A100 and GH200 for similar specifications, the real barrier has historically been software [1]. NVIDIA’s lead stems from decades of investment in software tooling, kernel libraries like cuBLAS and cuDNN, and tooling like PTX bytecode portability, rather than raw hardware capability alone [1]. The successful ROCm 7 fine-tuning suggests that this software gap is narrowing, making AMD a more viable option for developers seeking to avoid vendor lock-in.

Expanding ROCm Support for Training and Inference

AMD has significantly expanded its official documentation and support for AI workflows. The AMD AI Developer Hub now provides comprehensive Jupyter Notebook tutorials for training, fine-tuning, and inference on AMD GPUs [2]. These resources cover both single-accelerator and multi-accelerator systems, utilizing major frameworks like PyTorch, TensorFlow, and JAX [3].

For multi-GPU setups, Hugging Face Accelerate is integrated with Transformers to simplify scaling PyTorch code across multiple accelerators on ROCm [4]. This integration reduces the complexity previously associated with distributed training on non-NVIDIA hardware. Additionally, specific guides for tools like Unsloth now provide detailed installation instructions for ROCm 7.1, including necessary environment variable adjustments for AMD MI300X hardware, such as setting HSA_OVERRIDE_GFX_VERSION=9.4.2 [5].

Performance benchmarks published in April 2026 further validate these capabilities. Results for AI training and inference are available for AMD Instinct MI355X, MI325X, and MI300X GPUs, utilizing frameworks such as vLLM and Megatron-LM [6]. Inference benchmarks on the MI300X platform were conducted using a Docker container based on ROCm 7.0.0 and vLLM 0.11.2 [6]. ROCm inference optimization techniques include quantization, kernel optimization, and the use of libraries like Flash Attention and xFormers [7].

MedQA Benchmark Reaches Saturation

In the domain of medical AI, a different kind of milestone has been reached: the saturation of the MedQA benchmark. The MedQA benchmark, designed to evaluate the clinical knowledge of AI models, has been officially archived because it reached a performance ceiling [8]. Analysis shows that almost all recently released models now score above 95%, making the benchmark less useful for distinguishing between top-tier systems [8].

Top-performing models on the archived MedQA benchmark include o1 (96.52%), GPT 5.1 (96.38%), and Gemini 3.1 Pro Preview (96.37%) [8]. This saturation indicates that current AI models have largely mastered the type of clinical reasoning tested by MedQA, prompting the community to seek new, more challenging benchmarks.

Remaining Challenges: Inference and Specialized Kernels

Despite these advancements, challenges remain, particularly in inference workloads. While training on ROCm has become more accessible, inference tasks involving speculative decoding or custom attention kernels like FlashAttention and PagedAttention still exhibit larger performance gaps compared to CUDA [1]. These specialized operations are critical for high-throughput, low-latency AI services, and their optimization on AMD hardware remains an area for improvement.

However, the successful fine-tuning of LLMs on AMD hardware without custom kernels marks a pivotal moment. It demonstrates that the hardware itself is no longer the primary bottleneck; instead, the focus is shifting to software optimization and ecosystem maturity. As AMD continues to refine its ROCm stack and address inference-specific challenges, the AI community may see a more diversified hardware landscape, reducing reliance on a single vendor’s ecosystem.

Looking Ahead

The convergence of improved ROCm support and the saturation of existing benchmarks like MedQA signals a maturing AI industry. Developers now have more options for fine-tuning models on diverse hardware, while the community must innovate beyond current benchmark limits to drive further progress. The success of the Liquid AI fine-tuning on AMD hardware serves as a testament to the rapid evolution of open AI ecosystems, proving that high-performance AI is no longer exclusively tied to NVIDIA’s CUDA platform.

Sources

  1. Fine-tuning LLMs on AMD without CUDA | Mathias Lechner posted on the topic | LinkedIn (www.linkedin.com) — 2026-02-10
  2. Tutorials for AI developers (rocm.docs.amd.com) — 2026-01-01
  3. Fine-tuning and inference (rocm.docs.amd.com) — 2026-01-28
  4. Fine-tuning and inference using multiple accelerators (rocm.docs.amd.com) — 2024-09-12
  5. Fine-tuning LLMs on AMD GPUs with Unsloth Guide | Unsloth Documentation (unsloth.ai) — 2026-05-07
  6. Performance Results with AMD ROCm™ Software (www.amd.com) — 2026-04-15
  7. Use ROCm for AI inference optimization (rocm.docs.amd.com) — 2026-01-28
  8. Vals AI (www.vals.ai) — 2026-04-16

Frequently asked questions

Can I fine-tune LLMs on AMD MI325X without using CUDA or custom kernels?
Yes, AMD ROCm 7 enables CUDA-free fine-tuning on MI325X hardware using standard PyTorch and Hugging Face libraries. This breakthrough eliminates the previous need for custom kernels or obscure software forks, making the process more accessible.
How does AMD ROCm 7 compare to NVIDIA's CUDA ecosystem for AI training?
While NVIDIA's historical advantage stems from decades of investment in software tooling and kernel libraries, ROCm 7 is narrowing this gap. The MI325X hardware is now considered competitive with NVIDIA's A100 and GH200 for similar specifications.
What specific model was successfully fine-tuned on AMD MI325X using ROCm 7?
Liquid AI's LFM2.5-1.2B-Instruct model was successfully fine-tuned on an AMD MI325X accelerator. This demonstration proved that official PyTorch and Hugging Face libraries work effectively on the ROCm 7 stack without custom modifications.
Is the MedQA benchmark still active for evaluating medical AI models?
No, the MedQA benchmark has been officially archived because it reached a performance ceiling. Top models now consistently score above 95%, making the benchmark less useful for distinguishing between leading systems.
What are the remaining performance challenges for AMD ROCm compared to NVIDIA?
Inference workloads involving speculative decoding and custom attention kernels still face performance gaps on ROCm compared to training workloads. However, AMD has expanded support for multi-GPU setups using Hugging Face Accelerate and tools like Unsloth.