Can I fine-tune a 33B parameter model on a consumer GPU?

Yes, using QLoRA with 4-bit quantization, you can fine-tune models up to 33B parameters on a single 24GB GPU like the NVIDIA RTX 3090 or 4090. This is possible because QLoRA freezes the base model weights and only trains small adapter matrices, drastically reducing VRAM requirements.

How much VRAM does QLoRA save compared to full fine-tuning?

Full fine-tuning a 7B model in FP16 requires approximately 56GB of VRAM, while QLoRA allows the same task on a 24GB GPU. This efficiency comes from quantizing the model to 4-bit precision and training only low-rank adapter matrices.

What libraries are required to implement QLoRA fine-tuning?

You need the `transformers`, `peft`, `trl`, `bitsandbytes`, `datasets`, and `accelerate` libraries. For optimized performance, the `unsloth` library is recommended as it provides faster kernels and significant memory reductions.

How large are the resulting QLoRA adapter files?

QLoRA adapter files are significantly smaller than full model checkpoints, typically ranging from 10MB to 50MB. This makes them easy to distribute and deploy without needing to transfer gigabytes of data.

Fine-Tune LLMs on 24GB GPUs: QLoRA Step-by-Step Guide

TL;DR: QLoRA enables fine-tuning 7B to 33B parameter LLMs on consumer 24GB GPUs like the RTX 4090 by using 4-bit quantization. This method reduces trainable parameters by over 99% while maintaining 95-98% of full fine-tuning performance, democratizing AI development for independent researchers.

Key facts

QLoRA combines Low-Rank Adaptation with 4-bit NormalFloat (NF4) quantization to fit 33B models on 24GB VRAM GPUs.
Standard full fine-tuning of a 7B model requires approximately 56GB of VRAM, whereas QLoRA reduces this requirement to 24GB.
QLoRA reduces trainable parameters by over 99% while preserving 95-98% of the performance of full fine-tuning.
Resulting QLoRA adapter files are typically 10MB to 50MB, compared to 14-26GB for full model checkpoints.
The Unsloth framework claims to achieve 2x faster training speeds and a 60% memory reduction for LoRA/QLoRA tasks.
Using the PagedAdamW optimizer can improve training throughput by up to 25% on consumer GPUs compared to standard AdamW.
Sequence lengths up to 2048 tokens are achievable on 8GB VRAM hardware using optimized QLoRA strategies.

The Democratization of LLM Development

For years, fine-tuning Large Language Models (LLMs) was the exclusive domain of well-funded datacenters. Full fine-tuning a 7B-parameter model in FP16 requires approximately 56GB of VRAM just for training states [7]. This constraint effectively locked out independent researchers and small startups, forcing them to rely on expensive cloud clusters or API services [6].

However, the landscape has shifted. With the advent of Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) and its quantized variant, QLoRA, it is now possible to adapt models with billions of parameters on consumer-grade hardware [3]. Specifically, GPUs with 24GB of VRAM, such as the NVIDIA RTX 3090 and RTX 4090, have become the new standard for accessible AI development [7].

LoRA works by freezing the original model weights and training only small, low-rank adapter matrices injected into specific layers [3]. This approach reduces the number of trainable parameters by over 99% while maintaining 95-98% of the performance of full fine-tuning [5]. In this guide, we will walk through the process of fine-tuning a 7B model on a single 24GB GPU, and explore how QLoRA can push this limit to 33B parameters.

Why QLoRA? The Memory Advantage

Standard LoRA is memory-efficient, but QLoRA takes it further by combining LoRA with 4-bit quantization using NormalFloat 4 (NF4) [7]. This technique allows a 7B-parameter model to be trained on a single 24GB GPU, and with QLoRA, you can fine-tune models up to 33B parameters on the same hardware [3].

The benefits extend beyond training. The resulting adapter files are significantly smaller, typically ranging from 10MB to 50MB, compared to the 14-26GB required for full model checkpoints [3]. This makes distribution and deployment trivial.

Recent profiling on consumer hardware, such as the RTX 4060, indicates that even with tighter constraints (8GB VRAM), sequence lengths up to 2048 tokens are achievable using these strategies [2]. Furthermore, using paged optimizers like PagedAdamW can improve throughput by up to 25% compared to standard AdamW on consumer GPUs [2].

Step-by-Step: Fine-Tuning with Unsloth and PEFT

To maximize efficiency, we will use the Unsloth framework, which claims to achieve 2x faster LoRA/QLoRA fine-tuning and a 60% memory reduction through optimized kernels [7]. We will also leverage Hugging Face’s PEFT, TRL, and bitsandbytes libraries [3].

Step 1: Environment Setup

First, ensure you have the necessary libraries installed. You will need transformers, peft, trl, bitsandbytes, datasets, and accelerate [3]. For this walkthrough, we will use Unsloth for its speed optimizations [8].

!pip install unsloth trl peft bitsandbytes transformers datasets accelerate

Step 2: Load the Base Model with 4-Bit Quantization

We load the model using bitsandbytes to quantize it to 4-bit precision. This is the core of QLoRA, allowing us to fit large models into VRAM [7].

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

Step 3: Configure LoRA Adapters

Next, we apply the LoRA adapters. We target specific modules, typically q_proj and v_proj, to inject the low-rank matrices [3].

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Step 4: Prepare the Dataset

We need a dataset in the instruction-tuning format. The datasets library is essential here [3].

dataset = load_dataset("yahma/alpaca-cleaned", split = "train")

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = f"<|start_header_id|>user<|end_header_id|>\n\n{instruction}\n{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n{output}<|eot_id|>"
        texts.append(text)
    return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True,)

Step 5: Train the Model

We use the SFTTrainer from the TRL library to handle the supervised fine-tuning process [3].

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "paged_adamw_8bit", # Improves throughput by up to 25% [\[2\]](https://arxiv.org/html/2509.12229v1)
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer.train()

Note the use of paged_adamw_8bit in the optimizer. As noted in recent studies, this can improve throughput by up to 25% on consumer GPUs compared to standard AdamW [2].

Step 6: Save and Merge

Finally, save the adapter weights. These files will be small, typically 10-50MB [3].

model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

If you wish to merge the adapters back into the base model for inference, you can do so:

model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit")

Conclusion

Fine-tuning LLMs on a 24GB consumer GPU is no longer a theoretical exercise. By leveraging QLoRA, PEFT, and optimized frameworks like Unsloth, independent developers can adapt models ranging from 7B to 33B parameters without breaking the bank [7]. The resulting adapters are small, fast to train, and retain the majority of the base model’s capabilities [5]. As the hardware continues to evolve, these methods will remain essential for democratizing AI development.

For those with even less VRAM, such as the 8GB RTX 4060, these techniques still allow for viable training with sequence lengths up to 2048 tokens [2]. The barrier to entry has never been lower.

Sources

LoRA & QLoRA Fine-Tuning: Build Custom LLMs on a Single GPU [2026 Guide] | Meta Intelligence (www.meta-intelligence.tech) — 2025-08-07
How can I fine-tune large language models on a budget using LoRA and QLoRA on cloud GPUs? (www.runpod.io) — 2025-07-03
How to Fine-Tune LLMs with LoRA in Python — A Complete Guide - machinelearningplus (machinelearningplus.com) — 2026-03-06
LoRA Fine-Tuning: 7 Steps to Adapt Any LLM on One GPU (decodethefuture.org) — 2026-03-16
Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study (arxiv.org) — 2021-01-01
How to Fine-Tune LLMs in 2026: Costs, GPUs, and Code | Spheron Blog (www.spheron.network) — 2026-03-05