Skip to content
KoishiAI
ไทย
← Back to all articles

Local LLM Benchmark on a 48 GB Dual-GPU Rig: What Actually Runs in 2026

We ran Qwen3 27B, 32B, 35B-A3B, and 80B on an RTX 5090 + 5080 box to find the real sweet spot for local AI in 2026. Here is what we kept — and what we retired.

AI-drafted from cited sources, fact-checked and reviewed by a human editor. How we work · Standards · Report an error
KoishiAI's actual benchmark rig — open-case dual-GPU workstation with RTX 5090 (top) and Zotac RTX 5080 (bottom), RGB-lit, with a supplementary desk fan for airflow. Photographed in Thailand.
Photo by KoishiAI (editor's own rig, April 2026)

TL;DR: On a 48 GB dual-GPU rig (RTX 5090 + RTX 5080), Qwen3-Next-80B scored 100% on our 60-prompt brutal coding tier at 28 tokens/sec, while Qwen3.6-35B-A3B fit on a single 24 GB card at 138 tokens/sec and hit 95% once given a 12k reasoning budget. Dense 32B and 27B models beat the original 30B-A3B MoE by 35-40 points on hard tasks — so we retired the old MoE and standardized on a 3-model ladder.

Key facts

  • Test rig: RTX 5090 (24 GB) + RTX 5080 (16 GB) = 48 GB pooled via Ollama tensor split.
  • Test set: 60 verifiable coding and reasoning prompts with scripted reference graders — our internal “brutal tier”.
  • Qwen3-Next-80B-A3B: 46/48 GB used, 28 tokens/sec, 100% brutal at 12k budget.
  • Qwen3.6-35B-A3B (Q4 GGUF): fits in 24 GB single-GPU, 138 tokens/sec, 80% brutal baseline, 95% at 12k budget.
  • Qwen3-32B dense: 90% brutal at standard budget, the best-value pick for a single 24 GB card when quality matters most.
  • Qwen3.5-27B dense: 85% brutal, fastest of the dense tier but quality ceiling lower than 32B.
  • Qwen3-30B-A3B (original MoE): 50% brutal, burned 6.4x the tokens of dense 32B at the same accuracy — retired from our stack.
  • Null-result tricks: plan injection (p = 0.85-0.89) and best-of-3 sampling (p > 0.30) both failed to lift 27B accuracy meaningfully at 3x cost.

Why we ran this bench

The local-LLM landscape in 2026 looks nothing like 2024. Three forces collided at once: consumer GPUs crossed 24 GB VRAM at enthusiast prices, Mixture-of-Experts architectures matured into 80 B-parameter models that run on 48 GB cards, and Qwen’s April 2026 release cycle put four viable open-weight options within a single model family. That is more choice than teams can reasonably evaluate from published leaderboards alone — public benchmarks overweight easy tasks and under-reward latency, and none of them answer the question we actually care about: what should we actually run, today, on this hardware?

This article is our internal answer. It is not a replacement for a public eval; the tier of 60 prompts we grade against is domain-skewed toward code, reasoning, and Thai-language tasks relevant to KoishiAI’s own content pipeline. Treat the absolute numbers as one data point, not a verdict — but the relative ordering has held up across three probe runs and is stable enough that we have rebuilt our stack around it.

What it is

Our benchmark, which we internally call the “brutal tier,” is a fixed set of 60 prompts across four task families: multi-file refactors, algorithmic reasoning, structured data extraction, and translation of technical writing between Thai and English. Each prompt has a scripted grader — usually a regex check, an assertion on structured output, or a pinned exact-match reference for short answers. There is no LLM-as-judge component; graders are deterministic and re-runnable.

We score two axes per model: baseline accuracy at the model’s default generation settings, and budget accuracy when we allow the model up to 12,288 output tokens and a reasoning-first system prompt. We also log wall-clock throughput in tokens per second on our specific hardware, measured in a warm-cache state after a 10-prompt priming run.

Why it matters

Public benchmarks (MMLU, HumanEval, IFEval) are saturated or over-leaderboarded; most modern LLMs score within a few points of each other on the easy 80% of any given suite. Our brutal tier is designed to separate models on the hard 10%, where production failures actually live — the prompt where a user expects a working answer and a weaker model ships a confident but broken one. That is also where AI-answer-engine citations fail: the LLM doing the answering on Perplexity or ChatGPT also stumbles on these prompts, and a better underlying model produces better answers downstream.

Key concepts

Brutal tier vs. easy tier. Easy prompts saturate; brutal prompts rank. If your bench does not have a brutal tier, you cannot tell modern models apart.

Baseline vs. budget accuracy. A model that scores 80% baseline and 95% budget is a different tool than one that scores 90% baseline and 91% budget. The first trades latency for quality on demand; the second is a flat-accuracy workhorse.

Activated vs. total parameters. For MoE models, activated parameters (e.g. 3 B for Qwen3.6 35B-A3B) determine speed; total parameters determine quality ceiling. The ratio matters: Qwen3.6 at 35 B total / 3 B active is a different beast than the original 30B-A3B.

How to use it (our production ladder)

After three probe rounds we converged on a three-model ladder that we now run in production:

  1. Fast tier — Qwen3.6-35B-A3B-Q4. Default for every request that does not explicitly need reasoning. Single 24 GB card, 138 tokens/sec, 80% accuracy on brutal tier out of the box. This is the model we run the KoishiAI content pipeline on for drafting.

  2. Default tier — Qwen3-32B dense. When quality matters more than latency and we are inside 24 GB, we fall back to dense 32B. Scores 90% on brutal at standard generation settings and does not need a reasoning budget to hit its ceiling. We use it for fact-checking and translation inside the pipeline.

  3. Heavy tier — Qwen3-Next-80B-A3B. Only called when we detect the task is in the hardest 10% — long multi-file refactors, complex translation with cultural nuance, reasoning over long context. Needs the full 48 GB dual-GPU setup at 28 tokens/sec; we route to it sparingly.

The previous 30B-A3B MoE model, which once sat in the default slot, is gone. It failed the brutal tier on a 50%/85% gap against dense 32B while burning 6.4x the tokens — the worst cost/quality ratio of anything we tested.

Common pitfalls

Assuming MoE is always faster. MoE is only faster when the gate makes good routing decisions under your workload. The original Qwen3-30B-A3B routed poorly on long-form reasoning — each expert choice added tokens without improving grade. The newer Qwen3.6-35B-A3B and 80B-A3B fix this, but the label “MoE” alone does not predict speed.

Treating plan injection and best-of-3 sampling as free wins. We ran both as separate probe rounds on Qwen3.5-27B. Plan injection (system prompt: “First write a plan, then execute”) lifted coding accuracy by 0.25 points (not significant, p = 0.85-0.89) and cut throughput by 33%. Best-of-3 sampling added 0.06 points (p > 0.30) at 3x token cost. Neither survived our bench. If you need more accuracy, swap the base model — do not add test-time compute on a small one.

Running heterogeneous GPUs without tensor split sanity checks. An RTX 5080 at 16 GB and an RTX 5090 at 24 GB will happily split a 40 GB model — but memory bandwidth differs between them, and your throughput will track the slower card if the layer distribution puts too many on the 5080. Ollama handles this reasonably by default; vLLM needs manual tuning.

Alternatives and comparisons

We ran shorter side-benches on two non-Qwen families to sanity-check that we were not just “Qwen-pilled”:

  • Llama family (3.3 70B-Instruct-Q4). Fits in 48 GB, runs around 22 tokens/sec, scored 82% on our brutal tier at default settings. Respectable, but behind Qwen3-32B at a higher memory cost. Strong for general chat; weaker on multi-file code.
  • Gemma 4 31B dense. Released in April 2026 under Apache 2.0. Scored 88% on our brutal tier, close to Qwen3-32B, with the benefit of a more permissive license. We are evaluating it for the production default but have not switched yet — the incremental quality gap doesn’t justify breaking the Qwen stack’s downstream tuning.

When NOT to use local LLMs

Local is not the answer when: you need the absolute highest accuracy on one-off hard reasoning (Claude or GPT-5 still win on the hardest 1%), when variable workload means your GPU sits idle 90% of the time (API billing is cheaper per query), or when you lack an on-call person to handle model updates and CUDA driver breakage. Local shines for steady, high-volume workloads where data privacy matters and average throughput is the bottleneck.

Further reading

Our test prompts are not public; most are domain-specific to content generation and Thai translation. We may release a sanitized subset later. In the meantime, for a reproducible public alternative, BigCodeBench and IFEval-Strict are the closest equivalents on the coding side.

Methodology disclosure

  • This is a single operator’s bench on one specific rig.
  • Our prompts are skewed toward AI content production and Thai-language tasks; results likely do not generalize to SQL, legal, medical, or other domains.
  • Throughput numbers are Ollama-specific with our specific draft settings (temperature 0.65, num_ctx 16384 for research-grounded tasks). vLLM and TGI will differ.
  • Accuracy percentages are against our own deterministic graders, not public benchmark leaderboards.
  • If you want to replicate: the three models we ship with are all public (Qwen3.5-27B, Qwen3-32B, Qwen3.6-35B-A3B-Q4, Qwen3-Next-80B-A3B on Hugging Face).

Frequently asked questions

Which local LLM is best for 24 GB of VRAM in 2026?
In our bench, Qwen3.6-35B-A3B-Q4 is the sweet spot for a single 24 GB card. It pushes 138 tokens/sec and scores 80% on our brutal tier out of the box — rising to 95% when you allow a 12k-token reasoning budget. Smaller dense models trade quality for speed; Qwen3.5-27B is the next step down.
Is Qwen3-Next-80B worth running on consumer hardware?
Yes, if you have 48 GB of pooled VRAM (RTX 5090 24 GB + RTX 5080 16 GB works via Ollama tensor split) and you care about the hardest 10% of prompts. It scored 100% on our brutal tier — about 10 points above the 32B dense ceiling — but throughput drops to 28 t/s. For everyday coding, the 35B-A3B is usually the better tradeoff.
Why did the original Qwen3 30B-A3B MoE lose to smaller dense models?
On the brutal tier it scored 50% and burned 6.4x the token budget of dense 32B at the same accuracy. The routing gain didn't materialize for long-form reasoning — it kept activating suboptimal expert combinations and spending compute. The newer 35B-A3B and 80B-A3B fix this; the old 30B-A3B is retired in our stack.
Does plan injection or best-of-3 sampling help small models?
In our tests, no — neither gave a statistically significant lift on Qwen3.5-27B (p > 0.30 across 60 prompts). Both cost 3x tokens. We retired them. If you need more accuracy, scale the base model before reaching for test-time tricks.
How do you pool VRAM across two different-sized GPUs?
Ollama's CUDA backend supports tensor splitting across heterogeneous GPUs automatically; you just set OLLAMA_NUM_GPU=2 and it loads layers proportional to each card's free memory. vLLM is similar but more manual. The RTX 5080's lower memory bandwidth bottlenecks generation slightly — expect 70-80% of single-card throughput.