Local LLM Benchmark on a 48 GB Dual-GPU Rig: What Actually Runs in 2026
We ran Qwen3 27B, 32B, 35B-A3B, and 80B on an RTX 5090 + 5080 box to find the real sweet spot for local AI in 2026. Here is what we kept — and what we retired.
TL;DR: On a 48 GB dual-GPU rig (RTX 5090 + RTX 5080), Qwen3-Next-80B scored 100% on our 60-prompt brutal coding tier at 28 tokens/sec, while Qwen3.6-35B-A3B fit on a single 24 GB card at 138 tokens/sec and hit 95% once given a 12k reasoning budget. Dense 32B and 27B models beat the original 30B-A3B MoE by 35-40 points on hard tasks — so we retired the old MoE and standardized on a 3-model ladder.
Key facts
- Test rig: RTX 5090 (24 GB) + RTX 5080 (16 GB) = 48 GB pooled via Ollama tensor split.
- Test set: 60 verifiable coding and reasoning prompts with scripted reference graders — our internal “brutal tier”.
- Qwen3-Next-80B-A3B: 46/48 GB used, 28 tokens/sec, 100% brutal at 12k budget.
- Qwen3.6-35B-A3B (Q4 GGUF): fits in 24 GB single-GPU, 138 tokens/sec, 80% brutal baseline, 95% at 12k budget.
- Qwen3-32B dense: 90% brutal at standard budget, the best-value pick for a single 24 GB card when quality matters most.
- Qwen3.5-27B dense: 85% brutal, fastest of the dense tier but quality ceiling lower than 32B.
- Qwen3-30B-A3B (original MoE): 50% brutal, burned 6.4x the tokens of dense 32B at the same accuracy — retired from our stack.
- Null-result tricks: plan injection (p = 0.85-0.89) and best-of-3 sampling (p > 0.30) both failed to lift 27B accuracy meaningfully at 3x cost.
Why we ran this bench
The local-LLM landscape in 2026 looks nothing like 2024. Three forces collided at once: consumer GPUs crossed 24 GB VRAM at enthusiast prices, Mixture-of-Experts architectures matured into 80 B-parameter models that run on 48 GB cards, and Qwen’s April 2026 release cycle put four viable open-weight options within a single model family. That is more choice than teams can reasonably evaluate from published leaderboards alone — public benchmarks overweight easy tasks and under-reward latency, and none of them answer the question we actually care about: what should we actually run, today, on this hardware?
This article is our internal answer. It is not a replacement for a public eval; the tier of 60 prompts we grade against is domain-skewed toward code, reasoning, and Thai-language tasks relevant to KoishiAI’s own content pipeline. Treat the absolute numbers as one data point, not a verdict — but the relative ordering has held up across three probe runs and is stable enough that we have rebuilt our stack around it.
What it is
Our benchmark, which we internally call the “brutal tier,” is a fixed set of 60 prompts across four task families: multi-file refactors, algorithmic reasoning, structured data extraction, and translation of technical writing between Thai and English. Each prompt has a scripted grader — usually a regex check, an assertion on structured output, or a pinned exact-match reference for short answers. There is no LLM-as-judge component; graders are deterministic and re-runnable.
We score two axes per model: baseline accuracy at the model’s default generation settings, and budget accuracy when we allow the model up to 12,288 output tokens and a reasoning-first system prompt. We also log wall-clock throughput in tokens per second on our specific hardware, measured in a warm-cache state after a 10-prompt priming run.
Why it matters
Public benchmarks (MMLU, HumanEval, IFEval) are saturated or over-leaderboarded; most modern LLMs score within a few points of each other on the easy 80% of any given suite. Our brutal tier is designed to separate models on the hard 10%, where production failures actually live — the prompt where a user expects a working answer and a weaker model ships a confident but broken one. That is also where AI-answer-engine citations fail: the LLM doing the answering on Perplexity or ChatGPT also stumbles on these prompts, and a better underlying model produces better answers downstream.
Key concepts
Brutal tier vs. easy tier. Easy prompts saturate; brutal prompts rank. If your bench does not have a brutal tier, you cannot tell modern models apart.
Baseline vs. budget accuracy. A model that scores 80% baseline and 95% budget is a different tool than one that scores 90% baseline and 91% budget. The first trades latency for quality on demand; the second is a flat-accuracy workhorse.
Activated vs. total parameters. For MoE models, activated parameters (e.g. 3 B for Qwen3.6 35B-A3B) determine speed; total parameters determine quality ceiling. The ratio matters: Qwen3.6 at 35 B total / 3 B active is a different beast than the original 30B-A3B.
How to use it (our production ladder)
After three probe rounds we converged on a three-model ladder that we now run in production:
-
Fast tier — Qwen3.6-35B-A3B-Q4. Default for every request that does not explicitly need reasoning. Single 24 GB card, 138 tokens/sec, 80% accuracy on brutal tier out of the box. This is the model we run the KoishiAI content pipeline on for drafting.
-
Default tier — Qwen3-32B dense. When quality matters more than latency and we are inside 24 GB, we fall back to dense 32B. Scores 90% on brutal at standard generation settings and does not need a reasoning budget to hit its ceiling. We use it for fact-checking and translation inside the pipeline.
-
Heavy tier — Qwen3-Next-80B-A3B. Only called when we detect the task is in the hardest 10% — long multi-file refactors, complex translation with cultural nuance, reasoning over long context. Needs the full 48 GB dual-GPU setup at 28 tokens/sec; we route to it sparingly.
The previous 30B-A3B MoE model, which once sat in the default slot, is gone. It failed the brutal tier on a 50%/85% gap against dense 32B while burning 6.4x the tokens — the worst cost/quality ratio of anything we tested.
Common pitfalls
Assuming MoE is always faster. MoE is only faster when the gate makes good routing decisions under your workload. The original Qwen3-30B-A3B routed poorly on long-form reasoning — each expert choice added tokens without improving grade. The newer Qwen3.6-35B-A3B and 80B-A3B fix this, but the label “MoE” alone does not predict speed.
Treating plan injection and best-of-3 sampling as free wins. We ran both as separate probe rounds on Qwen3.5-27B. Plan injection (system prompt: “First write a plan, then execute”) lifted coding accuracy by 0.25 points (not significant, p = 0.85-0.89) and cut throughput by 33%. Best-of-3 sampling added 0.06 points (p > 0.30) at 3x token cost. Neither survived our bench. If you need more accuracy, swap the base model — do not add test-time compute on a small one.
Running heterogeneous GPUs without tensor split sanity checks. An RTX 5080 at 16 GB and an RTX 5090 at 24 GB will happily split a 40 GB model — but memory bandwidth differs between them, and your throughput will track the slower card if the layer distribution puts too many on the 5080. Ollama handles this reasonably by default; vLLM needs manual tuning.
Alternatives and comparisons
We ran shorter side-benches on two non-Qwen families to sanity-check that we were not just “Qwen-pilled”:
- Llama family (3.3 70B-Instruct-Q4). Fits in 48 GB, runs around 22 tokens/sec, scored 82% on our brutal tier at default settings. Respectable, but behind Qwen3-32B at a higher memory cost. Strong for general chat; weaker on multi-file code.
- Gemma 4 31B dense. Released in April 2026 under Apache 2.0. Scored 88% on our brutal tier, close to Qwen3-32B, with the benefit of a more permissive license. We are evaluating it for the production default but have not switched yet — the incremental quality gap doesn’t justify breaking the Qwen stack’s downstream tuning.
When NOT to use local LLMs
Local is not the answer when: you need the absolute highest accuracy on one-off hard reasoning (Claude or GPT-5 still win on the hardest 1%), when variable workload means your GPU sits idle 90% of the time (API billing is cheaper per query), or when you lack an on-call person to handle model updates and CUDA driver breakage. Local shines for steady, high-volume workloads where data privacy matters and average throughput is the bottleneck.
Further reading
Our test prompts are not public; most are domain-specific to content generation and Thai translation. We may release a sanitized subset later. In the meantime, for a reproducible public alternative, BigCodeBench and IFEval-Strict are the closest equivalents on the coding side.
Methodology disclosure
- This is a single operator’s bench on one specific rig.
- Our prompts are skewed toward AI content production and Thai-language tasks; results likely do not generalize to SQL, legal, medical, or other domains.
- Throughput numbers are Ollama-specific with our specific draft settings (temperature 0.65, num_ctx 16384 for research-grounded tasks). vLLM and TGI will differ.
- Accuracy percentages are against our own deterministic graders, not public benchmark leaderboards.
- If you want to replicate: the three models we ship with are all public (Qwen3.5-27B, Qwen3-32B, Qwen3.6-35B-A3B-Q4, Qwen3-Next-80B-A3B on Hugging Face).