ML / AI

How I Compare LLMs: Leaderboards, Benchmarks, and Hardware Tradeoffs

A practitioner's guide to reading leaderboards without being misled, picking the right GPU, and deciding when local models actually make sense.

2024-09-01·7 min read·llm, benchmarks, leaderboards, gpu, model-selection

Use with AI

ShareX LinkedIn

How I Compare LLMs: Leaderboards, Benchmarks, and Hardware Tradeoffs

Every month a new model drops claiming state-of-the-art on some benchmark. Most of these claims are technically true and practically useless. After spending too much time chasing benchmark numbers that didn't translate to real results, I built a more deliberate process for model selection - one that starts with knowing which signals to trust and which to ignore.

Why Benchmarks Often Mislead

The core problem is that benchmark performance and production performance diverge in predictable ways. When a benchmark gets widely used, model developers start optimizing for it - sometimes by training on benchmark-adjacent data, sometimes by overfitting evaluation formats. A score that looks impressive on paper reflects benchmark familiarity as much as genuine capability.

The specific patterns I watch for: single-task benchmarks that don't generalize (a model fine-tuned for MMLU questions isn't necessarily better at reasoning), static benchmarks that haven't been refreshed (models can inadvertently train on test-set adjacent content when benchmark questions appear in web-crawled training data), and benchmarks designed for academic milestones rather than practitioner use cases. A model that scores 85% on a legal reasoning benchmark built around appellate case summaries may handle your contract review workflow worse than a model scoring 79% if that 6-point gap reflects question-type overfitting.

The most gamed benchmarks tend to be the most cited: MMLU, HellaSwag, ARC - all useful when they were introduced, all now saturated enough that marginal improvements are noise. The NPHardEval benchmark from the University of Michigan/Rutgers is more interesting because it regenerates its 900 logic problems monthly, making overfitting structurally harder.

Leaderboards Worth Reading

Not all leaderboards are created equal. Here are the three I actually use.

LMSYS Chatbot Arena is my primary reference for general-purpose model selection. It uses the Elo rating system - the same method used for chess rankings - where human users compare two anonymous model responses and vote on which is better. The key advantage is that it's crowdsourced from real queries, not synthetic benchmarks. The gap between models on Chatbot Arena has repeatedly predicted real-world quality differences better than static benchmark scores. The limitation is coverage: models need enough matches to produce stable ratings, so very new or very niche models are underrepresented. Still, if a model I'm evaluating isn't even close to competitive on the Arena, I don't spend more time on it.

Hugging Face Open LLM Leaderboard is the standard reference for open-source model comparison. It runs EleutherAI's Language Model Evaluation Harness across a fixed set of tasks. Useful for comparing models within the same weight class - 7B vs. 13B vs. 70B - and for getting a quick read on whether a newly released model is genuinely competitive or just well-marketed. I use it for shortlisting, not for final selection.

The Enterprise Scenarios Leaderboard from Patronus is specifically interesting for regulated-industry use cases. It scores models on finance, legal, customer support, and creative writing accuracy, while also measuring propensity to return toxic answers or leak confidential information. For practitioners in financial services, legal, or healthcare contexts, the safety and information-leakage scores are load-bearing data points that don't appear in general leaderboards. Each task scores 1–100; models can be sorted by individual task categories rather than averaged score, which matters if your workload is heavily concentrated in one domain.

GPU Selection: What the Numbers Actually Mean

I've seen teams default to H100s for everything and teams try to run fine-tuning jobs on L4s. Both approaches waste money. The right GPU depends on what you're doing.

The critical spec is not FLOPS - it's memory. An ML model must fit into GPU RAM during inference, and with room to spare during training. A 70B parameter model at FP16 requires roughly 140GB of VRAM. A single H100 SXM has 80GB. You cannot run that model on a single H100; you need model parallelism across multiple cards or a quantized version of the model.

Precision format matters more than most teams realize. FP8 is the reason H100 performance numbers look dramatically better than A100 numbers on transformer workloads. The H100 has full hardware-pipeline support for FP8 - data movement from RAM to Tensor cores and back is implemented in FP8 end to end. The A100 does not. On transformer inference specifically, H100 FP8 inference runs roughly 37% faster than H100 FP16 inference, and the gap versus A100 is even larger. If your inference workload is transformer-based and latency-sensitive, H100 with FP8 is not just the best option - it's in a different performance class.

For practical decisions by workload: the H100 is the right choice for large-scale LLM fine-tuning and production transformer inference. It supports NVLink-based multi-node clustering (the "HPC" capability), which means you're not capped at a single host's GPU count. The A100 remains competitive for conventional CNN training - it has better price-to-performance on standard deep learning workloads where FP8 isn't the bottleneck, and it's often significantly cheaper to rent. The L40 targets generative AI inference combined with visual computing - it has RT cores for graphics workloads that the A100 and H100 lack entirely. The L4 is an entry-level option for teams that need GPU-accelerated compute at lower cost; fine for lightweight inference, inadequate for serious fine-tuning or large model serving.

One nuance on the L40 and L4: they don't support multi-node clustering at the hardware level, which caps your maximum model size at whatever fits in the combined RAM of GPUs on a single host. For serving very large models, this becomes a hard constraint.

Local LLM vs. Cloud API: When Each Makes Sense

The question isn't which is better in principle - it's which fits your cost structure, latency requirements, and data constraints.

Cloud API makes sense when: you're in the evaluation or prototype stage and don't want to commit to infrastructure, your query volume is unpredictable enough that idle GPU time would be wasteful, or you need the top-tier frontier models (GPT-4o, Claude Opus, Gemini Ultra) that aren't available for self-hosting. The per-token cost feels small until you're processing millions of documents.

Local deployment makes sense when: data cannot leave your infrastructure (regulated industries often have this constraint regardless of what cloud providers claim about data residency), your query volume is high enough that per-token API costs exceed hosting costs, or you need sub-100ms latency that cloud round-trips can't reliably deliver. At roughly 1–2 million tokens per day, the economics of self-hosting a well-quantized 70B model on a pair of A100s often beats API costs within a few months - though you need to account for engineering time to operate the stack.

The middle path I use most often: cloud API for development and low-volume workloads, spot-instance GPU hosting for high-volume batch inference, and dedicated hosting only for latency-critical paths with predictable load.

My Actual Decision Process

When I'm evaluating a model for a real workload, I run in this order. First, Chatbot Arena to filter the field - if a model isn't competitive in human preference comparisons, static benchmark scores rarely save it. Second, the Open LLM Leaderboard to check within the relevant weight class - model size drives hosting cost, so I care whether I'm choosing between 7B and 13B options or between 70B options. Third, a domain-specific evaluation on a sample of my actual data. No leaderboard substitutes for 100 real examples from your specific use case with your specific output requirements.

For hardware: start with the minimum RAM that fits the model at the precision you plan to run, then pick the GPU that delivers the required throughput at that memory footprint. Only upgrade to H100 if you're running transformer inference at scale with FP8 or need multi-node training. For most fine-tuning projects on 7–13B models, A100s offer better cost efficiency.

The common mistake I see is treating model selection and hardware selection as sequential decisions. They're coupled. A 70B model on quantized consumer hardware behaves differently from the same model on full-precision H100s, and that difference may outweigh model-to-model quality gaps you'd see on a leaderboard.

Want this implemented in your workflow?

I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call - no pitch, just a focused conversation about your situation.

Book a strategy call →Download the checklist →

I publish one post like this per month. Join AI Command Room and I'll send it directly to you.