A Practical Guide to LLMs
How to read model names, understand tradeoffs, and match them to your hardware
Large Language Models (LLMs) can feel confusing at first glance. You see names like Qwen3-32B-AWQ or Mixtral-8x7B, and it’s not obvious what they mean—or whether your machine can actually run them.
This guide is designed to give you a practical mental model. By the end, you should be able to:
Understand what different LLM types are
Decode model names and specs
Estimate hardware requirements quickly
Choose the right model for your setup
1. Types of LLMs (What Actually Matters)
There are three main architecture categories:
A. Decoder-only models
Examples: GPT, Llama, Qwen, Mistral.
They are used for chat, coding, reasoning, generation, which dominate modern AI usage. If you’re running a local model, this is almost certainly what you’re using.
B. Encoder-only models
Examples: BERT.
They are used for classification, embeddings, search and not typically used for chat or generation.
C. Encoder-decoder models
Examples: T5.
Used for translation, structured transformations.
2. Parameters That Actually Matter
You can ignore most of the noise. These five define almost everything.
A. Model Size
Examples:
7B = 7 billion parameters
70B = 70 billion parameters
This is the single most important parameter. It means that larger models are generally better performant, but also have dramatically higher memory requirements.
Memory usage: ~2 GB per 1B parameters, so:
7B ≈ 14 GB
70B ≈ 140 GB
B. Quantization
Quantization is the key to running model locally, since it compresses the model.
Format Memory Usage Quality
FP16 100% Best
INT8 ~50% Very close
INT4 ~25% Slight dropExamples:
FP16 → ~16 GB
INT4 → ~4 GB
This is the main trick that makes local LLMs possible.
C. Context Window
This is how much text the model can “remember” in a conversation.
Typical sizes:
4K → small
32K → standard
128K+ → long-context models
It matters because larger context implies more memory usage and can significantly increase runtime memory (KV cache).
D. KV Cache
There is a hidden memory cost as during inference, models store attention history which is KV Cache.
Impact:
Grows with conversation length
Can add 30–50% extra memory usage
This is why long chats slow down or crash smaller GPUs.
E. Architecture: Dense vs MoE
There are two main ways LLMs are built.
Dense models
Examples: Llama, Qwen, Mistral.
Every part of the model is used for every token. Think of it like using your entire brain for every thought. It is simple and predictable, but can be expensive. This is a standard approach though.
MoE (Mixture of Experts)
Examples: Mixtral, DBRX.
Only a subset of parameters is active per token. Result is much larger total model size and lower effective compute cost.
Example
Take Mixtral-8x7B. It has 8 experts, each with 7B parameters. Total size is 56B parameters.
But unlike standard models, it does not use all 56B at once. Instead, for each token, it selects only ~2 experts, that means only about 14B parameters are actually used per step.
So even though it’s a 56B model in total size, it behaves more like a 14B model in compute cost.
3. How to Decode Model Names
Let’s break down real examples:
Llama-3-8B
8B parameters
Dense model
No quantization
Needs ~20–24GB VRAM (FP16), fits on RTX 4090.
Qwen3-32B-AWQ
32B parameters
AWQ = 4-bit quantization
Needs ~20–24GB VRAM, fits on RTX 4090.
Mixtral-8x7B
8 experts × 7B each
MoE architecture
Needs ~36–40 GB. Runs like a much smaller model than it looks.
4. Matching Models to Hardware
Here’s a simple formula:
INT4 models:
VRAM ≈ 0.5 GB × (model size in B)
FP16 models:
VRAM ≈ 2 GB × (model size in B)
Then add 20–30% for KV cache and overhead.
Practical Mapping Table
Model Minimum VRAM Device(s)
7B INT4 6–8 GB Laptop / Mac
13B INT4 10–12 GB Laptop / Mac
32B INT4 ~22 GB High-end GPU
70B INT4 ~45 GB Multi-GPU
70BFP16 140+ GB Data center5. Hardware Guidelines
MacBooks (M1/M2/M3)
Best: 7B–13B INT4
Possible: 32B (slow)
RTX 4090 (24GB)
Ideal: up to 32B
Possible: 70B (with tricks / offloading)
RTX Pro 60000 Max Q (96GB)
Easy: 70B (INT4 or INT8)
Ideal:
70B FP16
Mixtral-8x7B (MoE)
Possible:
100B (INT8)
120B–140B (INT4)
Servers (A100 / H100)
70B+ models
Multi-GPU scaling
Production workloads
6. Popular Models Today
Open-source:
Llama 3
Qwen 2 / 3
Mistral / Mixtral
DeepSeek
Proprietary:
GPT
Claude
Gemini
Closing Thoughts
Understanding LLMs doesn’t require deep theory—just a few practical heuristics. Once you internalize how model size scales, how quantization changes the game, how memory really works, you will be able to quickly evaluate any model and know whether it’s worth trying.

