Quantization & acceleration is how you squeeze big models onto normal hardware and make them feel fast. Quantization shrinks weights from fp16/bf16 down to 8-bit or 4-bit (sometimes even lower), slashing VRAM without retraining. You’ll see a few flavors: bitsandbytes for drop-in 8-bit/4-bit loading and lean optimizers; AutoGPTQ for post-training 4-bit checkpoints; AWQ for activation-aware 4-bit with strong accuracy-per-GB; and ExLlamaV2/EXL2 when you want the highest tokens/sec on a single consumer GPU. These aren’t mutually exclusive: quantized weights often pair with specialized kernels or backends to move the needle.
Acceleration is the runtime half of the story: custom CUDA kernels, smart KV-cache handling, and sampler plumbing that keep the GPU busy. This guide focuses on single-box speedups via quantized formats and fast loaders. Expect trade-offs: smaller bit-widths and tighter group sizes reduce memory and increase throughput but can soften accuracy; better calibration data and modern formats (NF4, AWQ, EXL2) help retain quality. Also remember that quantization shrinks weights, not your KV cache: long contexts and big batches still eat memory.
Rule of thumb: start with 4-bit on a 7B–13B model, keep group_size=128 where applicable, use bfloat16 as the compute dtype when available, and only tighten further if VRAM forces your hand. Test on your own prompts, watch temps and clocks, and right-size context length before blaming the loader.
ExLlamaV2 (fast GPTQ/EXL2 inference)
What it is: a high-performance inference backend for quantized Llama-family models. It supports classic GPTQ checkpoints and its own EXL2 format, and ships custom CUDA kernels to squeeze more tokens/sec out of mid-range cards. Fast, VRAM-efficient decoding for 7B–70B models, with clean Python bindings and drop-in support in popular local UIs.
Why you’d pick it: strong single-GPU speed via optimized attention/MLP kernels; small-VRAM-friendly 4-bit GPTQ/EXL2; Python and UI loaders; EXL2 supports per-layer/per-tensor precision for better fidelity at the same footprint than plain 4-bit GPTQ.
Key capabilities: loads GPTQ and EXL2; mixed/variable bit-rates; fused CUDA kernels; KV-cache controls and RoPE scaling; streaming token output; optional multi-GPU sharding; selective LoRA/PEFT without duplicating full weights.
Rough VRAM guidance
- 7B @ 4-bit: ~5–7 GB for weights plus headroom for KV/cache; works on 8–12 GB cards.
- 13B @ 4-bit: ~8–12 GB; comfortable on 12–16 GB cards.
- 70B (quantized): needs sharding/offload. Footprint depends on bit-width, context length, batch size, and CPU offload.
Quick start
A) With text-generation-webui
- Place a GPTQ or EXL2 model in
models/. - Select the ExLlamaV2 loader and pick your model.
- Set context length, GPU selection/offload, and sampling preferences. If OOM, lower context or batch size, or use a tighter quant.
B) From Python
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer, ExLlamaV2Cache, ExLlamaV2Sampler
cfg = ExLlamaV2Config()
cfg.model_dir = "/path/to/your/exl2-or-gptq-model"
model = ExLlamaV2(cfg)
tokenizer = ExLlamaV2Tokenizer(cfg)
cache = ExLlamaV2Cache(model)
prompt = "Explain SIMD in one paragraph."
ids = tokenizer.encode(prompt)
sampler = ExLlamaV2Sampler(model)
output_ids = sampler.generate(
ids,
cache=cache,
max_new_tokens=200,
temperature=0.7,
top_p=0.95,
stream=True
)
for tok in output_ids:
print(tokenizer.decode(tok), end="", flush=True)
Performance tips
- Prefer EXL2 when available for better accuracy per GB.
- Right-size context. KV cache scales with sequence length × batch.
- For factual tasks, lower temperature and moderate top-p; for creative work, raise them.
- Keep GPU clocks cool and stable; sustained boost clocks matter.
Troubleshooting
- CUDA OOM: reduce context or batch; use a tighter quant; close other GPU apps.
- Slow first token: warm-up/graph initialization; subsequent generations are faster.
- Tokenizer mismatch: ensure tokenizer files match the checkpoint.
- Throughput stalls on long chats: trim history or use a sliding-window/rolling cache.
When to use something else
- Need OpenAI-compatible HTTP out of the box? Use vLLM or TGI.
- Need CPU-only portability? Use llama.cpp.
- Need a browser-only demo? Use WebLLM.
Links: https://github.com/turboderp-org/exllamav2
AutoGPTQ (GPTQ quantization toolkit)
What it is: a Python library for post-training, weight-only quantization of LLMs using GPTQ. It shrinks memory and boosts inference speed by turning FP16/BF16 checkpoints into compact 4-bit (and other bit-width) variants with minimal quality loss.
What it does: start with a floating-point model, run a short calibration pass over representative text, then export a GPTQ-quantized checkpoint (often .safetensors). Choose bit-width, group size, and related knobs to trade fidelity for footprint. Load with AutoGPTQ’s loader or Transformers’ GPTQ path; many pair GPTQ weights with ExLlamaV2 for speed.
Notes
- Baseline: 4-bit with
group_size=128is a practical start. - Calibration: use text similar to your prompts; better calibration gives better fidelity.
- Hardware: quantization is compute/RAM heavy; inference on the result is much lighter.
- Kernels: for best speed, use quant-optimized kernels or servers that support GPTQ.
- Housekeeping: keep tokenizer files in sync with the model.
Quick start
Quantize to 4-bit and save
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
model_id = "meta-llama/Llama-3-8b-instruct"
quant_config = BaseQuantizeConfig(
bits=4,
group_size=128,
damp_percent=0.01,
desc_act=False
)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config=quant_config)
calib_texts = [
"Your representative sample text goes here.",
"Add more lines that resemble your real prompts."
]
model.quantize(tokenizer=tokenizer, calib_texts=calib_texts)
save_dir = "./llama3-8b-gptq-4bit"
model.save_quantized(save_dir, use_safetensors=True)
tokenizer.save_pretrained(save_dir)
Load for inference (Transformers GPTQ path)
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import torch
save_dir = "./llama3-8b-gptq-4bit"
tokenizer = AutoTokenizer.from_pretrained(save_dir, use_fast=True)
gptq_cfg = GPTQConfig(bits=4, group_size=128)
model = AutoModelForCausalLM.from_pretrained(
save_dir,
device_map="auto",
torch_dtype=torch.float16,
quantization_config=gptq_cfg
)
prompt = "Explain SIMD in one paragraph."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Links: https://github.com/AutoGPTQ/AutoGPTQ
AWQ (activation-aware weight quantization)
What it is: a post-training, activation-aware, weight-only quantization method. A short calibration pass scores which weights matter via activations, then compresses weights, commonly to 4-bit, preserving quality well. Result: smaller checkpoints that fit on consumer GPUs and run faster than full-precision models, without retraining.
What it does: run AWQ on a small, representative text set and export a quantized model (weights plus a quant_config). Calibration computes per-channel/group scaling so sensitive layers keep more precision where it counts. Typical knobs: bit-width, group size, zero-point. Quantized models can be loaded with AWQ loaders (Python) or ecosystem tooling (AutoAWQ, various UIs/servers).
Notes
- Why pick it: strong accuracy-per-GB at 4-bit; smooth integration in local-LLM stacks.
- Calibration matters: use text that resembles your prompts.
- Bit-width & groups: baseline is 4-bit with
group_size=128andzero_point=True. - VRAM reality: 7B @ 4-bit ~5–7 GB; 13B @ 4-bit ~9–12 GB; context and batch inflate KV memory.
- Kernels & loaders: pair AWQ weights with a compatible inference path; choose kernel variants for your batching.
- Safety & ops: keep tokenizer files in sync; version your
quant_config.json; test on your eval set.
Quick start (quantize → load, Python)
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
base_id = "meta-llama/Llama-3-8b-instruct"
quant_out = "./llama3-8b-awq-4bit"
model = AutoAWQForCausalLM.from_pretrained(
base_id,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
base_id,
use_fast=True,
trust_remote_code=True
)
quant_config = {
"w_bit": 4,
"q_group_size": 128,
"zero_point": True,
"version": "GEMM"
}
calib_texts = [
"Your representative sample text here.",
"Add several lines that mirror your prompts and documents."
]
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data=calib_texts,
max_seq_len=512
)
model.save_quantized(quant_out, use_safetensors=True)
tokenizer.save_pretrained(quant_out)
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
quant_dir = "./llama3-8b-awq-4bit"
model = AutoAWQForCausalLM.from_quantized(
quant_dir,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
quant_dir,
use_fast=True,
trust_remote_code=True
)
prompt = "Explain SIMD in one paragraph."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=180, temperature=0.7, top_p=0.95)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Links: Repo (LLM-AWQ): https://github.com/mit-han-lab/llm-awq • Python loader (AutoAWQ): https://github.com/casper-hansen/AutoAWQ • Example UI: https://github.com/oobabooga/text-generation-webui
bitsandbytes (8-bit/4-bit inference & optimizers)
What it is: a lightweight CUDA toolkit that lets you run and fine-tune large models on consumer GPUs by squeezing weights and optimizers down to 8-bit or 4-bit. It plugs into PyTorch and Transformers for drop-in quantized loading and memory-lean optimizers.
What it does: for inference, load models in 8-bit (load_in_8bit=True) or 4-bit (BitsAndBytesConfig) so weights live compactly in VRAM while math happens in a higher-precision compute dtype (fp16/bf16). 4-bit typically uses NF4 with optional double quantization. For training/finetuning, it provides 8-bit optimizers (Adam/AdamW/Lion) and paged variants that cut optimizer memory; this enables QLoRA-style fine-tuning on a single GPU.
Notes
- VRAM wins: ~2–4× smaller weight footprints vs fp16.
- Quality trade-offs: 8-bit is typically near-lossless; 4-bit NF4 is strong for chat/code but still a trade.
- CUDA & wheels: match CUDA/PyTorch to the wheel. Linux is most reliable; Windows wheels can lag.
- KV cache still matters: quantization shrinks weights, not the KV cache.
- Training ergonomics: use paged 8-bit optimizers with PEFT/LoRA for longer sequences and bigger batches.
Quick starts
1) 4-bit inference (Transformers)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "meta-llama/Llama-3-8b-instruct"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)
prompt = "Explain SIMD in one paragraph."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, temperature=0.7, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=True))
2) 8-bit inference (simple one-liner)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Llama-3-8b-instruct"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_8bit=True,
device_map="auto"
)
3) Fine-tuning with paged 8-bit AdamW (Trainer)
from transformers import TrainingArguments, Trainer
# assumes `model`, `tok`, `train_ds`, and `eval_ds` are defined
args = TrainingArguments(
output_dir="./out",
per_device_train_batch_size=2,
learning_rate=2e-4,
num_train_epochs=1,
fp16=True, # or bf16=True if supported
optim="paged_adamw_8bit",
logging_steps=10,
save_steps=200
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_ds,
eval_dataset=eval_ds,
tokenizer=tok
)
trainer.train()
Links: GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes • Install notes: README

