Inference Backends & Servers: AI Resources 2025

Inference Backends & Servers: AI Resources 2025

Inference backends and servers are the engines that actually run models; on your box, your rack, or your cluster, and expose clean HTTP APIs so everything else (chat UIs, SDKs, agents) can talk to them. If desktop apps are the “nice cockpit,” these are the turbines. They handle batching, scheduling, quantization, memory management, and streaming so you can serve real traffic; quietly, quickly, and without paying the cloud tax.

You’ll see a few flavors here. llama.cpp is the lean C/C++ workhorse for GGUF models—great on CPU or Metal/CUDA/ROCm with a tiny footprint and an optional OpenAI-style server. vLLM is the throughput monster (PagedAttention + continuous batching) for GPUs and multi-GPU boxes. Text Generation Inference (TGI) brings a hardened, container-first Rust/Py server with dynamic batching, tensor parallel, and Prometheus metrics. MLC-LLM / WebLLM take the compiler route—native binaries for desktop/mobile and WebGPU for fully in-browser inference. FastChat gives you controller/worker orchestration with a web UI and OpenAI-compatible endpoints (and can plug vLLM in as a worker). OpenLLM (BentoML) wraps this all in production ergonomics—OpenAI-compatible APIs, a built-in chat UI, and one command to move from local to Docker/K8s/BentoCloud.

How to pick, fast: CPU-first or edge? llama.cpp. Single-GPU with high QPS? vLLM. Multi-GPU with a polished container and metrics? TGI. Browser/native “no server” demos? WebLLM / MLC-LLM. Multi-model routing with a familiar UI/API? FastChat. Turnkey OpenAI-style service you can promote to prod? OpenLLM. Whatever you choose, plan for weights + KV cache, keep endpoints private (reverse proxy/auth), and right-size quantization and context so your cards stay busy, not memory-starved.



llama.cpp (C/C++ runner for GGUF; tiny deps; OpenAI-style server)

What it is: A lean, portable C/C++ inference engine for running modern LLMs locally—via. Windows, macOS (Apple Silicon loves it), Linux, and even mobile. It speaks the GGUF model format (weights + tokenizer + metadata in one file), has minimal dependencies, and can serve models through a simple OpenAI-compatible HTTP API. Think: “drop in a .gguf, run a single binary, chat.”

What it does: The project loads a GGUF model, allocates a KV cache, and streams tokens with sampler controls entirely on your machine.

  • Backends & acceleration: CPU-only works out of the box; optional GPU offload via Metal (Apple), CUDA (NVIDIA), HIP/ROCm (AMD), and Vulkan (cross-vendor). You choose how many layers to push to GPU for a big speed-up without requiring full VRAM residency.
  • OpenAI-style server: Launch the built-in server and you get endpoints like /v1/chat/completions, /v1/completions, and /v1/embeddings, plus SSE streaming. Easy to point UIs and SDKs at http://localhost.
  • Quantization options: Built-in tools can quantize FP16 GGUF to a range of 2-bit–8-bit schemes (e.g., Q4_K_M, Q5_K_M, Q8_0). Smaller quants = lower RAM/VRAM and faster throughput, with some quality trade-off.
  • Long-context tricks: RoPE scaling modes (e.g., NTK/Yarn-style settings) let compatible models stretch context windows well beyond defaults—handy for RAG or codebases.
  • Constrained outputs: Grammar/JSON modes can force the model to emit valid JSON or match a custom grammar—useful for tool calls, config generation, or structured outputs.
  • Embeddings & batching: Switch to embedding mode for vector generation; batch multiple prompts to maximize throughput.
  • Speculative decoding & MoE: Optional draft-model speculative decoding for speed; supports MoE models (e.g., Mixtral) by loading only the active experts at each step.
  • Model zoo friendly: Runs a wide variety of open models converted to GGUF (Llama, Mistral/Mixtral, Qwen, Phi, Gemma, etc.). Conversion tooling is included, though most popular models are already published as GGUF.

Notes:

  • Pick your quant wisely:
    • CPU boxes: Q4_K_M is a sweet spot for 7–13B models; Q5_K_M if you want extra fidelity.
    • Apple Silicon: Offload with Metal; a 7B Q4_K_M runs well on base M-series; bump VRAM for bigger models.
    • NVIDIA/AMD: Use CUDA/ROCm and offload as many layers as VRAM allows (-ngl <N>).
  • Memory reality check (rough): A 7B Q4_K_* model wants ~4–5 GB RAM/VRAM; 13B ~8–10 GB. Larger models scale accordingly.
  • Keep it private: The server is real—don’t bind it to the public internet without a reverse proxy and auth.
  • Portable builds: Prebuilt binaries exist for common platforms, but building with CMake is straightforward if you want the exact backend flags for your machine.

Common commands (quick reference):

#1) Build (generic)
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
mkdir build && cd build && cmake .. && cmake --build . -j

#2) Run a simple prompt (CLI)
./llama-cli -m ~/models/llama-3-8b.Q4_K_M.gguf -p "Explain SIMD in one paragraph."

#3) Start the OpenAI-style server on localhost:8080
./llama-server -m ~/models/llama-3-8b.Q4_K_M.gguf --host 127.0.0.1 --port 8080

#4) Chat completion (curl)
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-llama",
    "messages": [{"role":"user","content":"Give me a haiku about caches."}],
    "stream": true
  }'

#5) GPU offload (example: offload 35 layers)
./llama-cli -m ~/models/llama-3-8b.Q4_K_M.gguf -ngl 35 -p "Summarize this repo."

Links:
GGUF model search (ready-to-run weights): https://huggingface.co/models?search=gguf
GitHub (source & releases): https://github.com/ggml-org/llama.cpp
Python bindings (drop-in for local OpenAI-style dev): https://github.com/abetlen/llama-cpp-python
Node bindings: https://github.com/withcatai/node-llama-cpp


vLLM (high-throughput serving with PagedAttention; OpenAI-compatible)

What it is: vLLM is a production-grade inference server for large language models—built to squeeze maximum throughput out of GPUs. It runs on Linux/macOS/Windows (GPU strongly recommended), exposes OpenAI-compatible HTTP endpoints, and scales from a single desktop card to multi-GPU nodes and clusters.

What it does: The server keeps models resident on your GPU(s), then uses PagedAttention and continuous batching to feed tokens to many concurrent requests without wasting memory. You point clients and UIs at its OpenAI-style API and get streaming chat/completions/embeddings with strong GPU utilization.

  • PagedAttention & continuous batching: Efficient KV-cache paging/blocking + smart scheduling lets short and long requests co-exist, keeping GPUs busy instead of idle.
  • OpenAI-compatible API: Drop-in routes for /v1/chat/completions, /v1/completions, and /v1/embeddings, plus SSE streaming—easy to wire into existing SDKs, apps, and UIs.
  • Multi-GPU & scaling: Tensor parallelism (and pipeline options in recent builds) to split models across GPUs; run multiple workers and scale horizontally behind a proxy if needed.
  • Speculative decoding (optional): Use a smaller “draft” model to propose tokens that the large model verifies—higher tokens/sec when it fits your workload.
  • LoRA adapters (serve-time): Load one or more LoRA adapters and select them per request, enabling multi-tenant fine-tunes without duplicating full weights.
  • Embeddings: Serve embedding models alongside chat; works with the same OpenAI-style /v1/embeddings endpoint.
  • Quantized & standard weights: Run FP16/bfloat16 models or compatible HF quantizations (e.g., GPTQ/AWQ) to fit bigger models per GPU.
  • Tunable scheduler: Caps, priorities, and batching knobs to balance latency vs throughput for your traffic shape.

Notes (read before you deploy):

  • GPU first: vLLM shines on NVIDIA/AMD GPUs; CPU-only is not the target path. Start with a 7–8B model on a 12–16 GB card; scale up with more VRAM or tensor parallel.
  • Memory planning: End-to-end RAM/VRAM usage is weights plus KV cache. Longer contexts and bigger batches grow the KV. Dial parameters like max tokens, context length, and gpu-memory-utilization to stay within budget.
  • Quantization trade-offs: GPTQ/AWQ can fit larger models on your card with modest quality loss. Throughput often improves due to memory bandwidth relief.
  • Security: It’s a real HTTP server. Keep it on localhost or put it behind a reverse proxy + auth. Don’t expose directly to the public internet.
  • Ops posture: Containers + a process supervisor make life easier. For multi-GPU boxes, set explicit parallelism flags; for clusters, front with a gateway (e.g., Nginx/Envoy) and scale workers.

Common commands (quick reference):

# 1) Install (Python env)
pip install vllm

# 2) Serve an instruct model with the OpenAI-compatible API (localhost:8000)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8b-instruct \
  --port 8000 \
  --download-dir /models \
  --gpu-memory-utilization 0.90

# 3) Call the Chat Completions endpoint (streaming)
curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-8b-instruct",
    "messages": [{"role":"user","content":"Give me three bullet points on SIMD."}],
    "stream": true
  }'

# 4) Multi-GPU (tensor parallel across 2 GPUs)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --tensor-parallel-size 2

# 5) Serve with a LoRA adapter (load at startup, select per request)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8b-instruct \
  --lora-modules mylora=/path/to/lora

# 6) Enable speculative decoding with a draft model
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8b-instruct \
  --draft-model Qwen2.5-3B-Instruct \
  --enable-speculative-decoding

Links:



Text Generation Inference (TGI) (Rust/Python high-perf server)

What it is: TGI is Hugging Face’s production-grade text generation server for GPUs. It’s built in Rust/Python, ships as a hardened Docker image, and focuses on throughput + stability: dynamic batching, token streaming, and tensor parallelism for multi-GPU boxes. You point clients at its HTTP endpoints and it handles the hot path, loading the model, batching requests, scheduling tokens, so your cards stay busy.

What it does: The server keeps the model resident on your GPU(s), then dynamically batches prefill/decoding across concurrent requests to minimize idle time. You get both non-streaming (/generate) and SSE streaming (/generate_stream) endpoints with a rich parameters object (max tokens, temperature/top-p/top-k, stop sequences, repetition penalty, etc.). TGI supports tensor parallel sharding (--num-shard N) to split large models across multiple GPUs, plus performance knobs (prefill/total token caps, scheduling) so you can trade a bit of latency for a lot of throughput.

  • GPU-first, multi-GPU aware: Designed for NVIDIA/ROCm stacks; scale from 1× to many GPUs with sharding.
  • Quantization & dtypes: Run BF16/FP16, and where models support it, common HF quantization paths (e.g., bitsandbytes 8-bit/4-bit; some GPTQ/AWQ builds) to fit larger models per card.
  • Token streaming: Server-sent events keep clients responsive for interactive chat UX.
  • Observability: Health endpoints and Prometheus metrics let you watch queue depth, tokens/sec, and latency.
  • Model coverage: Works with popular decoder-only LLMs hosted on the Hub (Llama/Mistral/Mixtral/Falcon/Gemma/etc.).
  • Embeddings note: TGI is for generation. If you need embeddings, use TEI (Text Embeddings Inference) alongside it.

Notes (read before you deploy):

  • Plan VRAM for weights + KV cache: Longer contexts and larger batches inflate KV memory. Set sensible --max-input-length and --max-total-tokens.
  • Security: It’s a real HTTP server. Keep it on localhost or put it behind a reverse proxy + auth. Don’t expose it raw to the internet.
  • OpenAI compatibility: TGI speaks its own endpoints. If you need OpenAI-style routes, front it with a gateway (e.g., LiteLLM) or use a UI that supports TGI’s native API.
  • Containers recommended: The maintained Docker image bundles the right runtime, CUDA, and server binaries—use it unless you have a strong reason not to.

Common commands (quick reference):

#1) Pull the official container
docker pull ghcr.io/huggingface/text-generation-inference:latest

#2) Run a model on GPU with streaming, exposing port 8080
# Replace <MODEL_ID> with a Hugging Face repo id; add HF_TOKEN if the model is gated.
docker run --gpus all --shm-size 1g -p 8080:80 \
  -e HF_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id <MODEL_ID> \
  --max-input-length 8192 \
  --max-total-tokens 8192

#3) Multi-GPU (tensor-parallel) across 2 GPUs
docker run --gpus '"device=0,1"' --shm-size 1g -p 8080:80 \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e HF_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id <BIG_MODEL_ID> \
  --num-shard 2 \
  --max-input-length 8192 \
  --max-total-tokens 12288

#4) Non-streaming generation
curl -s http://127.0.0.1:8080/generate -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "Explain SIMD in one paragraph.",
    "parameters": {"max_new_tokens": 150, "temperature": 0.7, "top_p": 0.95}
  }'

#5) Streaming generation (SSE)
curl -N http://127.0.0.1:8080/generate_stream -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "Give me a haiku about caches.",
    "parameters": {"max_new_tokens": 64, "temperature": 0.8}
  }'

Throughput tips:

  • Batch shape matters: Favor many medium requests over bursts of tiny single-token chats; dynamic batching will reward steady flow.
  • Token budgets: Tighten max_new_tokens and context length for latency-sensitive endpoints; loosen for bulk generation.
  • Sharding vs. replicas: For very large models, use --num-shard. For smaller models at high QPS, run multiple TGI replicas and load-balance.
  • Quant wisely: 4-bit/8-bit often gives a bigger win from memory bandwidth relief than you lose in quality—test on your prompts.

Links:

GitHub: https://github.com/huggingface/text-generation-inference
Docs: https://huggingface.co/docs/text-generation-inference/



MLC-LLM / WebLLM (compiler-driven deployment: native & in-browser via WebGPU)

What it is: MLC-LLM is a compiler stack (built on TVM) that turns open-weight LLMs into portable, hardware-specific binaries you can run locally via. Metal (Apple), CUDA (NVIDIA), ROCm/Vulkan (AMD/Intel), and even mobile. WebLLM is the browser runtime that uses WebGPU to run those same models fully client-side, no server required. The big idea: compile once, ship everywhere, and keep data on-device.

What it does: The toolchain operates by starting with a model checkpoint (e.g., Llama/Mistral/Mixtral/Qwen/Gemma), compile it with MLC-LLM into a compact, quantized artifact (e.g., q4f32_1) optimized for your target backend, then run it with the native runtimes (desktop/mobile) or with WebLLM in the browser. You get token streaming, prompt templates, chat history, and an OpenAI-style API surface in the web runtime, so you can wire it into existing front-ends with minimal fuss. For product teams, this means on-device assistants on laptops/phones and zero-backend demos that load models over HTTPS and execute purely in the user’s GPU.

Notes:

  • Privacy & offline: After the initial model download, inference is fully on-device—great for sensitive prompts or air-gapped demos.
  • WebGPU reality check: WebLLM requires WebGPU (Chromium-based browsers, newer Safari/Edge). Performance varies by device; expect better throughput on discrete GPUs and Apple Silicon.
  • Quantization vs. quality: Smaller quants (e.g., q4f32_1) fit in memory and load fast; larger quants give higher fidelity if you have VRAM/GRAM to spare.
  • Native or browser—same model family: You can compile once and target multiple runtimes (desktop, mobile, web). That’s the superpower here.
  • Asset delivery: For WebLLM, host model shards behind a CDN (range requests help) and consider prefetch/caching headers for smoother first-run UX.
  • Mobile support: iOS/Android builds are viable; expect reduced context/throughput compared to desktop GPUs but perfectly fine for many assistants.

Common “hello world” (quick reference):

A) Web (browser) with WebLLM

#1) Add the package
npm install @mlc-ai/web-llm

#2) Minimal usage (OpenAI-like)
import { CreateMLCEngine } from "@mlc-ai/web-llm";

const engine = await CreateMLCEngine(
  "Llama-3-8B-Instruct-q4f32_1-MLC",   // model id you’ve compiled or a prebuilt one
  { wasmUrl: "/mlc/wasm/", modelUrl: "/mlc/models/" } // where you host assets
);

#3) Chat
const reply = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are concise." },
    { role: "user", content: "Explain SIMD in one paragraph." }
  ],
  stream: true
});
for await (const chunk of reply) {
  // append chunk.choices[0].delta.content to the UI
}

B) Native (desktop) with a prebuilt model

  • Download a precompiled model for your backend (Metal/CUDA/ROCm/Vulkan).
  • Run the sample app/CLI (often called “MLC Chat”) and select the model:
# example invocation names vary by build; gist is:
mlc_chat --model Llama-3-8B-Instruct-q4f32_1-MLC --ctx 8192 --temp 0.7
  • For custom targets, compile from source with MLC-LLM and pick your backend + quantization during build.

Links:

MLC-LLM (compiler + native runtimes): https://github.com/mlc-ai/mlc-llm
WebLLM (WebGPU runtime / NPM): https://github.com/mlc-ai/web-llm
Project site & guides: https://llm.mlc.ai/
NPM package: https://www.npmjs.com/package/@mlc-ai/web-llm
Model names & quantization guide: https://llm.mlc.ai/docs/ (see model zoo/quant docs)



FastChat (LMSYS chat server + tooling)

What it is: FastChat is the open-source chat stack from LMSYS—the folks behind Vicuna and the Chatbot Arena. It’s a modular server that lets you host chat models locally or on your own GPUs—via. Linux, macOS, and Windows—with both a web UI and OpenAI-compatible HTTP endpoints. Think controller + workers + API: you register one or more models and serve them behind a clean, familiar interface.

What it does: You launch a controller (the router/registry), attach one or more model workers (each loads a specific model/back-end), and bring up the OpenAI-style API server (and optionally the Gradio web UI). Clients hit /v1/chat/completions or /v1/completions with SSE streaming, and FastChat handles request routing, multi-model selection by name, conversation templates (Vicuna/ChatML/etc.), and per-request sampling params. You can run a worker with the native Transformers backend for simplicity, or point a worker at vLLM for high throughput on multi-GPU boxes. It’s equally handy for a single-card dev machine or a small on-prem service.

Notes:

  • Controller/worker mental model: Start the controller once; spin up as many workers as you have models/GPUs; attach the OpenAI API server and (optionally) the web UI.
  • OpenAI-compatible endpoints: Core chat/completions + SSE streaming. Easy drop-in for SDKs and UIs that expect “/v1/*”.
  • Multi-model routing: Register multiple models (e.g., vicuna-7b, llama3-8b, mixtral-8x7b) and select by model in the request.
  • Backends: Use the built-in model worker for simplicity; use the vLLM worker when you need continuous batching, tensor parallel, and higher tokens/sec.
  • GPU first: CPU is possible for tiny models, but FastChat shines on GPUs. Plan VRAM for weights + KV cache (context length grows the KV).
  • Security: It’s a real HTTP server—keep it on localhost or put it behind a reverse proxy and auth before exposing it.
  • Data & eval: Logs can be saved for later analysis/finetune datasets; this project underpins LMSYS’s evaluation workflows (Vicuna/Arena heritage).

Common commands (quick reference):

#0) Install (Python env)
pip install fschat

#1) Start the controller (registry/router)
python -m fastchat.serve.controller

#2) Start a model worker (Transformers backend)
# Replace with your HF model id or local path
python -m fastchat.serve.model_worker \
  --model-path lmsys/vicuna-7b-v1.5 \
  --device cuda

#3) Start the OpenAI-compatible API server (localhost:8000)
python -m fastchat.serve.openai_api_server --host 127.0.0.1 --port 8000

#4) (Optional) Start the web UI
python -m fastchat.serve.gradio_web_server --host 127.0.0.1 --port 7860

#5) Call the Chat Completions endpoint (streaming)
curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lmsys/vicuna-7b-v1.5",
    "messages": [{"role":"user","content":"Give me three bullet points on SIMD."}],
    "stream": true
  }'

#6) (Optional) Run a vLLM-backed worker for higher throughput
python -m fastchat.serve.vllm_worker \
  --model-path meta-llama/Llama-3-8b-instruct \
  --tensor-parallel-size 2

Links:

README & examples: see repo root for controller/worker/API usage, vLLM worker, and web UI instructions
GitHub (source & issues): https://github.com/lm-sys/FastChat



OpenLLM (BentoML) (production-grade serving with OpenAI-compatible API)

What it is: OpenLLM is BentoML’s “run-any-open-weights as an OpenAI-style API” server. It ships a CLI, a built-in chat UI, and a batteries-included way to stand up local or cloud endpoints with sane defaults—then scale that same service with Docker/K8s or BentoCloud when you’re ready.

What it does: The stack works like this: you pip install, pick a model (Llama, Qwen, Phi, etc.), and openllm serve brings up an HTTP server on localhost:3000 exposing /v1/chat/completions, /v1/completions, and /v1/embeddings. You can browse a model repository, sync updates, run the /chat web UI, and—when needed—deploy the same service to BentoCloud with openllm deploy. Under the hood it integrates state-of-the-art inference backends (e.g., vLLM) while keeping the surface area OpenAI-compatible, so most SDKs and UIs work unchanged.

Notes:

  • Local first, cloud later: Start on a workstation; promote to Docker/K8s/BentoCloud without rewriting your app.
  • Model access: Gated Hub models require an HF_TOKEN. Plan VRAM for weights + KV cache based on context length.
  • Security posture: It’s a real API server—keep it on localhost or put it behind a reverse proxy + auth before exposing it.
  • Ecosystem fit: Plays well with OpenAI-client SDKs, LlamaIndex/LangChain, and BentoML services.

Common commands (quick reference):

#1) Install & say hello
pip install openllm
openllm hello

#2) Serve an LLM with OpenAI-compatible endpoints (localhost:3000)
export HF_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  # if the model is gated
openllm serve llama3.2:1b

#3) Use the built-in chat UI
open http://localhost:3000/chat

#4) List/sync available models from repositories
openllm model list
openllm repo update

#5) Deploy the same service to BentoCloud
openllm deploy llama3.2:1b --env HF_TOKEN

Links:

Announcement explainer: https://www.bentoml.com/blog/announcing-open-llm-an-open-source-platform-for-running-large-language-models-in-production
GitHub (source & README): https://github.com/bentoml/OpenLLM GitHub
BentoML docs (serving OpenAI-compatible APIs, vLLM backend): https://docs.bentoml.com/ BentoML Documentation