If you want to run AI on your own hardware: quietly, quickly, and without paying the cloud tax, this post may be your field guide. I pulled together the local apps and self-hosted chat UIs most used right now via. Windows, macOS, and Linux: GPT4All, Ollama, Open WebUI, text-generation-webui, KoboldCpp, LibreChat, Chatbot-UI, Hugging Face Chat-UI, Jan, and LocalAI. Each entry is short and useful: what it is, what it does, notes you should read before you click anything, and links straight to the canonical GitHub/docs.
If you’re after “download, click run, start typing,” start with GPT4All, Jan, or Open WebUI on top of Ollama. If you want knobs and dials, text-generation-webui and KoboldCpp are where you live. Need an OpenAI-compatible API so your existing tools operate thusly without rewrites? Point them at Ollama or LocalAI on localhost and get on with your day. Pick the stack that fits your box and your tolerance for tinkering; everything here stays local unless you decide otherwise. Or it does, I guess.
GPT4All (desktop app + SDKs)
What it is: GPT4All is a desktop app for chatting with local LLMs via. Windows, macOS, and Linux. Think “download a model, click run, start typing.” It also ships SDKs so you can wire those same local models into your own scripts or apps without relying on any cloud service.
What it does: The app operates thusly: it lists compatible local models, downloads them for you, and gives you a clean chat window with streaming replies, conversation history, and per-model settings (e.g., temperature/max tokens). You can keep multiple models installed and swap between them to compare behavior. If you’re a developer, the SDKs let you spin up the same models programmatically. It’s great for building a fully-offline tool, a CLI helper, or a small internal service.
Notes:
-
Privacy by default: Everything runs on your machine; no prompts or outputs need to leave your box.
-
Pick the right model: Smaller models are snappy but less capable; larger ones need more RAM/VRAM and can feel slower on CPU-only setups.
-
Not a power-tweaker’s playground: You get sensible controls, but if you want dozens of exotic samplers, routing, or complex plugin ecosystems, something like text-generation-webui or Open WebUI may suit you better.
-
Good first stop: If you’re just getting into local LLMs, GPT4All is an easy, low-friction starting point, and the SDKs make it straightforward to graduate into code when you’re ready.
Links:
GitHub (source & issues): https://github.com/nomic-ai/gpt4all
Downloads & docs: https://www.nomic.ai/gpt4all
Model catalog (curated options): https://gpt4all.io/models
Developer SDKs & examples: https://docs.gpt4all.io (language bindings and usage)
Ollama (local model runner + API)
What it is: Ollama is a lightweight model manager and runner for local LLMs—via. Windows, macOS, and Linux. Think “pull a model, run it, and chat,” all on your own machine. It also exposes a simple HTTP API (OpenAI-style routes included) so other apps and UIs can talk to your local models without any cloud dependency.
What it does: The service keeps a local library of models, fetches them on demand, and runs them with sane defaults. You can swap models freely, set system prompts, and persist conversations. Because it speaks an OpenAI-compatible API, you can point popular front-ends (or your own scripts) at localhost and they’ll just work. It also supports Modelfiles (compose prompts/parameters, pick quantizations) so you can define repeatable “model recipes” for your workflow.
Notes:
-
Privacy by default: Everything stays local; no prompts or outputs need to leave your box.
-
Pairs well with UIs: Works seamlessly with Open WebUI, LibreChat, and other OpenAI-compatible front-ends.
-
Model choice matters: Smaller quantized models run great on CPU; larger models shine with GPU VRAM. Plan RAM/VRAM accordingly.
-
Mind the API exposure: Keep the Ollama port private (behind a reverse proxy/auth). Don’t put it on the public internet.
-
Nice developer ergonomics: One-line pulls and runs; the API makes it easy to slot into existing tools and scripts.
Links:
GitHub (source & issues): https://github.com/ollama/ollama
Site & docs: https://ollama.com/ · https://ollama.com/docs
Model library: https://ollama.com/library
OpenAI-compat overview: https://ollama.com/blog/openai-compatibility
Open WebUI (self-hosted chat UI)
What it is: Open WebUI is a clean, extensible web interface for local LLMs via. Windows, macOS, and Linux. It runs on your machine (or a home server) and talks to Ollama or any OpenAI-compatible backend, so you can keep everything private and offline while using a modern ChatGPT-style UI.
What it does: The app operates thusly: it connects to one or more model backends (Ollama, vLLM, TGI, or any OpenAI-compatible server), lets you pick models per chat, and adds niceties like conversation history, prompt presets, file uploads, and built-in RAG (document search) so you can ask questions over your own PDFs, notes, and URLs. It also includes a “Model Builder” for packaging prompts/parameters into reusable personas/agents, and exposes API endpoints if you want other tools to call your local setup.
Notes:
-
Privacy by default: Designed to operate entirely offline; point it at local backends and keep your data on-box.
-
Plays well with others: Works seamlessly with Ollama and any OpenAI-compatible API. Mix local and remote if you choose.
-
RAG built in: Drop in docs or crawl sources and ask questions with citations; great for “private ChatGPT” use.
-
Tweakable, not fiddly: Rich features without the “everything and the kitchen sink” complexity. It’s considered a good daily driver UX. (If you want ultra-granular sampler controls and huge plugin ecosystems, text-generation-webui is the power-user alternative.)
-
Mind exposure: If you host beyond localhost, put it behind auth/reverse proxy and keep ports private.
Links:
GitHub (source & issues): https://github.com/open-webui/open-webui
Docs: https://docs.openwebui.com/ (install, features, admin)
Quick start with OpenAI-compatible servers: https://docs.openwebui.com/getting-started/quick-start/starting-with-openai-compatible/
Features overview: https://docs.openwebui.com/features/
RAG feature: https://docs.openwebui.com/features/rag/
API endpoints: https://docs.openwebui.com/getting-started/api-endpoints/
Site: https://openwebui.com/
text-generation-webui (power-user web UI)
What it is: text-generation-webui is the “kitchen-sink” local LLM interface via. Windows, macOS, and Linux. It’s a browser UI (Gradio-based) that loads a wide range of backends (Transformers, llama.cpp/GGUF, ExLlamaV2/V3, TensorRT-LLM), and it’s designed for folks who want all the toggles. You can run fully offline with portable builds or use the one-click installer for a full Python stack.
What it does: The app manages models, loads them with your chosen loader, and exposes a rich panel of generation settings (sampling, context window, quantization options, etc.). You can enable an API mode (--api) to let other tools call it, switch between loaders (e.g., llama.cpp for GGUF vs. ExLlama for GPTQ), and add extensions (TTS, voice input, RAG helpers, utilities) for extra capabilities. Installation paths include portable zips, a one-click installer, Conda, or Docker—pick your poison and launch at http://127.0.0.1:7860.
Notes:
-
Power and flexibility: Multiple backends/loaders (Transformers, llama.cpp, ExLlamaV2/V3, TensorRT-LLM) with loader-specific flags for performance tuning. If you like fine-grained control, this is your playground.
-
Install options: Go portable (zero-setup for GGUF), use the one-click installer, or spin it up with Conda/Docker—handy for different environments and GPUs.
-
API & multi-user: Launch with
--apifor programmatic access; there’s also a--multi-useroption in the CLI flags for shared setups. (Mind auth if you bind beyond localhost.) -
Extensions ecosystem: A robust extensions framework lets you bolt on features; check the wiki for how to write/enable them.
-
Hardware reality check: GGUF/llama.cpp runs well on CPU; higher-end loaders benefit from GPU VRAM. Choose loader/quantization to match your box.
Links:
GitHub (source & issues): https://github.com/oobabooga/text-generation-webui
Releases (portable builds): https://github.com/oobabooga/text-generation-webui/releases
One-click installers: https://github.com/oobabooga/one-click-installers
Docker guide (official wiki): https://github.com/oobabooga/text-generation-webui/wiki/09-%E2%80%90-Docker
Extensions (how-to): https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions
Model tab (loader details): https://github.com/oobabooga/text-generation-webui/wiki/04-%E2%80%90-Model-Tab
KoboldCpp (single-file runner + Kobold-style UI/API)
What it is:
KoboldCpp is a “one-file, zero-install” local LLM runner built on top of llama.cpp. It targets GGUF/GGML models, ships a lightweight Kobold-style UI, and runs on Windows, macOS, and Linux with CPU or GPU acceleration. It also exposes multiple compatible APIs (including OpenAI-style) so other tools can talk to your local model.
What it does:
The app operates thusly: you launch the single binary, load a .gguf model, and it serves a local UI plus HTTP endpoints (defaults to http://localhost:5001). You can switch “modes” (chat/adventure/instruct/storywriter), save persistent stories, and use author notes, world info, and character cards. Under the hood, you can offload layers to GPU, bump context size, and choose among sampling options. For integrations, it provides KoboldCpp/KoboldAI, OpenAI-compatible (/v1), and even Ollama-compatible routes, so front-ends like SillyTavern or custom scripts can connect without cloud dependencies.
Notes:
-
Privacy by default: Everything stays local; nothing needs to leave your box.
-
Binaries for your hardware: Grab CUDA builds for NVIDIA,
nocuda(with Vulkan/OpenCL) when you don’t need CUDA, andoldpcbuilds for older CPUs/GPUs. macOS ARM64 binaries are provided, plus Docker images. -
Mind the port: If you bind beyond localhost or enable multi-user, protect it (reverse proxy/auth). It’s a real API server.
-
Beyond text: Current builds can also wire in image generation (SD/SDXL/Flux), Whisper speech-to-text, and TTS—handy for creative stacks.
Links:
GitHub (source & issues): https://github.com/LostRuins/koboldcpp
Releases (pick CUDA / nocuda / oldpc / macOS ARM64): https://github.com/LostRuins/koboldcpp/releases
Wiki / Knowledgebase: https://github.com/LostRuins/koboldcpp/wiki
API docs (interactive): https://lite.koboldai.net/koboldcpp_api
Docker image: https://hub.docker.com/r/koboldai/koboldcpp
SillyTavern integration guide (example client): https://docs.sillytavern.app/usage/api-connections/koboldcpp/
LibreChat (self-hosted ChatGPT-style front-end)
What it is: LibreChat is a polished, self-hosted chat UI you run via. Windows, macOS, or Linux. It’s designed to feel familiar (ChatGPT-like) while letting you switch among many AI providers—from OpenAI/Anthropic/Google to local backends like Ollama—through straightforward configuration.
What it does: The app operates thusly: you spin it up (Docker Compose is the recommended path), it brings the pieces it needs (MongoDB, MeiliSearch, and built-in RAG services), and you connect one or more providers in librechat.yaml. From there you pick models per chat, search past conversations, import/export histories, and even go multimodal with image analysis and optional speech (STT/TTS). It’s a “one UI to rule them all” that can unify both cloud and local endpoints without changing your workflow.
Notes:
-
Privacy-friendly by design: Keep everything on your own box or home server; you choose which providers (and keys) to enable.
-
Provider-agnostic: Pre-configured guides for OpenAI and others, plus custom endpoints for Ollama, Mistral, OpenRouter, Groq, Databricks, DeepSeek, etc., all via
librechat.yaml. -
RAG & multimodal built in: Document Q&A (with its vectordb/rag_api) and image analysis with supported vision models (e.g., Claude 3, GPT-4, Gemini, LLaVA).
-
Quality-of-life features: Conversation import/export, search, and optional speech (hands-free chat with STT/TTS).
-
Ops tip: If you expose it beyond localhost, put it behind auth and a reverse proxy like Nginx/Traefik; stick with the Docker setup for easiest updates.
Links:
GitHub (source & issues): https://github.com/danny-avila/LibreChat
Docs (start here): https://www.librechat.ai/docs
Local install (Docker): https://www.librechat.ai/docs/local/docker
Providers & custom endpoints: https://www.librechat.ai/docs/quick_start/custom_endpoints
Pre-configured AI (OpenAI & more): https://www.librechat.ai/docs/configuration/pre_configured_ai
Features overview (multimodal, temp chat, etc.): https://www.librechat.ai/docs/features
Site: https://www.librechat.ai/
Chatbot-UI (minimal, extensible Next.js UI)
What it is: Chatbot-UI is a lightweight, open-source chat interface you can fork and run locally—via. Windows, macOS, and Linux. It’s built with Next.js/TypeScript and is meant to be a clean starting point for a ChatGPT-style front-end that you can point at the model/provider of your choice. The project tagline is “AI chat for any model,” and the codebase is under active revamp for simpler deployment and broader backend compatibility.
What it does: The app operates thusly: you deploy the Next.js project, configure your model endpoint (typically an OpenAI-compatible API), and you’re off to the races with a familiar multi-chat UI, streaming responses, and an easy canvas to customize branding and prompts. Historically it targeted OpenAI models out of the box and mimicked ChatGPT’s interface; many forks wire it to local runners (e.g., Ollama) through OpenAI-compatible gateways.
Notes:
-
Minimal & hackable: It’s a starter UI, not a kitchen-sink platform—great if you want full control over UX and code.
-
Backend-agnostic via proxy: Point it at OpenAI, or route requests to local servers (Ollama/TGI/vLLM) through an OpenAI-compatible gateway. Forks like “chatbot-ollama” show the pattern.
-
Mind the repo’s cadence: The maintainer notes a major refresh is in progress (simpler deploys, better backend compatibility). Expect changes.
-
Alternatives/relatives: If you need an even slimmer base, Chatbot-UI Lite exists; if you want batteries-included admin/features, consider LibreChat or Open WebUI.
Links:
GitHub (source & issues): https://github.com/mckaywrigley/chatbot-ui
Chatbot-UI Lite (barebones variant): https://github.com/mckaywrigley/chatbot-ui-lite
Example fork for local models (Ollama): https://github.com/ivanfioravanti/chatbot-ollama
Next.js AI chatbot template (Vercel, related starter): https://github.com/vercel/ai-chatbot
Hugging Face Chat-UI (open, self-hosted chat interface)
What it is: Chat-UI is Hugging Face’s open-source, SvelteKit-based front end that originally powered HuggingChat. You can deploy it yourself and wire it to many providers or your own local backends, keeping everything on your box or home server. The project is still maintained even though the public HuggingChat service was closed in July 2025.
What it does: The app operates thusly: you stand up the SvelteKit app (MongoDB under the hood), point it at one or more providers, and get a modern multi-chat UI with tools/function calling, web search/RAG, and multimodal support (image uploads on supported providers). It’s built for mixing and matching endpoints—self-hosted inference, HF Inference endpoints, or OpenAI-compatible servers, without changing your day-to-day workflow.
Notes:
-
Still maintained; hosted app paused: HuggingChat (the hosted service) closed July 1, 2025, but the Chat-UI codebase continues and can be one-click deployed to Spaces via a Docker template.
-
Privacy by default: Self-host and keep prompts/outputs local; pick your own providers/backends. (Docs cover MongoDB + SvelteKit internals.)
-
Built-in RAG & tools: Web search, scraping, embeddings, and function-calling style tools are first-class features.
-
Deployment paths: “No-setup” Space template, or run it yourself (Docker/manual). Good for quick trials or a private home server.
-
Ops tip: If you expose it beyond localhost, put it behind auth/reverse proxy and lock down ports—same advice as any local AI API/UI.
Links:
GitHub (source & issues): https://github.com/huggingface/chat-ui
Docs (overview, features, setup): https://huggingface.co/docs/chat-ui/en/index
Spaces Docker template (one-click deploy): https://huggingface.co/docs/hub/en/spaces-sdks-docker-chatui
Announcement on HuggingChat closure (context): https://huggingface.co/spaces/huggingchat/chat-ui/discussions/747
Jan (desktop app; offline by default)
What it is: Jan is an open-source, privacy-first desktop app for chatting with local LLMs—via. Windows, macOS, and Linux. It’s positioned as a “ChatGPT-alternative that runs 100% offline,” with optional connectors if you want to mix in cloud models later.
What it does: The app operates thusly: you install Jan, pick a model from its built-in hub (GGUF models from places like Hugging Face), and start chatting—no internet required once the model is downloaded. If you need more horsepower, you can also add API keys for providers like OpenAI, Anthropic, or Gemini and route specific chats to them. For developers, Jan can spin up a local OpenAI-compatible API server on localhost:1337, so scripts and tools that expect “/v1/chat/completions” can talk to your on-box model. Under the hood, Jan uses llama.cpp as its local inference engine, with hardware-optimized backends.
Notes:
-
Privacy by default: Runs completely on your machine; work fully offline after the initial model download.
-
Model hub & choice: One-click downloads; pick smaller models for speed/CPU or larger ones if you’ve got GPU VRAM.
-
Optional cloud mix: You can connect cloud providers with your own keys when you want to, but it’s not required.
-
Local API server: Built-in, OpenAI-compatible server with required API key and configurable host/port/CORS (keep it on localhost unless you know what you’re doing).
-
MCP tool connectors: Supports the Model Context Protocol to interact with external tools (browser automation, search, notebooks, etc.).
-
OS & hardware: Windows 10+, macOS 12+, Ubuntu 20.04+; works CPU-only but benefits from GPU backends (NVIDIA/CUDA, Vulkan for AMD/Intel, Apple Silicon).
Links:
GitHub (source & issues): https://github.com/menloresearch/jan
Site & downloads: https://jan.ai/
Documentation (overview/getting started): https://jan.ai/docs
Local API server (OpenAI-compatible): https://jan.ai/docs/api-server
llama.cpp engine settings/backends: https://jan.ai/docs/llama-cpp-server
MCP in Jan: https://jan.ai/docs/mcp
Offline setup primer: https://jan.ai/post/offline-chatgpt-alternative
LocalAI (drop-in OpenAI-compatible local API + simple WebUI)
What it is: LocalAI is a self-hosted, OpenAI-compatible REST API that lets you run LLMs, and also do image generation, speech-to-text, and text-to-speech—fully on your own machine. It aims to be a private, on-prem “/v1/*” replacement so apps expecting OpenAI can talk to your local box instead. GPU is optional; CPU-only works for many workloads.
What it does:
The service operates thusly: you run the API (Docker or binary), point it at models (from a gallery or direct URIs), and call it with the same endpoints your apps already use (chat/completions, embeddings, audio, images). There’s also a lightweight WebUI if you want a browser front-end. Via config YAMLs and “Modelfiles,” you can set defaults (samplers, templates) and swap backends per model family.
-
Model install paths: use the in-app Model Gallery,
local-ai run <gallery_name>, or reference models by URI (e.g.,huggingface://…,ollama://…). -
Functions / tools & JSON mode: supports OpenAI-style function calling and JSON-first generation with llama.cpp-compatible models.
-
Embeddings: OpenAI-compatible embeddings with
llama.cpp,bert.cpp, and sentence-transformers; plugs into RAG stacks. -
Audio (STT/TTS): Whisper-based transcription and OpenAI/ElevenLabs-compatible TTS endpoints.
-
Images: Stable Diffusion backends (CPU/GPU) for text-to-image via its image endpoints.
Notes:
-
Privacy by default: Keep everything on-box; just repoint clients to your LocalAI base URL.
-
Backend flexibility: Supports multiple model families/backends; pick per model in YAML and consult the compatibility table.
-
CPU works; GPU helps: You don’t need a GPU to start, but VRAM will accelerate bigger models and image jobs.
-
Mind exposure: If you bind beyond
localhost, protect it (reverse proxy/auth); it’s a real API server. -
Great with existing tools: Because it’s OpenAI-compatible, many SDKs and UIs (LangChain, Flowise, LibreChat, Open WebUI) can talk to it with minimal changes.
Links:
GitHub (source & issues): https://github.com/mudler/LocalAI
Docs (overview): https://localai.io/docs/overview/
Quickstart: https://localai.io/docs/getting-started/
Container images: https://localai.io/basics/container/
Model gallery: https://localai.io/models/
Model compatibility: https://localai.io/model-compatibility/
WebUI (frontend repo): https://github.com/go-skynet/LocalAI-frontend
OpenAI functions/tools: https://localai.io/features/openai-functions/
Embeddings: https://localai.io/features/embeddings/
Audio to text (Whisper): https://localai.io/features/audio-to-text/
Text to audio (TTS): https://localai.io/features/text-to-audio/
Image generation:https://localai.io/features/image-generation/

