RAG Platforms & “Private ChatGPT” Stacks: AI Resources 2025

RAG Platforms & “Private ChatGPT” Stacks: AI Resources 2025

This guide is a practical, self-hosted “private AI stack” you can run locally or on your own servers. It includes an OpenAI-compatible proxy, a visual builder for agent and RAG flows, a data-native Text-to-SQL framework, and several runtime options with quantization. Each section explains what it is, when to use it, caveats, and direct links to official docs and repos. Code blocks are ready for syntax highlighting.


LiteLLM local proxy (OpenAI compatible)

What it is: a lightweight gateway that exposes an OpenAI-style API while routing to many different backends you configure. Point any OpenAI SDK or CLI at the proxy and switch models in one place.

Use it when: you want a single endpoint for local and cloud models, consistent request/response formats, streaming, logging, and guardrails.

Strengths: single API over 100+ providers, per-project keys and budgets, retries and fallbacks, OpenAI-compatible errors.

Limitations: you still manage the underlying providers and credentials. Features vary per backend so pin configs per model.

1) Install

pip install "litellm[proxy]"

2) Create config.yaml

# Two models: cloud + local
model_list:
  - model_name: gpt4o
    litellm_params:
      model: openai/gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"
  - model_name: llama3-local
    litellm_params:
      model: ollama/llama3
      api_base: "http://localhost:11434"

3) Run the gateway

litellm --config config.yaml

4) Call it like OpenAI

curl -sN http://127.0.0.1:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-local-demo" \
  -d '{
    "model": "llama3-local",
    "messages": [{"role":"user","content":"Explain SIMD in one paragraph."}],
    "stream": true
  }'

Resources:
Docs home
Proxy quick start
Streaming
Usage & token accounting
GitHub


Flowise (no-code canvas for LLM flows, tools, RAG)

What it is: an open source visual builder for LLM apps. Drag nodes for models, retrievers, file/URL loaders, tools, and evaluators. Expose flows through an HTTP API, the TypeScript SDK, or an embeddable chat widget.

Use it when: you want to prototype or ship agent and RAG flows fast, collaborate with non-devs, or standardize ingestion and retrieval across projects.

Strengths: visual graphs, workspaces, tracing and eval helpers, streaming, first-class RAG nodes with chunking and citations.

Limitations: very large graphs can sprawl. Keep flow versions pinned and reviewed like code.

Quick start

  1. Clone the repo, open docker/, copy .env.example to .env.
  2. docker compose up -d then open http://localhost:3000.
  3. Build a chatflow then call it via the Prediction API or the SDK. Turn on streaming for chat UIs.
  • Supports local and cloud LLMs. Flows can call other flows and re-ingest living sources on a schedule.
  • Tune chunk size, retriever depth, and citations per flow. Persist embeddings to disk or your vector DB.

Resources:
Getting started
GitHub
TypeScript SDK
GitHub data loader
Project site


DB-GPT (agentic, data-native: Text-to-SQL, plugins, UI)

What it is: a self-hosted framework for database-aware assistants. Connect databases and optional document sources, then chat. The agent plans steps, drafts Text-to-SQL, shows the SQL, and on approval executes it. Results return as tables or charts.

Use it when: you want a “private Copilot” for live databases with reviewable SQL and pluggable tools for files, web, viz, and notifications.

Strengths: tools aimed at data work, workspaces to segment credentials and prompts, plugin ecosystem, document retrieval plus DB queries in the same conversation.

Safety: start with read-only roles, turn on query logs, and consider masking views for sensitive fields. Put approval gates on any write path.

Quick start

  1. Clone the repo. Copy the sample .env. Set DB connection strings and your model backend (local or cloud).
  2. docker compose up -d and open the UI.
  3. Create a workspace, connect databases, choose an LLM and embeddings, and start chatting. Approve SQL before execution.

Resources:
Docs overview
Main repo
DB-GPT Hub
Plugins
VLDB demo paper


Model runtimes and quantization

Why quantize: to fit larger models on limited VRAM and speed up inference. Pick the path that matches your hardware and library comfort level.

AutoAWQ: quantize and save

When to use: fast 4-bit inference on consumer GPUs with good quality when you provide a small calibration set that resembles your real prompts.

# Quantize with AWQ and save
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

base_id = "meta-llama/Llama-3-8b-instruct"
quant_out = "./llama3-8b-awq-4bit"

model = AutoAWQForCausalLM.from_pretrained(base_id, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(base_id, use_fast=True, trust_remote_code=True)

quant_config = {"w_bit": 4, "q_group_size": 128, "zero_point": True, "version": "GEMM"}
calib_texts = ["Your representative sample text here.", "Add several lines that mirror your prompts."]

model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_texts, max_seq_len=512)
model.save_quantized(quant_out, use_safetensors=True)
tokenizer.save_pretrained(quant_out)

AutoAWQ: load quantized and generate

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

quant_dir = "./llama3-8b-awq-4bit"
model = AutoAWQForCausalLM.from_quantized(quant_dir, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(quant_dir, use_fast=True, trust_remote_code=True)

prompt = "Explain SIMD in one paragraph."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=180, temperature=0.7, top_p=0.95)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Resources:
AutoAWQ GitHub
Examples

AutoGPTQ: quantize and save

When to use: GPTQ has broad runtime support. The legacy AutoGPTQ project is archived. Many stacks recommend newer GPTQ model loaders. Pin versions and read release notes.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3-8b-instruct"
quant_config = BaseQuantizeConfig(bits=4, group_size=128, damp_percent=0.01, desc_act=False)

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config=quant_config)

calib_texts = ["Your representative sample text goes here.", "Add lines that resemble your real prompts."]
model.quantize(tokenizer=tokenizer, calib_texts=calib_texts)

save_dir = "./llama3-8b-gptq-4bit"
model.save_quantized(save_dir, use_safetensors=True)
tokenizer.save_pretrained(save_dir)

Transformers + GPTQ: load and generate

from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import torch

save_dir = "./llama3-8b-gptq-4bit"
tokenizer = AutoTokenizer.from_pretrained(save_dir, use_fast=True)
gptq_cfg = GPTQConfig(bits=4, group_size=128)

model = AutoModelForCausalLM.from_pretrained(
    save_dir, device_map="auto", torch_dtype=torch.float16, quantization_config=gptq_cfg
)

prompt = "Explain SIMD in one paragraph."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, top_p=0.95)
print(tokenizer.decode(outs[0], skip_special_tokens=True))

Resources:
AutoGPTQ GitHub (archived)
Releases

Transformers + bitsandbytes 4-bit

When to use: quick 4-bit load through Transformers without pre-quantizing. Requires bitsandbytes and a compatible CUDA stack. If BF16 is not supported, use FP16.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Llama-3-8b-instruct"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

prompt = "Explain SIMD in one paragraph."
out = model.generate(**tok(prompt, return_tensors="pt").to(model.device), max_new_tokens=200, temperature=0.7, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=True))

exllamav2 (EXL2 and GPTQ runtime)

When to use: a fast inference engine for EXL2 and GPTQ weights with streaming token output. Recommended server is TabbyAPI for an OpenAI-compatible HTTP surface.

# Minimal pseudo-example; class names and args vary by version
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer, ExLlamaV2Cache, ExLlamaV2Sampler

cfg = ExLlamaV2Config()
cfg.model_dir = "/path/to/your/exl2-or-gptq-model"

model = ExLlamaV2(cfg)
tokenizer = ExLlamaV2Tokenizer(cfg)
cache = ExLlamaV2Cache(model)

prompt = "Explain SIMD in one paragraph."
ids = tokenizer.encode(prompt)
sampler = ExLlamaV2Sampler(model)
for tok in sampler.generate(ids, cache=cache, max_new_tokens=200, temperature=0.7, top_p=0.95, stream=True):
    print(tokenizer.decode(tok), end="", flush=True)

Resources:
exllamav2 GitHub
TabbyAPI GitHub
TabbyAPI getting started

Other serving backends


Whisper and WhisperX (speech to text)

What it is: local speech-to-text. Whisper handles transcription. WhisperX adds forced alignment for word-level timestamps and optional diarization. Useful for subtitles, editing, and searchable archives without sending audio to third parties.

whisper.cpp build and quick run

# Build
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
make    # or: mkdir build && cd build && cmake .. && cmake --build . -j

# Start low-latency mic transcription
./stream -m ./models/ggml-base.en.bin -t 8

# Transcribe to console and save .txt and .srt
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -otxt -osrt -t 8

Translate French to English

./main -m ./models/ggml-small.bin -f meeting_fr.wav -tr -otxt

WhisperX pipeline

Notes: install PyTorch that matches your CUDA first. Alignment uses an extra model. Diarization is optional.

pip install -U whisperx
import torch, whisperx

device = "cuda" if torch.cuda.is_available() else "cpu"
audio_file = "meeting.wav"

# Load audio
audio = whisperx.load_audio(audio_file)

# Transcribe with a Whisper model
asr_model = whisperx.load_model("large-v3", device)  # choose your size
asr_result = asr_model.transcribe(audio)             # segment timestamps

# Forced alignment for word-level timestamps
align_model, metadata = whisperx.load_align_model(language_code=asr_result["language"], device=device)
aligned = whisperx.align(asr_result["segments"], align_model, metadata, audio, device)

# Optional diarization
# diarize = whisperx.DiarizationPipeline(use_auth_token="YOUR_HF_TOKEN", device=device)
# spk_segments = diarize(audio_file)
# aligned = whisperx.assign_word_speakers(spk_segments, aligned)

# Save outputs
from pathlib import Path
from whisperx.utils import write_txt, write_srt, write_vtt

out = Path("out"); out.mkdir(exist_ok=True)
write_txt(aligned, out / "transcript.txt")
write_srt(aligned, out / "subtitles.srt")
write_vtt(aligned, out / "subtitles.vtt")

Resources:
whisper.cpp GitHub
WhisperX GitHub