practical

Running AI Locally

Q: Can I run ChatGPT-level AI on my computer?

Models like Qwen 3 32B and Llama 3.1 70B approach GPT-4 quality on many tasks. A 32B model fits on a 24GB GPU (RTX 3090/4090) and runs at 25-30 t/s. For coding specifically, Qwen2.5-Coder-32B matches GPT-4 on benchmarks. You won't match the latest cloud models on every task, but for most practical uses the gap has narrowed dramatically.

Q: How much VRAM do I need for local AI?

It depends on the model: 8GB runs 8B models (good for basic tasks), 16GB runs 14B models (solid general use), 24GB runs 32B models (near-GPT-4 quality). For 70B+ models, you need either an RTX 5090 (32GB) with partial offloading, or a Mac with 64GB+ unified memory. The sweet spot for most people is 24GB — a used RTX 3090 for ~$700.

Q: Is local AI slower than ChatGPT?

For small-to-medium models on a decent GPU, local AI is actually faster. An RTX 4090 generates 128+ tokens/sec from an 8B model — faster than most cloud APIs. The trade-off is that you're running smaller models. Cloud services run 100B+ parameter models on datacenter hardware, which gives them a quality edge on the hardest tasks.

Q: Does local AI work offline?

Yes, completely. Once the model is downloaded, everything runs on your machine with no internet connection. This is one of the biggest advantages — you can use AI on flights, in air-gapped environments, or in situations where data can't leave your network. The model files are large (4-40 GB) so download them before going offline.

Q: Do I need an NVIDIA GPU or will AMD work?

NVIDIA is the safest choice — CUDA has the best software support and optimization. AMD GPUs work via ROCm (Linux only, with some rough edges). Apple Silicon works great via Metal. Intel GPUs have basic support through Vulkan. For the least friction, buy NVIDIA.

Q: Can I run multiple AI models at the same time?

Yes. Each model needs its own VRAM/RAM allocation. Ollama manages this automatically — it keeps recently used models in memory and swaps them as needed. To run two models simultaneously, you need enough memory for both. Two 8B models at Q4 need ~12 GB total. System RAM should be at least 2× your VRAM for smooth multi-model operation.

Q: What is Ollama and how do I use it?

Ollama is a free, open-source tool for running AI models locally. Install it from ollama.com, then run "ollama run qwen3:8b" to download and start chatting with a model. It handles model management, GPU detection, and provides an OpenAI-compatible API automatically. It supports 100+ models including Llama, Qwen, DeepSeek, Gemma, and Mistral.

Q: What are MoE models and why do they matter for local AI?

MoE (Mixture of Experts) models like Qwen 3 30B-A3B have many parameters (30B) but only activate a fraction (3B) per token. This means you get 30B-level quality with 3B-level speed and memory bandwidth requirements. The catch: you still need enough memory to store all 30B parameters (~19GB at Q4). But generation speed is dramatically faster than a dense 30B model, making MoE models the best quality-per-token-per-second option for local AI.

Running large language models and AI on your own hardware instead of cloud services. Performance depends on VRAM, RAM bandwidth, and quantization — not CPU speed.

What is local AI?

Local AI means running large language models (LLMs) and other AI models directly on your own computer instead of sending requests to cloud services like ChatGPT or Claude. Your prompts never leave your machine, your data stays private, there are no API costs, no rate limits, and it works offline.

In 2026, local AI has become genuinely practical. Models like Qwen 3, Llama 3, and DeepSeek run on consumer hardware with quality approaching cloud models. The software ecosystem — led by Ollama, LM Studio, and llama.cpp — has matured to the point where getting started takes a single terminal command.

The core trade-off is simple: cloud AI gives you the biggest, most capable models with zero hardware investment. Local AI gives you privacy, zero ongoing cost, unlimited usage, and offline access — but you're limited by your hardware. Understanding that hardware is what this guide is about.

How LLMs actually run on your hardware

Understanding the basics of LLM inference helps you make better hardware decisions. Here's what happens when you type a prompt:

Phase 1 — Prefill (prompt processing): Your entire input is processed in parallel. This phase is compute-bound — GPU FLOPS matter. The model reads all input tokens at once and builds an internal state called the KV cache (Key-Value cache). On an RTX 5090, prefill runs at ~7,000 tokens/second. This is why long prompts feel nearly instant.

Phase 2 — Decode (token generation): Tokens are generated one at a time. Each new token requires reading the entire model's weights from memory. This phase is memory-bandwidth-bound — it doesn't matter how fast your GPU computes if it can't read weights fast enough. This is why an RTX 4090 (1,008 GB/s bandwidth) generates tokens 15× faster than DDR5 RAM (96 GB/s).

The KV cache: During generation, the model stores Key and Value vectors for every previous token so it doesn't have to recompute them. This cache grows with context length. For a 7B model: ~2 GB at 4K context, ~4.5 GB at 32K context. This is extra memory on top of the model weights — and it's why long conversations consume more VRAM than short ones.

Why this matters: Memory bandwidth — not compute power, not CPU speed — is the single most important spec for local AI. A used RTX 3090 with 936 GB/s bandwidth will run circles around the newest CPU with 96 GB/s DDR5. The GPU is almost always the bottleneck, and the CPU is almost irrelevant. An Intel i5 and a Ryzen 9 produce nearly identical inference speeds.

Quantization: fitting big models in small memory

Quantization is the technique that makes local AI practical. It reduces the precision of model weights from 16-bit floating point (2 bytes per parameter) to lower bit-widths, dramatically shrinking memory requirements with minimal quality loss.

A 7B parameter model at full FP16 precision needs 14 GB of memory. At Q4_K_M (4-bit), it needs only ~4 GB — a 3.5× reduction while retaining 97-99% of the original quality.

Quantization levels compared (7B model):

Format	Size	Quality	Recommendation
FP16 (16-bit)	~14 GB	100% baseline	Research only — needs huge VRAM
Q8_0 (8-bit)	~7 GB	99%+ retained	Best quality if you have the VRAM
Q5_K_M (5-bit)	~5 GB	~98%	Great balance for 16GB GPUs
Q4_K_M (4-bit)	~4 GB	97-99%	The sweet spot — recommended default
Q3_K_S (3-bit)	~3 GB	85-92%	Noticeable quality loss — last resort
Q2_K (2-bit)	~2.5 GB	70-85%	Not recommended — severe degradation

Q4_K_M uses a mixed-precision approach: attention layers and important weight matrices keep higher precision, while less critical feed-forward layers are quantized more aggressively. This is why Q4_K_M retains quality so well despite being 4-bit on average.

All modern local AI tools use the GGUF format (developed by the llama.cpp project), which stores quantized weights efficiently and is supported by Ollama, LM Studio, and every major inference engine.

CPU vs GPU for local AI

The GPU is the most critical component for local AI — and the CPU is surprisingly unimportant. Here's the practical breakdown:

When you must use a GPU:

You want interactive speed (20+ tokens/sec) from models 7B and larger
You're running image generation (Stable Diffusion, Flux)
You need to serve multiple users or run concurrent requests

When CPU-only actually works:

Small models (1-4B parameters) for lightweight tasks
MoE (Mixture of Experts) models like Qwen 3 30B-A3B that only activate ~3B parameters per token — achieving 22 tokens/sec on CPU with DDR5-6000
Batch processing where speed isn't critical
You have fast DDR5 RAM in a multi-channel configuration

The bandwidth gap explains everything:

Memory type	Bandwidth	~Tokens/sec (8B Q4)
DDR4-3600 dual-channel	58 GB/s	~5 t/s
DDR5-6000 dual-channel	96 GB/s	~15-22 t/s
GDDR6X (RTX 4090)	1,008 GB/s	128-150 t/s
GDDR7 (RTX 5090)	1,790 GB/s	145+ t/s

Layer offloading — the middle ground: When a model is too large for your VRAM, you can split it: some transformer layers stay on the GPU (fast), others spill to system RAM (slow). The GPU processes its layers at full speed, then waits while the CPU-side layers crawl across the PCIe bus. Example: Qwen 3 8B with all 36 layers on GPU runs at 40.6 t/s. With only 25 layers on GPU, it drops to 8.6 t/s. The lesson: partial offload is usable but painful. Always try to fit the whole model in VRAM.

RAM requirements by model size

The amount of memory you need depends on the model's parameter count, quantization level, and context length. Here's a practical reference:

VRAM/RAM needed at Q4_K_M quantization (8K context):

Model	Parameters	Memory needed	Fits on
Qwen 3 0.6B	0.6B	~0.5 GB	Anything — even phones
Llama 3.2 3B	3B	~3.6 GB	Any GPU with 4GB+
Qwen 3 8B	8B	~5.0 GB	RTX 4060 (8GB), any 8GB+ GPU
Qwen 3 14B	14B	~10.7 GB	RTX 4070 Super (12GB), RTX 3060 (12GB)
Qwen 3 32B	32B	~19.8 GB	RTX 3090/4090 (24GB)
Qwen 3 30B-A3B (MoE)	30B (3B active)	~18.6 GB	RTX 3090/4090 (24GB), or CPU with 32GB RAM
Llama 3.3 70B	70B	~45.6 GB	RTX 5090 (32GB) with offload, or Mac with 64GB+
Qwen 2.5 72B	72B	~50.5 GB	Mac with 64GB+ unified memory
Qwen 3 235B-A22B (MoE)	235B (22B active)	~143 GB (Q4)	Mac Studio with 192GB, or multi-GPU rigs

Context length adds up fast: Longer conversations consume more memory for the KV cache. A 4B model needs ~0.2 GB of KV cache at 2K context, but ~3 GB at 32K context. If you're running a 14B model with 32K context on a 12GB GPU, you might overflow. The fix: reduce context length or switch to a model with Grouped-Query Attention (GQA), which uses less KV cache per token.

Running multiple model instances: Each model instance needs its own memory allocation. Running two 7B models at Q4_K_M simultaneously requires ~12 GB minimum. You can load multiple models in RAM and swap between them (Ollama does this automatically), but only active models consume VRAM. Rule of thumb: keep system RAM at 2× your VRAM for comfortable multi-model operation.

Memory bandwidth matters as much as capacity: Having 64 GB of RAM means nothing if it's slow. DDR5-6000 dual-channel delivers 96 GB/s; DDR4-3600 delivers only 58 GB/s. That's a ~65% bandwidth difference that directly translates to ~65% more tokens per second for models running on CPU/RAM. Always run dual-channel (two sticks) — a single stick halves your bandwidth.

Mac vs PC for local AI

Mac and PC take fundamentally different approaches to local AI, and each wins in different scenarios.

Apple Silicon's killer advantage — unified memory:

Apple's M-series chips use unified memory architecture: CPU, GPU, and Neural Engine share the same physical RAM pool. A Mac with 128 GB can allocate ~96 GB to GPU inference — far more than any consumer NVIDIA GPU (RTX 5090 maxes out at 32 GB VRAM). This means a Mac can run 70B+ parameter models on a single device, which is impossible on any single consumer PC GPU.

PC's killer advantage — raw speed when models fit in VRAM:

When the model fits entirely in GPU VRAM, NVIDIA is roughly 3× faster at token generation than Apple Silicon. An RTX 4090 generates 128-150 tokens/sec on an 8B model; an M3 Max generates 50 t/s on the same model. For models under 24-32GB, a discrete NVIDIA GPU simply wins on throughput.

Apple Silicon benchmarks (llama.cpp, Q4_K_M):

Chip	Max RAM	Bandwidth	8B model	70B model
M1 (base)	16 GB	68 GB/s	9.7 t/s	N/A
M2 Max	96 GB	400 GB/s	~45 t/s	7 t/s
M3 Max	128 GB	400 GB/s	50.7 t/s	7.5 t/s
M4 Max	128 GB	546 GB/s	~80 t/s	9 t/s
M2 Ultra	192 GB	800 GB/s	76.3 t/s	12.1 t/s

NVIDIA GPU benchmarks (Q4_K_M):

GPU	VRAM	Bandwidth	8B model	14B model
RTX 3090	24 GB	936 GB/s	87 t/s	52 t/s
RTX 4090	24 GB	1,008 GB/s	128-150 t/s	~80 t/s
RTX 5060 Ti	16 GB	~500 GB/s	51 t/s	33 t/s
RTX 5090	32 GB	1,790 GB/s	145 t/s	103 t/s

Cost comparison:

Mac Mini M4 Pro 64 GB: ~$1,800 — runs 32B models at 11-12 t/s. Silent. Best value Mac for AI.
PC with RTX 4090: ~$2,500 total build — runs 32B models at 30+ t/s, but cannot run 70B at all.
MacBook Pro M4 Max 128 GB: ~$4,000 — runs 72B models at 9 t/s. Portable 70B inference.
Mac Studio M3 Ultra 192 GB: ~$5,000+ — runs the largest open models (235B quantized).
PC with RTX 5090: ~$3,500 total build — fastest consumer option for models under 32GB.

The bottom line: If you want to run models under 24-32 GB, a PC with a discrete NVIDIA GPU gives you the fastest generation at the lowest price. If you want to run 70B+ models on a single machine, Apple Silicon with high unified memory is your only practical consumer option. Apple also wins on power efficiency (40-80W vs 300-450W under load) and noise.

MLX advantage: Apple's MLX framework (built specifically for Apple Silicon) achieves 20-30% better performance than llama.cpp on Mac. Qwen 3 30B-A3B MoE exceeds 100 t/s on M4 Max with MLX. If you're on Mac, use MLX-based tools for best results.

Qwen models: the local AI all-rounder

The Qwen (pronounced "chwen") model family from Alibaba has become one of the most popular choices for local AI in 2026. Here's why, and how to run them.

Qwen 2.5: Available in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B sizes. Supports 128K token context and 29+ languages. The 72B-Instruct version outperforms Llama-3-405B on multiple benchmarks — a model 5× larger. Specialized variants include Qwen2.5-Coder (trained on 5.5 trillion tokens of code) and Qwen2.5-Math.

Qwen 3: The latest generation with sizes from 0.6B to 235B. The breakthrough feature is MoE (Mixture of Experts): the 30B-A3B model has 30 billion parameters but only activates ~3 billion per token. This means it fits in much less memory and runs much faster than a dense 30B model — while performing like one.

Why Qwen is special for local AI:

MoE efficiency: Qwen 3 30B-A3B achieves dense-30B quality while using 3B-model resources. It runs at 22 t/s on CPU-only with DDR5-6000 — making it one of the first large models that's genuinely usable without a GPU.
Coding: Qwen2.5-Coder-32B is competitive with GPT-4 for code completion and debugging. HumanEval score of 84.8 on the 7B variant alone.
Multilingual: Native support for 29+ languages including Chinese, English, Japanese, Korean, Arabic, French, German, and Spanish. Best-in-class for non-English local AI.
Size range: From 0.6B (runs on a phone) to 235B (needs 143 GB at Q4). There's a Qwen model for every hardware config.

Running Qwen locally:

# Install Ollama, then:
ollama run qwen3:8b          # Good starting point, needs ~5GB VRAM
ollama run qwen3:14b         # Better quality, needs ~11GB VRAM
ollama run qwen3:30b-a3b     # MoE — great quality, only ~19GB but fast
ollama run qwen2.5-coder:32b # Best local coding model, needs ~20GB

Qwen 3 performance benchmarks (Q4_K_M):

Hardware	Qwen3 8B	Qwen3 30B-A3B (MoE)
RTX 5090 (32GB)	145 t/s	142 t/s
RTX 3090 (24GB)	87 t/s	114 t/s
RTX 5060 Ti (16GB)	51 t/s	N/A (doesn't fit)
M4 Max with MLX	~80 t/s	100+ t/s
DDR5-6000 CPU-only	~15 t/s	22 t/s
DDR4-3600 CPU-only	~5 t/s	10-14 t/s

Ollama and the local AI software stack

The local AI software ecosystem has matured dramatically. Here's the landscape and when to use each tool.

Ollama — the easiest way to start:

Ollama is a lightweight CLI tool that wraps llama.cpp and makes running local models as simple as Docker pulls. One command downloads and runs a model. It automatically detects your GPU, manages quantized GGUF models, and exposes an OpenAI-compatible REST API on port 11434.

# Install: https://ollama.com
# Run your first model:
ollama run llama3.1:8b

# Use the API (OpenAI-compatible):
curl http://localhost:11434/v1/chat/completions \
  -d '{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hello"}]}'

Ollama supports 100+ models including Llama 3, Qwen 3, DeepSeek, Gemma, Mistral, and Phi. You can also import any of the 45,000+ GGUF models on Hugging Face. Key features include Modelfiles (Docker-like customization with system prompts and parameters) and automatic GPU/CPU layer splitting.

LM Studio — GUI + API in one:

If you prefer a visual interface, LM Studio provides a polished desktop app with chat UI, model browser, and a built-in API server. It includes "chat with documents" (local RAG), automatic GPU detection, and a headless mode for server deployment. Best for developers who want both a GUI and an API without managing the CLI.

llama.cpp — the foundation:

The C/C++ library that powers Ollama and most other tools. Zero dependencies, runs on everything from ARM devices to multi-GPU rigs. Use it directly when you need maximum control: custom batch sizes, specific layer assignments, prompt caching, or grammar-constrained generation. The llama-server binary provides a lightweight HTTP server with concurrent request support via the -np flag.

vLLM — production serving:

When you need to serve models to multiple users with high throughput, vLLM's continuous batching and PagedAttention handle 2-4× more concurrent requests than alternatives. Primarily Linux + NVIDIA. Overkill for personal use; essential for team or production deployments.

Other notable tools:

text-generation-webui: Maximum customization, zero telemetry, multi-backend support. For power users and researchers.
Jan: Desktop app with MCP support, browser automation, and agentic workflows.
GPT4All: Privacy-first desktop app with offline document chat. Simplest possible setup.
LocalAI: Universal API gateway that orchestrates multiple backends (LLM, image, audio) behind one OpenAI-compatible endpoint. Best for self-hosted "OpenAI replacement" deployments.

What you can actually do with local AI

Local AI isn't just a tech demo — here are the practical use cases that work well today:

Coding assistance: Models like Qwen2.5-Coder-32B and DeepSeek-Coder provide autocomplete, debugging, code review, and refactoring without sending proprietary code to external servers. Integrate via OpenAI-compatible APIs with VS Code extensions, Neovim plugins, or directly with tools like Continue.dev. A 14B coding model on a 16GB GPU covers most development tasks.

Private document analysis (RAG): Retrieval-Augmented Generation lets you "chat with your documents" locally. Your files are split into chunks, converted to vector embeddings, stored in a local vector database (ChromaDB, Milvus), and retrieved as context when you ask questions. Frameworks like LangChain and LlamaIndex make this straightforward. LM Studio includes built-in RAG. For models with 128K token context windows, you can skip RAG entirely and just paste entire documents into the prompt.

Always-on personal AI: Run a small model (3-8B) on a Mac Mini or low-power PC as a permanent local assistant. No subscription fees, no API costs. Accessible from any device on your network via the Ollama API. Models like Qwen 3 4B handle casual questions, writing assistance, and simple analysis at 50+ t/s on modest hardware.

Batch processing: Process thousands of documents, emails, or data entries through an LLM without per-token costs. Local AI makes bulk operations that would cost hundreds of dollars via cloud APIs essentially free. Speed doesn't matter as much for batch work — even CPU-only inference works if you're patient.

Creative writing: No content filtering, no usage policies. Models like Llama 3.1 70B and Qwen 2.5 72B approach GPT-4 quality for creative work. Fine-tune on your own writing style using LoRA adapters without uploading your work anywhere.

Translation: Qwen 2.5 supports 29+ languages natively. Useful for real-time translation in offline environments, translating internal documents, or processing multilingual content at scale.

Image generation: Stable Diffusion, Flux, and SDXL run locally with 8-12GB of VRAM. Use ComfyUI or Automatic1111 as the interface. This is a GPU-intensive workload — unlike text generation, image gen needs both VRAM capacity and GPU compute power.

Hardware builds for local AI

Here are three build tiers based on real-world local AI performance. The GPU is where almost all your budget should go.

Budget build (~$600) — run 8B models comfortably:

GPU: RTX 4060 Ti 8GB (~$300) or RTX 3060 12GB used (~$200)
CPU: Ryzen 5 7500F or Intel i5-13400F (~$130)
RAM: 32 GB DDR5-5600 (~$80)
Storage: 1TB NVMe PCIe 4.0 (~$70)
PSU: 650W (~$60)
Runs: Qwen 3 8B at ~38 t/s, Llama 3.2 3B at 80+ t/s
Best for: Getting started, coding assistance with small models, casual AI chat

Mid-range build (~$1,200-1,500) — run 14-32B models:

GPU: Used RTX 3090 24GB (~$700) or RTX 4070 Ti Super 16GB (~$800)
CPU: Ryzen 7 7700X (~$250)
RAM: 64 GB DDR5-6000 (~$150)
Storage: 2TB NVMe (~$130)
Motherboard: B650 ($150)
PSU: 750W (~$80)
Runs: Qwen 3 32B at 30+ t/s (3090), Qwen2.5-Coder-32B for serious coding
Sweet spot: The used RTX 3090 at ~$700 is the best value in local AI — 24GB VRAM for under half the price of a 4090

High-end build (~$2,500-3,500) — run 70B models:

GPU: RTX 5090 32GB (~$2,000) or RTX 4090 24GB (~$1,600)
CPU: Ryzen 7 9800X3D (~$400)
RAM: 64-128 GB DDR5-6000 (~$200-400)
Storage: 2-4TB NVMe
Motherboard: X670 (~$200)
PSU: 1000W (~$130)
Runs: Qwen 3 30B-A3B at 142 t/s, 70B quantized models with partial offload

Mac alternative — run 70B on one device:

Mac Mini M4 Pro 64 GB: ~$1,800 — 32B models at 11-12 t/s. Silent, compact, always-on. Best value entry to large-model local AI.
MacBook Pro M4 Max 128 GB: ~$4,000 — 72B models at 9 t/s. The only way to run 70B portably.
Mac Studio M3/M4 Ultra 192 GB: ~$5,000+ — 235B quantized models. The consumer capacity king.

Priority order for spending:

VRAM capacity — determines maximum model size. The single most important spec.
Memory bandwidth — directly determines tokens per second.
System RAM — at least 2× VRAM. Always dual-channel. DDR5 > DDR4.
NVMe SSD — PCIe 4.0+ for loading 70B models in under 10 seconds.
CPU — any modern chip works. Don't overspend here.

Speed reference: tokens per second by hardware

Here's what to expect from different hardware configurations. All benchmarks use Q4_K_M quantization with 8-16K context.

Usability thresholds:

< 5 t/s: Frustratingly slow — barely usable for interactive work
5-10 t/s: Usable for non-interactive tasks (batch processing, background agents)
10-20 t/s: Comfortable reading speed — good enough for most chat use
20-40 t/s: Smooth interactive experience
40+ t/s: Excellent — faster than you can read

NVIDIA GPUs (Q4_K_M):

GPU	VRAM	8B model	14B model	32B model
RTX 5090	32 GB	145 t/s	103 t/s	~60 t/s
RTX 4090	24 GB	128-150 t/s	~80 t/s	~30 t/s
RTX 3090	24 GB	87 t/s	52 t/s	~25 t/s
RTX 5060 Ti	16 GB	51 t/s	33 t/s	OOM
RTX 4060	8 GB	~38 t/s	OOM	OOM

Apple Silicon (llama.cpp, Q4_K_M):

Chip	Memory	8B model	32B model	70B model
M1 (base)	8-16 GB	9.7 t/s	N/A	N/A
M3 Max	48-128 GB	50.7 t/s	~20 t/s	7.5 t/s
M4 Max	64-128 GB	~80 t/s	~25 t/s	9 t/s
M2 Ultra	192 GB	76.3 t/s	~30 t/s	12.1 t/s

CPU-only (no GPU, Q4_K_M):

RAM config	8B model	30B-A3B MoE
DDR4-3600 dual-channel	~5 t/s	10-14 t/s
DDR5-6000 dual-channel	~15-22 t/s	22 t/s

Key takeaway: Memory bandwidth, not compute, is the bottleneck. An M3 Max (400 GB/s) generates tokens faster than an M4 Pro (273 GB/s) despite the M4 Pro having a newer architecture. A used RTX 3090 (936 GB/s, ~$700) is faster than any Mac under $4,000 for models that fit in 24 GB.

Getting started in 5 minutes

Here's the fastest path from zero to running AI on your machine:

Step 1: Install Ollama

# macOS / Linux:
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download from https://ollama.com

Step 2: Run a model

# Start with something small to test your setup:
ollama run qwen3:4b

# If you have 8GB+ VRAM:
ollama run qwen3:8b

# If you have 16GB+ VRAM:
ollama run qwen3:14b

# If you have 24GB+ VRAM:
ollama run qwen3:32b

Step 3: Use the API

# Ollama exposes an OpenAI-compatible API on port 11434:
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Explain quicksort in Python"}]
  }'

Step 4: Connect your tools

Ollama's OpenAI-compatible API means you can use it as a drop-in replacement in any app that supports OpenAI. Point your IDE extension, coding assistant, or custom app to http://localhost:11434 and set the model name.

Choosing your first model:

General chat (low VRAM): qwen3:4b — runs on anything, surprisingly capable
General chat (8GB+ VRAM): qwen3:8b — the best quality-to-speed ratio for 8GB GPUs
Coding: qwen2.5-coder:7b or qwen2.5-coder:32b if you have the VRAM
Maximum quality (24GB+): qwen3:32b or qwen3:30b-a3b
CPU-only: qwen3:30b-a3b — the MoE architecture makes it genuinely usable on CPU with fast DDR5

Frequently Asked Questions

Can I run ChatGPT-level AI on my computer?

Models like Qwen 3 32B and Llama 3.1 70B approach GPT-4 quality on many tasks. A 32B model fits on a 24GB GPU (RTX 3090/4090) and runs at 25-30 t/s. For coding specifically, Qwen2.5-Coder-32B matches GPT-4 on benchmarks. You won't match the latest cloud models on every task, but for most practical uses the gap has narrowed dramatically.

How much VRAM do I need for local AI?

It depends on the model: 8GB runs 8B models (good for basic tasks), 16GB runs 14B models (solid general use), 24GB runs 32B models (near-GPT-4 quality). For 70B+ models, you need either an RTX 5090 (32GB) with partial offloading, or a Mac with 64GB+ unified memory. The sweet spot for most people is 24GB — a used RTX 3090 for ~$700.

Is local AI slower than ChatGPT?

For small-to-medium models on a decent GPU, local AI is actually faster. An RTX 4090 generates 128+ tokens/sec from an 8B model — faster than most cloud APIs. The trade-off is that you're running smaller models. Cloud services run 100B+ parameter models on datacenter hardware, which gives them a quality edge on the hardest tasks.

Does local AI work offline?

Yes, completely. Once the model is downloaded, everything runs on your machine with no internet connection. This is one of the biggest advantages — you can use AI on flights, in air-gapped environments, or in situations where data can't leave your network. The model files are large (4-40 GB) so download them before going offline.

Do I need an NVIDIA GPU or will AMD work?

NVIDIA is the safest choice — CUDA has the best software support and optimization. AMD GPUs work via ROCm (Linux only, with some rough edges). Apple Silicon works great via Metal. Intel GPUs have basic support through Vulkan. For the least friction, buy NVIDIA.

Can I run multiple AI models at the same time?

Yes. Each model needs its own VRAM/RAM allocation. Ollama manages this automatically — it keeps recently used models in memory and swaps them as needed. To run two models simultaneously, you need enough memory for both. Two 8B models at Q4 need ~12 GB total. System RAM should be at least 2× your VRAM for smooth multi-model operation.

What is Ollama and how do I use it?

Ollama is a free, open-source tool for running AI models locally. Install it from ollama.com, then run "ollama run qwen3:8b" to download and start chatting with a model. It handles model management, GPU detection, and provides an OpenAI-compatible API automatically. It supports 100+ models including Llama, Qwen, DeepSeek, Gemma, and Mistral.

What are MoE models and why do they matter for local AI?

MoE (Mixture of Experts) models like Qwen 3 30B-A3B have many parameters (30B) but only activate a fraction (3B) per token. This means you get 30B-level quality with 3B-level speed and memory bandwidth requirements. The catch: you still need enough memory to store all 30B parameters (~19GB at Q4). But generation speed is dramatically faster than a dense 30B model, making MoE models the best quality-per-token-per-second option for local AI.