What is the Windows Desktop AI with RAG stack for?

LM Studio + AnythingLLM = local AI on Windows with a friendly GUI. Download models visually, chat with documents, no terminal needed. Runs on a $200 used GPU. It is purpose-built for A private AI assistant on Windows with document RAG - no CLI, no Docker, all desktop GUI and runs entirely on your own hardware.

How much does the Windows Desktop AI with RAG stack cost?

Windows Desktop AI with RAG costs around $200 in hardware up front and $0/month to run, since everything is self-hosted — no per-token or subscription fees versus a cloud equivalent.

How long does it take to set up Windows Desktop AI with RAG?

Plan for roughly 15 minutes. The stack is rated beginner.

What do I need to run Windows Desktop AI with RAG?

Windows Desktop AI with RAG is built from 2 tool(s), 2 model(s), 2 hardware item(s). Each is listed below with a link.

LM Studio + AnythingLLM = local AI on Windows with a friendly GUI. Download models visually, chat with documents, no terminal needed. Runs on a $200 used GPU.

Windows Desktop AI with RAG (LM Studio + AnythingLLM)

A fully private AI stack for Windows that never touches the cloud. LM Studio serves as your local model server with a visual interface — browse, download, and run models from HuggingFace without typing a single command. AnythingLLM adds document RAG, workspace isolation, and agent skills on top. Your documents, your GPU, your data.

This stack is built for Windows users who prefer a graphical interface — no Docker, no terminal commands beyond the basics. If you've wanted to run local AI but found Ollama's CLI intimidating, this is your setup.

What you get

Visual model browser — search HuggingFace models inside LM Studio, download with one click
Drop-in document Q&A — PDF, DOCX, TXT, CSV, code files. Drag them into AnythingLLM and ask questions
Workspace isolation — separate knowledge bases for work, personal, or different projects
OpenAI-compatible local API — LM Studio exposes http://localhost:1234/v1 so AnythingLLM and any other tool can connect
No data leaves your PC — all inference and embedding runs locally, works completely offline
No Docker, no WSL, no CLI — both apps are native Windows desktop installers
$0/month — the only cost is the GPU you already own

Architecture

Component	Role
LM Studio	Model manager + local API server (port 1234)
AnythingLLM	Chat UI + RAG engine + vector DB (LanceDB)
Qwen2.5 14B	General chat model, fits 12GB at Q4
LanceDB (built-in)	Local vector database for document embeddings

LM Studio runs the model and exposes an OpenAI-compatible endpoint. AnythingLLM connects to it, handles document chunking and embeddings, and provides the chat interface. Everything runs as native Windows apps — no containers.

Recommended GPU: RTX 3060 12GB (~$200 used) for running 7B-14B models. If you have an RTX 4060 (8GB), stick with 3B-7B models for good speed. No GPU required — both apps fall back to CPU, though responses will be slower.

Prerequisites

Windows 11 (64-bit, Home or Pro)
GPU with 4GB+ VRAM recommended (6GB+ preferred). CPU-only works but is slower
16GB RAM (8GB minimum, 16GB for comfortable multitasking)
10-30GB free disk for models
LM Studio (free) and AnythingLLM Desktop (free, MIT license)

Setup

Step 1: Install LM Studio

Go to lmstudio.ai and download the Windows installer
Run the installer — default path is fine
Open LM Studio

You'll see a clean interface with a search bar and model browser. No terminal commands needed.

Step 2: Download a model

In LM Studio, go to the Discover tab (magnifying glass icon in left sidebar)
Search for Qwen2.5-14B — a capable 14B model for general chat and RAG
Look for a Q4_K_M quantized version — best balance of quality and size
Click Download and wait (~8 GB for Q4_K_M)

If you have 8GB VRAM or less, search for Qwen2.5-7B or Llama 3.2 3B instead. The download process is the same.

Step 3: Start the local server

In LM Studio, go to the Developer tab (</> icon in left sidebar)
Select your downloaded model from the dropdown at the top
Click Start Server
You should see: Server listening on http://localhost:1234

Leave LM Studio open and the server running — AnythingLLM connects to it over this local port.

Step 4: Install AnythingLLM

Go to anythingllm.com/desktop and download the Windows installer
Important: When prompted, install for Current User only — not "All Users". Installing to Program Files causes a known spawn error
Open AnythingLLM Desktop

Step 5: Connect AnythingLLM to LM Studio

In AnythingLLM, open Settings (gear icon, bottom left)
Go to LLM Preference
Select LM Studio as the provider
Set the base URL to http://localhost:1234
Click Save changes
Go to Embedding Model in the same settings panel and set it to AnythingLLM built-in (or LM Studio if you prefer)

AnythingLLM will show a green indicator if it can reach the LM Studio server.

Step 6: Create a workspace and chat

Click New Workspace
Name it — for example "Work" or "Personal"
Optionally set a system prompt
Click Save

You can now chat with your local model. Type any question to confirm the connection is working.

Step 7: Add documents (RAG)

Inside your workspace, click the paperclip icon or drag a file into the chat area
Supported formats: .txt, .pdf, .md, .docx, .csv
AnythingLLM splits the document into chunks and creates embeddings locally
Once indexed (green indicator next to filename), ask questions about the document
The model retrieves relevant chunks and answers using your document as context

Workspaces are isolated — documents in one workspace are invisible to others. This lets you keep separate knowledge bases for different projects.

Use it

Once set up, you can:

Chat with local models — full conversational AI, no internet needed
Query your documents — upload PDFs, code files, meeting notes, and ask questions
Switch models on the fly — download a different model in LM Studio and select it from the dropdown
Use any OpenAI-compatible tool — since LM Studio exposes http://localhost:1234/v1, you can connect other tools (SillyTavern, Obsidian plugins, etc.)

Performance tips

GPU	Max model size	Typical speed
RTX 3060 12GB	14B at Q4	15-20 tok/s
RTX 4060 8GB	7B at Q4	20-30 tok/s
CPU-only (16GB)	3B at Q4	3-5 tok/s
CPU-only (32GB)	7B at Q4	2-4 tok/s

CPU-only inference works but expect slower responses. The RAG pipeline (embedding + retrieval) runs fine on CPU.

Cost vs cloud

	Local (this stack)	Cloud ChatGPT Plus
Monthly	$0	$20
GPU (used)	$200 (RTX 3060)	$0 (included)
Data privacy	Complete	Shared with OpenAI
Offline capable	Yes	No
Model selection	Any open model	GPT-4o only

Troubleshooting

AnythingLLM won't start — spawn error This happens when installed under Program Files (All Users). Uninstall and reinstall using "Current User" option.

Model loads but responds very slowly You're likely running on CPU. Check that LM Studio detects your GPU in the Developer tab. Try a smaller model (3B or 7B).

"Failed to connect to LM Studio" in AnythingLLM Make sure LM Studio's server is running (Developer tab > Start Server). Verify the port: http://localhost:1234 should show a page in your browser.

Documents are uploaded but the model ignores them Ensure the document has a green indicator in the workspace sidebar. Check that the embedding model is active in Settings > Embedding Model.

Windows Desktop AI with RAG