Do I need a GPU for Ollama RAG?

A recent CPU works for 3B–7B models; a GPU helps for larger ones. NoteBrain embeddings use WebGPU when available and fall back to WASM.

Local AI knowledge base - Ollama + RAG over your own notes

NoteBrain Desktop turns a folder of notes into a private, searchable AI knowledge base. Local embeddings, hybrid retrieval, and a local Ollama model answer your questions - 100% offline, free.

Local Ollama + local RAG is a Desktop feature. Browsers block direct connections to localhost Ollama (CORS), so the full local AI stack - Ollama, TransformersJS embeddings, on-disk SQLite vector index - runs in the free Desktop app (macOS, Windows, Linux). The Web app uses cloud AI (demo or Cloud plan) and falls back to keyword search.

Download Desktop (Free) Try the Web App

How the local RAG stack works

Index. NoteBrain chunks your notes and generates embeddings using a local TransformersJS model (WebGPU-accelerated when available).
Store. Vectors and metadata live in a local SQLite database (Desktop) or in your browser’s IndexedDB (Web). On Desktop, nothing leaves your machine.
Retrieve. Each query runs a hybrid search: full-text (Lunr / FTS5) + cosine on vectors - with a reranking step on top.
Answer. Retrieved chunks are passed as context to a local Ollama model (Llama, Mistral, Qwen, Gemma, Phi…). The LLM never sees the open internet.

Why local matters

Confidential data stays confidential. Medical records, client briefs, source code, draft contracts - never leave the machine.
No tokens to budget. Local inference is free; ask as many follow-ups as you want.
Works on a flight. Embeddings, retrieval and inference all run offline.
GPU-aware. Electron build prefers Vulkan / discrete GPU on desktop for fast local embeddings.
Voice notes stay local too. Speech-to-text runs on a local model (Whisper or NVIDIA Parakeet) - record a meeting, dictate an idea, transcribe a voice memo without sending audio anywhere.

Pick the model that fits

Goal	Try (Ollama)
Fast, small footprint	`llama3.2:3b`, `phi3`, `qwen2.5:3b`
Balanced quality	`llama3.1:8b`, `mistral`, `qwen2.5:7b`
Best answers (needs RAM)	`qwen2.5:14b`, `llama3.1:70b` (large VRAM)
Embeddings (already bundled)	Local TransformersJS - nothing to install
Speech-to-text (local)	Whisper or NVIDIA Parakeet - transcribe voice notes offline

Setup in three steps

Install Ollama and pull a model: ollama pull llama3.1:8b.
Install NoteBrain Desktop from GitHub.
Open NoteBrain → Settings → AI → pick Ollama and select your model.

That’s it. Ask your first question - NoteBrain retrieves the right notes and answers from local context.

What you don’t need

No OpenAI key.
No vector-database service.
No "embed in the cloud" step.
No monthly AI bill (unless you want managed cloud - optional).

FAQ

Do I need a GPU?

For Ollama: a recent CPU works for 3B–7B models; a GPU helps for larger ones. For NoteBrain embeddings: TransformersJS uses WebGPU when available and falls back to WASM otherwise.

How big can my note collection be?

The local SQLite index scales to tens of thousands of notes with sub-second retrieval. Embeddings are computed in the background; you can keep writing while indexing runs.

What about retrieval quality?

NoteBrain uses a hierarchical RAG: chunk-level vector + full-text search, optional rerank, plus document-level context expansion so answers don’t lose the surrounding context.