Local AI knowledge base - Ollama + RAG over your own notes

NoteBrain Desktop turns a folder of notes into a private, searchable AI knowledge base. Local embeddings, hybrid retrieval, and a local Ollama model answer your questions - 100% offline, free.

Local Ollama + local RAG is a Desktop feature. Browsers block direct connections to localhost Ollama (CORS), so the full local AI stack - Ollama, TransformersJS embeddings, on-disk SQLite vector index - runs in the free Desktop app (macOS, Windows, Linux). The Web app uses cloud AI (demo or Cloud plan) and falls back to keyword search.

How the local RAG stack works

  1. Index. NoteBrain chunks your notes and generates embeddings using a local TransformersJS model (WebGPU-accelerated when available).
  2. Store. Vectors and metadata live in a local SQLite database (Desktop) or in your browser’s IndexedDB (Web). On Desktop, nothing leaves your machine.
  3. Retrieve. Each query runs a hybrid search: full-text (Lunr / FTS5) + cosine on vectors - with a reranking step on top.
  4. Answer. Retrieved chunks are passed as context to a local Ollama model (Llama, Mistral, Qwen, Gemma, Phi…). The LLM never sees the open internet.

Why local matters

Pick the model that fits

GoalTry (Ollama)
Fast, small footprintllama3.2:3b, phi3, qwen2.5:3b
Balanced qualityllama3.1:8b, mistral, qwen2.5:7b
Best answers (needs RAM)qwen2.5:14b, llama3.1:70b (large VRAM)
Embeddings (already bundled)Local TransformersJS - nothing to install
Speech-to-text (local)Whisper or NVIDIA Parakeet - transcribe voice notes offline

Setup in three steps

  1. Install Ollama and pull a model: ollama pull llama3.1:8b.
  2. Install NoteBrain Desktop from GitHub.
  3. Open NoteBrain → Settings → AI → pick Ollama and select your model.

That’s it. Ask your first question - NoteBrain retrieves the right notes and answers from local context.

What you don’t need

FAQ

Do I need a GPU?

For Ollama: a recent CPU works for 3B–7B models; a GPU helps for larger ones. For NoteBrain embeddings: TransformersJS uses WebGPU when available and falls back to WASM otherwise.

How big can my note collection be?

The local SQLite index scales to tens of thousands of notes with sub-second retrieval. Embeddings are computed in the background; you can keep writing while indexing runs.

What about retrieval quality?

NoteBrain uses a hierarchical RAG: chunk-level vector + full-text search, optional rerank, plus document-level context expansion so answers don’t lose the surrounding context.