Local AI knowledge base - Ollama + RAG over your own notes
NoteBrain Desktop turns a folder of notes into a private, searchable AI knowledge base. Local embeddings, hybrid retrieval, and a local Ollama model answer your questions - 100% offline, free.
Local Ollama + local RAG is a Desktop feature. Browsers block direct connections to localhost Ollama (CORS), so the full local AI stack - Ollama, TransformersJS embeddings, on-disk SQLite vector index - runs in the free Desktop app (macOS, Windows, Linux). The Web app uses cloud AI (demo or Cloud plan) and falls back to keyword search.
How the local RAG stack works
- Index. NoteBrain chunks your notes and generates embeddings using a local TransformersJS model (WebGPU-accelerated when available).
- Store. Vectors and metadata live in a local SQLite database (Desktop) or in your browser’s IndexedDB (Web). On Desktop, nothing leaves your machine.
- Retrieve. Each query runs a hybrid search: full-text (Lunr / FTS5) + cosine on vectors - with a reranking step on top.
- Answer. Retrieved chunks are passed as context to a local Ollama model (Llama, Mistral, Qwen, Gemma, Phi…). The LLM never sees the open internet.
Why local matters
- Confidential data stays confidential. Medical records, client briefs, source code, draft contracts - never leave the machine.
- No tokens to budget. Local inference is free; ask as many follow-ups as you want.
- Works on a flight. Embeddings, retrieval and inference all run offline.
- GPU-aware. Electron build prefers Vulkan / discrete GPU on desktop for fast local embeddings.
- Voice notes stay local too. Speech-to-text runs on a local model (Whisper or NVIDIA Parakeet) - record a meeting, dictate an idea, transcribe a voice memo without sending audio anywhere.
Pick the model that fits
| Goal | Try (Ollama) |
|---|---|
| Fast, small footprint | llama3.2:3b, phi3, qwen2.5:3b |
| Balanced quality | llama3.1:8b, mistral, qwen2.5:7b |
| Best answers (needs RAM) | qwen2.5:14b, llama3.1:70b (large VRAM) |
| Embeddings (already bundled) | Local TransformersJS - nothing to install |
| Speech-to-text (local) | Whisper or NVIDIA Parakeet - transcribe voice notes offline |
Setup in three steps
- Install Ollama and pull a model:
ollama pull llama3.1:8b. - Install NoteBrain Desktop from GitHub.
- Open NoteBrain → Settings → AI → pick Ollama and select your model.
That’s it. Ask your first question - NoteBrain retrieves the right notes and answers from local context.
What you don’t need
- No OpenAI key.
- No vector-database service.
- No "embed in the cloud" step.
- No monthly AI bill (unless you want managed cloud - optional).
FAQ
Do I need a GPU?
For Ollama: a recent CPU works for 3B–7B models; a GPU helps for larger ones. For NoteBrain embeddings: TransformersJS uses WebGPU when available and falls back to WASM otherwise.
How big can my note collection be?
The local SQLite index scales to tens of thousands of notes with sub-second retrieval. Embeddings are computed in the background; you can keep writing while indexing runs.
What about retrieval quality?
NoteBrain uses a hierarchical RAG: chunk-level vector + full-text search, optional rerank, plus document-level context expansion so answers don’t lose the surrounding context.