The dream is running GraphRAG with locally-hosted LLMs. And at least for now, the dream is on hold for me.

In case you missed it, GraphRAG is a way of getting more useful results with LLMs by working with data you provide (in addition to whatever they’ve trained on.) The system uses LLMs to build a knowledge graph from documents you provide and then uses those graphs to power RAG queries.

This opens lots of possibilities. For information architecture work, it lets you ask useful questions of your own content. I’ve written about my experiments in that scenario. In that case, I used OpenAI’s models to power Microsoft’s GraphRAG application.

But I’m especially excited about the possibilities for personal knowledge management. Imagine an LLM tuned to and focused on your personal notes, journals, calendars, etc. That’s primarily why I’m dreaming of GraphRAG powered by local models.

There are several reasons why local models would be preferable. For one, there’s the cost: GraphRAG indexing runs are expensive. There’s also a privacy angle. Yes, I’ve told OpenAI I don’t want them to train their models using my data, but some of this stuff is extremely personal and I’m not comfortable with it leaving my computer at all.

But an even larger concern is dependency. I’m building a lifelong thinking assistant. (An amanuensis, as I outlined in Duly Noted.) It’s risky to delegate such a central part of this system to a party that could turn off the spigot at any time.

So I’ve been experimenting with graphrag using local models. There are good news and bad news.

Before I tell you about them, let me explain my setup. I’m using a 16” 2023 M2 Max MacBook Pro with 32GB of RAM. It’s not an entry-level machine, but not a monster either. I’m using ollama to run local models. I’ve tried around half a dozen at this point and have successfully set up one automated (non-GraphRAG) workflow using mistral-small3.1.

GraphRAG is extremely flexible. There are dozens of parameters to configure, including different LLMs for each step in the process. Off-the-shelf, its prompts are optimized specifically for GPT-4-turbo; other models require tweaking. Indexing runs (where the model converts texts to knowledge graphs) can take a long time. So tweaks are time-consuming.

I’ve had a go at it several times, but given up after a bit. I don’t have much free time these days, and most experiments have unsuccessfully ended with failed (and long!) indexing runs. But a few things have changed in recent weeks:

  • GraphRAG itself keeps evolving
  • There are now more powerful small local models that run better within my machine’s limitations
  • ChatGPT o3 came out

That last one may sound like a non-sequitur. Aren’t I trying to get away from cloud-hosted models for this use case? Well, yes — but in this case, I’m not using o3 to power GraphRAG. Instead, I’m using it to help me debug failed runs.

While certainly nothing like AGI, as some have claimed, o3 has proven to be excellent for dealing with the sort of tech-related issues that would’ve sent me off to Stack Overflow in the past. Debugging GraphRAG runs is one such task. I’ve been feeding o3 logfiles after each run, and it’s recommended helpful tweaks. It’s the single most important factor in my recent progress.

Yes, there’s been some progress: yesterday, after many tries, I finally got two local models to successfully complete an indexing run. Mind you, that doesn’t mean I can yet successfully query GraphRAG. But finishing the indexing run without issues is progress. That’s the good news.

Alas, the indexing run took around thirty-six hours to process nineteen relatively short Markdown files. To put that in perspective, the same indexing run using cloud-hosted models would likely have taken under ten minutes. My machine also ran at full throttle the whole time. (It’s the first time I’ve felt an M-series Mac get hot.)

The reduced processing speed isn’t just because the models themselves are slower: it’s also due to my machine’s limitations. After analyzing the log files, ChatGPT suggested reducing the number of concurrent API calls. The successful run specified just one call at a time for both models.

The upshot is that even though the indexing run finished successfully, this process is impractical for real-world use. My PKM has thousands of Markdown files. ChatGPT keeps suggesting further tweaks, but progress is frustratingly slow when cycles are measured in days.

I’ve considered upgrading to a MBP with more RAM or increasing the number of concurrent processes to find the upper threshold for my machine. But based on these results, I suspect improvements will be marginal given the amount of data I’m looking to process.

So that’s the bad news. For now, I’ll keep working with local models for other uses (such as OCRing handwritten notes; the workflow I alluded to above. More on that soon!) And of course, I’ll continue experimenting with cloud-based models for other use cases. In any case, I’ll share what I learn here.