Getting ready for local inference

28 June 2026

Recently I’ve been experimenting with local LLMs for coding. Up until a couple of months ago I was a happy GitHub Copilot subscriber. I’d been quite satisfied to let other companies take on the cost of buying and managing hardware, particularly while everything is moving so quickly in the field. Local inference would be ideal for privacy/security/reliability reasons but it’s not a hard requirement for the work I tend to do. Copilot was particularly nice because I could bounce freely between OpenAI’s and Anthropic’s models depending on what felt smarter at the time.

It was around then we started seeing big hikes in request cost multipliers, and finally the announcement that Copilot would be moving to usage-based billing. Up to this point I had a distinct feeling that my Copilot bill was oddly good value but I hadn’t realised the degree to which Microsoft had been subsidising the old request-based scheme. The new costs were not unreasonable but they were high, to the point that I was motivated to find a different way to fulfil my AI needs.

Please don’t misunderstand me at this juncture—I’m not suggesting that inference isn’t an economically viable service, or that the bubble’s going to pop any day now. It’s simply become clear that I’m not the target customer for the kind of inference that Github Copilot is selling. Today you buy your LLMs through Copilot for the same reason you use Azure. It’s because you’re already up to your neck in Microsoft contracts and it’s easier, or because you need the enterprise data security guarantees, or their filter protecting against accidentally regurgitating copyrighted material. There are plenty of customers for whom it’s not bonkers to pay the Copilot premium. They are not me, and that’s fine.

For now, resolving this has been easier than expected. My clients where I’m doing a significant amount of coding are providing/funding AI services for their staff anyway and all I had to do was ask. It turns out, if you think AI is a force multiplier for your employees, extending that to your contractors is a no-brainer.

Before that happened, though, I was already in the market for a new work Mac because I needed some higher specs and I’d been watching antirez posting about his project DwarfStar, in which he gets a pretty serious MoE open weights model running on Macs with 96 or 128 GB of RAM through clever quantisation. I wanted a hedge against token prices going crazy and the best VRAM I had locally was an RTX3060 with 12 GB. Not great. So, I made the call to buy the MacBook with the totally unreasonable 128 GB RAM configuration. Luckily for me, I did so before it became 33% more expensive the other day.

My initial report: I’ve had considerable success using ds4 with OpenCode, using the q2-q4 imatrix variant of DeepSeek v4 Flash. The model is neither brilliant nor fast but it’s pretty good and pretty reliable at tool calls. I’ve used it for all kinds of tasks from writing Rust, to sysadmin over SSH, to general Q&A. It feels a lot like when I was first using Sonnet 3.5 last year. Now it’s permanently available on my machine at marginal energy cost. That’s quite nice.

It really is a wild time. I was a kid in the 90s and 00s when ever-new CPUs were coming out with greater MHz and games changed year on year as video cards (as we called them then) leapt forward in capability. LLMs are the first tech I’ve seen since then where there’s such a rapid pace of development. If I look at my local ds4 setup today, in one sense it’s mildly disappointing. GPT-5.5 and Opus 4.8 run rings around it, especially on the fast and expensive rigs OpenAI and Anthropic use for inference. But if you’d shown this to me 3 years ago I’d have said, “dude, this is fricking incredible!” In this kind of timeline it’s hard to keep perspective.

However, life is short. Every second you spend staring at a screen waiting is a second you’re not going to spend again. Nowhere is this felt more keenly than when you’re waiting for a response from your local AI, and you know that Claude would already have finished 30 seconds ago. Oh, the dependability and privacy, but at what cost? I’m neither able nor willing to spend 5 figures on the kind of GPU setup that would let me run a good open-weights model at the context size and token rate offered by the cloud vendors. It’s a bit different from the typical cloud vs on-prem deliberation: yes cloud is better for casual use, but cloud is also better for heavy use. If a situation required it I could justify a beefy Dell PowerEdge; I could not justify the kind of hardware they’re using for these workloads.

Until there are breakthroughs in inference performance we have to assume that a good model and context will be relatively slow when run on local equipment. This calls for a different way of working. I really want some kind of framework that I haven’t found yet: something in the spirit of gas town but optimised for running locally where workers are slow and single-threaded. I want to be able to chat interactively about the requirements for my project, report bugs, request features, manipulate what’s being worked on, while leaving it to background processes to actually work on the code, raise PRs or some equivalent, perform quality control, and leave neat work for me to review when I’m ready.

This isn’t because I’m eager to relinquish control of my code and go full vibes. It’s because talking about code blow-by-blow with a local AI is slow and not good use of my time on this planet. It’s better if the AI takes on larger chunks of work and I get to interact with it less frequently and more efficiently. I just need some sort of thoughtful scheduling engine and CLI for doing this. I haven’t found it yet. In principle I could try to build it but I don’t want to. There must be thousands of other developers yearning for exactly this thing, and I don’t have a lot of free time. It’s my explicit plan to let somebody else figure this out, then use their tool when it’s done. If you see it then please drop me an email.

Until then, ds4 and opencode are pretty neat. If there was an AI or AI-pricing apocalypse tomorrow, I might not be able to work the same way I do with codex/claude, but I have plenty of local capability I could draw on. And even if I'm never forced to rely on it for economic wellbeing, at least I’ll have plenty of RAM for Docker and Electron apps. So far, even without any AI workloads, I’ve reached at least 54 GB RAM in use. Isn’t modern computing grand?


Serious Computer Business Blog by Thomas Karpiniec
Posts RSS, Atom