Retrospective

Running Local LLMs on a Mac Studio Taught Me How to Use Slow Models Powerfully

AAnonymous
10 min read

Introduction

Stories about running LLMs locally usually start with installation. Which runtime to use, which model to download, which port to open.

But once you live with them for a while, a more important question shows up: how do you make large local LLMs genuinely useful? Getting them to run and building a workflow around them are completely different problems.

I started with Ubuntu + vLLM + dual 3090 Ti. Now I have moved to Mac Studio M3 Ultra(512GB) and LM Studio, where I run large deepseek, qwen, and glm family models locally.

This is not an installation guide. I want to briefly explain why I ran into a ceiling with two 3090s, why I moved to a Mac Studio, and how I actually put slow local LLMs to work.

I Started with Two 3090 Ti Cards

The initial setup was fairly simple. I bought two used 3090 Ti GPUs for about KRW 2.2 million, put them on Ubuntu, and used vLLM as the runtime.

At the time, gemma 3 was one of the local models that felt genuinely good to use. If all you wanted was an assistant for very simple instructions or a small workflow, it was impressive enough. Its multilingual performance was decent, and token generation speed felt pretty satisfying by the standards of that moment.

I built and tested several assistants in that environment.

  • A chat agent
  • A tarot card reader
  • A translation-oriented assistant
  • A simple workflow runner

The conclusion at that stage was clear. Local LLMs were no longer at the toy stage. But that still did not mean they were ready to support meaningful agents without major compromises.

The Real Problem Was Hardware, Not Software

At first, I chose vLLM for a simple reason. It was relatively easy to set up, and it was much more widely known than sglang.

Over time, though, I realized that the real bottleneck in running multi-GPU 3090 systems at home had much less to do with runtime choice and much more to do with practical constraints.

  • The electricity bill was painful.
  • The heat output was excessive.
  • Running it continuously at home was hard.
  • Most importantly, the large models I actually wanted were still out of reach.

What I wanted at the time was to handle a deepseek v3-class model locally with enough context to be useful. Two 3090s clearly hit a wall there. Running small models well and turning large models into reliable working assets are completely different things.

In the end, I sold the GPUs and bought Mac Studio M3 Ultra(512GB). That was not just a hardware swap. It changed the way I think about local LLMs altogether.

What Became Possible After Moving to a Mac Studio

After switching to macOS + LM Studio, the biggest change was that models I previously could not seriously consider became realistic options.

Once I could load large models like deepseek, glm, and Qwen3-235B-A22B directly and inspect the results, I was honestly surprised by how strong they were. Their reasoning quality was better than I expected. Even with moderately complex instructions, they often produced outputs that were much more precise and stable than what I had seen before.

At the same time, the limitations also became sharper. pp performance was slower than I expected, and the setup was clearly not a good match for use cases that need immediate responses, like real-time chat. I had already read that it would feel slow, but once you use it every day, that limit becomes much more obvious.

That is where I changed my standard. Instead of asking Is this model fast?, I started asking Where does this model actually belong?

The Work That Still Felt Great, Even When It Was Slow

Interestingly, once I stopped expecting real-time responses, local LLMs became much more useful.

The areas where I was happiest looked very similar.

  • Work like data analysis, where the result needs to be careful and precise
  • Work like game story writing, where long runtimes are acceptable
  • Complex analytical workflows with many steps
  • OpenHands-style automated development work that can run overnight instead of a person

For example, generating a single story can take anywhere from 40 minutes to well over an hour. By real-time chat standards, that is painfully slow. But if you change the pattern to queue up three to five jobs before going to sleep and collect the results in the morning, the entire equation changes.

The process is slow, but the outputs are much larger and more refined. You do not spend the night waiting. You review the results the next morning, then queue up the next batch.

OpenHands-type work felt similar. Instead of using it like an instantly responsive copilot, it fit much better as a queued automation worker. From that angle, a local LLM is not really a slow chatbot. It is closer to a quiet worker that keeps going for hours.

It Does Not Have to Run Like a Rabbit to Win Like a Tortoise

When I think about local LLMs, I often come back to the story of the tortoise and the hare.

Cloud LLMs are usually closer to the hare. They are fast, responsive, and easy to use on demand. Large local models are closer to the tortoise. They are slow, they feel frustrating if you optimize for immediacy, and they disappoint quickly if you compare them on the wrong axis.

But once you change the evaluation criteria, the result changes too.

If the contest is real-time chat, local LLMs often lose. But if you structure the work around queues, repeated jobs, batch execution, and overnight processing, where consistency wins, the picture changes. That is where I got the highest satisfaction. The power draw was also far more stable than my multi-GPU setup. In practice, it felt like I could run larger models at well under one-seventh of the power consumption of the old GPU rig.

In the end, the important thing was not model speed. It was how I designed the rhythm of my work.

My Own Code Eventually Shifted to a Queue-Centered Design

This is not just an impression. It shows up directly in the implementation.

If you look at my ai-service module, it is pretty clear that I treat local LLMs less like real-time chat and more like background workers.

The model connection itself is simpler than it sounds. In RunnableAiModel, I treat vllm and lm studio as OpenAI-compatible providers and make them swappable by changing only the baseUrl. From the application's perspective, if the backend exposes an OpenAI-compatible endpoint, a local backend can be plugged in fairly naturally.

There is also a separate usage example for LM Studio. LMStudioVisionClient is implemented against LM Studio's /v1 endpoint for vision models. In other words, I wired local VLM support in a way that can extend all the way to image analysis.

The more important part is the execution model.

AiAgentExecutor does not run a workflow immediately. It registers it as a DTE job in a queue. LoopJobService and LoopWorkflowExecutor are designed to process long-running repeated tasks in the background, and they include checkpoint and recovery flows. WritingToolExecutor, the multi-agent workflow, and the marketing generation pipeline also lean much more toward asynchronous execution than instant-response interaction.

That is not accidental. Instead of forcing slow models into a real-time interface, it is much more practical to send latency-tolerant work through a queue.

In short, this is how I now understand local LLMs.

  • They fit work queues better than chat response loops.
  • They fit tasks where difficulty and refinement matter more than raw speed.
  • They work especially well with the pattern of running overnight and reviewing in the morning.
  • If they expose an OpenAI-compatible API, they are not hard to integrate into an existing service structure.

So Who Are Local LLMs Actually For?

If you have read this far, you are probably asking the same question I asked: are local LLMs truly useful, and are they worth serious money?

My answer is fairly clear.

Local LLMs are a strong fit for people like this.

  • People who do more batch work than real-time chat
  • People who want complex generation and analysis jobs to run overnight
  • People who want continued access to large models with lower operating costs
  • People who can redesign their workflow around queues

On the other hand, if the most important thing for you is an immediate conversational experience, local LLMs will probably feel frustrating at first. In that case, they are likely a choice built on the wrong expectation.

Closing

Running large local LLMs tends to move from installation success to utilization failure faster than people expect. Just because a model loads does not mean it becomes useful right away.

I started with dual 3090 Ti + Ubuntu + vLLM, and now I am on Mac Studio M3 Ultra(512GB) with LM Studio. After going through that path, my conclusion is simple.

For individuals and small teams rather than enterprise-scale operations, local LLMs shine far more as slow but powerful workers than as replacements for fast chat.

If you already have a Mac, and especially if you have something in the Mac Studio class, I think it is worth trying local LLMs this way at least once. You do not need to start with anything grand. Skip real-time chat for now, and just pick one difficult task to drop into tonight's queue.

That is where more possibilities begin than most people expect.