April 28, 2026 · 12 min read

The Privilege Stack: How to Run Legal AI on Your Own Hardware

Two months ago, a federal judge ruled that documents drafted with consumer Claude are not protected by attorney-client privilege. Six months before that, Anthropic changed its consumer policy so that Pro and Max accounts opt-in to training by default. The math has shifted. For absolute privilege protection, no third-party AI processor is acceptable. This is the deployment guide for running Claude-quality legal AI entirely on your firm's own hardware.

Why even Claude Enterprise might not be enough

On February 10, 2026, Judge Jed Rakoff of the Southern District of New York ruled from the bench that documents Bradley Heppner created using Anthropic's Claude consumer tool — drafts of his defense strategy, prepared after he had been served a grand jury subpoena and engaged counsel — were not protected by attorney-client privilege or the work product doctrine. The memorandum opinion followed on February 17.

The court's reasoning, in three parts:

  • Claude is not a licensed attorney. No fiduciary attorney-client relationship can form between a defendant and a generative AI tool.
  • The communications were not confidential. Anthropic's privacy policy explicitly notifies consumer users that data may be used for training and disclosed to third parties. That destroys any reasonable expectation of confidentiality.
  • The communications were not made for legal advice. Heppner initiated them on his own. They didn't reflect counsel's mental processes or strategy.

The court left an explicit door open in dicta:

"Whether privilege is protected within a closed, enterprise-grade AI system, where counsel directs the use of AI platforms and/or where inputs and outputs are not used to train models, remains an open question."

Open question. Not yes. Not no.

Consumer Claude is now off the table for privileged work — that part is settled. Claude Enterprise and Claude API with zero data retention are likely defensible, but no court has yet ruled. A firm relying on Enterprise is betting that a future SDNY (or a different circuit) will agree that Anthropic's contractual no-training, no-retention promises are enough. That's a reasonable bet for general work. It's not a bet you want to make on capital criminal defense, FOIA-sensitive litigation, or work where opposing counsel will challenge every piece of metadata.

Self-hosted inference forecloses the question entirely. No third-party processor, no policy to argue over, no terms of service to scrutinize. The data never leaves the firm.

What "no third-party processor" actually means

It means three things, in order:

  1. The model runs on hardware your firm owns or leases. Mac Studio in your office. Linux box in a closet. Server in a rack you control. Not on Anthropic's servers, not on AWS Bedrock, not on Azure OpenAI.
  2. The model weights are open. You downloaded them once from Hugging Face or Meta's site, they sit on local storage, and inference happens against your local copy. No API calls go anywhere during a chat.
  3. The connector and the data layer also run locally. Our open-source Clio MCP connector is stdio-only by design — there is no relay server, no cloud component. Clio's API is the one network endpoint, and you authenticate to it with your own OAuth credentials.

If you can disconnect the Wi-Fi after the model is downloaded and Clio is authenticated, and the AI assistant still works for queries that don't need fresh Clio data, you have a true self-hosted stack. The headline test for the install we walk through below is exactly this: run tcpdump during a session and confirm zero outbound traffic to AI provider domains.

Hardware reality check

Running a 70-billion parameter model is the entry point for Claude-quality output. At 4-bit quantization, a 70B model needs roughly 40-45 GB of memory just to hold the weights, plus working memory and OS overhead. That puts the realistic minimum at 64 GB of unified memory on Apple Silicon, or a discrete GPU with 32 GB+ of VRAM. Most firms will land on one of two configurations.

For solo lawyers and small firms: Mac Studio M4 Max, 128 GB unified memory

Apple's M4 Max chip with 128 GB of unified memory ($5,000-$7,000 depending on storage) is the current sweet spot for self-hosted legal AI. The unified memory architecture means CPU, GPU, and Neural Engine share one memory pool. A 70B model fits comfortably with room for a long conversation context. Real-world benchmarks put Llama 4 70B at 10-15 tokens per second on this hardware — fast enough that response feel matches a consumer chat experience.

One practical note: as of April 2026, Apple is in the middle of a DRAM shortage. The 512 GB option was eliminated in March, and high-memory configs frequently show as out of stock. If a 128 GB Mac Studio is unavailable, the M3 Ultra at $3,999 (96 GB starting) is the next step.

For firm-wide multi-user deployments: M3 Ultra or Linux + RTX 5090

If multiple attorneys need concurrent access, you have two paths:

  • Mac Studio M3 Ultra with 128 GB+ unified memory. Higher memory bandwidth (the M3 Ultra is faster than the M4 Max for inference, despite being a generation older), supports more concurrent users, and the macOS networking stack is straightforward to share across the firm.
  • Linux server with an RTX 5090 (32 GB VRAM) and 64 GB system RAM. More raw compute, but VRAM is the constraint. Models above 32 GB after quantization need techniques like CPU offloading, which slows inference. Better suited to running smaller models very fast (e.g., Llama 4 Scout, the MoE variant, runs at 17B active parameters per token).

Either path costs roughly $5,000-$10,000 depending on configuration. Compare that to Harvey AI at $1,200 per attorney per month — a 10-attorney firm hits parity in a single quarter.

Model options for legal work

The open-weight model landscape changed a lot in 2025. Three models are realistic candidates for legal work as of April 2026:

Model License Best for Hardware fit
Llama 4 70B Llama 4 Community License General versatility, drafting, summarization. Easiest to install. Most stable for tool calling. Mac Studio M4 Max 128 GB
DeepSeek V4 Pro MIT Reasoning, chain-of-thought, complex contract analysis. Top BenchLM score. Mac Studio M3 Ultra 256 GB (heavier)
Llama 4 Scout (MoE) Llama 4 Community License Fast responsive chat. 109B params total but only 17B active per token, so much faster. Mac Studio M4 Max 64 GB or RTX 5090

For most firms starting out, Llama 4 70B is the right primary choice. It's the best-documented model in the LM Studio and Ollama ecosystems, has the most stable tool-calling behavior, and the license terms are permissive enough for firm use. Once the install is working, swap in DeepSeek V4 Pro for reasoning-heavy queries (multi-document contract analysis, chain-of-thought drafting) and keep Llama 4 70B as the default.

For multilingual practices, Mistral Large is the strongest open-weight option for non-English work. For coding-heavy legal tech work (drafting compliance scripts, automating exports), GLM-5 from Zhipu AI scored 77.8% on SWE-Bench Verified and is licensed under Apache 2.0.

MCP clients that work with local models

The Model Context Protocol (MCP) is what makes this stack possible. MCP is an open standard from Anthropic for connecting AI models to external data sources — it doesn't lock you to Claude. Any MCP-compatible client can call our open-source Clio MCP connector, regardless of which model is running.

Two paths work today.

Recommended primary: LM Studio (native MCP)

LM Studio has shipped native MCP support since v0.3.17 (July 2025). The April 2026 release (v0.4.11) added OAuth support and improved tool-call reliability for newer models including Gemma 4. LM Studio acts as both an MCP client (consuming external MCP servers like our Clio connector) and an MCP server (exposing the local model to other tools).

The configuration format is the same mcp.json notation that Cursor and Claude Desktop use. Tool-call confirmation is built in: when Llama 4 wants to call our list_matters tool, LM Studio shows the user a confirmation dialog with the proposed call so they can inspect, approve, or deny.

Secondary path: Continue.dev + Ollama + bridge

If you'd rather use Ollama as the local inference runtime (it's lighter-weight than LM Studio and runs as a daemon), you'll need a bridge. Ollama doesn't natively speak MCP yet — issue #7865 on the Ollama repo is still open as of April 2026. The community has built three working bridges:

Pair any of these with Continue.dev in VS Code or JetBrains, and you have a working stack. Continue.dev natively supports MCP servers and works with local Ollama backends.

Step-by-step deployment (LM Studio path)

This is the cleanest install. Plan 30-45 minutes the first time, mostly for the model download.

1. Install LM Studio

Download LM Studio for macOS or Windows from lmstudio.ai/download. Open the app. The first run downloads the runtime — about 200 MB.

2. Download Llama 4 70B

Inside LM Studio, go to the search tab and look for Llama-4-70B-Instruct-Q4_K_M (the 4-bit quantized version, ~40 GB download). On a fast connection this takes 30-60 minutes. The model lives in ~/.cache/lm-studio/models/ after download.

3. Install our Clio MCP connector

From any terminal:

npm install -g @oktopeak/clio-mcp

This installs version 1.0.1 from the official MCP Registry. Verify with clio-mcp --version.

4. Register a Clio Developer App

Log in at developers.clio.com. Create a new application:

  • Name: Local Legal AI
  • Redirect URI: http://127.0.0.1:5678/callback (use 127.0.0.1, not localhost — Clio rejects localhost)
  • Permissions: read on Activities, Billing, Calendars, Contacts, Documents, Users; read+write on Matters and Tasks

Save the Client ID and Client Secret. Generate a 64-character encryption key:

node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"

5. Configure LM Studio's mcp.json

Open LM Studio's settings, find the MCP configuration. Add this entry:

{
          "mcpServers": {
            "clio": {
              "command": "npx",
              "args": ["-y", "@oktopeak/clio-mcp"],
              "env": {
                "CLIO_CLIENT_ID": "your-client-id",
                "CLIO_CLIENT_SECRET": "your-client-secret",
                "ENCRYPTION_KEY": "your-64-char-hex-key",
                "CLIO_REGION": "us"
              }
            }
          }
        }

Restart LM Studio. The Clio tools should appear in the tool list. The first time you use a Clio tool, your browser opens to Clio's OAuth page — log in, authorize, and the tokens are encrypted and stored at ~/.clio-mcp/tokens.enc.

6. Validate

Open a chat with Llama 4 70B in LM Studio. Type:

"List my open matters."

The model should call our list_matters tool, you'll see the confirmation dialog, approve it, and the response should come back with the matters from your Clio account. The audit log at ~/.clio-mcp/audit.log will have a new line documenting the call.

Validation: confirming zero outbound leakage

The headline claim of this stack is that no third party ever sees your Clio data. Verify it.

macOS: Little Snitch

Little Snitch shows every outbound network connection in real time. During a representative session, you should see:

  • Allowed: connections to app.clio.com and eu.app.clio.com (the Clio API)
  • Allowed: connections to 127.0.0.1 (the OAuth callback during initial auth)
  • Should never appear: *.anthropic.com, api.openai.com, *.azure.com, *.aws.amazon.com, or any other AI provider domain

Linux: tcpdump

Run during a session:

sudo tcpdump -n -i any 'port 443' | grep -v 'app.clio.com\|127.0.0.1'

If anything other than Clio API traffic shows up during a Llama-driven Clio query, you have a leak. Investigate.

When to use this stack vs Claude Enterprise

Honest comparison. Both are defensible. Pick based on the matter, not the marketing.

Factor Privilege Stack (self-hosted) Claude Enterprise
Privilege Strongest (no third-party processor) Open question per US v. Heppner dicta
Model quality Llama 4 70B / DeepSeek V4 — within a few points of Claude on legal benchmarks Claude 3.5 Sonnet / Opus — frontier quality on complex reasoning
Speed 10-15 tok/s on Mac Studio M4 Max ~30-50 tok/s
Cost (10-attorney firm) $5K-$10K hardware once + $0/month $300-$600/seat/month, ~$36K-$72K/year
Long context 128K tokens (Llama 4); some models reach 200K 200K tokens; can process full case files
Best for Criminal defense, FOIA-sensitive, regulated jurisdictions, paranoia Multi-jurisdictional research, large-context contract review

Many firms run both. Privilege stack for the most sensitive work, Claude Enterprise for general productivity. Our open-source connector works identically with both — same tools, same audit log format. The only thing that changes is which client you point it at.

Beyond Clio

The same stack works with any MCP-capable connector. As the open-source MCP ecosystem matures, expect to see practice management, document management, and research tools all connectable to your local model the same way Clio is. We're tracking iManage, MyCase, and PracticePanther for our next open-source releases. The investment in hardware and the local model carries forward — only the connector at the data layer changes.

Need help deploying this for your firm?

We deploy privilege-safe AI stacks for firms that need them. Hardware spec, model selection, install, validation, training. 30 minutes with a co-founder. No pitch.

See Our Legal AI Integration Service →

Sources verified April 28, 2026: Harvard Law Review on US v. Heppner, Gibson Dunn alert, Anthropic Sept 2025 consumer policy update, Anthropic privacy center, LM Studio MCP docs, Ollama MCP issue tracker, BenchLM April 2026 model rankings, Compute Market hardware guide. We will update this post when LM Studio publishes more comprehensive stdio configuration documentation and when we have benchmarked Llama 4 70B + Clio MCP end-to-end on a clean Mac Studio install. The deployment instructions above are based on documented behavior; firm-by-firm verification before relying on them in production work is the responsibility of the firm's IT.

Legal Tech

Related Articles

View all Legal Tech articles ➔

Book a Call