Can a law firm really run Claude-quality AI without sending data to Anthropic or OpenAI?

Yes. As of April 2026, open-weight models like Llama 4 70B and DeepSeek V4 Pro run on a Mac Studio with 128GB unified memory and produce responses that are comparable in quality to consumer Claude for most legal workflows: matter summaries, document drafting from templates, contract review, calendar prep. The models run entirely on the firm's hardware. No data leaves the device. The Model Context Protocol standard means our open-source Clio MCP connector works with these local models the same way it works with Claude.

What hardware do I need to run a self-hosted legal AI stack?

For solo lawyers and small firms: a Mac Studio M4 Max with 128GB unified memory, $5,000 to $7,000 depending on storage. This runs Llama 4 70B at 10-15 tokens per second, fast enough for interactive use. For firm-wide multi-user deployments: Mac Studio M3 Ultra with 128GB+ or a Linux box with an RTX 5090 (32GB VRAM) plus 64GB system RAM. Apple's DRAM shortage in early 2026 is pushing prices up; check availability before ordering.

Is a local open-weight model good enough for legal work?

For research, drafting from templates, summarization, and matter prep: yes. Llama 4 70B and DeepSeek V4 Pro score within a few points of frontier proprietary models on legal reasoning benchmarks as of April 2026. Where local models still trail: extremely long context windows (200K+ tokens), highly specialized case law citation accuracy, and edge-case multilingual queries. For most day-to-day work at a small or mid-size firm, the gap is invisible.

Does running AI on my own hardware actually preserve attorney-client privilege?

It eliminates the third-party processor concern, which is the central issue in United States v. Heppner (S.D.N.Y., February 2026). The court rejected privilege over Claude consumer outputs partly because Anthropic's privacy policy permitted training on user inputs and disclosure to third parties. Self-hosted inference removes that argument entirely: no third party ever has access to the data. The court left open whether enterprise-grade hosted AI preserves privilege; self-hosting forecloses the question.

When should I use this self-hosted stack instead of Claude Enterprise?

Use the privilege stack when you handle highly privileged work where any third-party access is unacceptable: criminal defense, FOIA-sensitive matters, biglaw confidentiality agreements that prohibit cloud AI vendors, jurisdictions with strict data residency rules. Use Claude Enterprise (with ZDR) when frontier model quality matters more than perfect privilege architecture: complex contract review, multi-jurisdictional research, large-context document analysis. Many firms use both: Enterprise for general work, the privilege stack for the most sensitive matters.

The Privilege Stack: How to Run Legal AI on Your Own Hardware (No Third-Party Processor)

Why even Claude Enterprise might not be enough

On February 10, 2026, Judge Jed Rakoff of the Southern District of New York ruled from the bench that documents Bradley Heppner created using Anthropic's Claude consumer tool — drafts of his defense strategy, prepared after he had been served a grand jury subpoena and engaged counsel — were not protected by attorney-client privilege or the work product doctrine. The memorandum opinion followed on February 17.

The court's reasoning, in three parts:

Claude is not a licensed attorney. No fiduciary attorney-client relationship can form between a defendant and a generative AI tool.
The communications were not confidential. Anthropic's privacy policy explicitly notifies consumer users that data may be used for training and disclosed to third parties. That destroys any reasonable expectation of confidentiality.
The communications were not made for legal advice. Heppner initiated them on his own. They didn't reflect counsel's mental processes or strategy.

The court left an explicit door open in dicta:

"Whether privilege is protected within a closed, enterprise-grade AI system, where counsel directs the use of AI platforms and/or where inputs and outputs are not used to train models, remains an open question."

Open question. Not yes. Not no.

Consumer Claude is now off the table for privileged work — that part is settled. Claude Enterprise and Claude API with zero data retention are likely defensible, but no court has yet ruled. A firm relying on Enterprise is betting that a future SDNY (or a different circuit) will agree that Anthropic's contractual no-training, no-retention promises are enough. That's a reasonable bet for general work. It's not a bet you want to make on capital criminal defense, FOIA-sensitive litigation, or work where opposing counsel will challenge every piece of metadata.

Self-hosted inference forecloses the question entirely. No third-party processor, no policy to argue over, no terms of service to scrutinize. The data never leaves the firm.

What "no third-party processor" actually means

It means three things, in order:

The model runs on hardware your firm owns or leases. Mac Studio in your office. Linux box in a closet. Server in a rack you control. Not on Anthropic's servers, not on AWS Bedrock, not on Azure OpenAI.
The model weights are open. You downloaded them once from Hugging Face or Meta's site, they sit on local storage, and inference happens against your local copy. No API calls go anywhere during a chat.
The connector and the data layer also run locally. Our open-source Clio MCP connector is stdio-only by design — there is no relay server, no cloud component. Clio's API is the one network endpoint, and you authenticate to it with your own OAuth credentials.

If you can disconnect the Wi-Fi after the model is downloaded and Clio is authenticated, and the AI assistant still works for queries that don't need fresh Clio data, you have a true self-hosted stack. The headline test for the install we walk through below is exactly this: run tcpdump during a session and confirm zero outbound traffic to AI provider domains.

Hardware reality check

Running a 70-billion parameter model is the entry point for Claude-quality output. At 4-bit quantization, a 70B model needs roughly 40-45 GB of memory just to hold the weights, plus working memory and OS overhead. That puts the realistic minimum at 64 GB of unified memory on Apple Silicon, or a discrete GPU with 32 GB+ of VRAM. Most firms will land on one of two configurations.

For solo lawyers and small firms: Mac Studio M4 Max, 128 GB unified memory

Apple's M4 Max chip with 128 GB of unified memory ($5,000-$7,000 depending on storage) is the current sweet spot for self-hosted legal AI. The unified memory architecture means CPU, GPU, and Neural Engine share one memory pool. A 70B model fits comfortably with room for a long conversation context. Real-world benchmarks put Llama 4 70B at 10-15 tokens per second on this hardware — fast enough that response feel matches a consumer chat experience.

One practical note: as of April 2026, Apple is in the middle of a DRAM shortage. The 512 GB option was eliminated in March, and high-memory configs frequently show as out of stock. If a 128 GB Mac Studio is unavailable, the M3 Ultra at $3,999 (96 GB starting) is the next step.

For firm-wide multi-user deployments: M3 Ultra or Linux + RTX 5090

If multiple attorneys need concurrent access, you have two paths:

Mac Studio M3 Ultra with 128 GB+ unified memory. Higher memory bandwidth (the M3 Ultra is faster than the M4 Max for inference, despite being a generation older), supports more concurrent users, and the macOS networking stack is straightforward to share across the firm.
Linux server with an RTX 5090 (32 GB VRAM) and 64 GB system RAM. More raw compute, but VRAM is the constraint. Models above 32 GB after quantization need techniques like CPU offloading, which slows inference. Better suited to running smaller models very fast (e.g., Llama 4 Scout, the MoE variant, runs at 17B active parameters per token).

Either path costs roughly $5,000-$10,000 depending on configuration. Compare that to Harvey AI at $1,200 per attorney per month — a 10-attorney firm hits parity in a single quarter.

Model options for legal work

The open-weight model landscape changed a lot in 2025. Three models are realistic candidates for legal work as of April 2026:

Model	License	Best for	Hardware fit
Llama 4 70B	Llama 4 Community License	General versatility, drafting, summarization. Easiest to install. Most stable for tool calling.	Mac Studio M4 Max 128 GB
DeepSeek V4 Pro	MIT	Reasoning, chain-of-thought, complex contract analysis. Top BenchLM score.	Mac Studio M3 Ultra 256 GB (heavier)
Llama 4 Scout (MoE)	Llama 4 Community License	Fast responsive chat. 109B params total but only 17B active per token, so much faster.	Mac Studio M4 Max 64 GB or RTX 5090

For most firms starting out, Llama 4 70B is the right primary choice. It's the best-documented model in the LM Studio and Ollama ecosystems, has the most stable tool-calling behavior, and the license terms are permissive enough for firm use. Once the install is working, swap in DeepSeek V4 Pro for reasoning-heavy queries (multi-document contract analysis, chain-of-thought drafting) and keep Llama 4 70B as the default.

For multilingual practices, Mistral Large is the strongest open-weight option for non-English work. For coding-heavy legal tech work (drafting compliance scripts, automating exports), GLM-5 from Zhipu AI scored 77.8% on SWE-Bench Verified and is licensed under Apache 2.0.

MCP clients that work with local models

The Model Context Protocol (MCP) is what makes this stack possible. MCP is an open standard from Anthropic for connecting AI models to external data sources — it doesn't lock you to Claude. Any MCP-compatible client can call our open-source Clio MCP connector, regardless of which model is running.

Two paths work today.

Recommended primary: LM Studio (native MCP)

LM Studio has shipped native MCP support since v0.3.17 (July 2025). The April 2026 release (v0.4.11) added OAuth support and improved tool-call reliability for newer models including Gemma 4. LM Studio acts as both an MCP client (consuming external MCP servers like our Clio connector) and an MCP server (exposing the local model to other tools).

The configuration format is the same mcp.json notation that Cursor and Claude Desktop use. Tool-call confirmation is built in: when Llama 4 wants to call our list_matters tool, LM Studio shows the user a confirmation dialog with the proposed call so they can inspect, approve, or deny.

Secondary path: Continue.dev + Ollama + bridge

If you'd rather use Ollama as the local inference runtime (it's lighter-weight than LM Studio and runs as a daemon), you'll need a bridge. Ollama doesn't natively speak MCP yet — issue #7865 on the Ollama repo is still open as of April 2026. The community has built three working bridges:

mcphost — Go-based MCP host with first-class Ollama integration. The most mature.
ollama-mcp-bridge — translates between MCP and the Ollama API. Lighter weight.
mcp-client-for-ollama — text UI client for developers.

Pair any of these with Continue.dev in VS Code or JetBrains, and you have a working stack. Continue.dev natively supports MCP servers and works with local Ollama backends.

Step-by-step deployment (LM Studio path)

This is the cleanest install. Plan 30-45 minutes the first time, mostly for the model download.

1. Install LM Studio

Download LM Studio for macOS or Windows from lmstudio.ai/download. Open the app. The first run downloads the runtime — about 200 MB.

2. Download Llama 4 70B

Inside LM Studio, go to the search tab and look for Llama-4-70B-Instruct-Q4_K_M (the 4-bit quantized version, ~40 GB download). On a fast connection this takes 30-60 minutes. The model lives in ~/.cache/lm-studio/models/ after download.

3. Install our Clio MCP connector

From any terminal:

npm install -g @oktopeak/clio-mcp

This installs version 1.0.1 from the official MCP Registry. Verify with clio-mcp --version.

4. Register a Clio Developer App

Name: Local Legal AI
Redirect URI: http://127.0.0.1:5678/callback (use 127.0.0.1, not localhost — Clio rejects localhost)
Permissions: read on Activities, Billing, Calendars, Contacts, Documents, Users; read+write on Matters and Tasks

Save the Client ID and Client Secret. Generate a 64-character encryption key:

node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"

5. Configure LM Studio's `mcp.json`

Open LM Studio's settings, find the MCP configuration. Add this entry:

{
          "mcpServers": {
            "clio": {
              "command": "npx",
              "args": ["-y", "@oktopeak/clio-mcp"],
              "env": {
                "CLIO_CLIENT_ID": "your-client-id",
                "CLIO_CLIENT_SECRET": "your-client-secret",
                "ENCRYPTION_KEY": "your-64-char-hex-key",
                "CLIO_REGION": "us"
              }
            }
          }
        }

Restart LM Studio. The Clio tools should appear in the tool list. The first time you use a Clio tool, your browser opens to Clio's OAuth page — log in, authorize, and the tokens are encrypted and stored at ~/.clio-mcp/tokens.enc.

6. Validate

Open a chat with Llama 4 70B in LM Studio. Type:

"List my open matters."

The model should call our list_matters tool, you'll see the confirmation dialog, approve it, and the response should come back with the matters from your Clio account. The audit log at ~/.clio-mcp/audit.log will have a new line documenting the call.

Validation: confirming zero outbound leakage

The headline claim of this stack is that no third party ever sees your Clio data. Verify it.

macOS: Little Snitch

Little Snitch shows every outbound network connection in real time. During a representative session, you should see:

Allowed: connections to app.clio.com and eu.app.clio.com (the Clio API)
Allowed: connections to 127.0.0.1 (the OAuth callback during initial auth)
Should never appear: *.anthropic.com, api.openai.com, *.azure.com, *.aws.amazon.com, or any other AI provider domain

Linux: `tcpdump`

Run during a session:

sudo tcpdump -n -i any 'port 443' | grep -v 'app.clio.com\|127.0.0.1'

If anything other than Clio API traffic shows up during a Llama-driven Clio query, you have a leak. Investigate.

When to use this stack vs Claude Enterprise

Honest comparison. Both are defensible. Pick based on the matter, not the marketing.

Factor	Privilege Stack (self-hosted)	Claude Enterprise
Privilege	Strongest (no third-party processor)	Open question per US v. Heppner dicta
Model quality	Llama 4 70B / DeepSeek V4 — within a few points of Claude on legal benchmarks	Claude 3.5 Sonnet / Opus — frontier quality on complex reasoning
Speed	10-15 tok/s on Mac Studio M4 Max	~30-50 tok/s
Cost (10-attorney firm)	$5K-$10K hardware once + $0/month	$300-$600/seat/month, ~$36K-$72K/year
Long context	128K tokens (Llama 4); some models reach 200K	200K tokens; can process full case files
Best for	Criminal defense, FOIA-sensitive, regulated jurisdictions, paranoia	Multi-jurisdictional research, large-context contract review

Many firms run both. Privilege stack for the most sensitive work, Claude Enterprise for general productivity. Our open-source connector works identically with both — same tools, same audit log format. The only thing that changes is which client you point it at.

Beyond Clio

The same stack works with any MCP-capable connector. As the open-source MCP ecosystem matures, expect to see practice management, document management, and research tools all connectable to your local model the same way Clio is. We're tracking iManage, MyCase, and PracticePanther for our next open-source releases. The investment in hardware and the local model carries forward — only the connector at the data layer changes.

Need help deploying this for your firm?

We deploy privilege-safe AI stacks for firms that need them. Hardware spec, model selection, install, validation, training. 30 minutes with a co-founder. No pitch.

See Our Legal AI Integration Service →

Sources verified April 28, 2026: Harvard Law Review on US v. Heppner, Gibson Dunn alert, Anthropic Sept 2025 consumer policy update, Anthropic privacy center, LM Studio MCP docs, Ollama MCP issue tracker, BenchLM April 2026 model rankings, Compute Market hardware guide. We will update this post when LM Studio publishes more comprehensive stdio configuration documentation and when we have benchmarked Llama 4 70B + Clio MCP end-to-end on a clean Mac Studio install. The deployment instructions above are based on documented behavior; firm-by-firm verification before relying on them in production work is the responsibility of the firm's IT.

The Privilege Stack: How to Run Legal AI on Your Own Hardware

Why even Claude Enterprise might not be enough

What "no third-party processor" actually means

Hardware reality check

For solo lawyers and small firms: Mac Studio M4 Max, 128 GB unified memory

For firm-wide multi-user deployments: M3 Ultra or Linux + RTX 5090

Model options for legal work

MCP clients that work with local models

Recommended primary: LM Studio (native MCP)

Secondary path: Continue.dev + Ollama + bridge

Step-by-step deployment (LM Studio path)

1. Install LM Studio

2. Download Llama 4 70B

3. Install our Clio MCP connector

4. Register a Clio Developer App

5. Configure LM Studio's `mcp.json`

6. Validate

Validation: confirming zero outbound leakage

macOS: Little Snitch

Linux: `tcpdump`

When to use this stack vs Claude Enterprise

Beyond Clio

Related Articles

Why 'Chat With Your Business' Won't Work for Law Firms or Clinics | Oktopeak

What ABA Formal Opinion 512 Actually Requires From Your Law Firm's AI Chatbot | Oktopeak

The 4 Clio MCP Connectors Compared: Features, Compliance, and Gaps | Oktopeak

Why even Claude Enterprise might not be enough

What "no third-party processor" actually means

Hardware reality check

For solo lawyers and small firms: Mac Studio M4 Max, 128 GB unified memory

For firm-wide multi-user deployments: M3 Ultra or Linux + RTX 5090

Model options for legal work

MCP clients that work with local models

Recommended primary: LM Studio (native MCP)

Secondary path: Continue.dev + Ollama + bridge

Step-by-step deployment (LM Studio path)

1. Install LM Studio

2. Download Llama 4 70B

3. Install our Clio MCP connector

4. Register a Clio Developer App

5. Configure LM Studio's mcp.json

6. Validate

Validation: confirming zero outbound leakage

macOS: Little Snitch

Linux: tcpdump

When to use this stack vs Claude Enterprise

Beyond Clio

Related Articles

Why 'Chat With Your Business' Won't Work for Law Firms or Clinics | Oktopeak

What ABA Formal Opinion 512 Actually Requires From Your Law Firm's AI Chatbot | Oktopeak

The 4 Clio MCP Connectors Compared: Features, Compliance, and Gaps | Oktopeak

5. Configure LM Studio's `mcp.json`

Linux: `tcpdump`