Run a free 35B AI coding agent on Apple Silicon Macs using local LLMs via llama.cpp or MLX with web search, shell, and file tools.
---
name: mac-code-local-ai-agent
description: Run a free 35B AI coding agent on Apple Silicon Macs using local LLMs via llama.cpp or MLX with web search, shell, and file tools.
triggers:
- "set up mac code local AI agent"
- "run Claude Code alternative on Mac for free"
- "local LLM agent on Apple Silicon"
- "35B model on 16GB Mac"
- "llama.cpp agent with tools on Mac"
- "MLX local coding agent"
- "out of RAM model inference Mac"
- "mac-code setup and usage"
---
# mac-code — Free Local AI Agent on Apple Silicon
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
Run a 35B reasoning model locally on your Mac for $0/month. mac-code is a CLI AI coding agent (Claude Code alternative) that routes tasks — web search, shell commands, file edits, chat — through a local LLM. Supports llama.cpp (30 tok/s) and MLX (64K context, persistent KV cache) backends on Apple Silicon.
---
## What It Does
- **LLM-as-router**: The model classifies every prompt as `search`, `shell`, or `chat` and routes accordingly
- **35B MoE at 30 tok/s** via llama.cpp + IQ2_M quantization (fits in 16 GB RAM)
- **35B full Q4 on 16 GB** via custom MoE Expert Sniper (1.54 tok/s, only 1.42 GB RAM used)
- **9B at 64K context** via quantized KV cache (`q4_0` keys/values)
- **MLX backend** adds persistent KV cache save/load, context compression, R2 sync
- **Tools**: DuckDuckGo search, shell execution, file read/write
---
## Installation
### Prerequisites
```bash
brew install llama.cpp
pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packages
```
### Clone the repo
```bash
git clone https://github.com/walter-grace/mac-code
cd mac-code
```
### Download models
**35B MoE — fast daily driver (10.6 GB, fits in 16 GB RAM):**
```bash
mkdir -p ~/models
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-35B-A3B-GGUF',
'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
local_dir='$HOME/models/'
)
"
```
**9B — 64K context, long documents (5.3 GB):**
```bash
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-9B-GGUF',
'Qwen3.5-9B-Q4_K_M.gguf',
local_dir='$HOME/models/'
)
"
```
---
## Starting the Backend
### Option A: llama.cpp + 35B MoE (recommended, 30 tok/s)
```bash
llama-server \
--model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 12288 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -np 1 -t 4
```
### Option B: llama.cpp + 9B (64K context)
```bash
llama-server \
--model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 65536 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -t 4
```
### Option C: MLX backend (persistent context, 9B)
```bash
# Starts server on port 8000, downloads model on first run
python3 mlx/mlx_engine.py
```
### Start the agent (all options)
```bash
python3 agent.py
```
---
## Agent CLI Commands
Inside the agent REPL, type `/` for all commands:
| Command | Action |
|---|---|
| `/agent` | Agent mode with tools (default) |
| `/raw` | Direct streaming, no tools |
| `/model 9b` | Switch to 9B model (64K context) |
| `/model 35b` | Switch to 35B MoE |
| `/search <query>` | Quick DuckDuckGo search |
| `/bench` | Run speed benchmark |
| `/stats` | Session statistics |
| `/cost` | Show cost savings vs cloud |
| `/good` / `/bad` | Grade the last response |
| `/improve` | View response grading stats |
| `/clear` | Reset conversation |
| `/quit` | Exit |
### Example prompts
```
> find all Python files modified in the last 7 days
→ routes to "shell", generates: find . -name "*.py" -mtime -7
> who won the NBA finals
→ routes to "search", queries DuckDuckGo, summarizes
> explain how attention works
→ routes to "chat", streams directly
```
---
## MLX Backend — Persistent KV Cache API
The MLX engine exposes a REST API on `localhost:8000`.
### Save context after processing a large codebase
```bash
curl -X POST localhost:8000/v1/context/save \
-H "Content-Type: application/json" \
-d '{"name": "my-project", "prompt": "$(cat README.md)"}'
```
### Load saved context instantly (0.0003s)
```bash
curl -X POST localhost:8000/v1/context/load \
-H "Content-Type: application/json" \
-d '{"name": "my-project"}'
```
### Download context from Cloudflare R2 (cross-Mac sync)
```bash
# Requires R2 credentials in environment
export R2_ACCOUNT_ID=your_account_id
export R2_ACCESS_KEY_ID=your_key_id
export R2_SECRET_ACCESS_KEY=your_secret
export R2_BUCKET=your_bucket_name
curl -X POST localhost:8000/v1/context/download \
-H "Content-Type: application/json" \
-d '{"name": "my-project"}'
```
### Standard OpenAI-compatible chat
```python
import requests
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "local",
"messages": [{"role": "user", "content": "Write a Python quicksort"}],
"stream": False
})
print(response.json()["choices"][0]["message"]["content"])
```
### Streaming chat
```python
import requests, json
with requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "local",
"messages": [{"role": "user", "content": "Explain transformers"}],
"stream": True
}, stream=True) as r:
for line in r.iter_lines():
if line.startswith(b"data: "):
chunk = json.loads(line[6:])
delta = chunk["choices"][0]["delta"].get("content", "")
print(delta, end="", flush=True)
```
---
## KV Cache Compression (MLX)
Compress context 4x with 99.3% similarity:
```python
from mlx.turboquant import compress_kv_cache
from mlx.kv_cache import save_kv_cache, load_kv_cache
# After building a KV cache from a long document
compressed = compress_kv_cache(kv_cache, bits=4) # 26.6 MB → 6.7 MB
save_kv_cache(compressed, "my-project-compressed")
# Load later
kv = load_kv_cache("my-project-compressed")
```
---
## Flash Streaming — Out-of-Core Inference
For models larger than your RAM (research mode):
```bash
cd research/flash-streaming
# Run 35B MoE Expert Sniper (22 GB model, 1.42 GB RAM)
python3 moe_expert_sniper.py
# Run 32B dense flash stream (18.4 GB model, 4.5 GB RAM)
python3 flash_stream_v2.py
```
### How F_NOCACHE direct I/O works
```python
import os, fcntl
# Open model file bypassing macOS Unified Buffer Cache
fd = os.open("model.bin", os.O_RDONLY)
fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) # bypass page cache
# Aligned read (16KB boundary for DART IOMMU)
ALIGN = 16384
offset = (layer_offset // ALIGN) * ALIGN
data = os.pread(fd, layer_size + ALIGN, offset)
weights = data[layer_offset - offset : layer_offset - offset + layer_size]
```
### MoE Expert Sniper pattern
```python
# Router predicts which 8 of 256 experts activate per token
active_experts = router_forward(hidden_state) # returns [8] indices
# Load only those experts from SSD (8 threads, parallel pread)
from concurrent.futures import ThreadPoolExecutor
def load_expert(expert_idx):
offset = expert_offsets[expert_idx]
return os.pread(fd, expert_size, offset)
with ThreadPoolExecutor(max_workers=8) as pool:
expert_weights = list(pool.map(load_expert, active_experts))
# ~14 MB loaded per layer instead of 221 MB (dense)
```
---
## Common Patterns
### Use as a Python library (direct API calls)
```python
import requests
BASE = "http://localhost:8000/v1"
def ask(prompt: str, system: str = "You are a helpful coding assistant.") -> str:
r = requests.post(f"{BASE}/chat/completions", json={
"model": "local",
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": prompt}
]
})
return r.json()["choices"][0]["message"]["content"]
# Examples
print(ask("Write a Python function to parse JSON safely"))
print(ask("Explain this error: AttributeError: NoneType has no attribute split"))
```
### Process a large file with paged inference
```python
from mlx.paged_inference import PagedInference
engine = PagedInference(model="mlx-community/Qwen3.5-9B-4bit")
with open("large_codebase.txt") as f:
content = f.read() # beyond single context window
# Automatically pages through content
result = engine.summarize(content, question="What does this codebase do?")
print(result)
```
### Monitor server performance
```bash
python3 dashboard.py
```
---
## Model Selection Guide
| Your Mac RAM | Best Option | Command |
|---|---|---|
| 8 GB | 9B Q4_K_M | `--model ~/models/Qwen3.5-9B-Q4_K_M.gguf --ctx-size 4096` |
| 16 GB | 35B IQ2_M (30 tok/s) | Default Option A above |
| 16 GB (quality) | 35B Q4 Expert Sniper | `python3 research/flash-streaming/moe_expert_sniper.py` |
| 48 GB | 35B Q4_K_M native | Download full Q4, `--n-gpu-layers 99` |
| 192 GB | 397B frontier | Any large GGUF, full offload |
---
## Troubleshooting
### Server not responding on port 8000
```bash
# Check if server is running
curl http://localhost:8000/health
# Check what's on port 8000
lsof -i :8000
# Restart llama-server with verbose logging
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --verbose
```
### Model download fails / incomplete
```bash
# Resume interrupted download
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-35B-A3B-GGUF',
'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
local_dir='$HOME/models/',
resume_download=True
)
"
```
### Slow inference / RAM pressure on 16 GB Mac
```bash
# Reduce context size to free RAM
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --ctx-size 4096 \ # reduced from 12288
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 -t 4
# Or switch to 9B for lower RAM usage
python3 agent.py
# Then: /model 9b
```
### MLX engine crashes with memory error
```bash
# MLX uses unified memory — check pressure
vm_stat | grep "Pages free"
# Reduce batch size in mlx_engine.py
# Edit: max_batch_size = 512 → max_batch_size = 128
```
### F_NOCACHE not bypassing page cache (macOS Sonoma+)
```python
# Verify F_NOCACHE is active
import fcntl, os
fd = os.open(model_path, os.O_RDONLY)
result = fcntl.fcntl(fd, fcntl.F_NOCACHE, 1)
assert result == 0, "F_NOCACHE failed — check macOS version and SIP status"
```
### `ddgs` search fails
```bash
pip3 install --upgrade ddgs --break-system-packages
# ddgs uses DuckDuckGo — no API key required, but may rate-limit
# Retry after 60 seconds if you get a 202 response
```
### Wrong reshape on GGUF dequantization
```python
# GGUF tensors are column-major — correct reshape:
weights = dequantized_flat.reshape(ne[1], ne[0]) # CORRECT
# NOT: dequantized_flat.reshape(ne[0], ne[1]).T # WRONG
```
---
## Architecture Summary
```
agent.py
├── Intent classification → "search" | "shell" | "chat"
├── search → ddgs.DDGS().text() → summarize
├── shell → generate command → subprocess.run()
└── chat → stream directly
Backends (both expose OpenAI-compatible API on :8000)
├── llama.cpp → fast, standard, no persistence
└── mlx/ → KV cache save/load/compress/sync
Flash Streaming (research/)
├── moe_expert_sniper.py → 35B Q4, 1.42 GB RAM
└── flash_stream_v2.py → 32B dense, 4.5 GB RAM
└── F_NOCACHE + pread + 16KB alignment
```
Creator's repository · aradotso/trending-skills