I'm on an Apple Silicon Mac running macOS 27 beta 1 (Apple Intelligence enabled) with a local LLM serving an OpenAI-compatible API via llama.cpp/Ollama. Build me an "on-device reflex layer":
A Swift app exposing Apple's Foundation Models framework as an OpenAI-compatible server on localhost (/v1/chat/completions, /v1/models, /health, /stats). Compile with swiftc directly targeting macOS 27 (SwiftPM may be broken under Command Line Tools), assemble a .app bundle with usage strings, ad-hoc codesign with a stable bundle identifier so TCC permissions persist, bind explicitly to 127.0.0.1 (NWListener defaults to IPv6), and run it as a LaunchAgent.
A cascade model id "auto": try the on-device model first, auto-escalate to my local LLM when the prompt exceeds the 4096-token context, the model errors or guardrail-refuses, or the output fails validation (empty, or invalid JSON when JSON was requested). Log escalations as JSONL training pairs.
A device-agent model id: the on-device model parses requests into a JSON action protocol (don't use
@Generable macros — they need full Xcode) and Swift executes via EventKit (create/list calendar events reminders), Contacts, and the shortcuts CLI. Inject a precomputed weekday→date table into the instructions — small models fail at calendar math, give them the answers.
Vision in 3 tiers: Vision-framework OCR first for text-heavy images (it's digit-exact, generative vision isn't), native ImageAttachment multimodal second, escalate to my local vision model as the floor.
Keep the model warm with session.prewarm() every 4 minutes — macOS evicts it when idle.
Wire my agent's lightweight tasks (triage, classification, titles, casual replies) to the bridge and keep heavy reasoning on my local LLM.
Verify each piece end-to-end with curl before moving on, and clean up any test calendar entries you create.
That prompt encodes every pothole we hit so you don't have to. Have fun. 🍏🤝🦙