Documentation – Arbiter

1. Getting Started

Arbiter is a local-first AI assistant. Once you install a model, core chat runs entirely on your device with no account, no cloud dependency, and no ongoing inference costs. Network features like web search, model downloads, and local-network server connections are explicit opt-ins.

First Launch

Open Arbiter and walk through the onboarding screens.
Head to the Model Catalog and install a model. If you are not sure where to start, look for models tagged Recommended. These are sized to run well on most devices.
Once the download finishes, the model loads automatically. Start a new chat and send a message.

Arbiter welcome screen with onboarding steps

Device Requirements

Platform	Minimum	Recommended
iOS	iOS 16, A14 Bionic, 6 GB RAM	iPhone 13 Pro or later; more RAM for larger models
macOS	macOS 14, Apple Silicon (M1)	M1 Pro / M2 or later with 16+ GB RAM

Models vary in size from under 1 GB to over 8 GB. Arbiter checks your device’s available memory before loading and warns you if a model is likely to exceed what your hardware can handle.

2. Understanding Model Formats

Arbiter supports four model runtime paths. Each has different tradeoffs around compatibility, performance, and setup.

GGUF

GGUF is a quantized model format popularized by the llama.cpp ecosystem. GGUF models are single-file downloads that run on both iOS and macOS through Arbiter’s built-in llama.cpp engine.

Compatibility: Works on both iPhone and Mac.
Performance: Efficient memory usage through quantization (most catalog entries are Q4 or Q8). Good balance of speed and quality on devices with limited RAM.
Best for: Compact models on iPhone, general chat, coding, and reasoning tasks.

MLX

MLX is Apple’s machine learning framework for Apple Silicon. MLX models are stored as a set of config, tokenizer, and weight files and run through mlx-swift. They also work on both iOS and macOS.

Compatibility: Works on both iPhone and Mac. Requires Apple Silicon.
Performance: Takes advantage of the unified memory architecture on Apple Silicon. Larger MLX models that would not fit comfortably on an iPhone can run well on a Mac with more RAM.
Vision models: All vision-capable models in Arbiter’s catalog use the MLX format through the MLXVLM runtime.
Server support: The macOS model server feature serves installed MLX models.
Best for: Larger models on Mac, vision tasks, and serving models to other devices.

Apple Foundation Model

Apple’s on-device Foundation Model is available on devices running iOS 26 or macOS 26 with Apple Intelligence enabled. This is a system-level model provided by Apple, so no download or storage is required.

Compatibility: Requires Apple Intelligence eligibility and iOS 26 / macOS 26.
Performance: Runs natively through Apple’s FoundationModels framework. Zero disk usage.
Best for: Quick responses without downloading a model, or as a lightweight default alongside open-source models.

Remote OpenAI-Compatible Servers

Arbiter can also use models served by another device on your local network. This includes Arbiter for macOS, LM Studio, Ollama-style servers, and other OpenAI-compatible endpoints. Remote models appear in the picker as remote:model-id.

Compatibility: Works on iOS and macOS when the server exposes OpenAI-style model and chat endpoints.
Performance:Lets an iPhone use larger models running on a nearby Mac or PC while keeping prompts inside the user’s own network.
Best for: Larger local-network models, desktop-hosted MLX models, and development workflows that expect an OpenAI-compatible API.

Which format should I pick? If you are on iPhone and want a fast, compact model, start with a recommended GGUF model. If you have a Mac with 16+ GB of RAM and want to explore larger or vision-capable models, try MLX. If your device supports Apple Intelligence, the Foundation Model is available with no downloads at all. If the best model is running on another computer, connect to it as a remote server.

3. Model Catalog & Downloads

Arbiter ships with a curated catalog of 44 models spanning multiple families: Gemma, Llama, DeepSeek, Qwen, Mistral, Phi, Granite, and others. The catalog includes 24 GGUF models, 20 MLX models, 9 vision-capable models, and 10 reasoning models.

Browsing and Filtering

The model browser supports search, filtering, and sorting:

Filters: Installed, MLX, GGUF, Recommended, Vision, Reasoning, and individual model families.
Sorting: Recommended fit (default), installed first, file size, alphabetical, and popularity.

Model catalog with filter chips and search

Downloading a Model

Tap a model in the catalog to see its details, including size, format, capabilities, and a link to its Hugging Face page.
Tap Download. Progress is tracked in the UI. Large models (4 to 8 GB) may take several minutes depending on your connection.
Once downloaded, the model is stored locally in the app’s sandbox. GGUF models download as a single file. MLX models download config, tokenizer, and weight shards from Hugging Face.

Memory Fit Checks

Arbiter checks your device’s available RAM against the model’s requirements. If a model is likely to exceed your device’s memory, you will see a warning before downloading. This is especially relevant on iPhones with 6 GB RAM where larger models may crash during inference.

Recommendations are device-aware. Arbiter considers minimum and maximum memory guidance from the catalog, avoids tight-memory models during onboarding, and adjusts recommendations for devices such as 8 GB iPhones that can run stronger models than the smallest phone-friendly defaults.

Deleting Models

Installed models can be deleted from the catalog screen to free up storage. Deleting a model removes all associated files from the app sandbox.

4. On-Device Inference

When you send a message with a locally installed model selected, inference happens entirely on your device. No network call is made, and your prompt never leaves the device.

How It Works

GGUF models run through a local llama.cpp engine compiled for Apple platforms.
MLX models run through mlx-swift and mlx-swift-lm, using Apple Silicon’s unified GPU and Neural Engine.
Responses stream token-by-token into the chat UI. You can tap Stop at any time to halt generation.

Performance Factors

Token generation speed depends on your hardware, the model size, quantization level, and current thermal state. A few guidelines:

Smaller quantized models (1 to 3 GB) run smoothly on most modern iPhones.
Larger models (4 to 8 GB) perform best on Mac or high-RAM iPhones.
Sustained generation on iPhone can trigger thermal throttling. Shorter conversations or smaller models help here.

Active chat with a streaming model response

Troubleshooting: Model Issues

Model won’t load or crashes: The model may be too large for your device. Try a smaller model. Models tagged “Recommended” in the catalog are sized for most devices. Close background apps to free up memory, especially on iPhone. If a download was interrupted, the model file may be corrupt. Delete and re-download it from the catalog.
Slow generation: Larger models generate slower, especially on iPhone. Switch to a smaller or more quantized variant. Extended generation can heat up the device and reduce speed. Very long conversations also increase processing time per token, so start a new chat if things slow down significantly.

5. Chat & Conversations

All chat sessions are stored locally in Core Data. Arbiter automatically titles new conversations from your first message and reopens your most recent session on launch.

Chat Features

Streaming responses with stop-generation control.
Retry the last assistant response to get a different answer.
Edit a previous user message and regenerate from that point.
Delete the most recent message pair.
Markdown rendering for formatted output and code blocks with syntax highlighting.
Rename, delete, and search chat sessions from the history sidebar.

Switching Models Mid-Conversation

You can switch the active model at any time. The new model picks up the existing conversation context. Keep in mind that different models have different context window sizes. Switching to a smaller model mid-conversation may trigger context management (see Context Management).

6. Files & Vision

File Uploads

Arbiter supports uploading PDF and plain text files for summarization and analysis. Files are copied into the app’s sandbox and processed locally.

Maximum file size: 2 MB.
PDF text is extracted via PDFKit. Plain text is read as UTF-8.
For smaller local models, Arbiter can pre-summarize the file to reduce context usage. For MLX and remote models, file excerpts can be included directly.
Previous file attachments can be represented by stored summaries in follow-up messages to conserve tokens.

Vision & Image Input

Vision-capable MLX models can process images alongside text prompts. Arbiter supports image input from the photo library and, on iOS, directly from the camera.

Only MLX models tagged as vision-capable support image input. Text-only models are protected from receiving image data.
Images are resized to 448×448 pixels and converted to JPEG before processing.
Vision models in the catalog include Gemma 4, LFM 2.5 VL, Ministral, Qwen2 VL, Llama 3.2 Vision, and SmolVLM.

iOS tip: Use the camera flow to snap a photo of a document, whiteboard, receipt, or label and ask Arbiter about it directly.

7. Web Search

Web search is an optional feature that gives your local model access to current information from the internet. It is disabled by default and must be explicitly toggled on per-message.

How It Works

Enable the search toggle in the chat input bar before sending your message.
Arbiter sends your query to search.askarbiter.ai and receives structured results.
Results are formatted into a compact, token-aware summary and injected into the model’s prompt alongside your question.
The model generates a response grounded in both its training data and the live search results.

Privacy Considerations

When search is enabled, your search query is sent to Arbiter’s search endpoint. We do not log or store queries beyond what is needed to return results. Your full conversation history is not transmitted. Only the specific search query is sent.

When to Use Search

Current events, news, or recent information.
Facts that may have changed since the model was trained.
Product prices, release dates, weather, sports scores.

For topics covered well by the model’s training data (general knowledge, coding, math), search is usually unnecessary and adds latency.

Troubleshooting: Web Search

Search toggle: Search must be explicitly toggled on in the chat input bar for each message.
Internet required: Search needs an active internet connection.
Rate limits: If you see rate-limit errors, wait a moment and try again.

8. Importing & Exporting Chats

Arbiter supports JSON-based chat export and import for backups, migration between devices, or archival.

Exporting

Open the chat you want to export.
Use the export option to generate a JSON file containing the full message history, metadata, attachments, and summaries.
Save or share the file through the system share sheet.

Importing

Open Arbiter and use the import option from the chat history screen.
Select a previously exported JSON file. Arbiter creates a new chat session from the imported data.

Note: Exported chats include message content, timestamps, model references, file summaries, and session metadata. Imported chats appear as new sessions in your history. Third-party chat formats are not supported.

9. Connecting to Local Servers

Arbiter can connect to any OpenAI-compatible API server running on your local network. This lets you run larger models on a nearby computer and chat from your iPhone or Mac without sending prompts to a cloud provider.

Supported Servers

Arbiter for macOS (see Running a Model Server)
LM Studio with the local server enabled in settings
Ollama, which runs an OpenAI-compatible endpoint by default
Any OpenAI-compatible server that exposes /v1/models and /v1/chat/completions

Setup

Make sure the server is running and accessible on the same Wi-Fi network as your device.
In Arbiter, go to Settings → Remote Server.
Choose Arbiter Server for another Arbiter app on your network, or Third-Party Server for LM Studio, Ollama, or another OpenAI-compatible server.
Enter the server’s host (IP address or hostname) and port. Arbiter Server defaults to port 8080; third-party OpenAI-compatible servers default to 1234.
Tap Test Connection. Arbiter queries /v1/models to discover available models and, when supported, /v1/active_modelto identify the server’s current model.
Select a remote model. It appears in the model picker as remote:model-id.

Remote server configuration with Bonjour discovery

Bonjour Discovery (iOS → Mac)

When Arbiter for macOS is serving a model, it advertises itself on the local network via Bonjour (_arbiter._tcp) with host and port metadata. Arbiter for iOS can automatically discover nearby Mac servers without manual IP entry. Discovery runs in a timed search window and falls back to manual host and port entry when local network permission or Wi-Fi configuration blocks discovery.

Remote Model Switching

Arbiter-compatible servers can expose /v1/active_model. When available, Arbiter can show the active model and request a model switch before chatting. Third-party servers that only support /v1/models and /v1/chat/completions still work for normal remote chat.

Privacy

Local-network connections stay on your network. Prompts are sent directly between devices and do not pass through Arbiter’s servers or any external endpoint. The privacy of this path depends on your own network configuration.

Troubleshooting: Server Connections

Local Network permission: On iOS, Arbiter requires the Local Network permission to discover and connect to servers on your Wi-Fi. Go to Settings → Privacy & Security → Local Network and make sure Arbiter is enabled. Without this permission, Bonjour discovery will not work, and manual connections may fail.
Same network: Both devices must be on the same Wi-Fi network.
Server running: Confirm the server (LM Studio, Ollama, Arbiter macOS) is actively running and not paused.
Correct host and port: Double-check the IP address and port number. Use Arbiter’s Test Connection to diagnose the issue. It reports specific errors for timeouts, refused connections, empty model lists, and HTTP failures.
Diagnostics: Remote connection screens include expandable diagnostics and copyable debug logs. These are useful when comparing Bonjour discovery, manual host/port entry, and the server’s own connection details.
Firewall: Make sure your Mac’s firewall allows incoming connections on the configured port.

10. Running a Model Server (macOS)

Arbiter for macOS can expose an installed MLX model as an OpenAI-compatible local API server. This turns your Mac into a private inference endpoint for your iPhone, other apps, IDE plugins, or any client that speaks the OpenAI chat completions format.

Starting the Server

Open Arbiter for macOS and install at least one MLX model.
Navigate to the Serve Model section.
Choose the installed MLX model you want the server to expose. Arbiter can load the selected model before serving.
Tap Start Server. The default port is 8080, but you can configure it.
Arbiter displays both the localhost URL (for the Mac itself) and the local network URL (for other devices).

macOS Serve Model interface with server running

API Endpoints

Endpoint	Method	Description
`/v1/models`	GET	Lists installed MLX models and identifies the currently loaded model.
`/v1/active_model`	GET	Returns the server’s active loaded model.
`/v1/active_model`	POST	Requests a model switch to another installed MLX model.
`/v1/chat/completions`	POST	Sends a chat completion request. Supports server-sent event streaming.

Connecting from iPhone

With the macOS server running, open Arbiter on your iPhone. If both devices are on the same Wi-Fi network, Arbiter for iOS can discover the Mac automatically via Bonjour. You can also enter the Mac’s IP and port manually under Settings → Remote Server.

Using with Other Clients

The server includes CORS headers and follows the OpenAI chat completions format, so you can point other tools at it. The Serve Model screen also shows copyable connection values, external API details, connected clients, and diagnostics.

# List available models
curl http://192.168.1.x:8080/v1/models

# Send a chat completion request
curl http://192.168.1.x:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-id",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Limitations

The server serves installed MLX models. GGUF models can still run in local chat, but they cannot be served over the macOS API yet.
One generation at a time. If a request is in progress, additional requests wait until the current one finishes.
No built-in authentication. The server is accessible to any device on your local network.

11. Apple Foundation Models

Starting with iOS 26 and macOS 26, Arbiter integrates Apple’s on-device Foundation Models through the FoundationModels framework. These are system-level models provided by Apple Intelligence, with no download required.

Requirements

iOS 26 or macOS 26.
Apple Intelligence must be available and enabled on your device.
Device must meet Apple’s eligibility requirements.

Usage

When available, Apple Foundation Model appears in the model picker alongside installed GGUF and MLX models. Select it like any other model. Responses stream into the same chat interface and integrate with Arbiter’s personalization and role system.

Troubleshooting: Apple Foundation Model

Requires iOS 26 or macOS 26. Earlier OS versions do not support the FoundationModels framework.
Apple Intelligence must be enabled in Settings → Apple Intelligence & Siri.
Not all devices support Apple Intelligence. Check Apple’s compatibility list for your hardware.
If the model is not ready or the device is ineligible, Arbiter shows a clear error and suggests switching to an installed local model.

12. Roles & Personalization

Assistant Roles

Arbiter includes 10 built-in assistant roles, each with a role-specific system prompt tuned for different tasks:

General Assistant
Language Translator
Meal Planner
Fitness Coach
Mindfulness Guide
Study Buddy
Career Advisor
Travel Planner
Coding Helper
Shopping Assistant

The Language Translator role includes a selectable target language (Spanish, French, Chinese, Japanese, Hindi). Starter prompts in the chat input update based on the selected role.

Personalization Settings

You can adjust how the assistant responds without writing a custom system prompt:

Setting	Options
Nickname	Custom name the assistant uses for you
Custom Instructions	Free-text instructions appended to the system prompt
Warmth	Direct, Balanced, Warm
Enthusiasm	Calm, Balanced, Energetic
Emoji Preference	None, Occasional, Frequent
Response Style	Concise, Balanced, Detailed

Only non-default settings are included in the prompt to save tokens on smaller models.

13. Siri & Shortcuts

Arbiter registers an App Intent called Ask Arbiter that you can invoke through Siri or the Shortcuts app.

Using with Siri

Say “Hey Siri, Ask Arbiter” followed by your question. Siri routes the query to Arbiter, which generates a short spoken response using the currently selected model. The app does not need to be open.

Using with Shortcuts

Add the Ask Arbiter action to any Shortcut workflow. The action accepts a text input and returns the model’s response as text, which you can pipe into other Shortcut actions.

Requirement: A local model must be loaded for the Siri intent to work (unless you are using the Apple Foundation Model). If no model is loaded, the shortcut prompts you to open Arbiter first.

14. Context Management

Different models have different context window sizes, and devices have different memory ceilings. Arbiter manages both: it fits each conversation into the model’s token window and, for on-device MLX models, keeps prompts below the memory level that could cause iOS to terminate the app.

Context Stages

Full: The conversation is below the soft threshold, so Arbiter sends the full history unchanged.
Approaching: The prompt is above the soft threshold but still inside the safe input budget. Arbiter can warn you and, when useful, prepare a summary in the background.
Hybrid: The full conversation no longer fits safely. Arbiter drops older turns, keeps recent turns in full, optionally prepends a saved summary, and always preserves the current user message.
Exceeded: The required prompt still cannot fit after trimming. Arbiter shows an error instead of sending a request that is likely to fail or crash.

Arbiter reserves output space before deciding how much input can be sent. The effective input budget is maxContextTokens - reservedResponseTokens, and the soft threshold is that safe budget multiplied by a model-specific fraction. This leaves room for the reply and starts trimming before the hard limit.

Context Budgets by Runtime

Model Type	Max Context	Response Reserve	Recent Turns Kept
GGUF	2,048 tokens	512 tokens	6
MLX	Read from `config.json`, then capped by device memory	1,024 tokens, clamped to fit the model	8
Apple Foundation	4,096 tokens	1,024 tokens	8
Remote server	32,768 tokens	4,096 tokens	50
Unknown fallback	4,096 tokens	1,024 tokens	8

MLX Memory Caps

Some MLX models advertise very large trained context lengths, such as 131,072 tokens, but an iPhone cannot always hold the required KV cache and temporary prefill tensors in memory. Arbiter therefore converts available device RAM into a safer token cap and uses the smaller of the model’s trained length and the memory-derived cap.

The cap accounts for model weights, a 500 MB activation reserve, KV-cache bytes per token from the model geometry, and extra prefill memory for vision-loaded MLX models. This is why a vision model may enter hybrid mode much earlier than its advertised context length suggests: the memory ceiling can be stricter than the token window.

Token Estimation

Arbiter estimates tokens without running a tokenizer on every prompt. The estimate is script-aware: Latin text is counted at roughly 3.5 characters per token, CJK and similar dense scripts closer to one token per character, and expansive scripts such as Hindi or Thai more conservatively. This prevents translation and multilingual chats from being under-counted by several times.

Summaries and Files

When a conversation enters hybrid mode, Arbiter can summarize older messages after the active response finishes and store that summary with the chat. Future turns can then include a compact summary plus recent messages instead of repeatedly sending the entire history. File attachments can also be represented by stored summaries in follow-up turns to reduce context pressure.

Reasoning and Search

Reasoning-capable models receive a larger output allowance when thinking mode is enabled, because the thinking trace uses reply tokens before the final answer. For search-grounded prompts, Arbiter disables thinking so smaller models do not spend their budget repeatedly reasoning over injected search snippets.

Tip: If you notice the model losing track of earlier parts of the conversation, it is likely in hybrid mode. Arbiter will keep recent turns and summaries, but start a new chat for topics that require precise recall of every earlier message.

15. iOS vs. macOS Differences

Both apps share the same core: local model execution, chat, file uploads, web search, personalization, and roles. The main differences are driven by platform capabilities and form factor.

Feature	iOS	macOS
GGUF models	Supported	Supported
MLX models	Supported	Supported
Apple Foundation Model	iOS 26+	macOS 26+
Camera input	Yes	No (photo library only)
Haptic feedback	Yes (configurable)	No
Serve model as API	No	Yes (MLX models)
Bonjour discovery	Discovers Mac servers	Advertises as server
Connect to remote servers	Yes	Yes
Practical model size	1 to 4 GB typical	4 to 8+ GB with more RAM
Siri / Shortcuts	Yes	Yes