AnyLanguageModel 0.4.0 is out with support for multi-modal inputs.
Vision language models are incredibly useful — extract text from receipts, analyze diagrams, describe images for accessibility, answer questions about photos, and much more.
You can now pass images directly to models running with MLX & Ollama, plus Anthropic, Gemini, and OpenAI providers. Core ML support coming soon.
ALT let response = try await session.respond(
to: "Extract the text from this image",
image: .init(
data: try Data(contentsOf: URL(fileURLWithPath: "path/to/receipt.png")),
mimeType: "image/png"
)
)
Introducing AnyLanguageModel: A Swift package that provides a drop-in replacement for Apple's Foundation Models framework with support for custom language model providers.
github.com/mattt/AnyLanguage…
Just change your import statement: