My take on the current language model API market after a few days of research, and what I ultimately chose for my application. Did I miss anything notable? All of these have OpenAI compatible APIs, though don't necessarily support all of the same features 👇
@cerebras - makes custom chips that appear to outperform all others by a large margin. Faster than all other options by at least 1-2 orders of magnitude. Generally double or triple the cost of other providers, but if you need the speed it's probably a great value. Full support for enforcing strict JSON schema on output. Currently has a wait list for direct access but you can work around this by using it through HF Inference providers (see below)
@GroqInc - also makes custom chips for inference, extremely fast. An order of magnitude faster than pretty much everyone except Cerebras. Really great price for Llama 4 Scout, other prices are average. Currently no "true" support for enforcing strict JSON schema on output currently. Default rate limits might be too low for some use cases, you can request higher ones but I have yet to have my request fulfilled.
@awscloud Bedrock - proprietary API format (Converse, InvokeModel) but offers an OpenAI-compatible shim called bedrock-access-gateway. Awful prices for everything except Amazon Nova models. Strict output schema only via "tool use", which works but feels hacky. In my experience the API is flaky, sporadically returning errors. Amazon Nova Micro is particularly inexpensive and surprisingly good for some tasks.
@huggingface Inference Providers - new "inference marketplace" that is a superset of the previous "HuggingFace Inference API" product. Gives you the ability to source inference from any of the supported providers. Inference sourced from a provider costs the same $ as the provider would charge you directly. Right now there are 10 providers supported, including Cerebras. Requires you to have a $9/month pro membership to go beyond the modest "free" tier limits (in addition to paying the per token cost). Currently doesn't appear to be able to tell you the cost of each provider/model in advance, I've been using OpenRouter to determine pricing.
@AnthropicAI - great models, but seems horrendously overpriced. Sonnet 3.7 is literally 107x the cost of Amazon Nova Micro per output token
@novita_labs - great prices, no "true" output schema support right now. Offers a more "precise" Llama 3.3 than most other providers (bf16) at a price that is among the most inexpensive for Llama 3.3
@FireworksAI_HQ - average prices, full support for output schema enforcement
@togethercompute (TogetherAI) - average prices, full support for output schema enforcement
@openrouter - also an "inference marketplace" - one API for many providers. Far more providers supported than HuggingFace Inference Providers. Has Groq but does not appear to have Cerebras
@OpenAI - the OG. IMO the only reasonably priced model is GPT-4o-mini, the rest are severely overpriced
NLP Cloud - a bit of a deviant in that they offer a flat monthly rate for a fixed number of requests per minute. Might be a good fit for high token count use cases that have a predictable request volume. Limited model options, but they have some SOTA models.
Self hosting on AWS Spot instances - this is always going to be the most cost effective option assuming you can completely saturate the compute power 24/7. It can be challenging to find the right instance type for a specific model. Smaller models are easy to accommodate via GPUs with modest amounts of VRAM. But 70B models are challenging because to get the necessary A100, H100 or H200 GPUs you have to provision an instance that *has 8 of them*, which is very expensive. Inferentia is a bit of an oddball - you have to go through a proprietary model transformation process to use it.
For the time being, I decided on
@huggingface Inference Providers for my application. I have it route through Fireworks for cases where I need output to conform to a strict JSON schema, and Novita for cases where I do not. I am using Llama 3.3 70B, Llama 4 Scout and LLama 4 Maverick. I might use Groq directly for some scenarios if it supported strict schemas and accepted my rate limit increase request.