Too many people say that MCP is a game changer, and now, with MCP, LMs are real agents, hallucinations are no longer an issue, and more nonsense like that.
I wanted to first write an article on MCP, but really, there's no substance for an article, so a post will do.
What is MCP? You might have already heard that it's USB Type-C for LMs, a protocol for connecting LMs to tools and resources, etc. But what exactly does that mean? Why do so many people scream that it changes everything?
Let's answer the latter question: It doesn't change everything. LMs still hallucinate and agents based on them aren't agentic.
Now, let's see what it is.
As you likely know, LMs can be finetuned to "use tools." That simply means that they can be finetuned to print something like "USE TOOL: CALCULATOR" when the user input is like "what is 3 2?"
Let the LM's output be "USE TOOL: CALCULATOR, ADD 2, 3."
This output must be parsed, the tool name and the arguments extracted, and then submitted to an actual calculator API. This is done by a handcrafted parser script.
As the number of tools you want the LM to be able to use grows, this parser script will become more and more complex and hard to maintain when the LM output or the tool input API changes.
Here comes MCP. It says: the LM must be responsible for outputting the tool call request in a format that the tool API understands without needing a parser.
So, the process is as follows:
1. The user says "what is 3 2?"
2. The MCP client (coded by the chatbot creator) connects to one or several MCP servers (coded by the tool providers) and pulls the list of supported tools from the MCP servers.
2.1. Each tool comes with a verbal description of what the tool can do and what input format it accepts (written by the MCP server provider, so the quality can vary from one provider to another).
3. The MCP client submits all tools and their descriptions to the LM, together with the user's prompt "what is 3 2?"
4. The LM looks at the tools and decides that to answer "what is 3 2?" it needs the tool CALC that takes a specific JSON as input (the input JSON format for CALC is provided by the MCP server provider).
5. The LM outputs "TOOL: CALC, WORKLOAD: [JSON]"
5.1. "CALC" is the name of the tool, as provided by an MCP server, "JSON" is an actual JSON as expected by the tool, for example {"operation": "addition", "argument1": 3, "argument2": 2}
5.2. This JSON structure wasn't invented by the LM. It was taken by the LM from the information that came with the list of tools.
Now, what no one is talking about, the elephants in the room:
1. The LM must be finetuned to understand the list of available tools and all information about these tools that comes from the MCP server. So not all LMs can be used with MCP, but only those that were finetuned for this. Finetuning for effective tool usage is not a simple task, so different models will handle MCP with different levels of success.
2. Even if the LM was finetuned for understanding MCP tool specifications, the JSON structure it will generate isn't guaranteed to be as the MCP server expects because the LM wasn't finetuned for this specific MCP server and its tools.
2.1. Even if the chatbot creator uses JSON structure enforcing decoding (using Outlines, for example), there's no guarantee that the generated JSON structure with all good keys will also contain all good values.
3. Some tools might require a very complex input structure, with lists, lists of lists, lists of dictionaries, etc. The description of what a tool can do can be very complex too or just poorly written by the MCP server provider.
3.1. There's zero guarantee that the LM will be good enough in choosing the right tool and then generating the right JSON-formatted input for this tool.
4. It all looks nice when you code a small demo of MCP with one server and two tools on it. But to build a general-purpose chatbot, the number of tools needed to answer all possible user demands can grow to hundreds, thousands, or even millions.
4.1. Putting all of them in the conversation context all the time will consume the maximum context size very quickly, and the LM will not be able to choose the right tool or fill the JSON structure with the right values.
4.2. Not putting all of them in the context will require a classifier trained to select the right tools, or using some sort k-nearest neighbors in the embedding space.
So, again, agents look good when there's only one of them using 2-3 simple tools. Try to scale it to hundreds of agents using thousands of tools and fail miserably.