Evaluating MCP servers - a quick guide

Matthew WangNov 19, 20258 min read

Building a well designed MCP server is an art. In the MCP community, it's increasingly described as "tool ergonomics", designing MCP tools so that an LLM understands how to use them.

Good tool ergonomics requires continuous iteration. As a server developer, you must fine-tune your tool descriptions, reduce unnecessary tool exposure, and structure interfaces in ways that minimize hallucinations. Even small changes like adding a new tool or rephrasing a description can significantly shift how well models interact with your server.

Running evals is the most effective way to measure, iterate, and improve these ergonomics.

Measuring your tool ergonomics with evals

Good tool ergonomics is an art, but quantifying it is a science. Here's how MCP evals works:

A mock agent is launched and connected to your MCP server, simulating how clients like Claude Code, Cursor, or ChatGPT would interact with it.
The agent is exposed to your server's entire toolset.
We give the agent a prompt, simulating a user asking the LLM a real-world question.
The agent runs the query, makes decisions on whether or not to call a tool. Executes tools and spits an output.
We examine the agent's output, evaluate whether or not the right tool was called.

In addition to tool-selection accuracy, evals track token usage, run duration estimates, and performance across multiple LLMs. Since MCP servers are used by diverse clients, it's crucial to understand how different models behave with your tools.

Keep in mind that evals are non-deterministic. Running them multiple times yields a useful snapshot of your server's performance at a given moment, but results will naturally vary.

Setting up prompts (test cases)

Choosing the right eval prompts is essential. Aim to mirror the workflows your users actually rely on. You likely know these scenarios best, but AI assistants can also help generate additional test cases.

For example, a popular action in Asana is to find what tasks are due today. We can set up a test cases that captures that:

Example

A common Asana workflow is checking which tasks are due today. A test case for that might look like:

Query

What tasks are due today?

Expected tools

asana_get_user
{
  id: "me"
}
asana_get_tasks
{
  id: "41812493"
}

Evals don't just measure whether the agent picks the right tools, they also evaluate how well the agent fills out tool parameters, offering direct insight into the clarity and quality of your tool descriptions.

Interpreting eval results

Once your evals have run, you can extract meaningful insights using common machine-learning classification metrics:

Accuracy

Accuracy measures the overall health of your MCP server's ergonomics. Measures what percentage of all runs passed with expected tool calls.

True Positive Rate (TPR or Recall)

Measures how discoverable a tool is, a judge of an individual tool's clarity. High TPR means a tool is described clearly and the agent knows when to use it.

TPR = correctly called expected tool / total times that tool was expected

False Positive Rate (FPR)

Measures how often tools are called when they shouldn't be. A high FPR for a given tool might mean it's too generic, and you need tighter scopes on it.

FPR = tool called when not expected / total runs where tool should NOT be used

Precision

Shows all cases where the tool was used, what proportion is correct. High precision means the tool was used in the right context, it's not being overused.

Precision = correct uses of tool / all uses of tool

Run evals within MCPJam

You can run your MCP server evals directly inside the MCPJam Inspector. The platform allows you to:

View Accuracy, TPR, FPR, Precision, token usage, and cross-model performance
Create many test cases with queries, expected tool calls, and input parameters (or auto-generate them)
Iterate on a single test case using the detailed test-case viewerideal for debugging failures
Review complete agent traces to understand decision-making step-by-step

Setting up the MCPJam inspector is a single npx command

npx @mcpjam/inspector@latest

You can then go to the evals page to set up evals.

OpenAI apps SDK

MCPJam also supports OpenAI Apps SDK and MCP-UI. You can also run evals on MCP servers for Apps SDK and view the UI rendering.