Overview
The Tool Invocation evaluator determines whether an LLM invoked a tool correctly with proper arguments, formatting, and safe content. This evaluator focuses on the how of tool calling - validating that the invocation itself is well-formed - rather than whether the right tool was selected.
When to Use
Use the Tool Invocation evaluator when you need to:
- Validate tool call arguments - Ensure all required parameters are present with correct values
- Check JSON formatting - Verify tool calls are properly structured
- Detect hallucinated fields - Identify when the LLM invents parameters not in the schema
- Audit for unsafe content - Check that arguments don’t contain PII or sensitive data
- Evaluate multi-tool invocations - Validate when the LLM calls multiple tools at once
This evaluator validates tool invocation correctness, not tool selection. For evaluating whether the right tool was chosen, use the Tool Selection evaluator instead.
Supported Levels
The level of an evaluator determines the scope of the evaluation in OpenTelemetry terms. Some evaluations are applicable to individual spans, some to full traces or sessions, and some are applicable at multiple levels.
| Level | Supported | Notes |
|---|
| Span | Yes | For LLM spans that contain tool calls. Evaluate individual tool-calling decisions. |
Relevant span kinds: Tool spans or LLM spans with tool calls, particularly in agentic applications.
The Tool Invocation evaluator requires three inputs:
| Field | Type | Description |
|---|
input | string | The conversation context (can include multi-turn history) |
available_tools | string | Tool schemas (JSON schema or human-readable format) |
tool_selection | string | The LLM’s tool invocation(s) with arguments |
In TypeScript, the fields use camelCase: availableTools and toolSelection.
While you can pass full JSON representations for each field, human-readable formats typically produce more accurate evaluations.
input (conversation context adapted from input messages):
User: I need to book a flight from New York to Los Angeles
Assistant: I'd be happy to help you book a flight. When would you like to travel?
User: Tomorrow morning, the earliest available
available_tools (tool descriptions adapted by JSON schemas):
book_flight: Book a flight between two cities
- origin (required): Departure city code (e.g., "JFK", "LAX")
- destination (required): Arrival city code
- date (required): Flight date in YYYY-MM-DD format
- time_preference (optional): "morning", "afternoon", or "evening"
search_hotels: Search for hotel accommodations
- city (required): City name or code
- check_in (required): Check-in date in YYYY-MM-DD format
- check_out (required): Check-out date in YYYY-MM-DD format
tool_selection (the LLM’s tool invocation adapted from tool_calls in the output):
book_flight(origin="JFK", destination="LAX", date="2024-01-15", time_preference="morning")
Additional tips:
- Include full conversation context - The evaluator considers the entire conversation history to validate argument values
- Multi-tool invocations are supported - If the LLM calls multiple tools, include all invocations in the
tool_selection field
Output Interpretation
The evaluator returns a Score object with the following properties:
| Property | Value | Description |
|---|
label | "correct" or "incorrect" | Classification result |
score | 1.0 or 0.0 | Numeric score (1.0 = correct, 0.0 = incorrect) |
explanation | string | LLM-generated reasoning for the classification |
direction | "maximize" | Higher scores are better |
metadata | object | Additional information such as the model name. When tracing is enabled, includes the trace_id for the evaluation. |
Usage Examples
from phoenix.evals import LLM
from phoenix.evals.metrics import ToolInvocationEvaluator
# Initialize the LLM client
llm = LLM(provider="openai", model="gpt-4o")
# Create the evaluator
tool_invocation_eval = ToolInvocationEvaluator(llm=llm)
# Inspect the evaluator's requirements
print(tool_invocation_eval.describe())
# Evaluate a tool invocation using human-readable format
eval_input = {
"input": """User: I need to book a flight from New York to Los Angeles
Assistant: I'd be happy to help you book a flight. When would you like to travel?
User: Tomorrow morning, the earliest available""",
"available_tools": """book_flight: Book a flight between two cities
- origin (required): Departure city code (e.g., "JFK", "LAX")
- destination (required): Arrival city code
- date (required): Flight date in YYYY-MM-DD format
- time_preference (optional): "morning", "afternoon", or "evening"
search_hotels: Search for hotel accommodations
- city (required): City name or code
- check_in (required): Check-in date in YYYY-MM-DD format
- check_out (required): Check-out date in YYYY-MM-DD format""",
"tool_selection": 'book_flight(origin="JFK", destination="LAX", date="2024-01-15", time_preference="morning")'
}
scores = tool_invocation_eval.evaluate(eval_input)
print(scores[0])
# Score(name='tool_invocation', score=1.0, label='correct', ...)
import { createToolInvocationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";
// Create the evaluator
const toolInvocationEvaluator = createToolInvocationEvaluator({
model: openai("gpt-4o"),
});
// Evaluate a tool invocation using human-readable format
const result = await toolInvocationEvaluator.evaluate({
input: `User: I need to book a flight from New York to Los Angeles
Assistant: I'd be happy to help you book a flight. When would you like to travel?
User: Tomorrow morning, the earliest available`,
availableTools: `book_flight: Book a flight between two cities
- origin (required): Departure city code (e.g., "JFK", "LAX")
- destination (required): Arrival city code
- date (required): Flight date in YYYY-MM-DD format
- time_preference (optional): "morning", "afternoon", or "evening"
search_hotels: Search for hotel accommodations
- city (required): City name or code
- check_in (required): Check-in date in YYYY-MM-DD format
- check_out (required): Check-out date in YYYY-MM-DD format`,
toolSelection: 'book_flight(origin="JFK", destination="LAX", date="2024-01-15", time_preference="morning")',
});
console.log(result);
// { score: 1, label: "correct", explanation: "..." }
When your data has different field names, use input mapping.
from phoenix.evals import LLM
from phoenix.evals.metrics import ToolInvocationEvaluator
llm = LLM(provider="openai", model="gpt-4o")
tool_invocation_eval = ToolInvocationEvaluator(llm=llm)
eval_input = {
"conversation": """User: Search for hotels in Paris
Assistant: I can help you find hotels. What are your check-in and check-out dates?
User: March 15th to March 20th""",
"tools_schema": """search_hotels: Search for hotel accommodations
- city (required): City name or code
- check_in (required): Check-in date in YYYY-MM-DD format
- check_out (required): Check-out date in YYYY-MM-DD format""",
"llm_tool_call": 'search_hotels(city="Paris", check_in="2024-03-15", check_out="2024-03-20")'
}
input_mapping = {
"input": "conversation",
"available_tools": "tools_schema",
"tool_selection": "llm_tool_call"
}
scores = tool_invocation_eval.evaluate(eval_input, input_mapping)
For more details on input mapping options, see Input Mapping.import { bindEvaluator, createToolInvocationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";
const toolInvocationEvaluator = createToolInvocationEvaluator({
model: openai("gpt-4o"),
});
const boundEvaluator = bindEvaluator(toolInvocationEvaluator, {
inputMapping: {
input: "conversation",
availableTools: "toolsSchema",
toolSelection: "llmToolCall",
},
});
const result = await boundEvaluator.evaluate({
conversation: `User: Search for hotels in Paris
Assistant: I can help you find hotels. What are your check-in and check-out dates?
User: March 15th to March 20th`,
toolsSchema: `search_hotels: Search for hotel accommodations
- city (required): City name or code
- check_in (required): Check-in date in YYYY-MM-DD format
- check_out (required): Check-out date in YYYY-MM-DD format`,
llmToolCall: 'search_hotels(city="Paris", check_in="2024-03-15", check_out="2024-03-20")',
});
For more details on input mapping options, see Input Mapping.
Configuration
For LLM client configuration options, see Configuring the LLM.
Viewing and Modifying the Prompt
You can view the latest versions of our prompt templates on GitHub. The evaluators are designed to work well in a variety of contexts, but we highly recommend modifying the prompt to be more specific to your use case. Feel free to adapt them.
from phoenix.evals.metrics import ToolInvocationEvaluator
from phoenix.evals import LLM, ClassificationEvaluator
llm = LLM(provider="openai", model="gpt-4o")
evaluator = ToolInvocationEvaluator(llm=llm)
# View the prompt template
print(evaluator.prompt_template)
# Create a custom evaluator based on the built-in template
custom_evaluator = ClassificationEvaluator(
name="tool_invocation",
prompt_template=evaluator.prompt_template, # Modify as needed
llm=llm,
choices={"correct": 1.0, "incorrect": 0.0},
direction="maximize",
)
import { TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG, createToolInvocationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";
// View the prompt template
console.log(TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.template);
// Create a custom evaluator with a modified template
const customEvaluator = createToolInvocationEvaluator({
model: openai("gpt-4o"),
promptTemplate: TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.template, // Modify as needed
});
Using with Phoenix
Evaluating Traces
Run evaluations on traces collected in Phoenix and log results as annotations:
Running Experiments
Use the Tool Invocation evaluator in Phoenix experiments:
API Reference