# Tool Call Accuracy Evaluator

### Getting Started
This sample demonstrates how to use tool call accuracy evaluator on agent data. The supported input formats include:
- simple data such as strings and `dict` describing tool calls;
- user-agent conversations in the form of list of agent messages. 

Before you begin:
```bash
pip install azure-ai-evaluation
```
Set these environment variables with your own values:
1) **MODEL_DEPLOYMENT_NAME** - The deployment name of the model for this AI-assisted evaluator, as found under the "Name" column in the "Models + endpoints" tab in your Azure AI Foundry project.
2) **AZURE_OPENAI_ENDPOINT** - Azure Open AI Endpoint to be used for evaluation.
3) **AZURE_OPENAI_API_KEY** - Azure Open AI Key to be used for evaluation.
4) **AZURE_OPENAI_API_VERSION** - Azure Open AI Api version to be used for evaluation.

The Tool Call Accuracy evaluator assesses how accurately an AI uses tools by examining:
- Relevance to the conversation
- Parameter correctness according to tool definitions
- Parameter value extraction from the conversation
- Potential usefulness of the tool call

The evaluator uses a binary scoring (0 or 1) for each tool call:

    - Score 0: The tool call is irrelevant or contains information not in the conversation/definition
    - Score 1: The tool call is relevant with properly extracted parameters from the conversation

If there are multiple call, the final score will be an **average** of individual tool calls, which can be interpreted as the **passing rate** of tool calls.

This evaluation focuses on measuring whether tool calls meaningfully contribute to addressing query while properly following tool definitions and using information present in the conversation history.

Tool Call Accuracy requires following input:
- Query - This can be a single query or a list of messages(conversation history with agent). Latter helps to determine if Agent used the information in history to make right tool calls.
- Tool Calls - Tool Call(s) made by Agent to answer the query. Optional - if response has tool calls, if not provided evaluator will look for tool calls in response.
- Response - (Optional) Response from Agent (or any GenAI App). This can be a single text response or a list or messages generated as part of Agent Response. If tool calls are not provide Tool Call Accuracy Evaluator will look at response for tool calls.
- Tool Definitions - Tool(s) definition used by Agent to answer the query. 


### Initialize Tool Call Accuracy Evaluator


In [None]:
import os
from azure.ai.evaluation import ToolCallAccuracyEvaluator, AzureOpenAIModelConfiguration
from pprint import pprint

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
)


tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config)

### Samples

#### Evaluating Single Tool Call

In [None]:
query = "How is the weather in Seattle ?"
tool_call = {
    "type": "tool_call",
    "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
    "name": "fetch_weather",
    "arguments": {"location": "Seattle"},
}

tool_definition = {
    "id": "fetch_weather",
    "name": "fetch_weather",
    "description": "Fetches the weather information for the specified location.",
    "parameters": {
        "type": "object",
        "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
    },
}

In [None]:
response = tool_call_accuracy(query=query, tool_calls=tool_call, tool_definitions=tool_definition)
pprint(response)

#### Multiple Tool Calls used by Agent to respond

In [None]:
query = "How is the weather in Seattle ?"
tool_calls = [
    {
        "type": "tool_call",
        "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
        "name": "fetch_weather",
        "arguments": {"location": "Seattle"},
    },
    {
        "type": "tool_call",
        "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
        "name": "fetch_weather",
        "arguments": {"location": "London"},
    },
]

tool_definition = {
    "id": "fetch_weather",
    "name": "fetch_weather",
    "description": "Fetches the weather information for the specified location.",
    "parameters": {
        "type": "object",
        "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
    },
}

In [None]:
response = tool_call_accuracy(query=query, tool_calls=tool_calls, tool_definitions=tool_definition)
pprint(response)

#### Tool Calls passed as part of `Response` (common for agent case)
- Tool Call Accuracy Evaluator extracts tool calls from response

In [None]:
query = "Can you send me an email with weather information for Seattle?"
response = [
    {
        "createdAt": "2025-03-26T17:27:35Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
                "name": "fetch_weather",
                "arguments": {"location": "Seattle"},
            }
        ],
    },
    {
        "createdAt": "2025-03-26T17:27:37Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
        "role": "tool",
        "content": [{"type": "tool_result", "tool_result": {"weather": "Rainy, 14\u00b0C"}}],
    },
    {
        "createdAt": "2025-03-26T17:27:38Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_iq9RuPxqzykebvACgX8pqRW2",
                "name": "send_email",
                "arguments": {
                    "recipient": "your_email@example.com",
                    "subject": "Weather Information for Seattle",
                    "body": "The current weather in Seattle is rainy with a temperature of 14\u00b0C.",
                },
            }
        ],
    },
    {
        "createdAt": "2025-03-26T17:27:41Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "tool_call_id": "call_iq9RuPxqzykebvACgX8pqRW2",
        "role": "tool",
        "content": [
            {"type": "tool_result", "tool_result": {"message": "Email successfully sent to your_email@example.com."}}
        ],
    },
    {
        "createdAt": "2025-03-26T17:27:42Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "I have successfully sent you an email with the weather information for Seattle. The current weather is rainy with a temperature of 14\u00b0C.",
            }
        ],
    },
]

tool_definitions = [
    {
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
        },
    },
    {
        "name": "send_email",
        "description": "Sends an email with the specified subject and body to the recipient.",
        "parameters": {
            "type": "object",
            "properties": {
                "recipient": {"type": "string", "description": "Email address of the recipient."},
                "subject": {"type": "string", "description": "Subject of the email."},
                "body": {"type": "string", "description": "Body content of the email."},
            },
        },
    },
]

In [None]:
response = tool_call_accuracy(query=query, response=response, tool_definitions=tool_definitions)
pprint(response)

#### Response as String (str)

In [None]:
query = "Book a flight to New York for tomorrow"

# Response as a simple string instead of a list of messages
response = "I've found several flight options to New York for tomorrow. I'll use the booking tool to reserve your seat."

tool_calls = [
    {
        "type": "tool_call",
        "tool_call_id": "call_book_flight_456",
        "name": "book_flight",
        "arguments": {
            "destination": "New York",
            "date": "tomorrow"
        },
    }
]

tool_definitions = [
    {
        "name": "book_flight",
        "description": "Books a flight to the specified destination on the given date.",
        "parameters": {
            "type": "object",
            "properties": {
                "destination": {"type": "string", "description": "The destination city."},
                "date": {"type": "string", "description": "The date of the flight."},
            },
        },
    }
]

result = tool_call_accuracy(query=query, response=response, tool_calls=tool_calls, tool_definitions=tool_definitions)
pprint(result)

#### Response as List[dict] with Tool Definition as Single Dict

In [None]:
query = "What's the weather in San Francisco?"

response = [
    {
        "createdAt": "2025-03-26T18:15:22Z",
        "run_id": "run_abc123",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_sf_weather_789",
                "name": "fetch_weather",
                "arguments": {"location": "San Francisco"},
            }
        ],
    },
    {
        "createdAt": "2025-03-26T18:15:24Z",
        "run_id": "run_abc123",
        "tool_call_id": "call_sf_weather_789",
        "role": "tool",
        "content": [{"type": "tool_result", "tool_result": {"weather": "Foggy, 18°C"}}],
    },
    {
        "createdAt": "2025-03-26T18:15:25Z",
        "run_id": "run_abc123",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "The weather in San Francisco is currently foggy with a temperature of 18°C.",
            }
        ],
    },
]

tool_definition_dict = {
    "name": "fetch_weather",
    "description": "Fetches the weather information for the specified location.",
    "parameters": {
        "type": "object",
        "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
    },
}

result = tool_call_accuracy(query=query, response=response, tool_definitions=tool_definition_dict)
pprint(result)

#### Query as Conversation History (List of Messages)
The evaluator also supports query as a list of messages representing conversation history. This helps determine if the Agent used the information in the conversation history to make the right tool calls.

In [None]:
# Query as conversation history instead of a single string
query_as_conversation = [
    {
        "role": "system",
        "content": "You are a helpful assistant that can fetch weather information and send emails."
    },
    {
        "role": "user", 
        "content": "Hi, can you check the weather in Seattle for me?"
    },
    {
        "role": "user",
        "content": "Actually, could you also send me an email with that weather information to john@example.com?"
    }
]

tool_calls = [
    {
        "type": "tool_call",
        "tool_call_id": "call_weather_123",
        "name": "fetch_weather",
        "arguments": {"location": "Seattle"},
    },
    {
        "type": "tool_call", 
        "tool_call_id": "call_email_456",
        "name": "send_email",
        "arguments": {
            "recipient": "john@example.com",
            "subject": "Weather Information for Seattle",
            "body": "Here is the weather information you requested."
        },
    },
]

tool_definitions = [
    {
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
        },
    },
    {
        "name": "send_email",
        "description": "Sends an email with the specified subject and body to the recipient.",
        "parameters": {
            "type": "object",
            "properties": {
                "recipient": {"type": "string", "description": "Email address of the recipient."},
                "subject": {"type": "string", "description": "Subject of the email."},
                "body": {"type": "string", "description": "Body content of the email."},
            },
        },
    },
]

In [None]:
response = tool_call_accuracy(query=query_as_conversation, tool_calls=tool_calls, tool_definitions=tool_definitions)
pprint(response)