# Task Navigation Efficiency Evaluator

### Getting Started

This sample demonstrates how to use the Task Navigation Efficiency Evaluator to evaluate whether an agent's sequence of actions follows optimal decision-making patterns.

Before running the sample:
```bash
pip install azure-ai-projects azure-identity azure-ai-evaluation
```
Note: The Task Navigation Efficiency Evaluator does not require Azure OpenAI configuration as it's a rule-based evaluator.

The Task Navigation Efficiency Evaluator measures how efficiently an agent navigates through a sequence of actions compared to an optimal task completion path.

The evaluator provides comprehensive evaluation with both binary matching results and additional detailed P\R\F1 results:

**Primary Result:**
- **Binary Match Result**: Pass/Fail based on the selected matching mode

**Available Matching Modes:**
- **Exact Match**: Agent's tool calls must exactly match the ground truth (default)
- **In-Order Match**: All ground truth steps must appear in correct order (allows extra steps)
- **Any-Order Match**: All ground truth steps must appear with sufficient frequency (most lenient)

**Properties Bag Additional Metrics (0.0 - 1.0):**
- **Precision**: How many of the agent's steps were necessary (relevant to ground truth)
- **Recall**: How many of the required steps were executed by the agent  
- **F1 Score**: Harmonic mean of precision and recall

The evaluation requires the following inputs:
- **Response**: The agent's response containing tool calls as a list of messages or string
- **Ground Truth**: List of expected tool/action steps as strings, or tuple with parameters for matching

### Initialize Task Navigation Efficiency Evaluator

In [None]:
from azure.ai.evaluation._evaluators._task_navigation_efficiency import _TaskNavigationEfficiencyEvaluator, _TaskNavigationEfficiencyMatchingMode
from pprint import pprint

# Initialize with exact match mode
task_navigation_efficiency_evaluator = _TaskNavigationEfficiencyEvaluator(
    matching_mode=_TaskNavigationEfficiencyMatchingMode.EXACT_MATCH
)

# Other examples:
# For in-order matching (allows extra steps but requires correct order)
# task_navigation_efficiency_evaluator = _TaskNavigationEfficiencyEvaluator(matching_mode=_TaskNavigationEfficiencyMatchingMode.IN_ORDER_MATCH)

# For any-order matching (most lenient - allows extra steps and different order)  
# task_navigation_efficiency_evaluator = _TaskNavigationEfficiencyEvaluator(matching_mode=_TaskNavigationEfficiencyMatchingMode.ANY_ORDER_MATCH)

# Or use defaults (exact match mode)
# task_navigation_efficiency_evaluator = _TaskNavigationEfficiencyEvaluator()

### Task Navigation Efficiency Examples

#### Sample 1: Perfect Path (Exact Match)

In [None]:
# Agent follows the exact optimal path
response = [
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "search", "arguments": {}}],
    },
    {
        "role": "assistant", 
        "content": [{"type": "tool_call", "tool_call_id": "call_2", "name": "analyze", "arguments": {}}],
    },
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_3", "name": "report", "arguments": {}}],
    },
]

ground_truth = ["search", "analyze", "report"]

result = task_navigation_efficiency_evaluator(response=response, ground_truth=ground_truth)
print("Perfect Path Results:")
pprint(result)

#### Sample 2: Efficient Path with Extra Steps

In [None]:
# Agent performs all required steps but with extra unnecessary step
response = [
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "search", "arguments": {}}],
    },
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_2", "name": "validate", "arguments": {}}],
    },
    {
        "role": "assistant", 
        "content": [{"type": "tool_call", "tool_call_id": "call_3", "name": "analyze", "arguments": {}}],
    },
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_4", "name": "report", "arguments": {}}],
    },
]

ground_truth = ["search", "analyze", "report"]

result = task_navigation_efficiency_evaluator(response=response, ground_truth=ground_truth)
print("\nPath with Extra Steps Results:")
pprint(result)

#### Sample 3: Inefficient Path (Wrong Order)

In [None]:
# Agent performs all required steps but in wrong order
response = [
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "report", "arguments": {}}],
    },
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_2", "name": "search", "arguments": {}}],
    },
    {
        "role": "assistant", 
        "content": [{"type": "tool_call", "tool_call_id": "call_3", "name": "analyze", "arguments": {}}],
    },
]

ground_truth = ["search", "analyze", "report"]

# Using in-order matching mode to demonstrate the difference
in_order_task_navigation_efficiency_evaluator = _TaskNavigationEfficiencyEvaluator(matching_mode=_TaskNavigationEfficiencyMatchingMode.IN_ORDER_MATCH)

result = in_order_task_navigation_efficiency_evaluator(response=response, ground_truth=ground_truth)
print("\nWrong Order Results:")
pprint(result)

#### Sample 4: Incomplete Path (Missing Steps)

In [None]:
# Agent performs only some of the required steps (incomplete)
response = [
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "search", "arguments": {}}],
    },
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_2", "name": "analyze", "arguments": {}}],
    },
]

ground_truth = ["search", "analyze", "report"]

result = task_navigation_efficiency_evaluator(response=response, ground_truth=ground_truth)
print("\nMissing Steps Results:")
pprint(result)

#### Sample 5: Real-World Customer Service Scenario

In [None]:
# Real-world example: Customer service agent handling a refund request
response = [
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "lookup_order", "arguments": {"order_id": "12345"}}],
    },
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_2", "name": "check_inventory", "arguments": {"product_id": "ABC123"}}],
    },
    {
        "role": "assistant", 
        "content": [{"type": "tool_call", "tool_call_id": "call_3", "name": "calculate_refund", "arguments": {"order_id": "12345"}}],
    },
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_4", "name": "process_refund", "arguments": {"order_id": "12345", "amount": "29.99"}}],
    },
]

ground_truth = ["lookup_order", "calculate_refund", "process_refund"]

result = task_navigation_efficiency_evaluator(response=response, ground_truth=ground_truth)
print("\nCustomer Service Results:")
pprint(result)

#### Sample 6: Complex Path with Duplicates

In [None]:
# Agent repeats some steps and includes extra ones
response = [
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "search", "arguments": {}}],
    },
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_2", "name": "search", "arguments": {}}],  # duplicate
    },
    {
        "role": "assistant", 
        "content": [{"type": "tool_call", "tool_call_id": "call_3", "name": "validate", "arguments": {}}],  # extra step
    },
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_4", "name": "analyze", "arguments": {}}],
    },
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_5", "name": "report", "arguments": {}}],
    },
]

ground_truth = ["search", "analyze", "report"]

result = task_navigation_efficiency_evaluator(response=response, ground_truth=ground_truth)
print("\nComplex Path with Duplicates Results:")
pprint(result)

#### Sample 7: Edge Cases and Error Scenarios

In [None]:
# Test edge cases

# Test with empty response
try:
    response = []
    ground_truth = ["search", "analyze", "report"]
    
    result = task_navigation_efficiency_evaluator(response=response, ground_truth=ground_truth)
    print("\nEmpty Response Results:")
    pprint(result)
except Exception as e:
    print(f"Error with empty response: {e}")

# Test with empty ground truth (should raise error)
try:
    response = [
        {
            "role": "assistant",
            "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "search", "arguments": {}}],
        }
    ]
    ground_truth = []
    
    result = task_navigation_efficiency_evaluator(response=response, ground_truth=ground_truth)
    print("\nEmpty Ground Truth Results:")
    pprint(result)
except Exception as e:
    print(f"Error with empty ground truth: {e}")

#### Sample 8: Tuple Format with Parameters

In [None]:
# TaskNavigationEfficiencyEvaluator also supports tuple format with parameters for exact parameter matching
response_with_params = [
    {
        "role": "assistant",
        "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "search", "arguments": {"query": "test"}}],
    },
]

# Ground truth using tuple format: (tool_names, parameters_dict)
# Parameters must match exactly for tools to be considered matching
ground_truth_with_params = (["search"], {"search": {"query": "test"}})

result = task_navigation_efficiency_evaluator(response=response_with_params, ground_truth=ground_truth_with_params)
print("\nTuple Format with Parameters Results:")
pprint(result)

#### Sample 9: String Response Input Type

In [None]:
# Demonstrate string response input type
# The string response should contain structured tool call information that can be parsed
string_response = "I'll help you with that. Let me search for information, then analyze the results, and finally provide a report."
ground_truth = ["search", "analyze", "report"]

result = task_navigation_efficiency_evaluator(response=string_response, ground_truth=ground_truth)
print("\nString Response Results:")
pprint(result)

### Evaluation Analysis Helper Function

In [None]:
# Helper functions for analysis

def analyze_task_navigation_efficiency(response, ground_truth, scenario_name, evaluator=None):
    """
    Helper function to analyze and display task navigation efficiency results
    """
    if evaluator is None:
        evaluator = task_navigation_efficiency_evaluator
        
    result = evaluator(response=response, ground_truth=ground_truth)
    
    print(f"\n{'='*50}")
    print(f"Analysis for: {scenario_name}")
    print(f"{'='*50}")
    
    print(f"Ground Truth Steps: {ground_truth}")
    print(f"Evaluator Matching Mode: {evaluator.matching_mode.value}")
    print(f"{'='*50}")
    
    # Display the returned results
    for key, value in result.items():
        if key == "task_navigation_efficiency_details":
            print(f"  {key}:")
            for prop_key, prop_value in value.items():
                print(f"    {prop_key}: {prop_value:.3f}")
        else:
            print(f"  {key}: {value}")

    return result

# Example with different matching modes
def compare_matching_modes(response, ground_truth, scenario_name):
    """
    Compare results across different matching modes for the same scenario
    """
    print(f"\n{'='*60}")
    print(f"Matching Mode Comparison for: {scenario_name}")
    print(f"{'='*60}")
    
    matching_modes_to_test = [
        _TaskNavigationEfficiencyMatchingMode.EXACT_MATCH,
        _TaskNavigationEfficiencyMatchingMode.IN_ORDER_MATCH,
        _TaskNavigationEfficiencyMatchingMode.ANY_ORDER_MATCH
    ]
    
    for mode in matching_modes_to_test:
        evaluator = _TaskNavigationEfficiencyEvaluator(matching_mode=mode)
        result = evaluator(response=response, ground_truth=ground_truth)
        
        # Get the main result value
        result_value = result.get("task_navigation_efficiency_result", "N/A")
        print(f"  {mode.value.upper():15}: {result_value}")
    
    return

### Example Usage of Helper Function

In [None]:
# Example: Using the helper function to analyze different scenarios

# Scenario 1: Perfect efficiency
perfect_response = [
    {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "authenticate", "arguments": {}}]},
    {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_2", "name": "fetch_data", "arguments": {}}]},
    {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_3", "name": "process_result", "arguments": {}}]},
]
perfect_ground_truth = ["authenticate", "fetch_data", "process_result"]

analyze_task_navigation_efficiency(perfect_response, perfect_ground_truth, "Perfect Efficiency Example")

# Scenario 2: Inefficient with extra steps
inefficient_response = [
    {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "authenticate", "arguments": {}}]},
    {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_2", "name": "log_attempt", "arguments": {}}]},  # extra
    {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_3", "name": "fetch_data", "arguments": {}}]},
    {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_4", "name": "validate_data", "arguments": {}}]},  # extra
    {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_5", "name": "process_result", "arguments": {}}]},
]
inefficient_ground_truth = ["authenticate", "fetch_data", "process_result"]

analyze_task_navigation_efficiency(inefficient_response, inefficient_ground_truth, "Inefficient Path with Extra Steps")

# Demonstrate different matching modes
print("\n" + "="*60)
print("COMPARING DIFFERENT MATCHING MODES")
print("="*60)

compare_matching_modes(inefficient_response, inefficient_ground_truth, "Inefficient Path Analysis")

# Example: Creating evaluators with different matching modes
print(f"\n{'='*60}")
print("INDIVIDUAL MATCHING MODE EXAMPLES")
print("="*60)

# Exact match evaluator
exact_match_evaluator = _TaskNavigationEfficiencyEvaluator(matching_mode=_TaskNavigationEfficiencyMatchingMode.EXACT_MATCH)
exact_result = exact_match_evaluator(response=perfect_response, ground_truth=perfect_ground_truth)
print(f"Exact Match Evaluator: {exact_result}")

# In-order match evaluator
in_order_evaluator = _TaskNavigationEfficiencyEvaluator(matching_mode=_TaskNavigationEfficiencyMatchingMode.IN_ORDER_MATCH)
in_order_result = in_order_evaluator(response=inefficient_response, ground_truth=inefficient_ground_truth)
print(f"In-Order Match Evaluator: {in_order_result}")

# Any-order match evaluator (most lenient)
any_order_evaluator = _TaskNavigationEfficiencyEvaluator(matching_mode=_TaskNavigationEfficiencyMatchingMode.ANY_ORDER_MATCH)
any_order_result = any_order_evaluator(response=inefficient_response, ground_truth=inefficient_ground_truth)
print(f"Any-Order Match Evaluator: {any_order_result}")