Skip to content

hyogrin/agent-operator-lab

Repository files navigation

Agent Operator Lab

This repo provides a comprehensive, hands-on workshop for operating AI agents at scale in Azure AI Foundry. It covers the full lifecycle of agent operations—from initial setup and fleet management to production-grade reliability and observability.

What you'll learn:

  • Control Plane Operations: Register agents and workflows, run fleet-wide monitoring, and connect both MS Agent Framework and Hosted Agents to Azure AI Foundry
  • Workload Optimization: Deep dive into anti-patterns vs best practices for context management, model migration strategies (canary rollout via APIM), multi-region routing, and caching techniques to reduce latency and cost
  • Reliability Patterns: Real-world retry strategies (exponential backoff, 429/5xx handling), multi-backend fallback, and timeout management (TTFT, streaming, adaptive) with practical error injection scenarios
  • Observability: End-to-end tracing with OpenTelemetry, Azure Application Insights integration, and Grafana dashboard visualization for AI workloads

What's Included

0. Setup

  • 0_setup/1_setup.ipynb: Bootstrap the minimum Foundry resources (Resource Group, AIServices account, Project), discover the Project endpoint/API key, and write a local config file for reuse.

1. Control Plane

2. Workload Optimization

3. Reliability

4. Observability

  • 4_observability/1_tracing_and_logging.ipynb: Observability patterns with OpenTelemetry and Azure Application Insights—distributed tracing, structured logging, metrics collection, and Grafana dashboard visualization.

  • 4_observability/2_evaluation_pipeline.ipynb: Stage-by-stage evaluation pipeline using the Azure AI Foundry Evals API—register custom evaluators (intent/agent/method exact-match), run builtin evaluators (groundedness, coherence, relevance, similarity), and generate HTML dashboards with Foundry portal integration.

  • 4_observability/segment-eval-pipeline.py: Standalone CLI for evaluating Application Insights traces. Supports csv-import (parse App Insights CSV), evaluate (live/offline with Foundry or local mode), and full (combined). See SK Backend README for details.

Evaluation Result MS Foundry Evaluation Result Analysis

Prerequisites

  • Python 3.12+
  • Azure CLI (az) and an Azure account with access to Azure AI Foundry
  • For Hosted Agents: Docker and an Azure Container Registry (ACR)
  • For observability: Application Insights resource
  • Foundry and Project resources will be created via az CLI

Setup

uv sync --prerelease=allow
source .venv/bin/activate

Copy environment variables:

cp sample.env .env

Update .env values as needed. The variables are organized by category:

Azure Core

  • AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID: Your Azure tenant and subscription identifiers
  • AZURE_CONTAINER_REGISTRY: Required for Hosted Agent container builds

Azure OpenAI

  • AZURE_OPENAI_ENDPOINT: Azure OpenAI endpoint URL
  • AZURE_OPENAI_API_KEY: Azure OpenAI API key
  • AZURE_AI_MODEL_DEPLOYMENT_NAME, AZURE_OPENAI_CHAT_DEPLOYMENT_NAME: Model deployment names
  • AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME: Embedding model for semantic caching (e.g., text-embedding-3-large)

Model Migration (Multi-Backend)

  • BACKEND_A_AZURE_OPENAI_ENDPOINT, BACKEND_A_AZURE_OPENAI_API_KEY, BACKEND_A_DEPLOYMENT: Primary backend (e.g., GPT-4o)
  • BACKEND_B_AZURE_OPENAI_ENDPOINT, BACKEND_B_AZURE_OPENAI_API_KEY, BACKEND_B_DEPLOYMENT: Secondary backend (e.g., GPT-5.x)
  • APIM_SUBSCRIPTION_KEY: Azure API Management subscription key for routing labs
  • APIM_TEST_RUNS, APIM_TIMEOUT_S: Test configuration parameters

Observability

  • APPLICATIONINSIGHTS_CONNECTION_STRING: Azure Application Insights connection string
  • APPLICATIONINSIGHTS_RESOURCE_ID: Application Insights resource ID for Grafana integration

Caching

  • REDIS_ENDPOINT, REDIS_PASSWORD: Azure Redis Cache credentials for response/semantic caching

Optional

  • BING_GROUNDING_CONNECTION_NAME: Only required for the web-search hosted agent scenario

Note: Authentication is typically done via DefaultAzureCredential (e.g., az login). Some flows may also use AZURE_OPENAI_API_KEY depending on the notebook/sample.

Workshop Snapshots

This section showcases what you will experience throughout the hands-on lab.

Agent Fleet Registration & Workflow Management

Register multiple agents and workflows in Azure AI Foundry for centralized management:

Registered Agent Fleet

Registered Workflow

Real-Time Agent Simulation

Run agent simulations and monitor live metrics directly in Azure AI Foundry portal:

Agent Simulation Results in Portal

Distributed Tracing with OpenTelemetry

Trace agent execution flows with detailed span information via Application Insights:

Agent Tracing Detail

Application Insights Metrics

View aggregated metrics (latency, error rates, token usage) exported from OpenTelemetry:

Simulation Results in Application Insights

Grafana Dashboard Visualization

Visualize AI gateway metrics in Azure Managed Grafana with custom dashboards:

Grafana Dashboard

Hosted Agent

Sample agents live under 1_controlplane/1.1_hosted-agent_sdk/:

  • calculator-agent: LangGraph + Hosting Adapter (simple arithmetic tools)
  • msft-docs-agent: MAF-based example
  • workflow-agent: concurrent workflow example
  • web-search-agent: grounding with Bing Search connection

Security Notes

  • Do not commit .env or 0_setup/.foundry_config.json (they can contain secrets like API keys).
  • If a key was ever committed, rotate the key in Azure and rewrite Git history before sharing the repo.

⚠️ Resource Cleanup Warning

Important: Running these notebooks creates Azure resources that incur costs. After completing the workshop, make sure to clean up resources to avoid unexpected charges:

  1. Azure Managed Grafana: Delete via Azure Portal or run the cleanup cell in 4_observability/1_tracing_and_logging.ipynb
  2. Azure Redis Cache: Delete via Azure Portal or run the cleanup cell in 2_workload_optimization/5_caching_strategies.ipynb
  3. Azure API Management: Delete via Azure Portal if created during the model migration labs
  4. Azure Container Registry images: Remove unused container images
  5. Application Insights: Retain or delete based on your monitoring needs

To delete all resources in the resource group at once:

az group delete --name <your-resource-group> --yes --no-wait

Tip: Each notebook includes a "Cleanup Resources" section at the end. Set the cleanup flag (e.g., DELETE_RESOURCES=True) and run the cell to remove resources created by that notebook.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors