Troubleshooting

MAF Agent Fails to Start / "Failed to acquire Foundry auth token"

The backend logs show an auth error when trying to invoke a hosted agent.

Cause: DefaultAzureCredential cannot acquire a token for the Foundry Responses API.

Fix:

Local dev (Docker Compose): Ensure you are logged in to Azure CLI:
```
az login
az account set --subscription <your-subscription-id>
```
Azure (production): Verify the backend Container App's managed identity has the CognitiveServicesOpenAIUser role on the Foundry account, and the Foundry project's managed identity has Cognitive Services OpenAI Contributor and Azure AI User roles on the Foundry account:
- Check infra/modules/role-assignments.bicep
- Re-run azd provision to reapply role assignments if missing
Both: Confirm AZURE_AI_PROJECT_ENDPOINT is set correctly:
```
https://<account>.services.ai.azure.com/api/projects/<project-name>
```
The /api/projects/<project-name> segment is required — the bare account endpoint will not work.

PubMed MCP: "Session terminated" / search_articles Fails

PubMed literature search fails with search_articles: PubMed search failed but other MCP tools (ICD-10, Clinical Trials, NPI, CMS) work fine.

Cause: PubMed's MCP server (pubmed.mcp.claude.com) terminates idle sessions after ~10 minutes. The agent container reuses the same MCP session across requests. If the session has been idle too long between user submissions, PubMed responds with McpError("Session terminated").

Fix (already applied): The clinical agent uses _ReconnectingMCPTool — a subclass of MCPStreamableHTTPTool that catches Session terminated errors and automatically reconnects with a fresh session. See agents/clinical/main.py.

If you still see this error, check:

The container image was rebuilt after the fix (azd up)
The agent version includes the _ReconnectingMCPTool change (check image tag in az cognitiveservices agent show)

Agent Returns "ID cannot be null or empty" / status: "failed"

All agent calls fail with 400 - ID cannot be null or empty or return status: "failed" with empty output.

Cause: MCPTool definitions in HostedAgentDefinition.tools cause the agentserver-core adapter to inject a UserInfoContextMiddleware that calls /agents/{name}/tools/resolve. This API is not available in all Foundry regions (returns 404), which crashes the entire ASGI pipeline.

Fix: Ensure agents are registered with tools=[] in scripts/register_agents.py. MCP tools are handled directly by MCPStreamableHTTPTool in each agent's main.py, not via Foundry's MCPTool proxy. See the comments in scripts/register_agents.py for details.

Agents Return Empty or Error Responses

Agents connect but return {"error": "...", "tool_results": []} instead of structured output.

Cause 1 — Wrong project endpoint: AZURE_AI_PROJECT_ENDPOINT points to the bare account endpoint instead of the project endpoint.

Cause 2 — Agent not registered: The hosted agent was not successfully deployed/registered with Foundry Agent Service.

Cause 3 — Model deployment missing: The AZURE_OPENAI_DEPLOYMENT_NAME (e.g., gpt-5.4) doesn't exist in the Foundry project.

Fix:

Verify agents are registered:

python scripts/register_agents.py --list

Confirm the endpoint format in AZURE_AI_PROJECT_ENDPOINT includes /api/projects/<project>.
In the Foundry portal under Build → Deployments, confirm the gpt-5.4 deployment exists and its name matches AZURE_OPENAI_DEPLOYMENT_NAME in each agent.yaml.

"Failed to proxy" / ECONNREFUSED / "Review failed"

The frontend shows an error when submitting a review.

Cause: The backend server is not running, or the frontend is not configured to reach it.

Fix:

Ensure the backend is running:

cd backend
uvicorn app.main:app --reload

Ensure frontend/.env.local has the correct backend URL:
```
NEXT_PUBLIC_API_BASE=http://localhost:8000/api
```
Restart the frontend dev server after changing .env.local:
```
cd frontend
npm run dev
```

Agent phase fails immediately

The review fails as soon as a specialist phase starts.

Cause: One or more hosted agent endpoint URLs are missing or unreachable.

Fix: Check which dispatch mode is active:

Docker Compose (local dev): Verify all URL variables are set in backend/.env or docker-compose.yml:

HOSTED_AGENT_COMPLIANCE_URL
HOSTED_AGENT_CLINICAL_URL
HOSTED_AGENT_COVERAGE_URL
HOSTED_AGENT_SYNTHESIS_URL

Make sure docker-compose.yml is running all four agent containers and that their ports match the URLs above.

Foundry Hosted Agents (production / azd up): Verify these variables are set (injected automatically by Bicep):

AZURE_AI_PROJECT_ENDPOINT
HOSTED_AGENT_CLINICAL_NAME
HOSTED_AGENT_COMPLIANCE_NAME
HOSTED_AGENT_COVERAGE_NAME
HOSTED_AGENT_SYNTHESIS_NAME

If set manually, confirm the project endpoint format: https://<account>.services.ai.azure.com/api/projects/<project-name>

Also confirm the agents were successfully registered by scripts/register_agents.py in the postprovision hook (check azd provision output for registration errors).

You can verify all deployment health at any time with:

python scripts/check_agents.py

This checks agent registration, App Insights connectivity, MCP tool connections, backend health, and frontend availability.

Hosted-agent authentication returns 401 or 403

The backend reaches the hosted endpoint but receives an authorization failure.

Cause: The outbound auth header configuration does not match the hosted agent deployment.

Fix: Depends on the dispatch mode:

Docker Compose: Direct HTTP to containers — no auth configured by default. If containers are behind a proxy requiring auth, set HOSTED_AGENT_AUTH_HEADER, HOSTED_AGENT_AUTH_SCHEME, and HOSTED_AGENT_AUTH_TOKEN in .env.

Foundry Hosted Agents (production): Credentials come from DefaultAzureCredential. Common causes:

The backend ACA managed identity is missing the CognitiveServicesOpenAIUser role on the Foundry account — check infra/modules/role-assignments.bicep and re-run azd provision
The Foundry project managed identity is missing Cognitive Services OpenAI Contributor or Azure AI User on the Foundry account — these roles are required for hosted agent containers to call gpt-5.4 and use Agent Service data actions
The deployer user is missing the Azure AI User role on the Foundry project (required by scripts/register_agents.py to register agents) — this role is auto-assigned by az role assignment create in the postprovision hook; re-run azd up to fix
AZURE_AI_PROJECT_ENDPOINT is pointing to the wrong project or account
The agents were not successfully registered — check scripts/register_agents.py output in the postprovision hook logs

Agent Registration Fails with PermissionDenied on First Run

register_agents.py fails with:

ERROR: (PermissionDenied) The principal ... lacks the required data action
Microsoft.CognitiveServices/accounts/AIServices/agents/write

Cause: Azure RBAC propagation delay. The postprovision hook assigns the Azure AI User role immediately before running register_agents.py, but Azure's role cache can take up to several minutes to update.

Automatic handling: The hook automatically detects newly assigned roles and retries register_agents.py every 10 seconds (up to 12 attempts / ~2 minutes). You'll see "Waiting for RBAC propagation (attempt N/12)..." messages in the output — this is expected on first deployment.

If all 12 retries fail: RBAC propagation took unusually long. Simply re-run azd up — the role already exists so registration will proceed without retries.

Hosted agent returns an unexpected payload shape

The backend reaches the hosted agent, but parsing or downstream validation fails.

Expected payload (Foundry Responses API envelope):

{
  "output": [{"content": [{"text": "{\"field\": \"value\", ...}"}]}]
}

The backend uses response.output_text from the OpenAI SDK and parses the JSON result directly.

Fix: Confirm the agent container is returning the standard Foundry Responses API envelope. MAF's from_agent_framework(agent).run() produces this format automatically.

Port Stuck After Killing Server (Windows)

After killing a server process, the port remains in LISTENING state.

Cause: Windows TCP socket lingering.

Fix: Wait 2-4 minutes for the socket to clear, or use a different port.

Agent Returns Truncated/Incomplete Response

One or more agents return partial data with missing top-level keys.

Cause: The agent's response_format structured output was not fully populated by the model response, typically due to token limits or a model timeout.

Symptoms in server logs:

WARNING app.agents.orchestrator: Clinical Reviewer Agent returned incomplete result (attempt 1/2). Missing keys: clinical_extraction, clinical_summary. Retrying...
INFO app.agents.orchestrator: Clinical Reviewer Agent succeeded on retry (attempt 2/2)

A normal Clinical result has 3 expected top-level keys (diagnosis_validation, clinical_extraction, clinical_summary). Additional fields like procedure_validation, tool_results, and clinical_trials are also present but not checked by the validation gate.

Mitigations (in place):

Result validation — checks for expected top-level keys via _EXPECTED_KEYS in orchestrator.py
Automatic retry — retries once (_MAX_AGENT_RETRIES = 1) if validation fails
SSE warnings — surfaces validation warnings to the frontend

If retries consistently fail:

The agent's HOSTED_AGENT_TIMEOUT_SECONDS (default 180) may be too low — increase it in backend/.env
Check the agent container logs in Foundry portal for model errors or context overflow