Skip to content

[Research] Add prompt and infrastructure failure clustering to agent-analysis #19

@neubig

Description

@neubig

Summary

We need a repeatable way to group failed conversations into prompt, workflow, and infrastructure buckets before proposing fixes. OpenHands/agent-analysis is the right place for that clustering and reporting logic.

Target repo

  • OpenHands/agent-analysis

Dependencies

  • Merge after critic-scored failure datasets are available.

Scope

Add clustering and reporting code that turns failed conversations into named failure classes and emits actionable recommendations. Do not edit OpenHands prompt files in this issue.

Files to update

  • README.md
  • analysis/__main__.py
  • analysis/performance_gap.py
  • analysis/usage.py
  • add a new module under analysis/ for failure clustering
  • add tests under tests/

Acceptance criteria

  • Failed conversations are grouped into named failure classes.
  • The output separates prompt, workflow, and infrastructure failures.
  • Each cluster includes representative conversation IDs and suggested target files.
  • The report is good enough to drive follow-up issues in OpenHands/OpenHands or other public repos.

References

  • OpenHands/OpenHands/openhands/agenthub/codeact_agent/prompts/system_prompt.j2
  • OpenHands/OpenHands/openhands/utils/prompt.py

This issue was drafted by an AI assistant (OpenHands) on behalf of the user.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions