Skip to content

possible refactor #84

Description

@brownsarahm

goals:

  • make benchtools more extendible to novel models/runners (e.g. so that a user can define their own model type without having to modify benchtools; so it's faster for us to add more)
  • work toward enabling agents (i.e. a loop-based model that will iterate multiple times and pass multiple responses)
  • enable passing more options to the apicall/runner (eg. max tokens)

likely path to implement:

  • add a run* method bench runner object
  • move the logic out of the cases in task.run to bench runner

recommendation:
start by creating bench runner object's run* method moving the case logic out of task.run. . Start so that task.run calls that new method and see how that works/ how it is to explain. Then consider if the following ideas might help make it easier to modify.

*this method might be named something else and even the object name might be changed

ideas to consider:

  • possibly task.run yeilds responses to benchmark (or other call mechanism) instead of doing the logging in here
  • possibly benchmark calls the logger-- this could be more modular and might enable logging more benchmark details with responses without passing benchmark info into the task in weird ways.
  • should runner object be passed to the task or the task to the runner?

requirements:

  • tasks can be run without a benchmark (e.g. for testing or for running a subset of a benchmark); task.run does not necessarily need to generate log files

context:

  • i think this would help supporting agents because i think we can set the spec so that the agent class yeilds the interim responses, which then get logged; that would be a way for benchtools to not need to know the loop. The benchmark or runner would need to store that it is
  • see how the custom scorerers/ custom responses are implemented. what would it take for a user-provided model interface?
  • see improve log structure #71

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions