You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
make benchtools more extendible to novel models/runners (e.g. so that a user can define their own model type without having to modify benchtools; so it's faster for us to add more)
work toward enabling agents (i.e. a loop-based model that will iterate multiple times and pass multiple responses)
enable passing more options to the apicall/runner (eg. max tokens)
likely path to implement:
add a run* method bench runner object
move the logic out of the cases in task.run to bench runner
recommendation:
start by creating bench runner object's run* method moving the case logic out of task.run. . Start so that task.run calls that new method and see how that works/ how it is to explain. Then consider if the following ideas might help make it easier to modify.
*this method might be named something else and even the object name might be changed
ideas to consider:
possibly task.run yeilds responses to benchmark (or other call mechanism) instead of doing the logging in here
possibly benchmark calls the logger-- this could be more modular and might enable logging more benchmark details with responses without passing benchmark info into the task in weird ways.
should runner object be passed to the task or the task to the runner?
requirements:
tasks can be run without a benchmark (e.g. for testing or for running a subset of a benchmark); task.run does not necessarily need to generate log files
context:
i think this would help supporting agents because i think we can set the spec so that the agent class yeilds the interim responses, which then get logged; that would be a way for benchtools to not need to know the loop. The benchmark or runner would need to store that it is
see how the custom scorerers/ custom responses are implemented. what would it take for a user-provided model interface?
goals:
likely path to implement:
recommendation:
start by creating bench runner object's run* method moving the case logic out of task.run. . Start so that task.run calls that new method and see how that works/ how it is to explain. Then consider if the following ideas might help make it easier to modify.
*this method might be named something else and even the object name might be changed
ideas to consider:
requirements:
context: