Popular repositories Loading
-
-
-
-
-
scalable-eval-aar
scalable-eval-aar PublicAdversarial evaluation framework for detecting gaming in automated alignment research. Tests alignment methods across 4 tasks with adversarial prompts designed to catch shortcuts.
Python
Something went wrong, please refresh the page to try again.
If the problem persists, check the GitHub status page or contact support.
If the problem persists, check the GitHub status page or contact support.