Skip to content

Fix: Frontier cs 2.0 blackbox evaluator #117

Merged
joyemang33 merged 3 commits into
mainfrom
frontier-cs-2.0-blackbox-evaluator
May 27, 2026
Merged

Fix: Frontier cs 2.0 blackbox evaluator #117
joyemang33 merged 3 commits into
mainfrom
frontier-cs-2.0-blackbox-evaluator

Conversation

@joyemang33
Copy link
Copy Markdown
Contributor

This pull request introduces a new "Erdos Unit Distance Demo" problem as a small-scale, visually inspectable variant of the main erdos_unit_distance problem, and makes several improvements to the evaluation and documentation for both problems. The demo problem is designed for quick agent workflow checks and uses only 10 points. Additionally, evaluator security and robustness are improved, and the main problem's constraints are tightened for greater rigor.

Key changes:

New Erdos Unit Distance Demo Problem

  • Added a new problem erdos_demo with configuration, problem statement, evaluator, and a reference solution. This includes a baseline solution, a detailed readme, an evaluator script, and configuration for agent workflows. [1] [2] [3] [4] [5] [6]

Evaluator Security and Robustness

  • Hardened both erdos_unit_distance and erdos_demo evaluators:
    • The evaluator source is now protected from unprivileged solution code by restricting file permissions if running as root in a container.
    • Solutions are executed as the nobody user where possible, and solution files are copied to isolated directories with controlled permissions.
    • The solution runner logic is now more robust, with improved error handling and clearer failure messages. [1] [2] [3] [4] [5] [6]

Constraint and Scoring Adjustments

  • Tightened the minimum allowed separation between points for erdos_unit_distance from 1e-6 to 1e-3 and made the floating-point tolerance for unit distances stricter, both in the evaluator and documentation. [1] [2]
  • Clarified and updated scoring and validity constraint documentation for both problems. [1] [2]

Documentation and Usability

  • Updated root and adapter READMEs to reflect the addition of the demo problem, including usage examples and new badge counts. [1] [2] [3]
  • Added a section to the main 2.0 README introducing the demo and its intended use as a quick workflow check.

References:
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]## Summary

Please read CONTRIBUTING.md before submitting.

Type of Change

  • New research problem
  • New algorithmic problem
  • New Frontier-CS 2.0 problem
  • Bug fix
  • Documentation update
  • Other:

Testing

Checklist

  • Code follows the project structure and conventions
  • Self-review completed
  • Documentation updated (if applicable)

CI Validation (for new problems)

When adding new problems, CI will automatically validate that your reference solution achieves score > 0.

  • Algorithmic problems: Include reference.cpp in your problem directory
  • Research problems: Include reference.py (or reference.cpp if language: cpp in config.yaml)
  • 2.0 problems: Include reference.py unless the problem config declares another language

@joyemang33 joyemang33 marked this pull request as ready for review May 27, 2026 16:25
@joyemang33 joyemang33 merged commit a9cac5a into main May 27, 2026
joyemang33 added a commit that referenced this pull request May 27, 2026
* feat: isolate Frontier-CS 2.0 evaluator

* feat: add Erdos demo task

* fix: regenerate default Harbor tasks before trials

## Summary
<!-- Brief description of changes -->

> Please read [CONTRIBUTING.md](../CONTRIBUTING.md) before submitting.

## Type of Change
- [ ] New research problem
- [ ] New algorithmic problem
- [ ] New Frontier-CS 2.0 problem
- [ ] Bug fix
- [ ] Documentation update
- [ ] Other:

## Testing
<!-- How were these changes tested? -->

## Checklist
- [ ] Code follows the project structure and conventions
- [ ] Self-review completed
- [ ] Documentation updated (if applicable)

## CI Validation (for new problems)
> When adding new problems, CI will automatically validate that your reference solution achieves score > 0.
> - Algorithmic problems: Include `reference.cpp` in your problem directory
> - Research problems: Include `reference.py` (or `reference.cpp` if `language: cpp` in config.yaml)
> - 2.0 problems: Include `reference.py` unless the problem config declares another language
joyemang33 added a commit that referenced this pull request May 27, 2026
* feat: isolate Frontier-CS 2.0 evaluator

* feat: add Erdos demo task

* fix: regenerate default Harbor tasks before trials

## Summary
<!-- Brief description of changes -->

> Please read [CONTRIBUTING.md](../CONTRIBUTING.md) before submitting.

## Type of Change
- [ ] New research problem
- [ ] New algorithmic problem
- [ ] New Frontier-CS 2.0 problem
- [ ] Bug fix
- [ ] Documentation update
- [ ] Other:

## Testing
<!-- How were these changes tested? -->

## Checklist
- [ ] Code follows the project structure and conventions
- [ ] Self-review completed
- [ ] Documentation updated (if applicable)

## CI Validation (for new problems)
> When adding new problems, CI will automatically validate that your reference solution achieves score > 0.
> - Algorithmic problems: Include `reference.cpp` in your problem directory
> - Research problems: Include `reference.py` (or `reference.cpp` if `language: cpp` in config.yaml)
> - 2.0 problems: Include `reference.py` unless the problem config declares another language
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant