中文高压复杂任务Benchmark。主要是测模型会不会在真实工作里误事。This is a Chinese-language high-pressure complex task benchmark. The main purpose is to test whether the model will cause problems in real-world applications.
benchmark decision-making stress-testing ai-safety reasoning ai-evaluation llm-evaluation llm-arena llm-benchmark chinese-benchmark
-
Updated
May 9, 2026 - HTML