HCEF-LLM (Human-Centric Evaluation Framework for Large Language Models) is a comprehensive evaluation framework designed to assess Large Language Models as "Ideal Intelligent Partners" from a human-centric perspective. This project contains an interactive web-based leaderboard and visualization platform for comparing model performance across multiple dimension
-
IIP-F1: Precise Instruction & Constraint Comprehension
- Meticulous interpretation of directives
- Understanding explicit & implicit constraints
- Key aspects:
- Accurate interpretation of explicit directives
- Identification of implicit constraints and assumptions
- Scope definition and ambiguity resolution
-
IIP-F2: Deep Goal & Intent Understanding
- Discerning underlying purpose
- Inferring user intent beyond literal requests
- Key aspects:
- Discernment of ultimate objectives
- Inferring user intent beyond literal requests
- Prioritization of multiple or conflicting goals
-
IIP-F3: Capable & Adaptive Task Execution
- Effective and efficient task execution
- Resilience and adaptability in dynamic environments
- Key aspects:
- Effective task decomposition and strategic planning
- Intelligent resource allocation and tool utilization
- Robust exception handling and dynamic adaptation
-
IIP-F4: Quality-Assured Outcome Delivery & Responsible Closure
- High-quality deliverable production
- Comprehensive and responsible task completion
- Key aspects:
- Adherence to delivery standards and formats
- Output self-assessment and iterative refinement
The Core Capability Dimensions (CDs) are fundamental abilities required for proficient and human-aligned performance, derived from foundational aspects of intelligence. These dimensions are organized into six primary categories, each with specific sub-dimensions:
-
CD1: Input Processing & Comprehension Essential for accurate information acquisition and deep understanding
- CD1.1: Multimodal Information Acquisition
- Processing various input modalities (text, visual, auditory)
- Initial gateway for subsequent processing
- Comprehensive input interpretation
- CD1.2: Contextual & Intentional Understanding
- Deep contextual relevance discernment
- User intent inference
- Pragmatic nuance comprehension
- Implicit assumption identification
- CD1.1: Multimodal Information Acquisition
-
CD2: Knowledge Retention & Application Mechanisms for storing, retrieving, and utilizing information effectively
- CD2.1: Dynamic Working Memory
- Active information maintenance
- Context tracking capability
- Multi-step reasoning support
- Complex instruction processing
- CD2.2: Accessible Long-Term Knowledge
- Vast knowledge repository access
- Factual information retrieval
- Procedural knowledge application
- Common-sense understanding
- CD2.1: Dynamic Working Memory
-
CD3: Logical Reasoning & Problem Solving Essential for resolving ambiguities, planning, and solution assessment
- CD3.1: Analytical & Inferential Reasoning
- Inductive reasoning for pattern recognition
- Deductive reasoning for rule application
- Evidence-based inference generation
- Logical consequence derivation
- CD3.2: Structured Problem Decomposition
- Complex problem breakdown into sub-components
- Component relationship analysis
- Systematic solution development
- Strategic execution planning
- CD3.1: Analytical & Inferential Reasoning
-
CD4: Imaginative & Creative Cognition Non-standard thinking for novel ideas and abstract thought
- CD4.1: Predictive & Generative Foresight
- Future state anticipation
- Outcome prediction modeling
- Scenario generation
- Pattern-based completion
- CD4.2: Novel Solution Ideation & Abstract Representation
- Divergent thinking application
- Analogical connection making
- Abstract concept representation
- Unconventional solution generation
- CD4.1: Predictive & Generative Foresight
-
CD5: Human-Centricity & Ethical Alignment Understanding and aligning with human values and social norms
- CD5.1: Empathetic & Social Awareness
- Emotional cue recognition
- Social dynamic interpretation
- Interpersonal nuance understanding
- Socially intelligent interaction
- CD5.2: Value Comprehension & Ethical Consideration
- Human value understanding
- Ethical principle application
- Responsible AI framework alignment
- Ethical situation recognition
- CD5.1: Empathetic & Social Awareness
-
CD6: Output Generation & Delivery Production of clear, appropriate, and effective outputs
- CD6.1: Clear & Coherent Communication
- Grammatical correctness
- Logical structure formation
- Message clarity optimization
- Information conveyance effectiveness
- CD6.2: Adaptive & Purposeful Expression
- Context-appropriate style adaptation
- Audience-specific tone adjustment
- Tool utilization effectiveness
- Action execution precision
- CD6.1: Clear & Coherent Communication
- Characteristics: Basic functionality with significant limitations
- Performance:
- Can only handle simple, explicit tasks
- Requires detailed guidance and structured input
- Unstable output quality
- Limited error handling capabilities
- Characteristics: Improving capabilities with some constraints
- Performance:
- Can handle moderately complex tasks
- Requires moderate guidance
- Gradually improving output quality
- Basic error handling capabilities
- Characteristics: Solid performance meeting most requirements
- Performance:
- Can handle complex tasks
- Strong autonomy
- Stable output quality
- Good error handling capabilities
- Characteristics: Exceptional performance exceeding expectations
- Performance:
- Can handle highly complex tasks
- Exceptional autonomy and creativity
- Excellent output quality
- Outstanding error handling and recovery capabilities
The leaderboard includes evaluation results for various state-of-the-art language models:
- OpenAI o4 mini
- OpenAI GPT 4.1
- Claude 4 Sonnet
- Gemini 2.5 Pro
- Qwen3 235B
Each model is evaluated across all six capability dimensions and assigned an overall proficiency level.
- Pingfan
- Dolly
If you use this framework in your research, please cite: BibTeX:
@misc{wang2025hcef,
title = {HCEF-LLM: A Human-Centric Evaluation Framework for Advancing Large Language Models as Ideal Intelligent Partners},
author = {Wang, Pingfan and Deng, Linyuan},
year = {2025},
month = {July},
publisher = {TechRxiv},
doi = {10.36227/techrxiv.175329267.71975279/v1},
url = {https://doi.org/10.36227/techrxiv.175329267.71975279/v1}
}