Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 0 additions & 17 deletions .github/CODEOWNERS

This file was deleted.

160 changes: 79 additions & 81 deletions skills/cufolio/BENCHMARK.md
Original file line number Diff line number Diff line change
@@ -1,81 +1,79 @@
# cufolio — Skill Evaluation Benchmark

<!--
SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

How the `cufolio` skill was evaluated, and the measured uplift it provides over an
agent reasoning from scratch. Required for catalog publication.

> Status: methodology is final; result cells marked _TBD_ are filled from a GPU run
> (see "Reproducing" below). The numbers must be regenerated whenever SKILL.md or the
> `cufolio` product changes materially.

## Setup

| | |
|---|---|
| Skill | `cufolio` (instruction-only; drives the installed `cufolio` package) |
| Agents | Claude Code **and** Codex (evaluate both per the publishing guide) |
| Model(s) | _TBD_ (record exact model + version) |
| Harness | NV-BASE (NV-ACES / Harbor) |
| Dataset | [`evals/evals.json`](evals/evals.json) — 5 positive + 4 negative cases |
| Hardware | NVIDIA GPU (cuOpt + cuML); record GPU model |
| Data | S&P 500 daily prices via `cufolio.utils.download_data` |

## Metrics

NV-BASE emits five evaluators that roll up into the five NVIDIA dimensions:

| Evaluator | Kind | Dimension |
|---|---|---|
| `skill_execution` | deterministic | Correctness |
| `skill_efficiency` | deterministic | Efficiency |
| `accuracy` | LLM judge (5-criterion) | Correctness |
| `goal_accuracy` | full-conversation judge | Effectiveness |
| `behavior_check` | per-step YES/NO | Effectiveness |
| (scan: prompt-injection/secrets/PII) | NV-CARPS | Security |
| (trigger on positive / silence on negative) | discoverability | Discoverability |

## Track A — Agent uplift (with vs. without the skill)

Each task run with the skill installed and again with it removed (baseline).

| Metric | Without skill | With skill |
|---|---|---|
| Positive tasks completed (goal_accuracy) | _TBD_ | _TBD_ |
| Behavior steps passed (behavior_check) | _TBD_ | _TBD_ |
| Trigger accuracy — fires on the 5 positives | _TBD_ | _TBD_ |
| Trigger accuracy — silent on the 4 negatives | _TBD_ | _TBD_ |
| Avg tokens / task | _TBD_ | _TBD_ |
| Avg wall-clock / task | _TBD_ | _TBD_ |

Expected qualitative uplift (what the skill encodes that a baseline agent misses):
forcing `c_max=0.0` to avoid the all-cash optimum (Trap 2), passing
`show_discretized_portfolios=False` (Trap 4), using the manual loop only when weights
are needed (Trap 3), and always solving on GPU with cuOpt (`SOLVER_SETTINGS`).

## Track B — Skill performance standards (Layer 3)

Deterministic end-to-end runs of the documented workflows, graded against
[`tests/benchmarks/thresholds.toml`](../../tests/benchmarks/thresholds.toml). Source: `tests/test_skill_benchmarks.py`.

| Workflow | Standard | Result |
|---|---|---|
| build-optimal | non-degenerate (not all-cash), sum(w)≈1, cuOpt, < 60s | _TBD_ |
| efficient-frontier | 25 points, return monotonic in CVaR, no `sum_to_one` crash | _TBD_ |
| weights-table | per-asset weight columns present | _TBD_ |
| backtest | optimized Sharpe > equal-weight Sharpe | _TBD_ |
| rebalance | ≥1 rebalance date, cumulative value series produced | _TBD_ |

## Reproducing

```bash
# Track B (no API key; needs GPU). Prints a metrics table + PASS/FAIL:
uv run pytest -m gpu tests/test_skill_benchmarks.py -v
uv run python tests/benchmarks/benchmark_workflows.py --check

# Track A (needs NVIDIA_INFERENCE_KEY + GPU), per evals/EVAL.md:
nv-base validate --external skills/cufolio
```
# Evaluation Report

Evaluation of the `cufolio` skill before publication through NVSkills-Eval.

This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the skill. The goal is to document whether the skill is safe, discoverable, effective, and useful for agents before it is published for broader workflow use.

## Evaluation Summary

- Skill: `cufolio`
- Evaluation date: 2026-06-11
- NVSkills-Eval profile: `external`
- Environment: `astra-sandbox`
- Dataset: 4 evaluation tasks
- Attempts per task: 1
- Pass threshold: 50%
- Overall verdict: PASS

## Agents Used

- `claude-code`
- `codex`

## Metrics Used

Reported benchmark dimensions:

- Security: checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access.
- Correctness: checks whether the agent follows the expected workflow and produces the correct final output.
- Discoverability: checks whether the agent loads the skill when relevant and avoids using it when irrelevant.
- Effectiveness: checks whether the agent performs measurably better with the skill than without it.
- Efficiency: checks whether the agent uses fewer tokens and avoids redundant work.

Underlying evaluation signals used in this run:

- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
- `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully.
- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations.
- `token_efficiency` (Token Efficiency): compares token usage with and without the skill.

## Test Tasks

The benchmark dataset contained 4 evaluation tasks:

- Positive tasks: 2 tasks where the skill was expected to activate.
- Negative tasks: 2 tasks where no skill was expected.
- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred.

Task composition is derived from the evaluation dataset when possible. Entries with `expected_skill` set are treated as positive skill-activation cases, while entries with `expected_skill: null` are treated as negative activation cases.

## Results

| Dimension | Num | `claude-code` | `codex` |
|---|---:|---:|---:|
| Security | 4 | 100% (+0%) | 100% (+0%) |
| Correctness | 4 | 76% (+26%) | 78% (+14%) |
| Discoverability | 4 | 93% (+27%) | 87% (+15%) |
| Effectiveness | 4 | 46% (+20%) | 44% (+3%) |
| Efficiency | 4 | 88% (+29%) | 75% (+16%) |

Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.

## Tier 1: Static Validation Summary

Tier 1 validation passed. NVSkills-Eval ran 1 checks and found 0 total findings.

Notable observations:

- SCHEMA: Found skill manifest: SKILL.md

## Tier 2: Deduplication Summary

This tier was not run or did not produce findings in this report.

## Publication Recommendation

The skill is suitable to proceed toward NVSkills-Eval publication based on this benchmark. Skill owners should keep this file with the skill and refresh it when the evaluation dataset, skill behavior, or target agents materially change.
1 change: 0 additions & 1 deletion skills/cufolio/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
---
name: cufolio
version: "25.10.00"
description: Use when a user asks to build, optimize, backtest, rebalance, or analyze a stock portfolio with Mean-CVaR, efficient frontiers, scenario generation, or NVIDIA cuOpt.
license: Apache-2.0
metadata:
Expand Down
14 changes: 13 additions & 1 deletion skills/cufolio/evals/EVAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,15 @@ described in `tests/benchmarks/benchmark_workflows.py` / `tests/benchmarks/thres

## Dataset

There are two datasets, same schema:

- `evals.json` — the **CI publish-gate set (P0, 4 cases)**: 2 positives
(`build-optimal-cvar`, `efficient-frontier-plot`) + 2 strong negatives
(`neg-vehicle-routing`, `neg-nn-price-forecast`). Sized to finish inside the
~1h NV-CARPS CI cap (see Notes).
- `evals-full.json` — the **full set (9 cases)**: all positives and negatives,
run on the nightly/manual job (longer timeout) for the published catalog benchmark.

`evals.json` follows the NV-BASE / agentskills.io eval format. Each case has:

- `id` — unique identifier
Expand Down Expand Up @@ -58,7 +67,10 @@ Discoverability, Effectiveness, Efficiency). Paste/auto-fill the results into `.
## Notes

- Keep this CI-gated set small (P0). NV-CARPS CI runners support evals up to ~1 hour, and the
positive cases each run a full GPU solve.
positive cases each run a full GPU solve. The publish gate runs `evals.json` (4 cases); the
full `evals-full.json` (9 cases) is for the longer nightly/manual run. With the default
`claude-code,codex` × 2 attempts × with/without arms (~8 pods/case), the full set overran the
cap — the gate set keeps the pod count low enough to finish.
- The positive cases download S&P 500 prices on first run. If a sandboxed runner has no network,
use the guide's `evals/files/` mechanism to stage a small price CSV (not shipped here — the
eval host is expected to install `cufolio` and have network/data access).
Expand Down
127 changes: 127 additions & 0 deletions skills/cufolio/evals/evals-full.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
[
{
"id": "build-optimal-cvar",
"question": "Using the cufolio package, build the optimal Mean-CVaR portfolio from the S&P 500 dataset and show me the allocation, expected return, and CVaR.",
"expected_skill": "cufolio",
"expected_script": null,
"should_trigger": true,
"ground_truth": "The agent returns a non-degenerate long-only allocation across multiple S&P 500 names (not 100% cash), solved on GPU with cuOpt, and reports per-asset weights summing to ~1 along with the expected daily return (roughly 0.1%-0.4%) and the CVaR (roughly 0.02-0.03 at 0.95 confidence).",
"expected_behavior": [
"The agent uses the installed cufolio package API (imports from cufolio and calls its functions), not a from-scratch reimplementation.",
"The agent ensures the price data exists, downloading it with cufolio.utils.download_data when data/stock_data/sp500.csv is missing.",
"The agent computes returns with calculate_returns (LOG) and generates KDE scenarios on GPU with generate_cvar_data.",
"The agent sets CvarParameters with w_min=0.0, w_max=1.0 and c_max=0.0 so the portfolio is fully invested and not a degenerate all-cash result.",
"The agent solves with the cuOpt SOLVER_SETTINGS (cp.CUOPT, solver_method PDLP) and never falls back to a CPU solver.",
"The agent's final answer reports a diversified allocation with its expected return and CVaR.",
"The agent does not leak secrets, run destructive commands, or access resources outside the workspace."
]
},
{
"id": "efficient-frontier-plot",
"question": "Plot the efficient frontier for the S&P 500 universe using cufolio.",
"expected_skill": "cufolio",
"expected_script": null,
"should_trigger": true,
"ground_truth": "The agent produces an efficient-frontier plot plus a metrics table across about 25 risk-aversion levels in which expected return is non-decreasing as CVaR increases, from a single create_efficient_frontier call.",
"expected_behavior": [
"The agent uses the installed cufolio package API (imports from cufolio and calls its functions), not a from-scratch reimplementation.",
"The agent calls create_efficient_frontier with ra_num around 25 and the cuOpt SOLVER_SETTINGS.",
"The agent uses the returned (results_df, fig, ax) for the plot and metrics.",
"The agent's final answer presents the frontier and confirms return rises with CVaR.",
"The agent does not leak secrets, run destructive commands, or access resources outside the workspace."
]
},
{
"id": "efficient-frontier-weights-table",
"question": "Give me a table of per-asset portfolio weights across a range of risk-aversion levels using cufolio.",
"expected_skill": "cufolio",
"expected_script": null,
"should_trigger": true,
"ground_truth": "The agent produces a table with one row per risk-aversion level and per-asset weight columns (plus cash), obtained by expanding the 'weights' column that create_efficient_frontier returns in results_df.",
"expected_behavior": [
"The agent uses the installed cufolio package API (imports from cufolio and calls its functions), not a from-scratch reimplementation.",
"The agent calls create_efficient_frontier (cuOpt SOLVER_SETTINGS) across a range of risk-aversion levels.",
"The agent expands the results_df 'weights' column into a per-asset table with one row per risk-aversion level (plus cash).",
"The agent does not leak secrets, run destructive commands, or access resources outside the workspace."
]
},
{
"id": "backtest-vs-benchmarks",
"question": "Backtest the optimal cufolio portfolio against some benchmark portfolios and report the risk-adjusted performance.",
"expected_skill": "cufolio",
"expected_script": null,
"should_trigger": true,
"ground_truth": "The agent runs a historical backtest of the optimized portfolio against benchmark portfolios and reports cumulative return, Sharpe, Sortino, and max drawdown, with the optimized portfolio achieving a higher Sharpe than a naive equal-weight benchmark.",
"expected_behavior": [
"The agent uses the installed cufolio package API (imports from cufolio and calls its functions), not a from-scratch reimplementation.",
"The agent first builds an optimal portfolio with the standard GPU CVaR workflow.",
"The agent runs portfolio_backtester / backtest_against_benchmarks with test_method='historical' against benchmark portfolios.",
"The agent's final answer reports Sharpe, Sortino, and max drawdown and shows the optimized portfolio beating the naive benchmark on Sharpe.",
"The agent does not leak secrets, run destructive commands, or access resources outside the workspace."
]
},
{
"id": "rebalance-monthly",
"question": "Set up a monthly rebalancing strategy with cufolio and backtest it with transaction costs.",
"expected_skill": "cufolio",
"expected_script": null,
"should_trigger": true,
"ground_truth": "The agent sets up a monthly rebalancing backtest with rebalance_portfolio and re_optimize using re_optimize_criteria of type drift_from_optimal with threshold 0, applies transaction costs, and reports the results table, the rebalance dates, and the cumulative portfolio value series.",
"expected_behavior": [
"The agent uses the installed cufolio package API (imports from cufolio and calls its functions), not a from-scratch reimplementation.",
"The agent uses rebalance_portfolio with re_optimize_criteria={'type': 'drift_from_optimal', 'threshold': 0, 'norm': 1} for a fixed monthly schedule rather than an integer trigger code.",
"The agent calls re_optimize with a transaction_cost_factor and a plot_title reflecting monthly rebalancing.",
"The agent solves each re-optimization with the cuOpt SOLVER_SETTINGS.",
"The agent's final answer reports the results table, the rebalance dates, and the cumulative portfolio value.",
"The agent does not leak secrets, run destructive commands, or access resources outside the workspace."
]
},
{
"id": "neg-vehicle-routing",
"question": "I have 12 delivery trucks and 300 stops. Solve the vehicle routing problem to minimize total distance.",
"expected_skill": null,
"expected_script": null,
"should_trigger": false,
"ground_truth": "The agent helps model and solve the vehicle routing problem (for example with a routing/VRP optimizer such as NVIDIA cuOpt's routing API), minimizing total distance across the 12 trucks and 300 stops.",
"expected_behavior": [
"The agent does not read or activate the cufolio skill.",
"The agent handles the request as a vehicle routing / VRP problem using an appropriate routing optimizer or general knowledge."
]
},
{
"id": "neg-reverse-linked-list",
"question": "Write a Python function to reverse a singly linked list in place.",
"expected_skill": null,
"expected_script": null,
"should_trigger": false,
"ground_truth": "The agent writes a correct Python function that reverses a singly linked list in place and briefly explains the pointer manipulation.",
"expected_behavior": [
"The agent does not read or activate the cufolio skill.",
"The agent answers using general data-structures coding knowledge."
]
},
{
"id": "neg-summarize-earnings",
"question": "Summarize the key risks and guidance from this company's latest quarterly earnings report.",
"expected_skill": null,
"expected_script": null,
"should_trigger": false,
"ground_truth": "The agent summarizes the key risks and forward guidance from the earnings report in clear prose.",
"expected_behavior": [
"The agent does not read or activate the cufolio skill.",
"The agent handles the request as document summarization using general knowledge or a summarization skill."
]
},
{
"id": "neg-nn-price-forecast",
"question": "Train a neural network on GPU to forecast next-week stock prices for these tickers.",
"expected_skill": null,
"expected_script": null,
"should_trigger": false,
"ground_truth": "The agent helps design and train a neural-network time-series model to forecast next-week prices (data preparation, model, training loop, evaluation) using general ML knowledge or an appropriate ML skill.",
"expected_behavior": [
"The agent does not read or activate the cufolio skill.",
"The agent treats the request as a time-series / ML forecasting task distinct from Mean-CVaR portfolio optimization."
]
}
]
Loading
Loading