Skip to content

Expand evals to 25 and improve SKILL.md#22

Merged
CybotTM merged 1 commit intomainfrom
feature/evals-and-improvements
Apr 1, 2026
Merged

Expand evals to 25 and improve SKILL.md#22
CybotTM merged 1 commit intomainfrom
feature/evals-and-improvements

Conversation

@CybotTM
Copy link
Copy Markdown
Member

@CybotTM CybotTM commented Apr 1, 2026

Summary

  • Expanded eval suite from 2 to 25 tests covering all skill capabilities
  • Improved SKILL.md with better trigger categorization, inline preferred-tools table, troubleshooting quick reference, and tighter workflow steps (413 words, under 500 limit)

Eval categories added

  • Reactive install (4): command-not-found for rg, batcat, jq, ripgrep+fd
  • Preferred tool recommendations (7): grep->rg, find->fd, JSON->jq, YAML->yq, diff->difft, benchmark->hyperfine, CSV->qsv
  • Project type detection (3): Python, Node.js, Docker
  • Troubleshooting (3): PATH issues, hash cache, permission-blocked installs
  • Audit/Update (3): audit dependencies, batch update, environment PATH check
  • Catalog/Mapping (3): catalog lookup, binary name mapping, install via script
  • Integration (2): security tool suggestion, fd+rg pipeline

SKILL.md improvements

  • Added "Advisory" trigger category for modern tool recommendations
  • Compact 2-column preferred tools table (saves ~40% space vs old format)
  • Inline troubleshooting table with Debian alias fixes
  • Explicit hash -r guidance in resolution workflow
  • All content under 500-word limit

Test plan

  • Verify evals/evals.json is valid JSON with 25 entries
  • Verify SKILL.md renders correctly and stays under 500 words
  • Spot-check that eval prompts match actual skill capabilities
  • Confirm all referenced scripts and reference files exist

Evals expanded from 2 to 25 covering all skill capabilities:
reactive install, proactive audit, preferred tool recommendations,
binary name mapping, PATH troubleshooting, project type detection,
batch updates, permission workarounds, and tool integration pipelines.

SKILL.md improvements: added Advisory trigger category, inline
preferred-tools table, troubleshooting quick reference, tighter
workflow steps with hash -r guidance. 413 words (under 500 limit).
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly expands the evaluation test suite in evals/evals.json by adding numerous test cases for CLI tool detection, installation, and modern tool recommendations. Additionally, the SKILL.md documentation has been refactored to improve clarity regarding workflows, preferred tools, and troubleshooting. The review feedback suggests using more specific assertion values in the evaluation file, such as 'ripgrep' and 'fd-find' instead of 'rg' and 'fd', to prevent false positives during testing, and refining the troubleshooting instructions for Debian-specific tool aliases.

},
{
"type": "content_contains",
"value": "fd",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion value "fd" is very short and likely to cause false positives in evaluation results, as it can appear as a substring in many common words or paths. Consider using a more specific string like "fd-find" or "fdfind", which are the package and binary names mentioned in the documentation.

Suggested change
"value": "fd",
"value": "fd-find",

"assertions": [
{
"type": "content_contains",
"value": "rg",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion value "rg" is prone to false positives as it frequently appears in common words (e.g., "large", "target", "merge"). Since the description explicitly mentions "ripgrep", using "ripgrep" as the assertion value would be much more reliable and consistent with other tests in this file.

Suggested change
"value": "rg",
"value": "ripgrep",

"assertions": [
{
"type": "content_contains",
"value": "fd",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using "fd" as a content assertion is risky due to potential false positives. Recommending the full name "fd-find" or checking for the specific binary name "fdfind" would improve the robustness of this evaluation case.

Suggested change
"value": "fd",
"value": "fd-find",

|---------|-----|
| Installed but not found | `hash -r` or add dir to PATH |
| No sudo | `cargo install`, `pip install --user`, manual binary |
| Debian `bat`=`batcat`, `fd`=`fdfind` | Symlink to `~/.local/bin/` |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The troubleshooting fix for Debian aliases is a bit vague. It would be more helpful to explicitly state that the symlink should be created using the standard tool name (e.g., bat or fd) to point to the Debian-specific binary.

Suggested change
| Debian `bat`=`batcat`, `fd`=`fdfind` | Symlink to `~/.local/bin/` |
| Debian bat=batcat, fd=fdfind | Symlink to standard name in ~/.local/bin/ |

"assertions": [
{
"type": "content_contains",
"value": "fd",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As noted in previous cases, "fd" is too generic for a reliable content assertion. Using "fd-find" or "fdfind" is preferred to ensure the evaluation correctly identifies the tool recommendation.

Suggested change
"value": "fd",
"value": "fd-find",

},
{
"type": "content_contains",
"value": "rg",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using "rg" as an assertion value here is likely to trigger false positives. Switching to "ripgrep" provides a safer and more accurate verification of the tool's presence in the output.

Suggested change
"value": "rg",
"value": "ripgrep",

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA d433f22.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

@CybotTM CybotTM merged commit 9e9d48f into main Apr 1, 2026
7 of 8 checks passed
@CybotTM CybotTM deleted the feature/evals-and-improvements branch April 1, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant