Skill.md Optimization with DSPy

Programmatically optimize Skill.md prompts using DSPy's prompt optimization framework.

Key finding: DSPy optimization shows improvement on local models (+12.5% on Qwen) but not frontier models (0% on GPT-4o).

Read the full writeup

Quick Start

Option 1: Test with Qwen (Local, No API Key)

# Install Ollama: https://ollama.ai
ollama pull qwen3

# Install dependencies
pip install -r requirements.txt

# Run REAL DSPy BootstrapFewShot optimization
python scripts/optimize_qwen.py

Result: See REAL automated DSPy optimization with BootstrapFewShot in ~3-4 minutes.

Baseline: 42.5%
BootstrapFewShot Optimized: 55.1%
+12.5% improvement from automated example selection

DSPy automatically selects best examples and generates optimized prompts - no manual prompt engineering.

Option 2: Test with GPT-4o (Requires Azure/OpenAI)

# Set environment variables
export AZURE_API_KEY="your-key"
export AZURE_API_BASE="your-endpoint"
export AZURE_DEPLOYMENT="your-deployment"

# Run optimization
python scripts/run_azure_optimization.py

What This Does

Input: code-review.md - A Skill that finds security vulnerabilities

Process:

Convert Skill.md → DSPy Signature
Optimize with 10 vulnerable code examples
Extract improvements → Skill-optimized.md

Output: Model-specific optimized variants

Results

Model	Baseline	Optimized	Improvement
GPT-4o (Azure)	40.6%	38.8%	0%
Qwen3 (Ollama)	42.5%	55.1%	+12.5%

Key insight: DSPy's BootstrapFewShot shows significant improvement on local models. GPT-4o sees no improvement (already optimal), Qwen shows +12.5% with automated optimization (+29% relative improvement).

Example Output

Running python scripts/optimize_qwen.py:

======================================================================
REAL DSPy OPTIMIZATION WITH QWEN
======================================================================

This uses BootstrapFewShot to AUTOMATICALLY:
  • Select which examples work best
  • Generate optimized prompts
  • Find the best reasoning strategy

======================================================================
STEP 1: BASELINE (No optimization)
======================================================================

Testing 1/3... 61.6%
Testing 2/3... 56.0%
Testing 3/3... 10.0%

  Baseline: 42.5%

======================================================================
STEP 2: DSPy BootstrapFewShot OPTIMIZATION
======================================================================

🔄 Running optimization...
Bootstrapped 2 full traces after 2 examples for up to 1 rounds.
  Optimization complete!

======================================================================
STEP 3: TESTING OPTIMIZED VERSION
======================================================================

Testing 1/3... 77.0%
Testing 2/3... 65.7%
Testing 3/3... 22.5%

  Optimized: 55.1%

======================================================================
RESULTS
======================================================================

Baseline:  42.5%
Optimized: 55.1%

  IMPROVEMENT: +12.5% (+29.4%)

DSPy's BootstrapFewShot successfully improved the skill!

Real automated DSPy optimization - BootstrapFewShot automatically selects best examples.

Documentation

Full Blog Post - Complete writeup with findings
Optimization Results - Detailed metrics

Limitations

Tested on one skill type (code review)
Two models only (GPT-4o, Qwen3)
Simple metrics (10 training examples)

Open questions:

Do optimizations transfer across models?
Can we auto-generate training data?
What's the right standardized format?

Contributing

Contributions welcome! Particularly interested in:

Testing with other skill types (data analysis, API design, etc.)
Testing with other models (Llama, Mistral, etc.)
Better evaluation metrics
Transfer learning experiments
Auto-generation of training data

License

MIT License - See LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
docs		docs
examples		examples
results		results
scripts		scripts
skills		skills
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Skill.md Optimization with DSPy

Quick Start

Option 1: Test with Qwen (Local, No API Key)

Option 2: Test with GPT-4o (Requires Azure/OpenAI)

What This Does

Results

Example Output

Documentation

Limitations

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

instavm/skill-optimization

Folders and files

Latest commit

History

Repository files navigation

Skill.md Optimization with DSPy

Quick Start

Option 1: Test with Qwen (Local, No API Key)

Option 2: Test with GPT-4o (Requires Azure/OpenAI)

What This Does

Results

Example Output

Documentation

Limitations

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages