Skip to content

instavm/skill-optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Skill.md Optimization with DSPy

Programmatically optimize Skill.md prompts using DSPy's prompt optimization framework.

Key finding: DSPy optimization shows improvement on local models (+12.5% on Qwen) but not frontier models (0% on GPT-4o).

Read the full writeup


Quick Start

Option 1: Test with Qwen (Local, No API Key)

# Install Ollama: https://ollama.ai
ollama pull qwen3

# Install dependencies
pip install -r requirements.txt

# Run REAL DSPy BootstrapFewShot optimization
python scripts/optimize_qwen.py

Result: See REAL automated DSPy optimization with BootstrapFewShot in ~3-4 minutes.

  • Baseline: 42.5%
  • BootstrapFewShot Optimized: 55.1%
  • +12.5% improvement from automated example selection

DSPy automatically selects best examples and generates optimized prompts - no manual prompt engineering.

Option 2: Test with GPT-4o (Requires Azure/OpenAI)

# Set environment variables
export AZURE_API_KEY="your-key"
export AZURE_API_BASE="your-endpoint"
export AZURE_DEPLOYMENT="your-deployment"

# Run optimization
python scripts/run_azure_optimization.py

What This Does

Input: code-review.md - A Skill that finds security vulnerabilities

Process:

  1. Convert Skill.md → DSPy Signature
  2. Optimize with 10 vulnerable code examples
  3. Extract improvements → Skill-optimized.md

Output: Model-specific optimized variants


Results

Model Baseline Optimized Improvement
GPT-4o (Azure) 40.6% 38.8% 0%
Qwen3 (Ollama) 42.5% 55.1% +12.5%

Key insight: DSPy's BootstrapFewShot shows significant improvement on local models. GPT-4o sees no improvement (already optimal), Qwen shows +12.5% with automated optimization (+29% relative improvement).


Example Output

Running python scripts/optimize_qwen.py:

======================================================================
REAL DSPy OPTIMIZATION WITH QWEN
======================================================================

This uses BootstrapFewShot to AUTOMATICALLY:
  • Select which examples work best
  • Generate optimized prompts
  • Find the best reasoning strategy

======================================================================
STEP 1: BASELINE (No optimization)
======================================================================

Testing 1/3... 61.6%
Testing 2/3... 56.0%
Testing 3/3... 10.0%

  Baseline: 42.5%

======================================================================
STEP 2: DSPy BootstrapFewShot OPTIMIZATION
======================================================================

🔄 Running optimization...
Bootstrapped 2 full traces after 2 examples for up to 1 rounds.
  Optimization complete!

======================================================================
STEP 3: TESTING OPTIMIZED VERSION
======================================================================

Testing 1/3... 77.0%
Testing 2/3... 65.7%
Testing 3/3... 22.5%

  Optimized: 55.1%

======================================================================
RESULTS
======================================================================

Baseline:  42.5%
Optimized: 55.1%

  IMPROVEMENT: +12.5% (+29.4%)

DSPy's BootstrapFewShot successfully improved the skill!

Real automated DSPy optimization - BootstrapFewShot automatically selects best examples.


Documentation


Limitations

  • Tested on one skill type (code review)
  • Two models only (GPT-4o, Qwen3)
  • Simple metrics (10 training examples)

Open questions:

  • Do optimizations transfer across models?
  • Can we auto-generate training data?
  • What's the right standardized format?

Contributing

Contributions welcome! Particularly interested in:

  1. Testing with other skill types (data analysis, API design, etc.)
  2. Testing with other models (Llama, Mistral, etc.)
  3. Better evaluation metrics
  4. Transfer learning experiments
  5. Auto-generation of training data

License

MIT License - See LICENSE

About

Demonstration of DSPy optimization for Skill.md files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages