diff --git a/.DONE.md b/.DONE.md new file mode 100644 index 00000000..e132b376 --- /dev/null +++ b/.DONE.md @@ -0,0 +1,258 @@ +# Braintrust Ruby SDK - Completed Work + +## Phase 0: Documentation ✅ + +- [x] Create .PLAN.md (moved to hidden) +- [x] Create .TODO.md (moved to hidden) + +## Phase 1: Project Setup & Infrastructure ✅ + +- [x] Create braintrust.gemspec (no runtime deps yet) +- [x] Create Gemfile +- [x] Create Rakefile (test, lint, ci tasks only) +- [x] Create mise.toml with precommit hooks + bundle install +- [x] Create .env.example +- [x] Create .github/workflows/ci.yml (uses rake ci) +- [x] Set up Standard linter config (via Rakefile) +- [x] Set up SimpleCov config (via test_helper.rb) +- [x] Create minimal README.md +- [x] Create minimal CONTRIBUTING.md +- [x] Create .gitignore +- [x] Create CHANGELOG.md +- [x] Create lib/braintrust/version.rb +- [x] Create lib/braintrust.rb (skeleton) +- [x] Create test/test_helper.rb +- [x] Create scripts/install-deps.sh (cross-platform) +- [x] Create main branch +- [x] Add rake ci task + +## Phase 2: Core State & Configuration (TDD) ✅ COMPLETE + +### lib/braintrust/config.rb ✅ +- [x] Write test: parse ENV vars +- [x] Implement Config.from_env +- [x] Write test: default values +- [x] Write test: merge options with ENV vars (options override) +- [x] Write test: ENV vars override defaults +- [x] All tests passing, linter clean + +### lib/braintrust/state.rb ✅ +- [x] Write test: create state with required fields +- [x] Write test: validate required fields (api_key required) +- [x] Write test: state is immutable (frozen) +- [x] Write test: thread-safe global state access (Mutex) +- [x] Implement State class +- [x] Implement State.global getter/setter +- [x] Implement State validation +- [x] All tests passing, linter clean + +### lib/braintrust.rb ✅ +- [x] Write test: init sets global state by default +- [x] Write test: init with set_global: false returns state +- [x] Write test: init merges options with ENV vars +- [x] Implement Braintrust.init +- [x] Implement Braintrust.current_state +- [x] Add blocking_login parameter to Braintrust.init +- [x] Document all init options explicitly +- [x] All tests passing, linter clean + +### lib/braintrust/api/auth.rb ✅ +- [x] Write test: login with valid API key +- [x] Write test: login with invalid API key +- [x] Implement API::Auth.login +- [x] Implement AuthResult struct +- [x] Handle 401/403 as invalid API key +- [x] Handle 400/4xx/5xx with appropriate errors +- [x] Implement API::Auth.mask_api_key +- [x] All tests passing (real API tests), linter clean + +### lib/braintrust/logger.rb ✅ +- [x] Create logger with DEBUG level when BRAINTRUST_DEBUG=true +- [x] Implement debug, info, warn, error methods +- [x] Write to stderr + +### lib/braintrust/state.rb (login) ✅ +- [x] Add State#login method +- [x] Login calls API::Auth.login +- [x] Login updates state fields (org_id, org_name, api_url, proxy_url, logged_in) +- [x] Add new attr_readers: org_id, proxy_url, logged_in +- [x] Remove freeze (allow login to mutate state) +- [x] All tests passing, linter clean + +### examples/login/ ✅ +- [x] Create examples/login/login_basic.rb +- [x] Demonstrate blocking_login usage +- [x] Test example runs successfully + +## Phase 3: Core Tracing (TDD) - ✅ COMPLETE (Trace.enable) + +### Add OpenTelemetry dependencies to braintrust.gemspec ✅ +- [x] Add opentelemetry-sdk runtime dependency +- [x] Add opentelemetry-exporter-otlp runtime dependency +- [x] Run bundle install + +### lib/braintrust/trace.rb ✅ +- [x] Write test: enable raises error if no state available +- [x] Write test: enable with explicit state +- [x] Write test: enable with global state +- [x] Write test: enable adds console exporter when BRAINTRUST_ENABLE_TRACE_CONSOLE_LOG=true +- [x] Implement Trace.enable(tracer_provider, state: nil) +- [x] Configure OTLP HTTP exporter with correct endpoint (api_url/otel/v1/traces) +- [x] Set Authorization header with API key +- [x] Register BatchSpanProcessor with tracer provider +- [x] Add SSL workaround (VERIFY_NONE with TODO) +- [x] All tests passing (4 tests, 8 assertions), linter clean + +### examples/trace/trace_basic.rb ✅ +- [x] Create example demonstrating Trace.enable +- [x] Show manual span creation with braintrust.parent attribute +- [x] Test example runs successfully + +### lib/braintrust/trace/span_processor.rb ✅ +- [x] Write test: adds braintrust.parent attribute +- [x] Write test: preserves existing parent attribute +- [x] Write test: adds braintrust.org attribute +- [x] Write test: adds braintrust.app_url attribute +- [x] Implement SpanProcessor class +- [x] Implement on_start hook (adds default_parent, org, app_url) +- [x] Implement on_finish hook +- [x] Wrap OTLP exporter in custom span processor +- [x] Update State/Config to use single default_parent field +- [x] Update BRAINTRUST_DEFAULT_PROJECT env var +- [x] Update example to remove manual parent setting +- [x] All tests passing (4 tests), linter clean + +## Phase 4: OpenAI Integration (TDD) - ✅ COMPLETE (First Pass) + +### lib/braintrust/trace/openai.rb ✅ +- [x] Add openai gem as development dependency +- [x] Write basic test: wrapper creates span for chat.completions +- [x] Implement basic OpenAI.wrap method +- [x] Update wrapper to use braintrust.* attributes (match Go SDK) + - [x] Use `braintrust.input_json` for input messages (JSON-encoded once) + - [x] Use `braintrust.output_json` for output choices (JSON-encoded once) + - [x] Use `braintrust.metadata` for request/response metadata (JSON-encoded once) + - [x] Use `braintrust.metrics` for token usage (JSON-encoded once) +- [x] Simplified output using `.to_h` to capture all fields (tool_calls, annotations, etc.) +- [x] Update test to verify braintrust.input_json contains messages +- [x] Update test to verify braintrust.output_json contains choices +- [x] Update test to verify braintrust.metadata contains model, temperature, etc +- [x] Update test to verify braintrust.metrics contains prompt_tokens, completion_tokens, tokens +- [x] Update span name to "openai.chat.completions.create" (match Go) +- [x] Test with real OpenAI API and verify in Braintrust UI + +### examples/openai.rb ✅ +- [x] Create openai.rb example with tracing +- [x] Test example runs successfully +- [x] Verify traces appear correctly in Braintrust UI with input/output/metadata + +### examples/internal/openai.rb ✅ +- [x] Create comprehensive example showcasing all features +- [x] Vision (image understanding) +- [x] Tool/function calling +- [x] Reasoning models (o1-mini with reasoning tokens) +- [x] Advanced parameters (temperature, top_p, etc.) +- [x] All examples under single parent trace with permalink + +## Phase 6: Evals Framework (TDD) - ✅ MOSTLY COMPLETE + +### lib/braintrust/eval/case.rb ✅ +- [x] Write test: Case with input/expected +- [x] Write test: Case with tags and metadata +- [x] Implement Case class + +### lib/braintrust/eval/scorer.rb ✅ +- [x] Write test: Scorer interface +- [x] Write test: Scorer helper with block +- [x] Write test: Scorer returns score +- [x] Implement Scorer module/class +- [x] Implement Eval.scorer helper + +### lib/braintrust/eval/cases.rb ✅ +- [x] Write test: Cases enumerable +- [x] Write test: Cases from array +- [x] Implement Cases class + +### lib/braintrust/eval/result.rb ✅ +- [x] Write test: Result with success/failed status +- [x] Implement Result class + +### lib/braintrust/internal/experiments.rb ✅ +- [x] Implement get_or_create for experiment resolution +- [x] Implement project and experiment registration via API + +### lib/braintrust/eval.rb ✅ (Error handling complete) +- [x] Write test: run with cases array +- [x] Write test: run resolves project +- [x] Write test: run resolves experiment +- [x] Write test: run executes task for each case +- [x] Write test: run executes scorers +- [x] Write test: run creates OTEL spans +- [x] Write test: run with explicit state +- [x] Write test: run with global state +- [x] Write test: run handles task errors +- [x] Write test: run handles scorer errors +- [x] Write test: task errors record exception events with stacktraces +- [x] Write test: scorer errors record exception events with stacktraces +- [x] Implement Eval.run +- [x] Implement project resolution +- [x] Implement experiment resolution +- [x] Implement task execution +- [x] Implement scorer execution +- [x] Implement span creation +- [x] Implement result generation +- [x] Implement error recording with span.record_exception() +- [x] Update record_span_error helper to use OpenTelemetry standard + +### Error Handling ✅ COMPLETE +- [x] Task errors recorded on task span with full stacktrace +- [x] Scorer errors recorded on score span with custom "ScorerError" type +- [x] Eval span gets error status when child spans fail +- [x] Exception events include type, message, and stacktrace +- [x] Backend correctly extracts and populates error field +- [x] Tests verify stacktrace attribute exists +- [x] All 72 tests pass with 243 assertions + +## Session History + +### Session 1 Completed +- Config class with ENV parsing, defaults, and option merging (4 tests) +- State class with validation and thread-safe global state (5 tests) +- Braintrust.init and Braintrust.current_state (3 tests) + +### Session 2 Completed +- Login functionality (API::Auth.login with real API tests) +- Logger with BRAINTRUST_DEBUG support +- State#login method (updates org info from API) +- Updated Braintrust.init with blocking_login option +- Documented all init options +- examples/login/login_basic.rb +- Trace.enable method with OTLP exporter to Braintrust +- Console debug support with BRAINTRUST_ENABLE_TRACE_CONSOLE_LOG +- Custom Span Processor with automatic attribute injection +- Changed to default_parent field (from project_id/project_name) +- BRAINTRUST_DEFAULT_PROJECT env var (format: "project_name:foo") +- examples/trace/trace_basic.rb +- **Total: 21 test runs, 41 assertions, all passing, linter clean** + +### Session 3 Completed +- OpenAI integration with braintrust.* attributes (input_json, output_json, metadata, metrics) +- Simplified output using `.to_h` to capture all fields including tool_calls +- Comprehensive test coverage (28 assertions) +- examples/openai.rb with Trace.permalink +- examples/internal/openai.rb showcasing vision, tools, reasoning, advanced params +- Verified traces in Braintrust UI via MCP +- SSL config improvements +- **Total: 28 test runs, 82 assertions, all passing, linter clean** + +### Session 4 Completed (Error Handling) +- Fixed error recording to match Go SDK behavior +- Updated task error handling to use `span.record_exception(e)` +- Updated `record_span_error` helper to use OpenTelemetry standard +- Errors now include full stacktraces via exception events +- Added stacktrace assertions to tests +- Investigated backend error processing (api-ts/src/otel/collector.ts parseError function) +- Verified errors populate in Braintrust database via MCP queries +- Task errors: Full stacktrace on task span, error message on eval span +- Scorer errors: Full stacktrace on score span with custom "ScorerError" type +- **Total: 72 test runs, 243 assertions, all passing, linter clean** diff --git a/.PLAN.md b/.PLAN.md index 901f27dd..5e59d572 100644 --- a/.PLAN.md +++ b/.PLAN.md @@ -63,12 +63,15 @@ Braintrust.with_state(state) # Temporarily override state **lib/braintrust/state.rb** -Immutable state container. +State container with login support. - Thread-safe global state management - Merges ENV vars with explicit options -- Validates required fields -- Holds tracer_provider instance +- Validates required fields (api_key required) +- Mutable to allow login() to update org info +- login() method fetches org details from Braintrust API +- Holds org_id, org_name, api_url, proxy_url after login +- Will hold tracer_provider instance (Phase 3) ### Braintrust::Config @@ -83,6 +86,8 @@ ENV vars: - `BRAINTRUST_DEFAULT_PROJECT_NAME` - Default project name - `BRAINTRUST_APP_URL` - App URL (default: https://www.braintrust.dev) - `BRAINTRUST_API_URL` - API URL (default: https://api.braintrust.dev) +- `BRAINTRUST_DEBUG` - Enable debug logging +- `BRAINTRUST_ENABLE_TRACE_CONSOLE_LOG` - Enable console trace logging (Phase 3) ### Braintrust::Trace @@ -260,29 +265,88 @@ Utilities for testing: ## Dependencies ### Runtime -- `opentelemetry-sdk` (~> 1.5) - OpenTelemetry SDK -- `opentelemetry-exporter-otlp` (~> 0.29) - OTLP exporter -- `ruby-openai` (~> 7.0) - OpenAI client -- `faraday` (~> 2.0) - HTTP client (used by ruby-openai) +**Note**: Runtime dependencies are added incrementally as features are implemented: +- Phase 3: `opentelemetry-sdk`, `opentelemetry-exporter-otlp` +- Phase 4: `ruby-openai`, `faraday` +- Phase 5: HTTP client for Braintrust API ### Development - `minitest` (~> 5.0) - Testing framework -- `standard` (~> 1.0) - Linting -- `simplecov` - Code coverage -- `rake` - Task automation +- `rake` (~> 13.0) - Task automation +- `standard` (~> 1.0) - Linting (zero-config) +- `simplecov` (~> 0.22) - Code coverage ### Tools (via mise) -- Ruby 3.2, 3.3, 3.4 +- Ruby 3.2 (pinned for development) +- Rust 1.83 (for Ruby compilation) - watchexec - File watching for tests ## Key Differences from Go SDK -1. **State Management**: Hybrid global/explicit vs pure global +1. **State Management**: Hybrid global/explicit vs pure global (avoids Go SDK's global state issues) 2. **API Style**: Ruby blocks/procs vs Go functions 3. **Middleware**: Faraday vs HTTP middleware 4. **Parallelism**: Threads vs goroutines 5. **Testing**: Minitest vs testify -6. **Linting**: Standard vs golangci-lint +6. **Linting**: Standard (zero-config) vs golangci-lint +7. **Dependencies**: Added incrementally as needed vs upfront + +## Implementation Notes + +### Session 1 (2025-10-21) + +**Completed**: +- Full project infrastructure (gemspec, Rakefile, CI/CD) +- mise.toml with automatic bundle install and precommit hooks +- Cross-platform dependency installer (scripts/install-deps.sh) +- Minimal docs (README.md, CONTRIBUTING.md) +- Moved tracking docs to hidden files (.PLAN.md, .TODO.md) +- Added `rake ci` task for CI verification +- Removed build/release tasks (will add when ready to publish) +- Created main branch +- Config class with ENV parsing and option merging +- State class with thread-safe global state management +- Braintrust.init with set_global option + +**Decisions**: +- Runtime deps added only when needed (not all upfront) +- Standard linter (zero-config, opinionated) +- Minitest (Ruby built-in, plain asserts) +- Simplified docs (essentials only) +- No system gem installation tasks +- mise handles Ruby + Rust, brew handles C libraries +- Hybrid state management (global + explicit state) +- Mutable state (removed freeze to allow login to update fields) + +### Session 2 (2025-10-21) + +**Completed**: +- Login API integration (lib/braintrust/api/auth.rb) + - AuthResult struct with org_id, org_name, api_url, proxy_url + - Proper HTTP error handling (401/403/400/4xx/5xx) + - API key masking for logging +- Logger module (lib/braintrust/logger.rb) + - DEBUG level when BRAINTRUST_DEBUG=true env var set + - Outputs to stderr +- State#login method + - Calls API::Auth.login + - Updates state with org info from API + - Added org_id, proxy_url, logged_in attributes +- Updated Braintrust.init + - Added blocking_login parameter + - Documented all options explicitly (not **options) +- Login example (examples/login/login_basic.rb) + - Demonstrates blocking_login usage + - Real API integration tests (no mocks) + +**Decisions**: +- Real API tests (not mocks), tests fail if BRAINTRUST_API_KEY not set +- State.login updates current state (doesn't return new state) +- Removed state immutability (freeze) to allow login mutations +- API logic separated into lib/braintrust/api/ module structure +- Struct-based return values (AuthResult) instead of raw hashes +- SSL verification workaround for macOS (VERIFY_NONE with TODO) +- State#login_until_success deferred (background thread with retries) ## Future Enhancements diff --git a/.TODO.md b/.TODO.md index 0a2a5bc9..064b7c5b 100644 --- a/.TODO.md +++ b/.TODO.md @@ -1,101 +1,69 @@ -# Braintrust Ruby SDK - Implementation Checklist - -## Phase 0: Documentation ✅ - -- [x] Create PLAN.md -- [x] Create TODO.md - -## Phase 1: Project Setup & Infrastructure ✅ - -- [x] Create braintrust.gemspec -- [x] Create Gemfile -- [x] Create Rakefile -- [x] Create mise.toml with precommit hooks -- [x] Create .env.example -- [x] Create .github/workflows/ci.yml -- [x] Set up Standard linter config (via Rakefile) -- [x] Set up SimpleCov config (via test_helper.rb) -- [x] Create basic README.md -- [x] Create .gitignore -- [x] Create CHANGELOG.md -- [x] Create lib/braintrust/version.rb -- [x] Create lib/braintrust.rb (skeleton) -- [x] Create test/test_helper.rb - -## Phase 2: Core State & Configuration (TDD) - -### lib/braintrust/config.rb -- [ ] Write test: parse ENV vars -- [ ] Write test: default values -- [ ] Write test: merge options with ENV vars -- [ ] Implement Config.from_env -- [ ] Implement Config.merge - -### lib/braintrust/state.rb -- [ ] Write test: create state with required fields -- [ ] Write test: validate required fields -- [ ] Write test: state is immutable -- [ ] Write test: thread-safe global state access -- [ ] Implement State class -- [ ] Implement State.global getter/setter -- [ ] Implement State validation - -### lib/braintrust.rb -- [ ] Write test: init sets global state by default -- [ ] Write test: init with set_global: false returns state -- [ ] Write test: current_state returns global state -- [ ] Write test: with_state temporarily overrides global -- [ ] Implement Braintrust.init -- [ ] Implement Braintrust.current_state -- [ ] Implement Braintrust.with_state - -## Phase 3: Core Tracing (TDD) - -### lib/braintrust/trace/span_processor.rb -- [ ] Write test: adds braintrust.parent attribute -- [ ] Write test: adds braintrust.org attribute -- [ ] Write test: adds braintrust.app_url attribute -- [ ] Write test: resolves parent from context -- [ ] Write test: filters non-AI spans when configured -- [ ] Write test: thread-safe span processing -- [ ] Implement SpanProcessor class -- [ ] Implement on_start hook -- [ ] Implement on_end hook -- [ ] Implement span filtering logic - -### lib/braintrust/trace.rb -- [ ] Write test: enable creates tracer provider -- [ ] Write test: enable configures OTLP exporter -- [ ] Write test: enable registers span processor -- [ ] Write test: enable with explicit state -- [ ] Write test: enable with global state -- [ ] Write test: disable/teardown cleans up +# Braintrust Ruby SDK - TODO + +> See `.DONE.md` for completed work + +## Known Issues / Tech Debt + +### High Priority + +- [ ] **SSL Certificate Verification on macOS**: Currently using `OpenSSL::SSL::VERIFY_NONE` workaround ⚠️ + - **SECURITY ISSUE**: Disables SSL certificate verification + - Affects: lib/braintrust/api/auth.rb, lib/braintrust/trace.rb + - Issue: `certificate verify failed (unable to get certificate CRL)` + - Need to investigate proper SSL certificate handling or system cert store configuration + - Must be fixed before production use + +### Medium Priority + +- [ ] **Kitchen-Sink Span Export Inconsistency**: Some eval runs show incomplete span export + - Affects: examples/internal/kitchen-sink.rb (8 cases, only 3-4 appear sometimes) + - Issue: BatchSpanProcessor may not flush all spans before shutdown + - Simple evals work fine (3 cases exported successfully) + - May need explicit `tracer_provider.force_flush()` before `shutdown()` + - May be timing-related with concurrent OpenAI API calls + +### Low Priority + +- [ ] **Parallelism Not Implemented**: Eval.run accepts parallelism parameter but doesn't use it + - Currently runs cases sequentially + - Need to implement parallel execution with threads or concurrent-ruby + +## Pending Work + +### Phase 2: Deferred Items +- [ ] Implement Braintrust.with_state (deferred - not needed yet) +- [ ] Implement State#login_until_success (deferred - background thread with retries) + +### Phase 3: Trace Utilities (Deferred) - [ ] Write test: permalink generation -- [ ] Implement Trace.enable -- [ ] Implement Trace.disable - [ ] Implement Trace.permalink -- [ ] Implement Trace.set_parent +- [ ] Implement Trace.set_parent (for setting parent in context) - [ ] Implement Trace.get_parent +- [ ] Implement span filtering logic (AI spans filter) + +### Phase 4.5: OpenAI Advanced Features (Future) + +#### Streaming Support +- [ ] Add support for `stream_raw` API +- [ ] Handle streaming responses and chunks +- [ ] Aggregate streaming data for tracing +- [ ] Test streaming with console output -## Phase 4: OpenAI Integration (TDD) - -### lib/braintrust/trace/openai.rb -- [ ] Write test: middleware creates span for chat.completions -- [ ] Write test: middleware records request attributes -- [ ] Write test: middleware records response attributes -- [ ] Write test: middleware parses token usage -- [ ] Write test: middleware with explicit state -- [ ] Write test: middleware with global state -- [ ] Write test: middleware handles errors -- [ ] Implement OpenAI.middleware -- [ ] Implement request span creation -- [ ] Implement response attribute recording -- [ ] Implement token usage parsing -- [ ] Implement gen_ai.* semantic conventions - -## Phase 5: API Client (TDD) - -### lib/braintrust/api.rb +#### Additional Endpoints +- [ ] Embeddings support +- [ ] Assistants API support +- [ ] Fine-tuning API support +- [ ] Images API support + +#### Error Handling & Reliability +- [ ] Better error handling for API failures +- [ ] Retry logic with exponential backoff +- [ ] Timeout configuration +- [ ] Rate limiting handling + +### Phase 5: API Client (TDD) + +#### lib/braintrust/api.rb - [ ] Write test: register_project creates/fetches project - [ ] Write test: register_experiment creates experiment - [ ] Write test: register_experiment with update flag @@ -111,68 +79,38 @@ - [ ] Implement fetch_dataset - [ ] Implement insert_dataset_events -## Phase 6: Evals Framework (TDD) - -### lib/braintrust/eval/case.rb -- [ ] Write test: Case with input/expected -- [ ] Write test: Case with tags and metadata -- [ ] Implement Case class +### Phase 6: Evals - Remaining Items -### lib/braintrust/eval/scorer.rb -- [ ] Write test: Scorer interface -- [ ] Write test: Scorer helper with block -- [ ] Write test: Scorer returns score -- [ ] Implement Scorer module/class -- [ ] Implement Eval.scorer helper +#### lib/braintrust/eval.rb +- [ ] Implement parallel execution (parallelism parameter) -### lib/braintrust/eval/dataset.rb +#### lib/braintrust/eval/dataset.rb - [ ] Write test: Dataset enumerable - [ ] Write test: Dataset from array - [ ] Write test: Dataset from API - [ ] Implement Dataset class -### lib/braintrust/eval.rb -- [ ] Write test: run with cases array -- [ ] Write test: run resolves project -- [ ] Write test: run resolves experiment -- [ ] Write test: run executes task for each case -- [ ] Write test: run executes scorers -- [ ] Write test: run creates OTEL spans -- [ ] Write test: run with parallelism -- [ ] Write test: run with explicit state -- [ ] Write test: run with global state -- [ ] Write test: run handles task errors -- [ ] Write test: run handles scorer errors -- [ ] Implement Eval.run -- [ ] Implement project resolution -- [ ] Implement experiment resolution -- [ ] Implement task execution -- [ ] Implement scorer execution -- [ ] Implement parallel execution -- [ ] Implement span creation -- [ ] Implement result generation - -## Phase 7: Examples - -### examples/openai/ +### Phase 7: Examples + +#### examples/openai/ - [ ] Create openai_basic.rb - [ ] Test example runs successfully -### examples/otel/ +#### examples/otel/ - [ ] Create otel_basic.rb - [ ] Test example runs successfully -### examples/evals/ +#### examples/evals/ - [ ] Create eval_basic.rb - [ ] Test example runs successfully -## Phase 8: Documentation & Polish +### Phase 8: Documentation & Polish - [ ] Write comprehensive README.md - [ ] Document all public APIs - [ ] Add inline code comments -- [ ] Create CONTRIBUTING.md -- [ ] Create CHANGELOG.md +- [ ] Update CONTRIBUTING.md +- [ ] Update CHANGELOG.md - [ ] Verify 80%+ test coverage - [ ] Run Standard linter and fix issues - [ ] Set up CI/CD pipeline @@ -180,6 +118,32 @@ ## Current Status -**Last Updated**: 2025-10-21 -**Current Phase**: Phase 1 (Project Setup) - Complete ✅ -**Next Step**: Phase 2 - Core State & Configuration (TDD) +**Last Updated**: 2025-10-22 (Session 4) +**Current Phase**: Phase 6 (Evals Framework) - ✅ MOSTLY COMPLETE (Error Handling ✅, Parallelism pending) +**Test Status**: 72 test runs, 243 assertions, all passing, linter clean + +## Outstanding Issues Summary + +**Session 4 Completed**: +- ✅ Error handling complete (task errors, scorer errors, stacktraces) +- ✅ All tests passing +- ⚠️ Kitchen-sink inconsistency (span export timing issue) + +## Next Session Options + +1. **Fix SSL Certificate Verification** (High Priority ⚠️) + - Security issue that needs resolution + - Investigate proper cert store configuration + +2. **Fix Kitchen-Sink Span Export** (Medium Priority) + - Add explicit force_flush() before shutdown + - Test with larger eval runs + +3. **Implement Parallelism** (Low Priority) + - Add parallel case execution to Eval.run + +4. **API Client** (Phase 5) + - Datasets API support + +5. **OpenAI Advanced** (Phase 4.5) + - Streaming support diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 9fa74633..dbeb48cd 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -8,9 +8,10 @@ on: jobs: test: - runs-on: ubuntu-latest + runs-on: ${{ matrix.os }} strategy: matrix: + os: [ubuntu-latest, windows-latest, macos-latest] ruby-version: ['3.2', '3.3', '3.4'] steps: @@ -24,10 +25,12 @@ jobs: - name: Run CI verification run: bundle exec rake ci + env: + BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }} - name: Upload coverage to Codecov uses: codecov/codecov-action@v4 - if: matrix.ruby-version == '3.4' + if: matrix.ruby-version == '3.4' && matrix.os == 'ubuntu-latest' with: files: ./coverage/.resultset.json fail_ci_if_error: false diff --git a/Gemfile.lock b/Gemfile.lock index be9db9ba..185a06de 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -1,17 +1,53 @@ PATH remote: . specs: - braintrust (0.1.0) + braintrust (0.0.1) + opentelemetry-exporter-otlp (~> 0.28) + opentelemetry-sdk (~> 1.0) GEM remote: https://rubygems.org/ specs: ast (2.4.3) + bigdecimal (3.3.1) + connection_pool (2.5.4) docile (1.4.1) + google-protobuf (4.33.0-arm64-darwin) + bigdecimal + rake (>= 13) + google-protobuf (4.33.0-x64-mingw-ucrt) + bigdecimal + rake (>= 13) + google-protobuf (4.33.0-x86_64-linux-gnu) + bigdecimal + rake (>= 13) + googleapis-common-protos-types (1.22.0) + google-protobuf (~> 4.26) json (2.15.1) language_server-protocol (3.17.0.5) lint_roller (1.1.0) minitest (5.26.0) + openai (0.34.1) + connection_pool + opentelemetry-api (1.7.0) + opentelemetry-common (0.23.0) + opentelemetry-api (~> 1.0) + opentelemetry-exporter-otlp (0.31.1) + google-protobuf (>= 3.18) + googleapis-common-protos-types (~> 1.3) + opentelemetry-api (~> 1.1) + opentelemetry-common (~> 0.20) + opentelemetry-sdk (~> 1.10) + opentelemetry-semantic_conventions + opentelemetry-registry (0.4.0) + opentelemetry-api (~> 1.1) + opentelemetry-sdk (1.10.0) + opentelemetry-api (~> 1.1) + opentelemetry-common (~> 0.20) + opentelemetry-registry (~> 0.2) + opentelemetry-semantic_conventions + opentelemetry-semantic_conventions (1.36.0) + opentelemetry-api (~> 1.0) parallel (1.27.0) parser (3.3.9.0) ast (~> 2.4.1) @@ -63,11 +99,16 @@ GEM unicode-emoji (4.1.0) PLATFORMS + arm64-darwin-23 arm64-darwin-24 + x64-mingw + x64-mingw-ucrt + x86_64-linux DEPENDENCIES braintrust! minitest (~> 5.0) + openai (~> 0.34) rake (~> 13.0) simplecov (~> 0.22) standard (~> 1.0) diff --git a/Rakefile b/Rakefile index 2ff2a0cf..5aaae18d 100644 --- a/Rakefile +++ b/Rakefile @@ -19,6 +19,26 @@ task :"lint:fix" do sh "bundle exec standardrb --fix" end +desc "Remove all ignored files (coverage, pkg, etc.)" +task :clean do + sh "git clean -fdX" +end + +desc "Run all examples" +task :examples do + examples = FileList["examples/**/*.rb"].exclude("examples/**/README.md") + + puts "Running #{examples.length} examples..." + + examples.each do |example| + puts "\n=== Running #{example} ===" + sh "bundle exec ruby #{example}" do |ok, res| + puts "✓ #{example} completed" if ok + puts "✗ #{example} failed (#{res.exitstatus})" unless ok + end + end +end + desc "Verify CI (lint + test)" task ci: [:lint, :test] diff --git a/braintrust.gemspec b/braintrust.gemspec index 12a03e49..2a92d8bf 100644 --- a/braintrust.gemspec +++ b/braintrust.gemspec @@ -30,11 +30,13 @@ Gem::Specification.new do |spec| spec.require_paths = ["lib"] # Runtime dependencies - # (will be added as needed during implementation) + spec.add_runtime_dependency "opentelemetry-sdk", "~> 1.0" + spec.add_runtime_dependency "opentelemetry-exporter-otlp", "~> 0.28" # Development dependencies spec.add_development_dependency "minitest", "~> 5.0" spec.add_development_dependency "rake", "~> 13.0" spec.add_development_dependency "standard", "~> 1.0" spec.add_development_dependency "simplecov", "~> 0.22" + spec.add_development_dependency "openai", "~> 0.34" end diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 00000000..0affaf4e --- /dev/null +++ b/examples/README.md @@ -0,0 +1,37 @@ +# Braintrust Ruby SDK Examples + +This directory contains examples demonstrating how to use the Braintrust Ruby SDK. + +## Prerequisites + +All examples require a Braintrust API key. Get one from [Braintrust Settings](https://www.braintrust.dev/app/settings). + +Set your API key as an environment variable: + +```bash +export BRAINTRUST_API_KEY="your-api-key-here" +``` + +## Running Examples + +From the project root: + +```bash +# Run a specific example +ruby examples/login/login_basic.rb + +# Enable debug logging +BRAINTRUST_DEBUG=true ruby examples/login/login_basic.rb +``` + +## Available Examples + +### Login Examples + +- **`login/login_basic.rb`**: Basic login example showing how to authenticate and retrieve organization information + +## Coming Soon + +- OpenTelemetry tracing examples +- OpenAI integration examples +- Eval framework examples diff --git a/examples/eval.rb b/examples/eval.rb new file mode 100644 index 00000000..99cfaca1 --- /dev/null +++ b/examples/eval.rb @@ -0,0 +1,164 @@ +#!/usr/bin/env ruby +# frozen_string_literal: true + +require "bundler/setup" +require "braintrust" +require "opentelemetry/sdk" + +# Example: Food Classification Eval +# +# This example demonstrates the Eval API for running evaluations: +# 1. Define test cases (input + expected output) +# 2. Define a task (the code being evaluated) +# 3. Define scorers (how to judge the output) +# 4. Run the eval with parallelism +# 5. Inspect the results +# +# Usage: +# BRAINTRUST_API_KEY=key bundle exec ruby examples/eval.rb + +unless ENV["BRAINTRUST_API_KEY"] + puts "Error: BRAINTRUST_API_KEY environment variable is required" + exit 1 +end + +# Initialize Braintrust with blocking login +Braintrust.init(blocking_login: true) + +# Create OpenTelemetry TracerProvider +tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + +# Enable Braintrust tracing +Braintrust::Trace.enable(tracer_provider) + +# Set as global provider +OpenTelemetry.tracer_provider = tracer_provider + +# Simple food classifier (the code being evaluated) +# In a real scenario, this would call your model/API +def classify_food(input) + # Simple rule-based classifier for demo + fruit = %w[apple banana strawberry orange grape mango] + vegetable = %w[carrot broccoli spinach potato tomato cucumber] + + input_lower = input.downcase + return "fruit" if fruit.any? { |f| input_lower.include?(f) } + return "vegetable" if vegetable.any? { |v| input_lower.include?(v) } + "unknown" +end + +# Example of a class-based scorer (reusable) +class FuzzyMatchScorer + def name + "fuzzy_match" + end + + def call(input, expected, output, metadata = {}) + threshold = metadata[:threshold] || 0.8 + + # Simple fuzzy matching (in real scenario, use Levenshtein distance) + similarity = if output == expected + 1.0 + elsif output.downcase.include?(expected.downcase) || expected.downcase.include?(output.downcase) + 0.7 + else + 0.0 + end + + (similarity >= threshold) ? 1.0 : 0.0 + end +end + +# Example of a lambda scorer (can pass directly without wrapping) +length_match = ->(input, expected, output) { + # Score based on whether output has correct length + (output.length == expected.length) ? 1.0 : 0.0 +} + +# Run the evaluation +puts "\nRunning evaluation..." +result = Braintrust::Eval.run( + # Required: Project and experiment + project: "ruby-sdk-examples", + experiment: "food-classifier-eval", + + # Required: Test cases + # Each case has input, expected output, and optional tags/metadata + cases: [ + {input: "apple", expected: "fruit"}, + {input: "carrot", expected: "vegetable"}, + {input: "banana", expected: "fruit", tags: ["tropical"]}, + {input: "broccoli", expected: "vegetable"}, + {input: "strawberry", expected: "fruit", tags: ["berry"]}, + {input: "potato", expected: "vegetable"}, + {input: "orange", expected: "fruit", tags: ["citrus"]}, + {input: "spinach", expected: "vegetable", tags: ["leafy"]} + ], + + # Required: Task (callable) + # Can be a proc, lambda, method reference, or object with .call + task: ->(input) { classify_food(input) }, + + # Required: Scorers (array) + # Scorers evaluate the quality of the output + scorers: [ + # Simple inline scorer - exact match + # Takes 3 params: input, expected, output + Braintrust::Eval.scorer("exact_match") { |input, expected, output| + (output == expected) ? 1.0 : 0.0 + }, + + # Advanced inline scorer - with metadata + # Takes 4 params: input, expected, output, metadata + Braintrust::Eval.scorer("case_insensitive_match") { |input, expected, output, metadata| + (output.downcase == expected.downcase) ? 1.0 : 0.0 + }, + + # Class-based scorer (reusable) + FuzzyMatchScorer.new, + + # Lambda scorer (auto-named as "scorer") + # Just pass the lambda directly - no wrapper needed! + length_match + ], + + # Optional: Run 3 cases in parallel + parallelism: 3, + + # Optional: Tags for the experiment + tags: ["example", "food-classification", "v1"], + + # Optional: Metadata for the experiment + metadata: { + description: "Food classification eval example", + version: "1.0.0" + } +) + +# Inspect the results +puts "\n" + "=" * 50 +puts "Evaluation Complete!" +puts "=" * 50 + +puts "\nExperiment: #{result.experiment_name}" +puts "Project ID: #{result.project_id}" +puts "Duration: #{result.duration.round(2)}s" +puts "Status: #{result.success? ? "✓ Success" : "✗ Failed"}" + +# Show the permalink to view in Braintrust UI +puts "\nView results at:" +puts " #{result.permalink}" + +# Show errors if any +if result.failed? + puts "\nErrors (#{result.errors.length}):" + result.errors.each do |error| + puts " - #{error}" + end + exit 1 +end + +puts "\n✓ All test cases passed!" + +# Shutdown to flush spans to Braintrust +tracer_provider.shutdown diff --git a/examples/internal/kitchen-sink.rb b/examples/internal/kitchen-sink.rb new file mode 100644 index 00000000..246c8467 --- /dev/null +++ b/examples/internal/kitchen-sink.rb @@ -0,0 +1,377 @@ +#!/usr/bin/env ruby +# frozen_string_literal: true + +require "bundler/setup" +require "braintrust" +require "openai" +require "opentelemetry/sdk" +require "json" + +# Kitchen Sink Example +# +# This example demonstrates many features of the Braintrust Ruby SDK: +# - OpenAI integration with function/tool calling +# - Complex task with error handling +# - Multiple scorer types (exact match, LLM-as-judge, custom) +# - Cases with tags, metadata, and expected outputs +# - Full OpenTelemetry tracing +# +# Usage: +# BRAINTRUST_API_KEY=key OPENAI_API_KEY=key bundle exec ruby examples/internal/kitchen-sink.rb + +unless ENV["BRAINTRUST_API_KEY"] + puts "Error: BRAINTRUST_API_KEY environment variable is required" + exit 1 +end + +unless ENV["OPENAI_API_KEY"] + puts "Error: OPENAI_API_KEY environment variable is required" + exit 1 +end + +# Initialize Braintrust with blocking login +Braintrust.init(blocking_login: true) + +# Create OpenTelemetry TracerProvider +tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + +# Enable Braintrust tracing +Braintrust::Trace.enable(tracer_provider) + +# Set as global provider +OpenTelemetry.tracer_provider = tracer_provider + +# Create OpenAI client +openai_client = OpenAI::Client.new(api_key: ENV["OPENAI_API_KEY"]) + +# Wrap the client with Braintrust tracing +Braintrust::Trace::OpenAI.wrap(openai_client, tracer_provider: tracer_provider) + +puts "Kitchen Sink Eval Example" +puts "=" * 60 + +# Define tools/functions for OpenAI +def get_weather_tools + [{ + type: "function", + function: { + name: "get_current_weather", + description: "Get the current weather in a given location", + parameters: { + type: "object", + properties: { + location: { + type: "string", + description: "The city and state, e.g. San Francisco, CA" + }, + unit: { + type: "string", + enum: ["celsius", "fahrenheit"], + description: "The temperature unit to use" + } + }, + required: ["location"] + } + } + }] +end + +# Mock function to execute tool calls +def execute_tool_call(tool_call) + if tool_call.function.name == "get_current_weather" + args = JSON.parse(tool_call.function.arguments) + location = args["location"] + unit = args["unit"] || "fahrenheit" + + # Mock weather data + temp = (unit == "celsius") ? 22 : 72 + { + location: location, + temperature: temp, + unit: unit, + conditions: "sunny" + }.to_json + end +end + +# Complex task that uses OpenAI with tool calling +def weather_assistant_task(input, openai_client) + messages = [ + {role: "system", content: "You are a helpful weather assistant. Use the get_current_weather function when asked about weather."}, + {role: "user", content: input} + ] + + # First API call - may trigger tool calls + response = openai_client.chat.completions.create( + model: "gpt-4o-mini", + messages: messages, + tools: get_weather_tools, + tool_choice: "auto", + max_tokens: 150 + ) + + choice = response.choices[0] + + # If there are tool calls, execute them and make another API call + if choice.finish_reason == "tool_calls" && choice.message.tool_calls + # Add assistant's message with tool calls + messages << { + role: "assistant", + content: choice.message.content, + tool_calls: choice.message.tool_calls.map { |tc| + { + id: tc.id, + type: tc.type, + function: { + name: tc.function.name, + arguments: tc.function.arguments + } + } + } + } + + # Execute each tool call and add results + choice.message.tool_calls.each do |tool_call| + result = execute_tool_call(tool_call) + messages << { + role: "tool", + tool_call_id: tool_call.id, + content: result + } + end + + # Second API call with tool results + response = openai_client.chat.completions.create( + model: "gpt-4o-mini", + messages: messages, + max_tokens: 150 + ) + end + + response.choices[0].message.content +end + +# Scorers + +# 1. Exact match scorer +exact_match_scorer = Braintrust::Eval.scorer("exact_match") do |input, expected, output| + next 1.0 if expected.nil? + next 0.0 if output.nil? + (output == expected) ? 1.0 : 0.0 +end + +# 2. Contains keyword scorer +contains_keyword_scorer = Braintrust::Eval.scorer("contains_keyword") do |input, expected, output, metadata| + keyword = metadata[:keyword] + next 1.0 unless keyword + next 0.0 if output.nil? + + output.downcase.include?(keyword.downcase) ? 1.0 : 0.0 +end + +# 3. LLM-as-judge scorer using OpenAI +class LLMJudgeScorer + def initialize(openai_client, name, criterion) + @openai_client = openai_client + @name = name + @criterion = criterion + end + + attr_reader :name + + def call(input, expected, output, metadata = {}) + return 0.0 if output.nil? + + prompt = <<~PROMPT + Evaluate the following response based on this criterion: #{@criterion} + + User Input: #{input} + Assistant Response: #{output} + #{"Expected Response: #{expected}" if expected} + + Score the response from 0.0 to 1.0 based on how well it meets the criterion. + Respond with ONLY a number between 0.0 and 1.0, nothing else. + PROMPT + + response = @openai_client.chat.completions.create( + model: "gpt-4o-mini", + messages: [{role: "user", content: prompt}], + temperature: 0.0, + max_tokens: 10 + ) + + score_text = response.choices[0].message.content.strip + score_text.to_f + rescue => e + puts "LLM Judge error: #{e.message}" + 0.5 # Default score on error + end +end + +# 4. Response length scorer +length_scorer = Braintrust::Eval.scorer("appropriate_length") do |input, expected, output| + next 0.0 if output.nil? + + length = output.length + # Penalize very short (< 20 chars) or very long (> 500 chars) responses + if length < 20 + 0.3 + elsif length > 500 + 0.7 + else + 1.0 + end +end + +# 5. Failing scorer (demonstrates error handling) +failing_scorer = Braintrust::Eval.scorer("error_demo") do |input, expected, output, metadata| + # This scorer intentionally fails on a specific scenario + if metadata[:scenario] == "ambiguous" + raise "Intentional error: Cannot score ambiguous queries" + end + 1.0 # Success for all other cases +end + +# Create LLM judges +helpfulness_judge = LLMJudgeScorer.new(openai_client, "helpfulness", "does the response directly answer the question?") +accuracy_judge = LLMJudgeScorer.new(openai_client, "accuracy", "is the information provided accurate and relevant?") + +# Test cases with various scenarios +test_cases = [ + # Successful case with tool calling + { + input: "What's the weather like in San Francisco?", + expected: nil, # No exact expected output + metadata: {keyword: "san francisco", scenario: "weather_query"}, + tags: ["weather", "tool_calling", "success"] + }, + + # Another weather query + { + input: "Tell me the temperature in New York City", + expected: nil, + metadata: {keyword: "new york", scenario: "weather_query"}, + tags: ["weather", "tool_calling", "success"] + }, + + # Non-weather query (no tool calling) + { + input: "What's the capital of France?", + expected: "Paris", + metadata: {keyword: "paris", scenario: "general_knowledge"}, + tags: ["general_knowledge", "no_tools", "success"] + }, + + # Query that might produce shorter response + { + input: "Say hello", + expected: nil, + metadata: {scenario: "short_response"}, + tags: ["greeting", "short"] + }, + + # Complex query combining weather and other info + { + input: "What's the weather in Seattle and what's the city known for?", + expected: nil, + metadata: {keyword: "seattle", scenario: "complex_query"}, + tags: ["weather", "general_knowledge", "complex"] + }, + + # Edge case - ambiguous location + { + input: "What's the weather in Paris?", + expected: nil, + metadata: {keyword: "paris", scenario: "ambiguous"}, + tags: ["weather", "ambiguous", "edge_case"] + }, + + # Multiple locations + { + input: "Compare the weather in Boston and Miami", + expected: nil, + metadata: {scenario: "multi_location"}, + tags: ["weather", "comparison", "complex"] + }, + + # Weather with specific unit preference + { + input: "What's the temperature in Tokyo in celsius?", + expected: nil, + metadata: {keyword: "celsius", scenario: "unit_preference"}, + tags: ["weather", "unit_conversion"] + } +] + +# Run the evaluation +puts "\nRunning comprehensive evaluation..." +puts "Cases: #{test_cases.length}" +puts "Scorers: 6 (exact_match, contains_keyword, appropriate_length, error_demo, helpfulness, accuracy)" +puts + +result = Braintrust::Eval.run( + project: "ruby-sdk-examples", + experiment: "ruby-kitchen-sink-eval", + + cases: test_cases, + + # Task wraps the OpenAI call + task: ->(input) { weather_assistant_task(input, openai_client) }, + + # Multiple scorers of different types + scorers: [ + exact_match_scorer, + contains_keyword_scorer, + length_scorer, + failing_scorer, + helpfulness_judge, + accuracy_judge + ], + + # Run 3 cases in parallel for speed + parallelism: 3, + + # Tags for the experiment + tags: ["kitchen-sink", "comprehensive", "openai", "tools"], + + # Metadata for the experiment + metadata: { + description: "Comprehensive eval demonstrating all SDK features", + model: "gpt-4o-mini", + sdk_version: Braintrust::VERSION, + features: [ + "openai_integration", + "tool_calling", + "llm_as_judge", + "custom_scorers", + "error_handling", + "tracing" + ] + } +) + +# Print results +puts "\n" + "=" * 60 +puts "Evaluation Complete!" +puts "=" * 60 + +puts "\nExperiment: #{result.experiment_name}" +puts "Project ID: #{result.project_id}" +puts "Duration: #{result.duration.round(2)}s" +puts "Status: #{result.success? ? "✓ Success" : "✗ Failed"}" + +puts "\nView detailed results at:" +puts " #{result.permalink}" + +if result.failed? + puts "\n⚠ Errors encountered (#{result.errors.length}):" + result.errors.each_with_index do |error, i| + puts " #{i + 1}. #{error}" + end + exit 1 +end + +puts "\n✓ All test cases completed successfully!" + +# Shutdown to flush spans +tracer_provider.shutdown diff --git a/examples/internal/openai.rb b/examples/internal/openai.rb new file mode 100755 index 00000000..ca149c15 --- /dev/null +++ b/examples/internal/openai.rb @@ -0,0 +1,187 @@ +#!/usr/bin/env ruby +# frozen_string_literal: true + +require "bundler/setup" +require "braintrust" +require "openai" +require "opentelemetry/sdk" +require "json" + +# Internal example: Comprehensive OpenAI features with Braintrust tracing +# +# This example demonstrates all major OpenAI chat completion features: +# 1. Vision (image understanding) +# 2. Tool/function calling +# 3. Streaming responses +# 4. Reasoning models (o1-mini) +# +# Usage: +# BRAINTRUST_API_KEY=key OPENAI_API_KEY=key bundle exec ruby examples/internal/openai.rb + +unless ENV["BRAINTRUST_API_KEY"] + puts "Error: BRAINTRUST_API_KEY environment variable is required" + exit 1 +end + +unless ENV["OPENAI_API_KEY"] + puts "Error: OPENAI_API_KEY environment variable is required" + exit 1 +end + +# Initialize Braintrust with blocking login to get org info +Braintrust.init(blocking_login: true) + +# Create OpenTelemetry TracerProvider +tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + +# Enable Braintrust tracing +Braintrust::Trace.enable(tracer_provider) + +# Set as global provider +OpenTelemetry.tracer_provider = tracer_provider + +# Get a tracer for this example +tracer = OpenTelemetry.tracer_provider.tracer("openai-comprehensive-example") + +# Create OpenAI client and wrap it +client = OpenAI::Client.new(api_key: ENV["OPENAI_API_KEY"]) +Braintrust::Trace::OpenAI.wrap(client, tracer_provider: tracer_provider) + +puts "OpenAI Comprehensive Features Example" +puts "=" * 50 + +# Wrap all examples under a single parent trace +root_span = nil +tracer.in_span("examples/internal/openai.rb") do |span| + root_span = span + # Example 1: Vision - Image Understanding + puts "\n1. Vision (Image Understanding)" + puts "-" * 50 + tracer.in_span("example-vision") do + response = client.chat.completions.create( + model: "gpt-4o-mini", + messages: [ + { + role: "user", + content: [ + {type: "text", text: "What's in this image?"}, + { + type: "image_url", + image_url: { + url: "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/320px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" + } + } + ] + } + ], + max_tokens: 100 + ) + puts "✓ Vision response: #{response.choices[0].message.content[0..100]}..." + puts " Tokens: #{response.usage.total_tokens}" + end + + # Example 2: Tool/Function Calling + puts "\n2. Tool/Function Calling" + puts "-" * 50 + tracer.in_span("example-tools") do + response = client.chat.completions.create( + model: "gpt-4o-mini", + messages: [ + {role: "user", content: "What's the weather like in San Francisco?"} + ], + tools: [ + { + type: "function", + function: { + name: "get_weather", + description: "Get the current weather in a given location", + parameters: { + type: "object", + properties: { + location: { + type: "string", + description: "The city and state, e.g. San Francisco, CA" + }, + unit: { + type: "string", + enum: ["celsius", "fahrenheit"] + } + }, + required: ["location"] + } + } + } + ], + tool_choice: "auto", + max_tokens: 100 + ) + + message = response.choices[0].message + if message.tool_calls&.any? + tool_call = message.tool_calls[0] + puts "✓ Tool called: #{tool_call.function.name}" + puts " Arguments: #{tool_call.function.arguments}" + else + puts "✓ Response: #{message.content}" + end + puts " Tokens: #{response.usage.total_tokens}" + end + + # Example 3: Streaming (TODO: requires wrapper support for stream_raw) + # Skipping for now - requires different API in OpenAI gem + puts "\n3. Streaming Response" + puts "-" * 50 + puts "⊘ Skipped: Streaming requires wrapper updates (stream_raw API)" + + # Example 4: Reasoning Model (o1-mini) + puts "\n4. Reasoning Model (o1-mini)" + puts "-" * 50 + tracer.in_span("example-reasoning") do + response = client.chat.completions.create( + model: "o1-mini", + messages: [ + { + role: "user", + content: "If I have 3 apples and buy 2 more, then give away 1, how many do I have?" + } + ] + ) + puts "✓ Reasoning response: #{response.choices[0].message.content}" + puts " Tokens: #{response.usage.total_tokens}" + puts " Reasoning tokens: #{response.usage.completion_tokens_details&.reasoning_tokens}" if response.usage.respond_to?(:completion_tokens_details) + end + + # Example 5: Multiple parameters + puts "\n5. Advanced Parameters" + puts "-" * 50 + tracer.in_span("example-advanced-params") do + response = client.chat.completions.create( + model: "gpt-4o-mini", + messages: [ + {role: "system", content: "You are a helpful assistant. Be concise."}, + {role: "user", content: "What is Ruby?"} + ], + temperature: 0.7, + top_p: 0.9, + frequency_penalty: 0.5, + presence_penalty: 0.5, + max_tokens: 50, + n: 1, + seed: 12345 + ) + puts "✓ Response: #{response.choices[0].message.content[0..80]}..." + puts " Model: #{response.model}" + puts " System fingerprint: #{response.system_fingerprint}" + puts " Tokens: #{response.usage.total_tokens}" + end +end # End of parent trace + +puts "\n" + "=" * 50 +puts "✓ All examples completed!" +puts "✓ View this trace at:" +puts " #{Braintrust::Trace.permalink(root_span)}" + +# Shutdown to flush spans +tracer_provider.shutdown + +puts "\n✓ Trace sent to Braintrust!" diff --git a/examples/login.rb b/examples/login.rb new file mode 100644 index 00000000..54006a00 --- /dev/null +++ b/examples/login.rb @@ -0,0 +1,38 @@ +#!/usr/bin/env ruby +# frozen_string_literal: true + +require "bundler/setup" +require "braintrust" + +# Basic login example +# +# This example demonstrates how to: +# - Initialize the Braintrust SDK +# - Log in to retrieve organization information +# - Access the state fields after login +# +# Prerequisites: +# - Set BRAINTRUST_API_KEY environment variable +# +# Run with: +# bundle exec ruby examples/login.rb + +# Check for API key +unless ENV["BRAINTRUST_API_KEY"] + puts "Error: BRAINTRUST_API_KEY environment variable is required" + puts "Get your API key from: https://www.braintrust.dev/app/settings" + exit 1 +end + +# Initialize Braintrust with blocking login +puts "Initializing and logging in to Braintrust..." +state = Braintrust.init(blocking_login: true) + +puts "\n✓ Successfully logged in!" +puts "\nOrganization Information:" +puts " Org ID: #{state.org_id}" +puts " Org Name: #{state.org_name}" +puts " API URL: #{state.api_url}" +puts " Proxy URL: #{state.proxy_url}" +puts " Logged In: #{state.logged_in}" +puts " App URL: #{state.app_url}" diff --git a/examples/openai.rb b/examples/openai.rb new file mode 100644 index 00000000..b001fa88 --- /dev/null +++ b/examples/openai.rb @@ -0,0 +1,91 @@ +#!/usr/bin/env ruby +# frozen_string_literal: true + +require "bundler/setup" +require "braintrust" +require "openai" +require "opentelemetry/sdk" + +# Example: OpenAI chat completion with Braintrust tracing +# +# This example demonstrates how to automatically trace OpenAI API calls with Braintrust. +# +# Note: The openai gem is a development dependency. To run this example: +# 1. Install dependencies: bundle install +# 2. Run from the SDK root: bundle exec ruby examples/openai.rb +# +# Usage: +# BRAINTRUST_API_KEY=your-bt-key OPENAI_API_KEY=your-openai-key bundle exec ruby examples/openai.rb +# +# Optional: Set a default project for traces +# BRAINTRUST_DEFAULT_PROJECT=project_name:my-project bundle exec ruby examples/openai.rb + +# Check for API keys +unless ENV["BRAINTRUST_API_KEY"] + puts "Error: BRAINTRUST_API_KEY environment variable is required" + puts "Get your API key from: https://www.braintrust.dev/app/settings" + exit 1 +end + +unless ENV["OPENAI_API_KEY"] + puts "Error: OPENAI_API_KEY environment variable is required" + puts "Get your API key from: https://platform.openai.com/api-keys" + exit 1 +end + +# Initialize Braintrust with blocking login to ensure org name is available for permalinks +Braintrust.init(blocking_login: true) + +# Create OpenTelemetry TracerProvider +tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + +# Enable Braintrust tracing +Braintrust::Trace.enable(tracer_provider) + +# Set as global provider +OpenTelemetry.tracer_provider = tracer_provider + +# Create OpenAI client +client = OpenAI::Client.new(api_key: ENV["OPENAI_API_KEY"]) + +# Wrap the client with Braintrust tracing +# This automatically creates spans for all chat completion requests +Braintrust::Trace::OpenAI.wrap(client, tracer_provider: tracer_provider) + +# Create a root span to capture the entire operation +tracer = tracer_provider.tracer("openai-example") +root_span = nil + +# Make a chat completion request (automatically traced!) +puts "Sending chat completion request to OpenAI..." +response = tracer.in_span("examples/openai.rb") do |span| + root_span = span + + client.chat.completions.create( + messages: [ + {role: "system", content: "You are a helpful assistant."}, + {role: "user", content: "Say hello and tell me a short joke."} + ], + model: "gpt-4o-mini", + max_tokens: 100 + ) +end + +# Print the response +puts "\n✓ Response received!" +puts "\nAssistant: #{response.choices[0].message.content}" + +# Print usage stats +puts "\nToken usage:" +puts " Prompt tokens: #{response.usage.prompt_tokens}" +puts " Completion tokens: #{response.usage.completion_tokens}" +puts " Total tokens: #{response.usage.total_tokens}" + +# Print permalink to view this trace in Braintrust +puts "\n✓ View this trace in Braintrust:" +puts " #{Braintrust::Trace.permalink(root_span)}" + +# Shutdown to flush spans to Braintrust +tracer_provider.shutdown + +puts "\n✓ Trace sent to Braintrust!" diff --git a/examples/trace.rb b/examples/trace.rb new file mode 100644 index 00000000..f635f2cc --- /dev/null +++ b/examples/trace.rb @@ -0,0 +1,77 @@ +#!/usr/bin/env ruby +# frozen_string_literal: true + +require "bundler/setup" +require "braintrust" +require "opentelemetry/sdk" + +# Example: Enable Braintrust tracing and send a span manually +# +# This example demonstrates how to: +# 1. Initialize Braintrust with a project +# 2. Create an OpenTelemetry TracerProvider +# 3. Enable Braintrust tracing (automatically adds braintrust.parent, org, app_url) +# 4. Create spans manually +# 5. Send the spans to Braintrust +# +# Usage: +# BRAINTRUST_API_KEY=your-key bundle exec ruby examples/trace.rb +# +# Optional: Set a default project for traces +# BRAINTRUST_DEFAULT_PROJECT=project_name:ruby-sdk-examples bundle exec ruby examples/trace.rb +# +# With console debug logging: +# BRAINTRUST_ENABLE_TRACE_CONSOLE_LOG=true BRAINTRUST_API_KEY=your-key bundle exec ruby examples/trace.rb + +# Check for API key +unless ENV["BRAINTRUST_API_KEY"] + puts "Error: BRAINTRUST_API_KEY environment variable is required" + puts "Get your API key from: https://www.braintrust.dev/app/settings" + exit 1 +end + +# Initialize Braintrust with blocking login to ensure org name is available for permalinks +Braintrust.init(blocking_login: true) + +# Create a TracerProvider +tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + +# Enable Braintrust tracing (adds OTLP exporter) +Braintrust::Trace.enable(tracer_provider) + +# Set as global provider +OpenTelemetry.tracer_provider = tracer_provider + +# Get a tracer +tracer = OpenTelemetry.tracer_provider.tracer("my-app") + +# Create a span manually +# Note: braintrust.parent, braintrust.org, and braintrust.app_url are automatically added! +root_span = nil +tracer.in_span("examples/trace.rb") do |span| + root_span = span + + # Set custom attributes + span.set_attribute("user.id", "123") + span.set_attribute("operation.type", "manual_test") + span.set_attribute("environment", "example") + + puts "Inside span - doing some work..." + sleep 0.1 + + # You can create nested spans - they also get Braintrust attributes automatically + tracer.in_span("nested-operation") do |nested_span| + nested_span.set_attribute("step", "1") + puts " Inside nested span..." + sleep 0.05 + end +end + +# Print permalink to view this trace in Braintrust +puts "\n✓ View this trace in Braintrust:" +puts " #{Braintrust::Trace.permalink(root_span)}" + +# Shutdown to flush spans to Braintrust +tracer_provider.shutdown + +puts "\n✓ Success! Trace sent to Braintrust!" diff --git a/lib/braintrust.rb b/lib/braintrust.rb index c834ac9d..c5895a5d 100644 --- a/lib/braintrust.rb +++ b/lib/braintrust.rb @@ -1,6 +1,14 @@ # frozen_string_literal: true +# Load SSL config first to configure OpenSSL defaults before any connections +require_relative "braintrust/ssl_config" + require_relative "braintrust/version" +require_relative "braintrust/config" +require_relative "braintrust/state" +require_relative "braintrust/trace" +require_relative "braintrust/internal/experiments" +require_relative "braintrust/eval" # Braintrust Ruby SDK # @@ -20,5 +28,37 @@ module Braintrust class Error < StandardError; end - # TODO: Implementation coming in Phase 2 + # Initialize Braintrust SDK + # Creates a State from config (ENV + options) and optionally sets it as global + # + # @param set_global [Boolean] whether to set as global state (default: true) + # @param blocking_login [Boolean] whether to block and login immediately (default: false) + # @param api_key [String, nil] Braintrust API key (overrides BRAINTRUST_API_KEY env var) + # @param org_name [String, nil] Organization name (overrides BRAINTRUST_ORG_NAME env var) + # @param default_parent [String, nil] Default parent for spans (overrides BRAINTRUST_DEFAULT_PROJECT env var, format: "project_name:my-project" or "project_id:uuid") + # @param app_url [String, nil] App URL (overrides BRAINTRUST_APP_URL env var, default: https://www.braintrust.dev) + # @param api_url [String, nil] API URL (overrides BRAINTRUST_API_URL env var, default: https://api.braintrust.dev) + # @return [State] the created state + def self.init(set_global: true, blocking_login: false, **options) + config = Config.from_env(**options) + state = State.new( + api_key: config.api_key, + org_name: config.org_name, + default_parent: config.default_parent, + app_url: config.app_url, + api_url: config.api_url + ) + + State.global = state if set_global + + state.login if blocking_login + + state + end + + # Get the current global state + # @return [State, nil] the global state, or nil if not set + def self.current_state + State.global + end end diff --git a/lib/braintrust/api/auth.rb b/lib/braintrust/api/auth.rb new file mode 100644 index 00000000..4131f9a8 --- /dev/null +++ b/lib/braintrust/api/auth.rb @@ -0,0 +1,100 @@ +# frozen_string_literal: true + +require "net/http" +require "json" +require "uri" +require_relative "../logger" + +module Braintrust + module API + module Auth + # Result of a successful login + AuthResult = Struct.new(:org_id, :org_name, :api_url, :proxy_url, keyword_init: true) + + # Mask API key for logging (show first 8 chars) + def self.mask_api_key(api_key) + return "nil" if api_key.nil? + return api_key if api_key.length <= 8 + "#{api_key[0...8]}...#{api_key[-4..]}" + end + + # Login to Braintrust API + # @param api_key [String] Braintrust API key + # @param app_url [String] Braintrust app URL + # @param org_name [String, nil] Optional org name to filter by + # @return [AuthResult] org info + # @raise [Braintrust::Error] if login fails + def self.login(api_key:, app_url:, org_name: nil) + masked_key = mask_api_key(api_key) + Log.debug("Login: attempting login with API key #{masked_key}, org #{org_name.inspect}, app URL #{app_url}") + + uri = URI("#{app_url}/api/apikey/login") + request = Net::HTTP::Post.new(uri) + request["Authorization"] = "Bearer #{api_key}" + + http = Net::HTTP.new(uri.hostname, uri.port) + if uri.scheme == "https" + http.use_ssl = true + # TODO: This should be VERIFY_PEER but macOS has CRL issues + # Need to update system certs or configure ca_file properly + http.verify_mode = OpenSSL::SSL::VERIFY_NONE + end + + response = http.start do |http_session| + http_session.request(request) + end + + Log.debug("Login: received response [#{response.code}]") + + # Handle different status codes + case response + when Net::HTTPUnauthorized, Net::HTTPForbidden + raise Error, "Invalid API key: [#{response.code}]" + when Net::HTTPBadRequest + raise Error, "Bad request: [#{response.code}] #{response.body}" + when Net::HTTPClientError + raise Error, "Client error: [#{response.code}] #{response.message}" + when Net::HTTPServerError + raise Error, "Server error: [#{response.code}] #{response.message}" + when Net::HTTPSuccess + # Success - continue processing + else + raise Error, "Unexpected response: [#{response.code}] #{response.message}" + end + + data = JSON.parse(response.body) + org_info_list = data["org_info"] + + if org_info_list.nil? || org_info_list.empty? + raise Error, "No organizations found for API key" + end + + # Select org: filter by org_name if present, else take first + org_info = if org_name + found = org_info_list.find { |org| org["name"] == org_name } + if found + Log.debug("Login: selected org '#{org_name}' (id: #{found["id"]})") + found + else + available = org_info_list.map { |o| o["name"] }.join(", ") + raise Error, "Organization '#{org_name}' not found. Available: #{available}" + end + else + selected = org_info_list.first + Log.debug("Login: selected first org '#{selected["name"]}' (id: #{selected["id"]})") + selected + end + + result = AuthResult.new( + org_id: org_info["id"], + org_name: org_info["name"], + api_url: org_info["api_url"], + proxy_url: org_info["proxy_url"] + ) + + Log.debug("Login: successfully logged in as org '#{result.org_name}' (#{result.org_id})") + result + end + end + end +end diff --git a/lib/braintrust/config.rb b/lib/braintrust/config.rb new file mode 100644 index 00000000..5d0834a2 --- /dev/null +++ b/lib/braintrust/config.rb @@ -0,0 +1,30 @@ +# frozen_string_literal: true + +module Braintrust + # Configuration object that reads from environment variables + # and allows overriding with explicit options + class Config + attr_reader :api_key, :org_name, :default_parent, :app_url, :api_url + + def initialize(api_key: nil, org_name: nil, default_parent: nil, app_url: nil, api_url: nil) + @api_key = api_key + @org_name = org_name + @default_parent = default_parent + @app_url = app_url + @api_url = api_url + end + + # Create a Config from environment variables, with option overrides + # Passed-in options take priority over ENV vars + def self.from_env(**options) + defaults = { + api_key: ENV["BRAINTRUST_API_KEY"], + org_name: ENV["BRAINTRUST_ORG_NAME"], + default_parent: ENV["BRAINTRUST_DEFAULT_PROJECT"], + app_url: ENV["BRAINTRUST_APP_URL"] || "https://www.braintrust.dev", + api_url: ENV["BRAINTRUST_API_URL"] || "https://api.braintrust.dev" + } + new(**defaults.merge(options)) + end + end +end diff --git a/lib/braintrust/eval.rb b/lib/braintrust/eval.rb new file mode 100644 index 00000000..a2c770e2 --- /dev/null +++ b/lib/braintrust/eval.rb @@ -0,0 +1,303 @@ +# frozen_string_literal: true + +require_relative "eval/case" +require_relative "eval/cases" +require_relative "eval/scorer" +require_relative "eval/result" +require_relative "internal/experiments" +require "opentelemetry/sdk" +require "json" + +module Braintrust + module Eval + class << self + # Create a scorer with a name and callable + # @param name [String] The scorer name + # @param callable [#call, nil] Optional callable (if not using block) + # @param block [Proc] The scorer block + # @return [Scorer] + def scorer(name, callable = nil, &block) + Scorer.new(name, callable, &block) + end + + # Run an evaluation + # @param project [String] The project name + # @param experiment [String] The experiment name + # @param cases [Array, Enumerable] The test cases + # @param task [#call] The task to evaluate (must be callable) + # @param scorers [Array] The scorers to use (Scorer objects or callables) + # @param parallelism [Integer] Number of parallel workers (default: 1) + # @param tags [Array] Optional experiment tags + # @param metadata [Hash] Optional experiment metadata + # @param update [Boolean] If true, allow reusing existing experiment (default: false) + # @param state [State, nil] Braintrust state (defaults to global state) + # @param tracer_provider [TracerProvider, nil] OpenTelemetry tracer provider (defaults to global) + # @return [Result] + def run(project:, experiment:, cases:, task:, scorers:, + parallelism: 1, tags: nil, metadata: nil, update: false, + state: nil, tracer_provider: nil) + # Validate required parameters + validate_params!(project: project, experiment: experiment, + cases: cases, task: task, scorers: scorers) + + # Get state from parameter or global + state ||= Braintrust.current_state + raise Error, "No state available" unless state + + # Register project and experiment via API + result = Internal::Experiments.get_or_create( + experiment, project, state: state, + tags: tags, metadata: metadata, update: update + ) + + experiment_id = result[:experiment_id] + project_id = result[:project_id] + project_name = result[:project_name] + + # Run the eval with resolved experiment info + run_internal( + experiment_id: experiment_id, + experiment_name: experiment, + project_id: project_id, + project_name: project_name, + cases: cases, + task: task, + scorers: scorers, + state: state, + tracer_provider: tracer_provider + ) + end + + private + + # Internal eval runner that doesn't touch the API + # @param experiment_id [String] Resolved experiment ID + # @param experiment_name [String] Experiment name + # @param project_id [String] Resolved project ID + # @param project_name [String] Project name + # @param cases [Array, Enumerable, Cases] Test cases + # @param task [#call] Task callable + # @param scorers [Array] Scorers + # @param state [State] Braintrust state + # @param tracer_provider [TracerProvider, nil] OpenTelemetry tracer provider + # @return [Result] + def run_internal(experiment_id:, experiment_name:, project_id:, project_name:, + cases:, task:, scorers:, state:, tracer_provider: nil) + start_time = Time.now + + # Get tracer for creating spans + tracer_provider ||= OpenTelemetry.tracer_provider + tracer = tracer_provider.tracer("braintrust-eval") + + # Parent attribute for all eval spans + parent_attr = "experiment_id:#{experiment_id}" + + # Normalize cases to Cases wrapper + normalized_cases = normalize_cases(cases) + + # Normalize scorers to Scorer objects + normalized_scorers = normalize_scorers(scorers) + + # Collect errors + errors = [] + + # Run each case with tracing + normalized_cases.each do |test_case| + run_case(test_case, task, normalized_scorers, errors, + tracer, parent_attr) + end + + # Calculate duration + duration = Time.now - start_time + + # Generate permalink: {app_url}/app/{org}/object?object_type=experiment&object_id={experiment_id} + permalink = "#{state.app_url}/app/#{state.org_name}/object?object_type=experiment&object_id=#{experiment_id}" + + # Return result + Result.new( + experiment_id: experiment_id, + experiment_name: experiment_name, + project_id: project_id, + permalink: permalink, + errors: errors, + duration: duration + ) + end + + # Validate required parameters + # @raise [ArgumentError] if validation fails + def validate_params!(project:, experiment:, cases:, task:, scorers:) + raise ArgumentError, "project is required" unless project + raise ArgumentError, "experiment is required" unless experiment + raise ArgumentError, "cases is required" unless cases + raise ArgumentError, "task is required" unless task + raise ArgumentError, "scorers is required" unless scorers + + # Validate task is callable + unless task.respond_to?(:call) + raise ArgumentError, "task must be callable (respond to :call)" + end + end + + # Normalize cases input to Cases wrapper + # @param cases_input [Array, Enumerable, Cases] The cases input + # @return [Cases] + def normalize_cases(cases_input) + case cases_input + when Cases + cases_input + when Array, Enumerable + Cases.new(cases_input) + else + if cases_input.respond_to?(:each) + Cases.new(cases_input) + else + raise ArgumentError, "cases must be Array or Enumerable" + end + end + end + + # Normalize scorers to Scorer objects + # @param scorers_input [Array] The scorers input (Scorer objects or callables) + # @return [Array] + def normalize_scorers(scorers_input) + scorers_input.map do |scorer| + case scorer + when Scorer + # Already a Scorer + scorer + else + # Wrap callable in Scorer (auto-detects name) + Scorer.new(scorer) + end + end + end + + # Run a single test case with OpenTelemetry tracing + # Creates eval span (parent) with task and score as children + # @param test_case [Case] The test case + # @param task [#call] The task + # @param scorers [Array] The scorers + # @param errors [Array] Error collection array + # @param tracer [Tracer] OpenTelemetry tracer + # @param parent_attr [String] Parent attribute (experiment_id:project/exp_id) + def run_case(test_case, task, scorers, errors, tracer, parent_attr) + # Create eval span (parent) + tracer.in_span("eval") do |eval_span| + eval_span.set_attribute("braintrust.parent", parent_attr) + + # Set tags early so they're present even if task fails + eval_span.set_attribute("braintrust.tags", test_case.tags) if test_case.tags + + # Run task + output = nil + begin + output = run_task(test_case, task, tracer, parent_attr) + rescue => e + # Error already recorded on task span, set eval span status + eval_span.status = OpenTelemetry::Trace::Status.error(e.message) + errors << "Task failed for input '#{test_case.input}': #{e.message}" + next + end + + # Run scorers + begin + run_scorers(test_case, output, scorers, tracer, parent_attr) + rescue => e + # Error already recorded on score span, set eval span status + eval_span.status = OpenTelemetry::Trace::Status.error(e.message) + errors << "Scorers failed for input '#{test_case.input}': #{e.message}" + end + + # Set eval span attributes (after task and scorers complete) + set_json_attr(eval_span, "braintrust.span_attributes", {type: "eval"}) + set_json_attr(eval_span, "braintrust.input_json", test_case.input) + set_json_attr(eval_span, "braintrust.output_json", output) + set_json_attr(eval_span, "braintrust.expected", test_case.expected) if test_case.expected + end + end + + # Run task with OpenTelemetry tracing + # Creates task span with input and output + # @param test_case [Case] The test case + # @param task [#call] The task + # @param tracer [Tracer] OpenTelemetry tracer + # @param parent_attr [String] Parent attribute + # @return [Object] Task output + def run_task(test_case, task, tracer, parent_attr) + tracer.in_span("task") do |task_span| + task_span.set_attribute("braintrust.parent", parent_attr) + set_json_attr(task_span, "braintrust.span_attributes", {type: "task"}) + set_json_attr(task_span, "braintrust.input_json", test_case.input) + + begin + output = task.call(test_case.input) + set_json_attr(task_span, "braintrust.output_json", output) + output + rescue => e + # Record exception event with stacktrace, then set error status + task_span.record_exception(e) + task_span.status = OpenTelemetry::Trace::Status.error(e.message) + raise + end + end + end + + # Run scorers with OpenTelemetry tracing + # Creates single score span for all scorers + # @param test_case [Case] The test case + # @param output [Object] Task output + # @param scorers [Array] The scorers + # @param tracer [Tracer] OpenTelemetry tracer + # @param parent_attr [String] Parent attribute + def run_scorers(test_case, output, scorers, tracer, parent_attr) + tracer.in_span("score") do |score_span| + score_span.set_attribute("braintrust.parent", parent_attr) + set_json_attr(score_span, "braintrust.span_attributes", {type: "score"}) + + scores = {} + scorer_error = nil + scorers.each do |scorer| + score_value = scorer.call(test_case.input, test_case.expected, output, test_case.metadata || {}) + scores[scorer.name] = score_value + rescue => e + # Record first error but continue processing other scorers + scorer_error ||= "Scorer '#{scorer.name}' failed: #{e.message}" + record_span_error(score_span, e, "ScorerError") + end + + # Always set scores attribute, even if some scorers failed + set_json_attr(score_span, "braintrust.scores", scores) + + # Raise after setting scores so we can see which scorers succeeded + raise scorer_error if scorer_error + end + end + + # Record error on span with exception event and error status + # @param span [OpenTelemetry::Trace::Span] The span to record error on + # @param error [Exception] The error that occurred + # @param error_type [String] The error type name (optional, used for custom error classification) + def record_span_error(span, error, error_type = nil) + # Record exception with stacktrace (OpenTelemetry standard) + if error_type + # For custom error types, add type override + span.record_exception(error, attributes: {"exception.type" => error_type}) + else + span.record_exception(error) + end + + # Set span status to error + span.status = OpenTelemetry::Trace::Status.error(error.message) + end + + # Set a span attribute by JSON encoding the value + # @param span [OpenTelemetry::Trace::Span] The span + # @param key [String] The attribute key + # @param value [Object] The value to JSON encode + def set_json_attr(span, key, value) + span.set_attribute(key, JSON.dump(value)) + end + end + end +end diff --git a/lib/braintrust/eval/.eval-design.md b/lib/braintrust/eval/.eval-design.md new file mode 100644 index 00000000..ad053b6c --- /dev/null +++ b/lib/braintrust/eval/.eval-design.md @@ -0,0 +1,628 @@ +# Braintrust Ruby SDK - Eval API Design + +**Created**: 2025-10-21 +**Status**: Design Complete, Ready for Implementation + +## Overview + +The Eval API provides a framework for evaluating AI model outputs against expected results using custom scoring functions. It handles: +- Running tasks (code being evaluated) on test cases +- Scoring outputs against expected values +- Parallel execution for performance +- Error collection and reporting +- Integration with Braintrust experiments for tracking + +## Design Decisions + +### Decision 1: Tasks (How Users Define Code Being Evaluated) + +**Decision: Hybrid - Accept anything callable** + +**Rationale:** +- Maximum flexibility for simple and complex use cases +- Allows inline procs/lambdas for simple tasks +- Supports classes for reusable, configurable tasks +- Very Ruby-idiomatic (like Rack, Rails routing) + +**Implementation:** +- Validate with `responds_to?(:call)` +- Task receives one parameter: `input` +- Task returns output value (any type) + +**Examples:** +```ruby +# Inline proc +task: ->(input) { classify_food(input) } + +# Class with configuration +class APIClassifier + def initialize(api_key, endpoint) + @api_key = api_key + @endpoint = endpoint + end + + def call(input) + HTTP.auth("Bearer #{@api_key}") + .post(@endpoint, json: {text: input}) + .parse["result"] + end +end + +task: APIClassifier.new(ENV["API_KEY"], "https://api.example.com/classify") + +# Method reference +task: method(:classify_food) +``` + +--- + +### Decision 2: Scorer Parameters (Arity Detection) + +**Decision: Optional metadata with arity detection (3 or 4 params)** + +**Rationale:** +- Most scorers don't need metadata - keep them simple +- Advanced scorers can access metadata when needed +- Arity detection is idiomatic Ruby (like Rails callbacks) +- Clear intent from parameter count + +**Implementation:** +```ruby +case block.arity +when 3 + # Block takes (input, expected, output) + ->(i, e, o, m) { block.call(i, e, o) } +when 4, -4 # -4 means optional 4th param + # Block takes (input, expected, output, metadata) + block +else + raise ArgumentError, "Scorer block must accept 3 or 4 parameters" +end +``` + +**Examples:** +```ruby +# Simple - 3 params +Eval.scorer("exact_match") { |input, expected, output| + output == expected ? 1.0 : 0.0 +} + +# Advanced - 4 params with metadata +Eval.scorer("threshold_match") { |input, expected, output, metadata| + threshold = metadata[:threshold] || 0.8 + similarity(output, expected) >= threshold ? 1.0 : 0.0 +} +``` + +--- + +### Decision 3: Scorer Return Values (Normalize All Formats) + +**Decision: Accept float, hash, or array - normalize internally** + +**Rationale:** +- Simple case stays simple (return a float) +- Advanced case supported (return multiple scores) +- Progressive complexity - users can grow into features +- Very Ruby - duck typing, "make it work" + +**Implementation:** +```ruby +def normalize_scores(result, scorer_name) + case result + when Numeric + [{name: scorer_name, score: result.to_f}] + when Hash + name = result[:name] || result["name"] || scorer_name + score = result[:score] || result["score"] + [{name: name, score: score.to_f}] + when Array + result.map { |r| normalize_scores(r, scorer_name).first } + else + raise ArgumentError, "Invalid scorer return value: #{result.class}" + end +end +``` + +**Examples:** +```ruby +# Simple - return float +Eval.scorer("exact_match") { |i, e, o| + o == e ? 1.0 : 0.0 +} + +# Return hash with custom name +Eval.scorer("similarity") { |i, e, o| + {name: "cosine_similarity", score: 0.85} +} + +# Multiple scores from one scorer +Eval.scorer("nlp_metrics") { |i, e, o| + [ + {name: "bleu", score: calculate_bleu(e, o)}, + {name: "rouge", score: calculate_rouge(e, o)}, + {name: "meteor", score: calculate_meteor(e, o)} + ] +} +``` + +--- + +### Decision 4: Scorer Definition (Helper Method) + +**Decision: `Eval.scorer(name, &block)` helper + class support** + +**Rationale:** +- Clean helper for inline scorers (matches Go SDK) +- Supports custom classes for reusable scorers +- Name is explicit and associated with scorer +- Flexible - block or callable argument + +**Implementation:** +```ruby +def self.scorer(name, callable = nil, &block) + scorer_impl = callable || block + raise ArgumentError, "Must provide callable or block" unless scorer_impl + Scorer.new(name, scorer_impl) +end + +class Scorer + attr_reader :name + + def initialize(name, callable) + @name = name + @callable = callable + end + + def call(input, expected, output, metadata = {}) + # Handle arity detection and normalization + end +end +``` + +**Examples:** +```ruby +scorers: [ + # Block form (inline) + Eval.scorer("exact_match") { |i, e, o| o == e ? 1.0 : 0.0 }, + + # Callable form (reusable class) + Eval.scorer("fuzzy", FuzzyScorer.new), + + # Or if class already has .name method: + LLMJudgeScorer.new(ENV["OPENAI_API_KEY"]) +] +``` + +--- + +### Decision 5: Cases (Test Data Input) + +**Decision: Hybrid - Accept Array, Enumerable, or Cases wrapper** + +**Rationale:** +- Simple stays simple (array of hashes) +- Advanced supported (lazy enumerators, dataset fetchers) +- Progressive complexity +- Memory efficient for large datasets + +**Implementation:** +```ruby +def self.normalize_cases(cases_input) + case cases_input + when Array + # Simple array → wrap in Cases iterator + Cases.new(cases_input) + when Cases + # Already wrapped + cases_input + else + # Assume it's Enumerable (enumerator, dataset, etc.) + if cases_input.respond_to?(:each) + Cases.new(cases_input) + else + raise ArgumentError, "cases must be Array or Enumerable" + end + end +end +``` + +**Examples:** +```ruby +# Simple - array of hashes +cases: [ + {input: "apple", expected: "fruit"}, + {input: "carrot", expected: "vegetable", tags: ["root"], metadata: {category: "produce"}} +] + +# Dataset (lazy loading) +cases: Eval.dataset("my-dataset", project: "my-project") + +# Custom enumerator +cases: Enumerator.new do |y| + CSV.foreach("test_cases.csv") do |row| + y << {input: row[0], expected: row[1]} + end +end +``` + +--- + +### Decision 6: Result Object + +**Decision: Result class with methods** + +**Rationale:** +- Explicit interface (IDE autocomplete) +- Methods like `success?`, `failed?`, `to_s` +- Matches Go SDK +- Extensible for future stats/summaries + +**Implementation:** +```ruby +class Result + attr_reader :experiment_id, :experiment_name, :project_id, + :permalink, :errors, :duration + + def initialize(experiment_id:, experiment_name:, project_id:, + permalink:, errors:, duration:) + @experiment_id = experiment_id + @experiment_name = experiment_name + @project_id = project_id + @permalink = permalink + @errors = errors + @duration = duration + end + + def success? + errors.empty? + end + + def failed? + !success? + end + + def to_s + <<~MSG + + === Experiment: #{experiment_name} === + Project: #{project_id} + Duration: #{duration.round(1)}s + Link: #{permalink} + #{errors.any? ? "\nErrors:\n #{errors.join("\n ")}" : ""} + MSG + end +end +``` + +**Usage:** +```ruby +result = Eval.run(...) +puts result.permalink +puts "Success: #{result.success?}" +puts "Duration: #{result.duration}s" +result.errors.each { |err| puts "Error: #{err}" } +``` + +--- + +### Decision 7: Error Handling + +**Decision: Collect all errors, don't raise** + +**Rationale:** +- See all failures, not just the first one +- Parallel-friendly (other threads continue) +- Matches Go SDK (`errors.Join`) +- User can raise if desired: `raise result.errors.first unless result.success?` + +**Implementation:** +```ruby +errors = [] + +# Collect task errors +begin + output = task.call(input) +rescue => e + errors << "Task failed for input '#{input}': #{e.message}" + next # Continue to next case +end + +# Collect scorer errors +scorers.each do |scorer| + begin + score = scorer.call(input, expected, output, metadata) + rescue => e + errors << "Scorer '#{scorer.name}' failed: #{e.message}" + # Continue with other scorers + end +end + +# Return result with all collected errors +Result.new(..., errors: errors) +``` + +--- + +### Decision 8: Parallelism + +**Decision: Ruby Threads (stdlib)** + +**Rationale:** +- No dependencies (stdlib only) +- Good for evals (most are I/O bound - API calls, LLM judge) +- Simple thread pool pattern +- Matches Go's goroutines conceptually + +**Implementation:** +```ruby +parallelism = opts[:parallelism] || 1 +queue = Queue.new +results = [] +mutex = Mutex.new + +# Fill queue +cases.each { |test_case| queue << test_case } + +# Spawn worker threads +threads = parallelism.times.map do + Thread.new do + while (test_case = queue.pop(true) rescue nil) + result = run_case(test_case) + mutex.synchronize { results << result } + end + end +end + +# Wait for all threads +threads.each(&:join) +``` + +**Usage:** +```ruby +Eval.run( + ..., + parallelism: 5 # Run 5 cases concurrently +) +``` + +--- + +## Complete API Example + +```ruby +result = Eval.run( + # Required: Project and experiment + project: "ruby-sdk-examples", + experiment: "food-classifier-v1", + + # Required: Test cases + # Simple array of hashes + cases: [ + {input: "apple", expected: "fruit"}, + {input: "carrot", expected: "vegetable"}, + {input: "banana", expected: "fruit", tags: ["tropical"]}, + {input: "broccoli", expected: "vegetable", metadata: {category: "cruciferous"}} + ], + + # Required: Task (callable) + # Can be proc, lambda, or object with .call + task: ->(input) { + # Call your model/API/function + classify_food(input) + }, + + # Required: Scorers (array) + scorers: [ + # Simple inline scorer (3 params) + Eval.scorer("exact_match") { |input, expected, output| + output == expected ? 1.0 : 0.0 + }, + + # Advanced scorer with metadata (4 params) + Eval.scorer("fuzzy_match") { |input, expected, output, metadata| + threshold = metadata[:threshold] || 0.8 + similarity(output, expected) >= threshold ? 1.0 : 0.0 + }, + + # Multi-score scorer (returns array) + Eval.scorer("nlp_metrics") { |i, e, o| + [ + {name: "bleu", score: calculate_bleu(e, o)}, + {name: "rouge", score: calculate_rouge(e, o)} + ] + }, + + # Class-based scorer (reusable) + LLMJudgeScorer.new(ENV["OPENAI_API_KEY"]) + ], + + # Optional: Parallelism (default: 1) + parallelism: 5, + + # Optional: Tags for the experiment + tags: ["example", "food-classifier", "v1"], + + # Optional: Metadata for the experiment + metadata: { + model: "gpt-4o-mini", + version: "1.0.0", + description: "Food classification eval" + } +) + +# Use the result +puts result.to_s # Pretty-printed summary +puts result.permalink # Link to Braintrust UI +puts "Success: #{result.success?}" +puts "Duration: #{result.duration.round(2)}s" + +unless result.success? + puts "\nErrors:" + result.errors.each { |e| puts " - #{e}" } +end +``` + +--- + +## HTTP/API Usage Examples + +### LLM-as-Judge Pattern + +```ruby +class LLMJudgeScorer + def initialize(api_key, model: "gpt-4o-mini") + @client = OpenAI::Client.new(api_key: api_key) + @model = model + end + + def name + "llm_judge" + end + + def call(input, expected, output, metadata = {}) + prompt = <<~PROMPT + Rate the quality of this output on a scale of 0.0 to 1.0. + + Input: #{input} + Expected: #{expected} + Got: #{output} + + Return only a number between 0.0 and 1.0. + PROMPT + + response = @client.chat.completions.create( + model: @model, + messages: [{role: "user", content: prompt}], + max_tokens: 10 + ) + + response.choices[0].message.content.to_f + end +end + +# Usage +result = Eval.run( + project: "my-project", + experiment: "llm-judge-eval", + cases: [...], + task: ->(input) { my_model.generate(input) }, + scorers: [ + LLMJudgeScorer.new(ENV["OPENAI_API_KEY"]) + ] +) +``` + +### API-Based Task + +```ruby +class APIClassifier + def initialize(api_key, endpoint) + @api_key = api_key + @endpoint = endpoint + end + + def call(input) + response = HTTP.auth("Bearer #{@api_key}") + .post(@endpoint, json: {text: input}) + response.parse["classification"] + end +end + +# Usage +result = Eval.run( + project: "api-eval", + experiment: "classification-test", + cases: [{input: "text", expected: "label"}], + task: APIClassifier.new(ENV["API_KEY"], "https://api.example.com/classify"), + scorers: [ + Eval.scorer("exact") { |i, e, o| o == e ? 1.0 : 0.0 } + ], + parallelism: 10 # Run 10 API calls concurrently +) +``` + +--- + +## Implementation Notes + +### Key Classes + +1. **Eval** (`lib/braintrust/eval.rb`) + - Module with `Eval.run` and `Eval.scorer` methods + - Main entry point for users + - Handles options parsing and orchestration + +2. **Result** (`lib/braintrust/eval/result.rb`) + - Value object containing eval results + - Methods: `success?`, `failed?`, `permalink`, `to_s` + +3. **Case** (`lib/braintrust/eval/case.rb`) + - Struct representing a test case + - Fields: `input`, `expected`, `tags`, `metadata` + +4. **Cases** (`lib/braintrust/eval/cases.rb`) + - Iterator wrapper for test cases + - Wraps arrays or enumerables + - Provides `each` method + +5. **Scorer** (`lib/braintrust/eval/scorer.rb`) + - Wrapper for scorer callables + - Handles arity detection + - Normalizes return values + +### OpenTelemetry Spans + +Following Go SDK pattern, create spans for: +- **eval span**: One per test case (parent span) + - Attributes: `braintrust.input_json`, `braintrust.output_json`, `braintrust.expected`, `braintrust.span_attributes` (type: "eval") +- **task span**: Child of eval span + - Attributes: `braintrust.input_json`, `braintrust.output_json`, `braintrust.span_attributes` (type: "task") +- **score span**: Child of eval span + - Attributes: `braintrust.scores` (map of score name → value), `braintrust.span_attributes` (type: "score") + +### API Integration + +Need to implement API methods: +- `API.register_project(name)` → returns `{id:, name:}` +- `API.register_experiment(name, project_id, opts)` → returns `{id:, name:}` + +### Thread Safety + +- Use `Queue` for work distribution +- Use `Mutex` for shared state (errors array, results) +- Each thread runs independently on its own case + +--- + +## Future Enhancements + +### Dataset Support +```ruby +cases: Eval.dataset("my-dataset", project: "my-project") +``` +Lazy-loads dataset from Braintrust API. + +### Built-in Scorers +```ruby +Eval::Scorers::ExactMatch.new +Eval::Scorers::Levenshtein.new(threshold: 0.8) +``` + +### Summary Statistics +```ruby +result.summary # Returns score averages, percentiles, etc. +``` + +### Streaming Progress +```ruby +Eval.run(..., progress: true) # Shows progress bar +``` + +--- + +## References + +- Go SDK: `braintrust-x-go/braintrust/eval/eval.go` +- Go Example: `braintrust-x-go/examples/evals/evals.go` +- Ruby Test Frameworks: RSpec (blocks) vs Minitest (classes) diff --git a/lib/braintrust/eval/case.rb b/lib/braintrust/eval/case.rb new file mode 100644 index 00000000..d549aa16 --- /dev/null +++ b/lib/braintrust/eval/case.rb @@ -0,0 +1,12 @@ +# frozen_string_literal: true + +module Braintrust + module Eval + # Case represents a single test case in an evaluation + # @attr input [Object] The input to the task + # @attr expected [Object, nil] The expected output (optional) + # @attr tags [Array, nil] Optional tags for filtering/grouping + # @attr metadata [Hash, nil] Optional metadata for the case + Case = Struct.new(:input, :expected, :tags, :metadata, keyword_init: true) + end +end diff --git a/lib/braintrust/eval/cases.rb b/lib/braintrust/eval/cases.rb new file mode 100644 index 00000000..8e605f68 --- /dev/null +++ b/lib/braintrust/eval/cases.rb @@ -0,0 +1,58 @@ +# frozen_string_literal: true + +require_relative "case" + +module Braintrust + module Eval + # Cases wraps test case data (arrays or enumerables) and normalizes them to Case objects + # Supports lazy evaluation for memory-efficient processing of large datasets + class Cases + include Enumerable + + # Create a new Cases wrapper + # @param enumerable [Array, Enumerable] The test cases (hashes or Case objects) + def initialize(enumerable) + unless enumerable.respond_to?(:each) + raise ArgumentError, "Cases must be enumerable (respond to :each)" + end + + @enumerable = enumerable + end + + # Iterate over cases, normalizing each to a Case object + # @yield [Case] Each test case + def each + return enum_for(:each) unless block_given? + + @enumerable.each do |item| + yield normalize_case(item) + end + end + + # Get the count of cases + # Note: For lazy enumerators, this will force evaluation + # @return [Integer] + def count + @enumerable.count + end + + private + + # Normalize a case item to a Case object + # @param item [Hash, Case] The case item + # @return [Case] + def normalize_case(item) + case item + when Case + # Already a Case object + item + when Hash + # Convert hash to Case object + Case.new(**item) + else + raise ArgumentError, "Case must be a Hash or Case object, got #{item.class}" + end + end + end + end +end diff --git a/lib/braintrust/eval/result.rb b/lib/braintrust/eval/result.rb new file mode 100644 index 00000000..214d242f --- /dev/null +++ b/lib/braintrust/eval/result.rb @@ -0,0 +1,60 @@ +# frozen_string_literal: true + +module Braintrust + module Eval + # Result represents the outcome of an evaluation run + # Contains experiment metadata, errors, and timing information + class Result + attr_reader :experiment_id, :experiment_name, :project_id, + :permalink, :errors, :duration + + # Create a new result + # @param experiment_id [String] The experiment ID + # @param experiment_name [String] The experiment name + # @param project_id [String] The project ID + # @param permalink [String] Link to view the experiment in Braintrust UI + # @param errors [Array] List of errors that occurred + # @param duration [Float] Duration in seconds + def initialize(experiment_id:, experiment_name:, project_id:, + permalink:, errors:, duration:) + @experiment_id = experiment_id + @experiment_name = experiment_name + @project_id = project_id + @permalink = permalink + @errors = errors + @duration = duration + end + + # Check if the evaluation was successful (no errors) + # @return [Boolean] + def success? + errors.empty? + end + + # Check if the evaluation failed (has errors) + # @return [Boolean] + def failed? + !success? + end + + # Format the result as a human-readable string + # @return [String] + def to_s + output = <<~MSG + + === Experiment: #{experiment_name} === + Project: #{project_id} + Duration: #{duration.round(1)}s + Link: #{permalink} + MSG + + if errors.any? + output += "\nErrors:\n" + errors.each { |err| output += " - #{err}\n" } + end + + output + end + end + end +end diff --git a/lib/braintrust/eval/scorer.rb b/lib/braintrust/eval/scorer.rb new file mode 100644 index 00000000..16519ba4 --- /dev/null +++ b/lib/braintrust/eval/scorer.rb @@ -0,0 +1,108 @@ +# frozen_string_literal: true + +module Braintrust + module Eval + # Scorer wraps a scoring function that evaluates task output against expected values + # Scorers can accept 3 params (input, expected, output) or 4 params (input, expected, output, metadata) + # They can return a float, hash, or array of hashes + class Scorer + attr_reader :name + + # Create a new scorer + # @param name_or_callable [String, Symbol, #call] Name or callable (if callable, name is auto-detected) + # @param callable [#call, nil] Callable if name was provided separately + # @param block [Proc, nil] Block if no callable provided + def initialize(name_or_callable = nil, callable = nil, &block) + # Determine name and callable from arguments + if name_or_callable.nil? && callable.nil? && block.nil? + raise ArgumentError, "Must provide callable or block" + end + + # If first arg is a string/symbol, it's the name + if name_or_callable.is_a?(String) || name_or_callable.is_a?(Symbol) + @name = name_or_callable.to_s + @callable = callable || block + raise ArgumentError, "Must provide callable or block" unless @callable + else + # First arg is the callable, try to auto-detect name + @callable = name_or_callable || callable || block + @name = detect_name(@callable) + end + + # Validate callable + unless @callable.respond_to?(:call) + raise ArgumentError, "Scorer must be callable (respond to :call)" + end + + # Detect arity and wrap callable if needed + @wrapped_callable = wrap_callable(@callable) + end + + # Call the scorer + # @param input [Object] The input to the task + # @param expected [Object] The expected output + # @param output [Object] The actual output from the task + # @param metadata [Hash] Optional metadata + # @return [Float, Hash, Array] Score value(s) + def call(input, expected, output, metadata = {}) + @wrapped_callable.call(input, expected, output, metadata) + end + + private + + # Detect the name from a callable object + # @param callable [#call] The callable + # @return [String] The detected name + def detect_name(callable) + # Method objects have .name + if callable.is_a?(Method) + return callable.name.to_s + end + + # Objects with .name method + if callable.respond_to?(:name) + return callable.name.to_s + end + + # Fallback + "scorer" + end + + # Wrap the callable to always accept 4 parameters + # @param callable [#call] The callable to wrap + # @return [Proc] Wrapped callable that accepts 4 params + def wrap_callable(callable) + arity = callable_arity(callable) + + case arity + when 3 + # Callable takes 3 params - wrap to ignore metadata + ->(input, expected, output, metadata) { + callable.call(input, expected, output) + } + when 4, -4, -1 + # Callable takes 4 params (or variadic with 4+) + # -4 means optional 4th param + # -1 means variadic (*args) + callable + else + raise ArgumentError, "Scorer must accept 3 or 4 parameters (got arity #{arity})" + end + end + + # Get the arity of a callable + # @param callable [#call] The callable + # @return [Integer] The arity + def callable_arity(callable) + if callable.respond_to?(:arity) + callable.arity + elsif callable.respond_to?(:method) + callable.method(:call).arity + else + # Assume 3 params if we can't detect + 3 + end + end + end + end +end diff --git a/lib/braintrust/internal/experiments.rb b/lib/braintrust/internal/experiments.rb new file mode 100644 index 00000000..0f6b354b --- /dev/null +++ b/lib/braintrust/internal/experiments.rb @@ -0,0 +1,137 @@ +# frozen_string_literal: true + +require "net/http" +require "json" +require "uri" +require_relative "../logger" + +module Braintrust + module Internal + # Experiments module provides internal API methods for registering projects and experiments + # Methods are marked private to prevent direct user access - use through Eval.run + module Experiments + # Public convenience method to register/get both project and experiment + # @param experiment_name [String] The experiment name + # @param project_name [String] The project name + # @param state [State] Braintrust state with API key and URL + # @param tags [Array, nil] Optional experiment tags + # @param metadata [Hash, nil] Optional experiment metadata + # @param update [Boolean] If true, allow reusing existing experiment (default: false) + # @return [Hash] Hash with :experiment_id, :experiment_name, :project_id, :project_name + def self.get_or_create(experiment_name, project_name, state:, + tags: nil, metadata: nil, update: false) + # Register/get project first + project = register_project(project_name, state) + + # Then register/get experiment + experiment = register_experiment( + experiment_name, + project["id"], + state, + tags: tags, + metadata: metadata, + update: update + ) + + { + experiment_id: experiment["id"], + experiment_name: experiment["name"], + project_id: project["id"], + project_name: project["name"] + } + end + + # Register or get a project by name + # POST /v1/project with {name: "project-name"} + # Returns existing project if already exists + # @param name [String] Project name + # @param state [State] Braintrust state + # @return [Hash] Project data with "id", "name", "org_id", etc. + # @raise [Braintrust::Error] if API call fails + def self.register_project(name, state) + Log.debug("Registering project: #{name}") + + uri = URI("#{state.api_url}/v1/project") + request = Net::HTTP::Post.new(uri) + request["Content-Type"] = "application/json" + request["Authorization"] = "Bearer #{state.api_key}" + request.body = JSON.dump({name: name}) + + http = Net::HTTP.new(uri.hostname, uri.port) + if uri.scheme == "https" + http.use_ssl = true + # TODO: This should be VERIFY_PEER but macOS has CRL issues + http.verify_mode = OpenSSL::SSL::VERIFY_NONE + end + + response = http.start do |http_session| + http_session.request(request) + end + + Log.debug("Register project response: [#{response.code}]") + + # Handle response codes + unless response.is_a?(Net::HTTPSuccess) + raise Error, "Failed to register project '#{name}': [#{response.code}] #{response.body}" + end + + project = JSON.parse(response.body) + Log.debug("Project registered: #{project["id"]} (#{project["name"]})") + project + end + private_class_method :register_project + + # Register or get an experiment by name + # POST /v1/experiment with {project_id:, name:, ensure_new:, tags:[], metadata:{}} + # @param name [String] Experiment name + # @param project_id [String] Project ID + # @param state [State] Braintrust state + # @param tags [Array, nil] Optional tags + # @param metadata [Hash, nil] Optional metadata + # @param update [Boolean] If true, allow reusing existing experiment (ensure_new: false) + # @return [Hash] Experiment data with "id", "name", "project_id", etc. + # @raise [Braintrust::Error] if API call fails + def self.register_experiment(name, project_id, state, tags: nil, metadata: nil, update: false) + Log.debug("Registering experiment: #{name} (project: #{project_id}, update: #{update})") + + uri = URI("#{state.api_url}/v1/experiment") + request = Net::HTTP::Post.new(uri) + request["Content-Type"] = "application/json" + request["Authorization"] = "Bearer #{state.api_key}" + + payload = { + project_id: project_id, + name: name, + ensure_new: !update # When update=true, allow reusing existing experiment + } + payload[:tags] = tags if tags + payload[:metadata] = metadata if metadata + + request.body = JSON.dump(payload) + + http = Net::HTTP.new(uri.hostname, uri.port) + if uri.scheme == "https" + http.use_ssl = true + # TODO: This should be VERIFY_PEER but macOS has CRL issues + http.verify_mode = OpenSSL::SSL::VERIFY_NONE + end + + response = http.start do |http_session| + http_session.request(request) + end + + Log.debug("Register experiment response: [#{response.code}]") + + # Handle response codes + unless response.is_a?(Net::HTTPSuccess) + raise Error, "Failed to register experiment '#{name}': [#{response.code}] #{response.body}" + end + + experiment = JSON.parse(response.body) + Log.debug("Experiment registered: #{experiment["id"]} (#{experiment["name"]})") + experiment + end + private_class_method :register_experiment + end + end +end diff --git a/lib/braintrust/logger.rb b/lib/braintrust/logger.rb new file mode 100644 index 00000000..5cb5c4e0 --- /dev/null +++ b/lib/braintrust/logger.rb @@ -0,0 +1,32 @@ +# frozen_string_literal: true + +require "logger" + +module Braintrust + # Simple logger for Braintrust SDK + module Log + # Default to WARN unless BRAINTRUST_DEBUG is set + level = ENV["BRAINTRUST_DEBUG"] ? Logger::DEBUG : Logger::WARN + @logger = Logger.new($stderr, level: level) + + class << self + attr_accessor :logger + + def debug(message) + @logger.debug(message) + end + + def info(message) + @logger.info(message) + end + + def warn(message) + @logger.warn(message) + end + + def error(message) + @logger.error(message) + end + end + end +end diff --git a/lib/braintrust/ssl_config.rb b/lib/braintrust/ssl_config.rb new file mode 100644 index 00000000..52c9f27d --- /dev/null +++ b/lib/braintrust/ssl_config.rb @@ -0,0 +1,31 @@ +# frozen_string_literal: true + +require "openssl" + +module Braintrust + # SSL configuration helpers for macOS CRL issues + # + # This module configures OpenSSL to bypass Certificate Revocation List (CRL) errors + # which commonly occur on macOS due to system certificate configuration issues. + # All other SSL verification checks remain active for security. + module SSLConfig + # Configure global SSL defaults to ignore CRL errors + # This affects all Ruby SSL connections system-wide + def self.configure_defaults! + # Set up a verify callback that ignores CRL errors but keeps other checks + OpenSSL::SSL::SSLContext::DEFAULT_PARAMS[:verify_mode] = OpenSSL::SSL::VERIFY_PEER + OpenSSL::SSL::SSLContext::DEFAULT_PARAMS[:verify_callback] = proc do |preverify_ok, store_context| + if store_context.error == OpenSSL::X509::V_ERR_UNABLE_TO_GET_CRL + # Ignore CRL errors (common on macOS) + true + else + # Keep all other SSL verification + preverify_ok + end + end + end + end +end + +# Auto-configure SSL defaults when this module is loaded +Braintrust::SSLConfig.configure_defaults! diff --git a/lib/braintrust/state.rb b/lib/braintrust/state.rb new file mode 100644 index 00000000..ac0f6a62 --- /dev/null +++ b/lib/braintrust/state.rb @@ -0,0 +1,75 @@ +# frozen_string_literal: true + +require_relative "api/auth" + +module Braintrust + # State object that holds Braintrust configuration + # Thread-safe global state management + class State + attr_reader :api_key, :org_name, :org_id, :default_parent, :app_url, :api_url, :proxy_url, :logged_in + + @mutex = Mutex.new + @global_state = nil + + def initialize(api_key: nil, org_name: nil, org_id: nil, default_parent: nil, app_url: nil, api_url: nil, proxy_url: nil, logged_in: false) + raise ArgumentError, "api_key is required" if api_key.nil? || api_key.empty? + + @api_key = api_key + @org_name = org_name + @org_id = org_id + @default_parent = default_parent + @app_url = app_url || "https://www.braintrust.dev" + @api_url = api_url + @proxy_url = proxy_url + @logged_in = logged_in + end + + # Thread-safe global state getter + def self.global + @mutex.synchronize { @global_state } + end + + # Thread-safe global state setter + def self.global=(state) + @mutex.synchronize { @global_state = state } + end + + # Login to Braintrust API and update state with org info + # Makes synchronous HTTP request via API::Auth + # Updates @org_id, @org_name, @api_url, @proxy_url, @logged_in + # @return [self] + def login + result = API::Auth.login( + api_key: @api_key, + app_url: @app_url, + org_name: @org_name + ) + + # Update state with org info + @org_id = result.org_id + @org_name = result.org_name + @api_url = result.api_url + @proxy_url = result.proxy_url + @logged_in = true + + self + end + + # Validate state is properly configured + # Raises ArgumentError if state is invalid + # @return [self] + def validate + raise ArgumentError, "api_key is required" if @api_key.nil? || @api_key.empty? + raise ArgumentError, "api_url is required" if @api_url.nil? || @api_url.empty? + raise ArgumentError, "app_url is required" if @app_url.nil? || @app_url.empty? + + # If logged_in is true, org_id and org_name should be present + if @logged_in + raise ArgumentError, "org_id is required when logged_in is true" if @org_id.nil? || @org_id.empty? + raise ArgumentError, "org_name is required when logged_in is true" if @org_name.nil? || @org_name.empty? + end + + self + end + end +end diff --git a/lib/braintrust/trace.rb b/lib/braintrust/trace.rb new file mode 100644 index 00000000..62225a34 --- /dev/null +++ b/lib/braintrust/trace.rb @@ -0,0 +1,108 @@ +# frozen_string_literal: true + +require "opentelemetry/sdk" +require "opentelemetry/exporter/otlp" +require_relative "trace/span_processor" +require_relative "trace/openai" +require_relative "logger" + +module Braintrust + module Trace + def self.enable(tracer_provider, state: nil, exporter: nil) + state ||= Braintrust.current_state + raise Error, "No state available" unless state + + # Create OTLP HTTP exporter unless override provided + exporter ||= OpenTelemetry::Exporter::OTLP::Exporter.new( + endpoint: "#{state.api_url}/otel/v1/traces", + headers: { + "Authorization" => "Bearer #{state.api_key}" + } + ) + + # Wrap in batch processor + batch_processor = OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(exporter) + + # Wrap batch processor in our custom span processor to add Braintrust attributes + processor = SpanProcessor.new(batch_processor, state) + + # Register with tracer provider + tracer_provider.add_span_processor(processor) + + # Console debug if enabled + if ENV["BRAINTRUST_ENABLE_TRACE_CONSOLE_LOG"] + console_exporter = OpenTelemetry::SDK::Trace::Export::ConsoleSpanExporter.new + console_processor = OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(console_exporter) + tracer_provider.add_span_processor(console_processor) + end + + self + end + + # Generate a permalink URL for a span to view in the Braintrust UI + # Returns an empty string if the permalink cannot be generated + # @param span [OpenTelemetry::Trace::Span] The span to generate a permalink for + # @return [String] The permalink URL, or empty string if an error occurs + def self.permalink(span) + return "" if span.nil? + + # Extract required attributes from span + span_context = span.context + trace_id = span_context.hex_trace_id + span_id = span_context.hex_span_id + + # Get Braintrust attributes + attributes = span.attributes if span.respond_to?(:attributes) + unless attributes + Log.error("Span does not support attributes") + return "" + end + + app_url = attributes[SpanProcessor::APP_URL_ATTR_KEY] + org_name = attributes[SpanProcessor::ORG_ATTR_KEY] + parent = attributes[SpanProcessor::PARENT_ATTR_KEY] + + # Validate required attributes + unless app_url + Log.error("Missing required attribute: #{SpanProcessor::APP_URL_ATTR_KEY}") + return "" + end + + unless org_name + Log.error("Missing required attribute: #{SpanProcessor::ORG_ATTR_KEY}") + return "" + end + + unless parent + Log.error("Missing required attribute: #{SpanProcessor::PARENT_ATTR_KEY}") + return "" + end + + # Parse parent to determine URL format + parent_type, parent_id = parent.split(":", 2) + unless parent_type && parent_id + Log.error("Invalid parent format: #{parent}") + return "" + end + + # Build the permalink URL based on parent type + if parent_type == "experiment_id" + # For experiments: {app_url}/app/{org}/p/{project}/experiments/{experiment_id}?r={trace_id}&s={span_id} + project_name, experiment_id = parent_id.split("/", 2) + unless project_name && experiment_id + Log.error("Invalid experiment parent format: #{parent_id}") + return "" + end + + "#{app_url}/app/#{org_name}/p/#{project_name}/experiments/#{experiment_id}?r=#{trace_id}&s=#{span_id}" + else + # For projects: {app_url}/app/{org}/p/{project}/logs?r={trace_id}&s={span_id} + # parent_type is typically "project_name" + "#{app_url}/app/#{org_name}/p/#{parent_id}/logs?r=#{trace_id}&s=#{span_id}" + end + rescue => e + Log.error("Failed to generate permalink: #{e.message}") + "" + end + end +end diff --git a/lib/braintrust/trace/openai.rb b/lib/braintrust/trace/openai.rb new file mode 100644 index 00000000..c507d4e9 --- /dev/null +++ b/lib/braintrust/trace/openai.rb @@ -0,0 +1,87 @@ +# frozen_string_literal: true + +require "opentelemetry/sdk" +require "json" + +module Braintrust + module Trace + module OpenAI + # Wrap an OpenAI::Client to automatically create spans for chat completions + # @param client [OpenAI::Client] the OpenAI client to wrap + # @param tracer_provider [OpenTelemetry::SDK::Trace::TracerProvider] the tracer provider (defaults to global) + def self.wrap(client, tracer_provider: nil) + tracer_provider ||= ::OpenTelemetry.tracer_provider + + # Create a wrapper module that intercepts chat.completions.create + wrapper = Module.new do + define_method(:create) do |**params| + tracer = tracer_provider.tracer("braintrust") + + tracer.in_span("openai.chat.completions.create") do |span| + # Initialize metadata hash + metadata = { + "provider" => "openai", + "endpoint" => "/v1/chat/completions" + } + + # Capture request metadata fields + metadata_fields = %i[ + model frequency_penalty logit_bias logprobs max_tokens n + presence_penalty response_format seed service_tier stop + stream stream_options temperature top_p top_logprobs + tools tool_choice parallel_tool_calls user functions function_call + ] + + metadata_fields.each do |field| + metadata[field.to_s] = params[field] if params.key?(field) + end + + # Set input messages as JSON + if params[:messages] + messages_array = params[:messages].map do |msg| + {role: msg[:role].to_s, content: msg[:content]} + end + span.set_attribute("braintrust.input_json", JSON.generate(messages_array)) + end + + # Call the original method + response = super(**params) + + # Set output (choices) as JSON + # Use to_h to get the raw structure with all fields (including tool_calls) + if response.respond_to?(:choices) && response.choices&.any? + choices_array = response.choices.map(&:to_h) + span.set_attribute("braintrust.output_json", JSON.generate(choices_array)) + end + + # Set metrics (token usage) + if response.respond_to?(:usage) && response.usage + metrics = {} + metrics["prompt_tokens"] = response.usage.prompt_tokens if response.usage.prompt_tokens + metrics["completion_tokens"] = response.usage.completion_tokens if response.usage.completion_tokens + metrics["tokens"] = response.usage.total_tokens if response.usage.total_tokens + span.set_attribute("braintrust.metrics", JSON.generate(metrics)) + end + + # Add response metadata fields + metadata["id"] = response.id if response.respond_to?(:id) && response.id + metadata["created"] = response.created if response.respond_to?(:created) && response.created + metadata["system_fingerprint"] = response.system_fingerprint if response.respond_to?(:system_fingerprint) && response.system_fingerprint + metadata["service_tier"] = response.service_tier if response.respond_to?(:service_tier) && response.service_tier + + # Set metadata ONCE at the end with complete hash + span.set_attribute("braintrust.metadata", JSON.generate(metadata)) + + response + end + end + end + + # Prepend the wrapper to the completions resource + client.chat.completions.singleton_class.prepend(wrapper) + + client + end + end + end +end diff --git a/lib/braintrust/trace/span_processor.rb b/lib/braintrust/trace/span_processor.rb new file mode 100644 index 00000000..333a9f61 --- /dev/null +++ b/lib/braintrust/trace/span_processor.rb @@ -0,0 +1,71 @@ +# frozen_string_literal: true + +require "opentelemetry/sdk" + +module Braintrust + module Trace + # Custom span processor that adds Braintrust-specific attributes to spans + class SpanProcessor + PARENT_ATTR_KEY = "braintrust.parent" + ORG_ATTR_KEY = "braintrust.org" + APP_URL_ATTR_KEY = "braintrust.app_url" + + def initialize(wrapped_processor, state) + @wrapped = wrapped_processor + @state = state + end + + def on_start(span, parent_context) + # Add default parent if span doesn't already have one + has_parent = span.respond_to?(:attributes) && span.attributes&.key?(PARENT_ATTR_KEY) + + unless has_parent + # Try to inherit parent from parent span in context + parent_value = get_parent_from_context(parent_context) || default_parent + span.set_attribute(PARENT_ATTR_KEY, parent_value) + end + + # Always add org and app_url + span.set_attribute(ORG_ATTR_KEY, @state.org_name) if @state.org_name + span.set_attribute(APP_URL_ATTR_KEY, @state.app_url) if @state.app_url + + # Delegate to wrapped processor + @wrapped.on_start(span, parent_context) + end + + # Called when a span ends + def on_finish(span) + @wrapped.on_finish(span) + end + + # Shutdown the processor + def shutdown(timeout: nil) + @wrapped.shutdown(timeout: timeout) + end + + # Force flush any buffered spans + def force_flush(timeout: nil) + @wrapped.force_flush(timeout: timeout) + end + + private + + def default_parent + @state.default_parent || "project_name:ruby-sdk-default-project" + end + + # Get parent attribute from parent span in context + def get_parent_from_context(parent_context) + return nil unless parent_context + + # Get the current span from the context (the parent span) + parent_span = OpenTelemetry::Trace.current_span(parent_context) + return nil unless parent_span + return nil unless parent_span.respond_to?(:attributes) + + # Return the parent attribute from the parent span + parent_span.attributes&.[](PARENT_ATTR_KEY) + end + end + end +end diff --git a/lib/braintrust/version.rb b/lib/braintrust/version.rb index b892a8ba..73c4bae7 100644 --- a/lib/braintrust/version.rb +++ b/lib/braintrust/version.rb @@ -1,5 +1,5 @@ # frozen_string_literal: true module Braintrust - VERSION = "0.1.0" + VERSION = "0.0.1" end diff --git a/mise.toml b/mise.toml index 18333906..09b4a098 100644 --- a/mise.toml +++ b/mise.toml @@ -20,8 +20,7 @@ description = "Runs tests when files change" run = "watchexec --exts rb --watch lib --watch test --restart --clear -- rake test" [tasks.verify-fmt] -silent = true -run = "bundle exec standardrb --format progress || (bundle exec standardrb --fix && exit 1)" +run = "bundle exec standardrb --format progress" [hooks] postinstall = """ diff --git a/test/braintrust/config_test.rb b/test/braintrust/config_test.rb index 317b286f..3007b07f 100644 --- a/test/braintrust/config_test.rb +++ b/test/braintrust/config_test.rb @@ -3,13 +3,67 @@ require "test_helper" class Braintrust::ConfigTest < Minitest::Test + def setup + # Save original env vars + @original_api_key = ENV["BRAINTRUST_API_KEY"] + @original_org_name = ENV["BRAINTRUST_ORG_NAME"] + @original_app_url = ENV["BRAINTRUST_APP_URL"] + end + + def teardown + # Restore original env vars + if @original_api_key + ENV["BRAINTRUST_API_KEY"] = @original_api_key + else + ENV.delete("BRAINTRUST_API_KEY") + end + + if @original_org_name + ENV["BRAINTRUST_ORG_NAME"] = @original_org_name + else + ENV.delete("BRAINTRUST_ORG_NAME") + end + + if @original_app_url + ENV["BRAINTRUST_APP_URL"] = @original_app_url + else + ENV.delete("BRAINTRUST_APP_URL") + end + end + def test_parses_api_key_from_env ENV["BRAINTRUST_API_KEY"] = "test-key-123" config = Braintrust::Config.from_env assert_equal "test-key-123", config.api_key - ensure - ENV.delete("BRAINTRUST_API_KEY") + end + + def test_provides_default_values + config = Braintrust::Config.from_env + + assert_equal "https://www.braintrust.dev", config.app_url + assert_equal "https://api.braintrust.dev", config.api_url + end + + def test_passed_options_override_env_vars + ENV["BRAINTRUST_API_KEY"] = "env-key" + ENV["BRAINTRUST_ORG_NAME"] = "env-org" + + config = Braintrust::Config.from_env( + api_key: "explicit-key", + org_name: "explicit-org" + ) + + assert_equal "explicit-key", config.api_key + assert_equal "explicit-org", config.org_name + end + + def test_env_vars_override_defaults + ENV["BRAINTRUST_APP_URL"] = "https://custom.braintrust.dev" + + config = Braintrust::Config.from_env + + assert_equal "https://custom.braintrust.dev", config.app_url end end diff --git a/test/braintrust/eval/case_test.rb b/test/braintrust/eval/case_test.rb new file mode 100644 index 00000000..4cc7394e --- /dev/null +++ b/test/braintrust/eval/case_test.rb @@ -0,0 +1,61 @@ +# frozen_string_literal: true + +require "test_helper" +require "braintrust/eval/case" + +class Braintrust::Eval::CaseTest < Minitest::Test + def test_case_with_input_and_expected + # Test basic case creation with input and expected + test_case = Braintrust::Eval::Case.new( + input: "apple", + expected: "fruit" + ) + + assert_equal "apple", test_case.input + assert_equal "fruit", test_case.expected + assert_nil test_case.tags + assert_nil test_case.metadata + end + + def test_case_with_all_fields + # Test case with all fields populated + test_case = Braintrust::Eval::Case.new( + input: "banana", + expected: "fruit", + tags: ["tropical", "sweet"], + metadata: {color: "yellow", price: 0.5} + ) + + assert_equal "banana", test_case.input + assert_equal "fruit", test_case.expected + assert_equal ["tropical", "sweet"], test_case.tags + assert_equal({color: "yellow", price: 0.5}, test_case.metadata) + end + + def test_case_input_only + # Test that expected, tags, and metadata are optional + test_case = Braintrust::Eval::Case.new(input: "test") + + assert_equal "test", test_case.input + assert_nil test_case.expected + assert_nil test_case.tags + assert_nil test_case.metadata + end + + def test_case_from_hash + # Test creating case from hash (as users will provide) + hash = { + input: "carrot", + expected: "vegetable", + tags: ["orange"], + metadata: {category: "root"} + } + + test_case = Braintrust::Eval::Case.new(**hash) + + assert_equal "carrot", test_case.input + assert_equal "vegetable", test_case.expected + assert_equal ["orange"], test_case.tags + assert_equal({category: "root"}, test_case.metadata) + end +end diff --git a/test/braintrust/eval/cases_test.rb b/test/braintrust/eval/cases_test.rb new file mode 100644 index 00000000..847b3366 --- /dev/null +++ b/test/braintrust/eval/cases_test.rb @@ -0,0 +1,121 @@ +# frozen_string_literal: true + +require "test_helper" +require "braintrust/eval/case" +require "braintrust/eval/cases" + +class Braintrust::Eval::CasesTest < Minitest::Test + def test_cases_from_array_of_hashes + # Test creating Cases from array of hashes + cases_input = [ + {input: "apple", expected: "fruit"}, + {input: "carrot", expected: "vegetable"} + ] + + cases = Braintrust::Eval::Cases.new(cases_input) + + result = [] + cases.each do |test_case| + result << test_case + end + + assert_equal 2, result.length + assert_instance_of Braintrust::Eval::Case, result[0] + assert_equal "apple", result[0].input + assert_equal "fruit", result[0].expected + end + + def test_cases_from_array_of_case_objects + # Test that Cases accepts already-built Case objects + cases_input = [ + Braintrust::Eval::Case.new(input: "apple", expected: "fruit"), + Braintrust::Eval::Case.new(input: "carrot", expected: "vegetable") + ] + + cases = Braintrust::Eval::Cases.new(cases_input) + + result = [] + cases.each do |test_case| + result << test_case + end + + assert_equal 2, result.length + assert_equal "apple", result[0].input + end + + def test_cases_from_enumerator + # Test creating Cases from lazy enumerator + enumerator = Enumerator.new do |yielder| + yielder << {input: "apple", expected: "fruit"} + yielder << {input: "carrot", expected: "vegetable"} + end + + cases = Braintrust::Eval::Cases.new(enumerator) + + result = [] + cases.each do |test_case| + result << test_case + end + + assert_equal 2, result.length + assert_equal "apple", result[0].input + end + + def test_cases_with_all_fields + # Test that Cases preserves tags and metadata + cases_input = [ + { + input: "apple", + expected: "fruit", + tags: ["sweet"], + metadata: {color: "red"} + } + ] + + cases = Braintrust::Eval::Cases.new(cases_input) + + result = [] + cases.each do |test_case| + result << test_case + end + + assert_equal ["sweet"], result[0].tags + assert_equal({color: "red"}, result[0].metadata) + end + + def test_cases_lazy_evaluation + # Test that enumerator is evaluated lazily + evaluated = [] + + enumerator = Enumerator.new do |yielder| + evaluated << 1 + yielder << {input: "first", expected: "a"} + evaluated << 2 + yielder << {input: "second", expected: "b"} + end + + cases = Braintrust::Eval::Cases.new(enumerator) + + # Creating Cases should not trigger evaluation + assert_equal [], evaluated + + # Iterating should trigger evaluation + cases.each { |_| break } # Break after first + + # Should have evaluated first item only + assert_equal [1], evaluated + end + + def test_cases_count + # Test that Cases provides count method + cases_input = [ + {input: "apple", expected: "fruit"}, + {input: "carrot", expected: "vegetable"} + ] + + cases = Braintrust::Eval::Cases.new(cases_input) + + # For arrays, count should work + assert_equal 2, cases.count + end +end diff --git a/test/braintrust/eval/result_test.rb b/test/braintrust/eval/result_test.rb new file mode 100644 index 00000000..f8a7611f --- /dev/null +++ b/test/braintrust/eval/result_test.rb @@ -0,0 +1,93 @@ +# frozen_string_literal: true + +require "test_helper" +require "braintrust/eval/result" + +class Braintrust::Eval::ResultTest < Minitest::Test + def test_result_with_success + # Test successful result (no errors) + result = Braintrust::Eval::Result.new( + experiment_id: "exp_123", + experiment_name: "my-experiment", + project_id: "proj_456", + permalink: "https://braintrust.dev/link", + errors: [], + duration: 1.5 + ) + + assert_equal "exp_123", result.experiment_id + assert_equal "my-experiment", result.experiment_name + assert_equal "proj_456", result.project_id + assert_equal "https://braintrust.dev/link", result.permalink + assert_equal [], result.errors + assert_equal 1.5, result.duration + + assert result.success? + refute result.failed? + end + + def test_result_with_errors + # Test failed result (with errors) + result = Braintrust::Eval::Result.new( + experiment_id: "exp_123", + experiment_name: "my-experiment", + project_id: "proj_456", + permalink: "https://braintrust.dev/link", + errors: ["Task failed for input 'apple'", "Scorer 'exact_match' failed"], + duration: 2.3 + ) + + assert_equal 2, result.errors.length + refute result.success? + assert result.failed? + end + + def test_result_to_s_success + # Test to_s formatting for successful result + result = Braintrust::Eval::Result.new( + experiment_id: "exp_123", + experiment_name: "food-classifier", + project_id: "proj_456", + permalink: "https://braintrust.dev/link", + errors: [], + duration: 1.234 + ) + + output = result.to_s + + assert_match(/food-classifier/, output) + assert_match(/proj_456/, output) + assert_match(/1.2s/, output) # Rounded to 1 decimal + assert_match(/braintrust.dev\/link/, output) + refute_match(/Errors:/, output) # No errors section + end + + def test_result_to_s_with_errors + # Test to_s formatting for failed result + result = Braintrust::Eval::Result.new( + experiment_id: "exp_123", + experiment_name: "food-classifier", + project_id: "proj_456", + permalink: "https://braintrust.dev/link", + errors: ["Error 1", "Error 2"], + duration: 1.234 + ) + + output = result.to_s + + assert_match(/food-classifier/, output) + assert_match(/Errors:/, output) + assert_match(/Error 1/, output) + assert_match(/Error 2/, output) + end + + def test_result_requires_all_fields + # Test that all required fields must be provided + assert_raises(ArgumentError) do + Braintrust::Eval::Result.new( + experiment_name: "test" + # Missing other required fields + ) + end + end +end diff --git a/test/braintrust/eval/scorer_test.rb b/test/braintrust/eval/scorer_test.rb new file mode 100644 index 00000000..602e6904 --- /dev/null +++ b/test/braintrust/eval/scorer_test.rb @@ -0,0 +1,165 @@ +# frozen_string_literal: true + +require "test_helper" +require "braintrust/eval/scorer" + +class Braintrust::Eval::ScorerTest < Minitest::Test + def test_scorer_with_3_param_block + # Test scorer with 3 params (input, expected, output) + # Block should be called without metadata + scorer = Braintrust::Eval::Scorer.new("exact_match") do |input, expected, output| + (output == expected) ? 1.0 : 0.0 + end + + assert_equal "exact_match", scorer.name + + # Call with metadata - block should ignore it + result = scorer.call("apple", "fruit", "fruit", {threshold: 0.5}) + assert_equal 1.0, result + end + + def test_scorer_with_4_param_block + # Test scorer with 4 params (input, expected, output, metadata) + # Block should receive metadata + scorer = Braintrust::Eval::Scorer.new("threshold_match") do |input, expected, output, metadata| + threshold = metadata[:threshold] || 0.8 + score = 0.9 + (score >= threshold) ? 1.0 : 0.0 + end + + assert_equal "threshold_match", scorer.name + + # Call with high threshold - should fail + result = scorer.call("a", "b", "c", {threshold: 0.95}) + assert_equal 0.0, result + + # Call with low threshold - should pass + result = scorer.call("a", "b", "c", {threshold: 0.85}) + assert_equal 1.0, result + end + + def test_scorer_with_callable_object + # Test scorer with object that responds to .call + callable = Class.new do + def call(input, expected, output) + (output.downcase == expected.downcase) ? 1.0 : 0.0 + end + end.new + + scorer = Braintrust::Eval::Scorer.new("case_insensitive", callable) + + assert_equal "case_insensitive", scorer.name + + result = scorer.call("test", "HELLO", "hello", {}) + assert_equal 1.0, result + end + + def test_scorer_return_float + # Test that float return values are passed through + scorer = Braintrust::Eval::Scorer.new("float_scorer") do |i, e, o| + 0.75 + end + + result = scorer.call("a", "b", "c", {}) + assert_equal 0.75, result + end + + def test_scorer_return_hash + # Test that hash return values are normalized + scorer = Braintrust::Eval::Scorer.new("hash_scorer") do |i, e, o| + {name: "custom_name", score: 0.85} + end + + result = scorer.call("a", "b", "c", {}) + assert_equal({name: "custom_name", score: 0.85}, result) + end + + def test_scorer_return_array + # Test that array return values are normalized + scorer = Braintrust::Eval::Scorer.new("multi_scorer") do |i, e, o| + [ + {name: "metric1", score: 0.9}, + {name: "metric2", score: 0.8} + ] + end + + result = scorer.call("a", "b", "c", {}) + assert_equal 2, result.length + assert_equal({name: "metric1", score: 0.9}, result[0]) + assert_equal({name: "metric2", score: 0.8}, result[1]) + end + + def test_scorer_invalid_arity + # Test that scorer raises error for invalid arity + error = assert_raises(ArgumentError) do + Braintrust::Eval::Scorer.new("bad_scorer") do |only_one_param| + 1.0 + end + end + + assert_match(/must accept 3 or 4 parameters/, error.message) + end + + def test_scorer_missing_callable + # Test that scorer raises error if no callable provided + error = assert_raises(ArgumentError) do + Braintrust::Eval::Scorer.new("no_callable") + end + + assert_match(/must provide callable or block/i, error.message) + end + + def test_scorer_with_callable_object_having_name + # Test scorer that uses object's .name method if available + callable = Class.new do + def name + "object_name" + end + + def call(input, expected, output) + 1.0 + end + end.new + + # When name is provided explicitly, it should override object's name + scorer = Braintrust::Eval::Scorer.new("explicit_name", callable) + assert_equal "explicit_name", scorer.name + end + + def test_scorer_with_method_auto_name + # Test that method objects automatically use the method name + sample_scorer = lambda { |input, expected, output| + (output == expected) ? 1.0 : 0.0 + } + # Give it a name property for testing + sample_scorer.define_singleton_method(:name) { "sample_scorer" } + + # Pass method object without explicit name + scorer = Braintrust::Eval::Scorer.new(sample_scorer) + + # Should auto-detect name from method + assert_equal "sample_scorer", scorer.name + + result = scorer.call("test", "fruit", "fruit", {}) + assert_equal 1.0, result + end + + def test_scorer_with_callable_object_auto_name + # Test that objects with .name method automatically use it + callable = Class.new do + def name + "auto_name" + end + + def call(input, expected, output) + 1.0 + end + end.new + + # Pass callable without explicit name + scorer = Braintrust::Eval::Scorer.new(callable) + + # Should auto-detect name from object + assert_equal "auto_name", scorer.name + end +end diff --git a/test/braintrust/eval_test.rb b/test/braintrust/eval_test.rb new file mode 100644 index 00000000..9c2bbe7f --- /dev/null +++ b/test/braintrust/eval_test.rb @@ -0,0 +1,358 @@ +# frozen_string_literal: true + +require "test_helper" +require "braintrust/eval" + +class Braintrust::EvalTest < Minitest::Test + def test_eval_scorer_helper + # Test Eval.scorer helper method + scorer = Braintrust::Eval.scorer("test_scorer") do |input, expected, output| + (output == expected) ? 1.0 : 0.0 + end + + assert_equal "test_scorer", scorer.name + assert_instance_of Braintrust::Eval::Scorer, scorer + end + + def test_eval_run_basic + skip "Requires BRAINTRUST_API_KEY" unless ENV["BRAINTRUST_API_KEY"] + + Braintrust.init(blocking_login: true) + state = Braintrust.current_state + + task = ->(input) { input.upcase } + scorer = Braintrust::Eval.scorer("exact") do |input, expected, output| + (output == expected) ? 1.0 : 0.0 + end + + result = Braintrust::Eval.run( + project: "ruby-sdk-test", + experiment: "test-basic-#{Time.now.to_i}", + cases: [ + {input: "hello", expected: "HELLO"}, + {input: "world", expected: "WORLD"} + ], + task: task, + scorers: [scorer], + state: state + ) + + assert_instance_of Braintrust::Eval::Result, result + assert result.success? + assert_equal [], result.errors + assert result.duration > 0 + end + + def test_eval_run_with_task_error + skip "Requires BRAINTRUST_API_KEY" unless ENV["BRAINTRUST_API_KEY"] + + Braintrust.init(blocking_login: true) + state = Braintrust.current_state + + task = ->(input) { + raise "Task failed!" if input == "bad" + input.upcase + } + + scorer = Braintrust::Eval.scorer("exact") do |input, expected, output| + (output == expected) ? 1.0 : 0.0 + end + + result = Braintrust::Eval.run( + project: "ruby-sdk-test", + experiment: "test-task-error-#{Time.now.to_i}", + cases: [ + {input: "good", expected: "GOOD"}, + {input: "bad", expected: "BAD"} + ], + task: task, + scorers: [scorer], + state: state + ) + + assert result.failed? + assert_equal 1, result.errors.length + assert_match(/Task failed/, result.errors[0]) + end + + def test_eval_run_with_scorer_error + skip "Requires BRAINTRUST_API_KEY" unless ENV["BRAINTRUST_API_KEY"] + + Braintrust.init(blocking_login: true) + state = Braintrust.current_state + + task = ->(input) { input.upcase } + + scorer = Braintrust::Eval.scorer("failing_scorer") do |input, expected, output| + raise "Scorer failed!" if input == "bad" + 1.0 + end + + result = Braintrust::Eval.run( + project: "ruby-sdk-test", + experiment: "test-scorer-error-#{Time.now.to_i}", + cases: [ + {input: "good", expected: "GOOD"}, + {input: "bad", expected: "BAD"} + ], + task: task, + scorers: [scorer], + state: state + ) + + assert result.failed? + assert_equal 1, result.errors.length + assert_match(/Scorer.*failed/, result.errors[0]) + end + + def test_eval_scorer_error_records_exception_event + # Test that scorer errors are recorded as exception events on spans + rig = setup_otel_test_rig + + task = ->(input) { input.upcase } + good_scorer = Braintrust::Eval.scorer("good") { |i, e, o| 1.0 } + failing_scorer = Braintrust::Eval.scorer("failing") do |i, e, o| + raise "Intentional error" if i == "bad" + 1.0 + end + + # Use run_test_eval helper to avoid API calls in tests + run_test_eval( + experiment_id: "test-exp-123", + experiment_name: "test-error-events", + project_id: "test-proj-123", + project_name: "test-project", + cases: [{input: "bad", expected: "BAD"}], + task: task, + scorers: [good_scorer, failing_scorer], + state: rig.state, + tracer_provider: rig.tracer_provider + ) + + spans = rig.drain + score_span = spans.find { |s| s.name == "score" } + + assert score_span, "Expected score span" + assert score_span.events, "Expected span to have events" + + exception_event = score_span.events.find { |e| e.name == "exception" } + assert exception_event, "Expected exception event" + assert_equal "ScorerError", exception_event.attributes["exception.type"] + assert_match(/Intentional error/, exception_event.attributes["exception.message"]) + assert exception_event.attributes["exception.stacktrace"], "Expected stacktrace in exception event" + + # Verify scores still recorded for successful scorers + scores = JSON.parse(score_span.attributes["braintrust.scores"]) + assert_equal 1.0, scores["good"], "Good scorer should have succeeded" + assert_nil scores["failing"], "Failing scorer should not have a score" + end + + def test_eval_run_with_multiple_scorers + skip "Requires BRAINTRUST_API_KEY" unless ENV["BRAINTRUST_API_KEY"] + + Braintrust.init(blocking_login: true) + state = Braintrust.current_state + + task = ->(input) { input.upcase } + + scorer1 = Braintrust::Eval.scorer("exact") do |input, expected, output| + (output == expected) ? 1.0 : 0.0 + end + + scorer2 = Braintrust::Eval.scorer("length") do |input, expected, output| + (output.length == expected.length) ? 1.0 : 0.0 + end + + result = Braintrust::Eval.run( + project: "ruby-sdk-test", + experiment: "test-multiple-scorers-#{Time.now.to_i}", + cases: [ + {input: "hello", expected: "HELLO"} + ], + task: task, + scorers: [scorer1, scorer2], + state: state + ) + + assert result.success? + end + + def test_eval_run_with_callable_task + skip "Requires BRAINTRUST_API_KEY" unless ENV["BRAINTRUST_API_KEY"] + + Braintrust.init(blocking_login: true) + state = Braintrust.current_state + + callable_task = Class.new do + def call(input) + input.reverse + end + end.new + + scorer = Braintrust::Eval.scorer("exact") do |input, expected, output| + (output == expected) ? 1.0 : 0.0 + end + + result = Braintrust::Eval.run( + project: "ruby-sdk-test", + experiment: "test-callable-task-#{Time.now.to_i}", + cases: [ + {input: "hello", expected: "olleh"} + ], + task: callable_task, + scorers: [scorer], + state: state + ) + + assert result.success? + end + + def test_eval_run_validates_required_params + # Test that run validates required parameters (no API call needed) + error = assert_raises(ArgumentError) do + Braintrust::Eval.run + # Missing required params + end + + # Ruby's keyword arg validation or our custom validation + assert_match(/required|missing keyword/i, error.message) + end + + def test_eval_run_validates_task_callable + # Test that task must be callable (no API call needed) + state = get_test_state + + error = assert_raises(ArgumentError) do + Braintrust::Eval.run( + project: "test", + experiment: "test", + cases: [], + task: "not callable", # String is not callable + scorers: [], + state: state + ) + end + + assert_match(/task.*callable/i, error.message) + end + + def test_eval_run_with_method_scorer + skip "Requires BRAINTRUST_API_KEY" unless ENV["BRAINTRUST_API_KEY"] + + Braintrust.init(blocking_login: true) + state = Braintrust.current_state + + task = ->(input) { input.upcase } + # Use a lambda instead of nested method + test_method_scorer = ->(input, expected, output) { (output == expected) ? 1.0 : 0.0 } + + result = Braintrust::Eval.run( + project: "ruby-sdk-test", + experiment: "test-method-scorer-#{Time.now.to_i}", + cases: [ + {input: "hello", expected: "HELLO"} + ], + task: task, + scorers: [test_method_scorer], # Pass lambda directly + state: state + ) + + assert result.success? + end + + def test_eval_task_error_records_exception_on_task_span + # Test that task errors are recorded as exception events on the TASK span (not eval span) + rig = setup_otel_test_rig + + task = ->(input) { + raise "Task intentionally failed" if input == "bad" + input.upcase + } + scorer = Braintrust::Eval.scorer("good") { |i, e, o| 1.0 } + + # Use run_test_eval helper to avoid API calls in tests + run_test_eval( + experiment_id: "test-exp-123", + experiment_name: "test-task-error", + project_id: "test-proj-123", + project_name: "test-project", + cases: [{input: "bad", expected: "BAD"}], + task: task, + scorers: [scorer], + state: rig.state, + tracer_provider: rig.tracer_provider + ) + + spans = rig.drain + task_span = spans.find { |s| s.name == "task" } + eval_span = spans.find { |s| s.name == "eval" } + + # Task span should exist and have exception event (added by OpenTelemetry) + assert task_span, "Expected task span" + assert task_span.events, "Expected task span to have events" + + exception_event = task_span.events.find { |e| e.name == "exception" } + assert exception_event, "Expected exception event on task span" + assert_equal "RuntimeError", exception_event.attributes["exception.type"] + assert_match(/Task intentionally failed/, exception_event.attributes["exception.message"]) + assert exception_event.attributes["exception.stacktrace"], "Expected stacktrace in exception event" + + # Eval span should also have error status + assert eval_span, "Expected eval span" + assert_equal OpenTelemetry::Trace::Status::ERROR, eval_span.status.code + end + + def test_eval_run_with_tracing + skip "Requires BRAINTRUST_API_KEY" unless ENV["BRAINTRUST_API_KEY"] + + # Set up test rig for capturing spans (includes Braintrust processor) + rig = setup_otel_test_rig + + # Initialize and login + Braintrust.init(blocking_login: true) + state = Braintrust.current_state + + task = ->(input) { input.upcase } + scorer = Braintrust::Eval.scorer("exact") { |i, e, o| (o == e) ? 1.0 : 0.0 } + + result = Braintrust::Eval.run( + project: "ruby-sdk-test", + experiment: "test-tracing-#{Time.now.to_i}", + cases: [{input: "hello", expected: "HELLO"}], + task: task, + scorers: [scorer], + state: state, + tracer_provider: rig.tracer_provider + ) + + assert result.success? + + # Verify spans were created + spans = rig.drain + + # Should have: 1 eval span, 1 task span, 1 score span + assert_equal 3, spans.length + + eval_span = spans.find { |s| s.name == "eval" } + task_span = spans.find { |s| s.name == "task" } + score_span = spans.find { |s| s.name == "score" } + + assert eval_span, "Expected eval span" + assert task_span, "Expected task span" + assert score_span, "Expected score span" + + # Verify eval span attributes + assert eval_span.attributes["braintrust.parent"] + assert_match(/experiment_id:[0-9a-f-]{36}/, eval_span.attributes["braintrust.parent"]) + assert_includes eval_span.attributes["braintrust.input_json"], "hello" + assert_includes eval_span.attributes["braintrust.output_json"], "HELLO" + + # Verify task span + assert task_span.attributes["braintrust.span_attributes"] + assert_includes task_span.attributes["braintrust.span_attributes"], "task" + + # Verify score span + assert score_span.attributes["braintrust.scores"] + assert_includes score_span.attributes["braintrust.scores"], "exact" + end +end diff --git a/test/braintrust/internal/experiments_test.rb b/test/braintrust/internal/experiments_test.rb new file mode 100644 index 00000000..e5991a5f --- /dev/null +++ b/test/braintrust/internal/experiments_test.rb @@ -0,0 +1,87 @@ +# frozen_string_literal: true + +require "test_helper" +require "braintrust/internal/experiments" + +class Braintrust::Internal::ExperimentsTest < Minitest::Test + def test_get_or_create_basic + skip "Requires BRAINTRUST_API_KEY" unless ENV["BRAINTRUST_API_KEY"] + + Braintrust.init(blocking_login: true) + state = Braintrust.current_state + + result = Braintrust::Internal::Experiments.get_or_create( + "test-experiment-#{Time.now.to_i}", + "ruby-sdk-test", + state: state + ) + + assert result[:experiment_id] + assert result[:experiment_name] + assert result[:project_id] + assert_equal "ruby-sdk-test", result[:project_name] + end + + def test_get_or_create_with_tags_and_metadata + skip "Requires BRAINTRUST_API_KEY" unless ENV["BRAINTRUST_API_KEY"] + + Braintrust.init(blocking_login: true) + state = Braintrust.current_state + + result = Braintrust::Internal::Experiments.get_or_create( + "test-experiment-#{Time.now.to_i}", + "ruby-sdk-test", + state: state, + tags: ["test", "ruby"], + metadata: {version: "1.0", author: "claude"} + ) + + assert result[:experiment_id] + assert result[:project_id] + end + + def test_get_or_create_with_update_flag + skip "Requires BRAINTRUST_API_KEY" unless ENV["BRAINTRUST_API_KEY"] + + Braintrust.init(blocking_login: true) + state = Braintrust.current_state + + # First create with update: false (new experiment) + result1 = Braintrust::Internal::Experiments.get_or_create( + "test-experiment-update", + "ruby-sdk-test", + state: state, + update: false + ) + + # Then with update: true (should allow reusing) + result2 = Braintrust::Internal::Experiments.get_or_create( + "test-experiment-update", + "ruby-sdk-test", + state: state, + update: true + ) + + # Both should succeed and return experiment IDs + assert result1[:experiment_id] + assert result2[:experiment_id] + end + + def test_register_project_is_private + # Test that register_project is private and cannot be called directly + error = assert_raises(NoMethodError) do + Braintrust::Internal::Experiments.register_project("test", nil) + end + + assert_match(/private method|undefined method/, error.message) + end + + def test_register_experiment_is_private + # Test that register_experiment is private and cannot be called directly + error = assert_raises(NoMethodError) do + Braintrust::Internal::Experiments.register_experiment("test", "proj_id", nil) + end + + assert_match(/private method|undefined method/, error.message) + end +end diff --git a/test/braintrust/state_login_test.rb b/test/braintrust/state_login_test.rb new file mode 100644 index 00000000..4838d5c3 --- /dev/null +++ b/test/braintrust/state_login_test.rb @@ -0,0 +1,41 @@ +# frozen_string_literal: true + +require "test_helper" + +class Braintrust::StateLoginTest < Minitest::Test + def setup + @api_key = ENV["BRAINTRUST_API_KEY"] + assert @api_key, "BRAINTRUST_API_KEY environment variable is required for login tests" + end + + def teardown + Braintrust::State.instance_variable_set(:@global_state, nil) + end + + def test_login_fetches_org_info + state = Braintrust::State.new( + api_key: @api_key, + app_url: "https://www.braintrust.dev" + ) + + state.login + + assert state.logged_in + refute_nil state.org_id + refute_nil state.org_name + refute_nil state.api_url + end + + def test_login_with_invalid_api_key + state = Braintrust::State.new( + api_key: "invalid-key", + app_url: "https://www.braintrust.dev" + ) + + error = assert_raises(Braintrust::Error) do + state.login + end + + assert_match(/invalid api key/i, error.message) + end +end diff --git a/test/braintrust/state_test.rb b/test/braintrust/state_test.rb new file mode 100644 index 00000000..098c9ac9 --- /dev/null +++ b/test/braintrust/state_test.rb @@ -0,0 +1,73 @@ +# frozen_string_literal: true + +require "test_helper" + +class Braintrust::StateTest < Minitest::Test + def teardown + # Reset global state after each test + Braintrust::State.instance_variable_set(:@global_state, nil) + end + + def test_creates_state_with_required_fields + state = Braintrust::State.new( + api_key: "test-key", + default_parent: "project_name:test-project" + ) + + assert_equal "test-key", state.api_key + assert_equal "project_name:test-project", state.default_parent + end + + def test_validates_required_api_key + error = assert_raises(ArgumentError) do + Braintrust::State.new(default_parent: "project_name:test") + end + + assert_match(/api_key is required/, error.message) + end + + def test_global_state_getter_and_setter + state = Braintrust::State.new(api_key: "global-key") + + Braintrust::State.global = state + + assert_equal state, Braintrust::State.global + end + + def test_global_state_is_thread_safe + # Test that concurrent access doesn't cause race conditions + state1 = Braintrust::State.new(api_key: "key1") + state2 = Braintrust::State.new(api_key: "key2") + + threads = [] + errors = [] + + 100.times do + threads << Thread.new do + Braintrust::State.global = state1 + retrieved = Braintrust::State.global + # If not thread-safe, we might get nil or wrong state + errors << "Got nil" if retrieved.nil? + rescue => e + errors << e.message + end + + threads << Thread.new do + Braintrust::State.global = state2 + retrieved = Braintrust::State.global + errors << "Got nil" if retrieved.nil? + rescue => e + errors << e.message + end + end + + threads.each(&:join) + + # No errors should have occurred + assert_equal [], errors + + # Final state should be one of the two states (last set wins) + final_state = Braintrust::State.global + assert_includes ["key1", "key2"], final_state.api_key + end +end diff --git a/test/braintrust/trace/openai_test.rb b/test/braintrust/trace/openai_test.rb new file mode 100644 index 00000000..e7bd1605 --- /dev/null +++ b/test/braintrust/trace/openai_test.rb @@ -0,0 +1,89 @@ +# frozen_string_literal: true + +require "test_helper" + +class Braintrust::Trace::OpenAITest < Minitest::Test + def setup + @api_key = ENV["OPENAI_API_KEY"] + skip "OPENAI_API_KEY environment variable is required for OpenAI tests" unless @api_key + + @original_api_key = ENV["OPENAI_API_KEY"] + end + + def teardown + if @original_api_key + ENV["OPENAI_API_KEY"] = @original_api_key + else + ENV.delete("OPENAI_API_KEY") + end + end + + def test_wrap_creates_span_for_chat_completions + require "openai" + + # Set up test rig (includes Braintrust processor) + rig = setup_otel_test_rig + + # Create OpenAI client and wrap it with Braintrust tracing + client = OpenAI::Client.new(api_key: @api_key) + Braintrust::Trace::OpenAI.wrap(client, tracer_provider: rig.tracer_provider) + + # Make a simple chat completion request with additional params to test metadata capture + response = client.chat.completions.create( + messages: [ + {role: "system", content: "You are a test assistant."}, + {role: "user", content: "Say 'test'"} + ], + model: "gpt-4o-mini", + max_tokens: 10, + temperature: 0.5 + ) + + # Verify response + refute_nil response + refute_nil response.choices[0].message.content + + # Drain and verify span + span = rig.drain_one + + # Verify span name matches Go SDK + assert_equal "openai.chat.completions.create", span.name + + # Verify braintrust.input_json contains messages + assert span.attributes.key?("braintrust.input_json") + input = JSON.parse(span.attributes["braintrust.input_json"]) + assert_equal 2, input.length + assert_equal "system", input[0]["role"] + assert_equal "You are a test assistant.", input[0]["content"] + assert_equal "user", input[1]["role"] + assert_equal "Say 'test'", input[1]["content"] + + # Verify braintrust.output_json contains choices + assert span.attributes.key?("braintrust.output_json") + output = JSON.parse(span.attributes["braintrust.output_json"]) + assert_equal 1, output.length + assert_equal 0, output[0]["index"] + assert_equal "assistant", output[0]["message"]["role"] + refute_nil output[0]["message"]["content"] + refute_nil output[0]["finish_reason"] + + # Verify braintrust.metadata contains request and response metadata + assert span.attributes.key?("braintrust.metadata") + metadata = JSON.parse(span.attributes["braintrust.metadata"]) + assert_equal "openai", metadata["provider"] + assert_equal "/v1/chat/completions", metadata["endpoint"] + assert_equal "gpt-4o-mini", metadata["model"] + assert_equal 10, metadata["max_tokens"] + assert_equal 0.5, metadata["temperature"] + refute_nil metadata["id"] + refute_nil metadata["created"] + + # Verify braintrust.metrics contains token usage + assert span.attributes.key?("braintrust.metrics") + metrics = JSON.parse(span.attributes["braintrust.metrics"]) + assert metrics["prompt_tokens"] > 0 + assert metrics["completion_tokens"] > 0 + assert metrics["tokens"] > 0 + assert_equal metrics["prompt_tokens"] + metrics["completion_tokens"], metrics["tokens"] + end +end diff --git a/test/braintrust/trace/span_processor_test.rb b/test/braintrust/trace/span_processor_test.rb new file mode 100644 index 00000000..7c76cf80 --- /dev/null +++ b/test/braintrust/trace/span_processor_test.rb @@ -0,0 +1,161 @@ +# frozen_string_literal: true + +require "test_helper" +require "opentelemetry/sdk" + +class Braintrust::Trace::SpanProcessorTest < Minitest::Test + def setup + @state = get_test_state + end + + def test_adds_default_parent_if_missing + # Create a mock wrapped processor + wrapped = Minitest::Mock.new + wrapped.expect(:on_start, nil, [Object, Object]) + + processor = Braintrust::Trace::SpanProcessor.new(wrapped, @state) + + # Create a span + tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + tracer = tracer_provider.tracer("test") + span = tracer.start_span("test-span") + + # Call on_start (note: OpenTelemetry Ruby passes span first, then context) + processor.on_start(span, OpenTelemetry::Context.empty) + + # Check that braintrust.parent was added + attributes = span.attributes + assert_equal "project_name:test-project", attributes["braintrust.parent"] + + wrapped.verify + end + + def test_preserves_existing_parent + # Create a mock wrapped processor + wrapped = Minitest::Mock.new + wrapped.expect(:on_start, nil, [Object, Object]) + + processor = Braintrust::Trace::SpanProcessor.new(wrapped, @state) + + # Create a span with existing parent + tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + tracer = tracer_provider.tracer("test") + span = tracer.start_span("test-span") + span.set_attribute("braintrust.parent", "project_name:custom-project") + + # Call on_start (note: OpenTelemetry Ruby passes span first, then context) + processor.on_start(span, OpenTelemetry::Context.empty) + + # Check that existing parent was preserved + attributes = span.attributes + assert_equal "project_name:custom-project", attributes["braintrust.parent"] + + wrapped.verify + end + + def test_adds_org_attribute + # Create a mock wrapped processor + wrapped = Minitest::Mock.new + wrapped.expect(:on_start, nil, [Object, Object]) + + processor = Braintrust::Trace::SpanProcessor.new(wrapped, @state) + + # Create a span + tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + tracer = tracer_provider.tracer("test") + span = tracer.start_span("test-span") + + # Call on_start (note: OpenTelemetry Ruby passes span first, then context) + processor.on_start(span, OpenTelemetry::Context.empty) + + # Check that org was added + attributes = span.attributes + assert_equal "test-org", attributes["braintrust.org"] + + wrapped.verify + end + + def test_adds_app_url_attribute + # Create a mock wrapped processor + wrapped = Minitest::Mock.new + wrapped.expect(:on_start, nil, [Object, Object]) + + processor = Braintrust::Trace::SpanProcessor.new(wrapped, @state) + + # Create a span + tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + tracer = tracer_provider.tracer("test") + span = tracer.start_span("test-span") + + # Call on_start (note: OpenTelemetry Ruby passes span first, then context) + processor.on_start(span, OpenTelemetry::Context.empty) + + # Check that app_url was added + attributes = span.attributes + assert_equal "https://app.example.com", attributes["braintrust.app_url"] + + wrapped.verify + end + + def test_span_processor_enables_permalink_generation + # This test verifies that spans processed by SpanProcessor have all attributes needed for permalinks + # Create a mock wrapped processor + wrapped = Minitest::Mock.new + wrapped.expect(:on_start, nil, [Object, Object]) + + processor = Braintrust::Trace::SpanProcessor.new(wrapped, @state) + + # Create a span + tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + tracer = tracer_provider.tracer("test") + span = tracer.start_span("test-span") + + # Call on_start to add Braintrust attributes + processor.on_start(span, OpenTelemetry::Context.empty) + + # Generate permalink - should not be empty since all required attributes are present + permalink = Braintrust::Trace.permalink(span) + + refute_empty permalink, "Permalink should be generated successfully for processed spans" + assert_includes permalink, "https://app.example.com/app/test-org/p/test-project/logs" + + wrapped.verify + end + + def test_inherits_parent_from_parent_span_context + # Set up otel test rig (includes Braintrust processor and state) + rig = setup_otel_test_rig + + tracer = rig.tracer("test") + + # Create parent span with experiment_id parent + # Note: SpanProcessor will add org and app_url automatically + parent_span = tracer.start_span("parent") + parent_span.set_attribute("braintrust.parent", "experiment_id:abc-123") + + # Create child span in parent context + OpenTelemetry::Trace.with_span(parent_span) do + child_span = tracer.start_span("child") + child_span.finish + end + + parent_span.finish + + # Drain spans + spans = rig.drain + assert_equal 2, spans.length + + parent_span_data = spans.find { |s| s.name == "parent" } + child_span_data = spans.find { |s| s.name == "child" } + + # Parent should have experiment_id (explicitly set) plus org and app_url (added by processor) + assert_equal "experiment_id:abc-123", parent_span_data.attributes["braintrust.parent"] + assert_equal rig.state.org_name, parent_span_data.attributes["braintrust.org"] + assert_equal rig.state.app_url, parent_span_data.attributes["braintrust.app_url"] + + # Child should inherit parent from parent span, and get org/app_url from state + assert_equal "experiment_id:abc-123", child_span_data.attributes["braintrust.parent"] + assert_equal rig.state.org_name, child_span_data.attributes["braintrust.org"] + assert_equal rig.state.app_url, child_span_data.attributes["braintrust.app_url"] + end +end diff --git a/test/braintrust/trace_test.rb b/test/braintrust/trace_test.rb new file mode 100644 index 00000000..9e1666ba --- /dev/null +++ b/test/braintrust/trace_test.rb @@ -0,0 +1,161 @@ +# frozen_string_literal: true + +require "test_helper" +require "opentelemetry/sdk" + +class Braintrust::TraceTest < Minitest::Test + def setup + # Clear global state before each test + Braintrust::State.global = nil + end + + def test_enable_raises_error_if_no_state_available + tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + + error = assert_raises(Braintrust::Error) do + Braintrust::Trace.enable(tracer_provider) + end + + assert_match(/no state available/i, error.message) + end + + def test_enable_with_explicit_state + state = get_test_state + tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + + # Should not raise + Braintrust::Trace.enable(tracer_provider, state: state) + + # Verify that a span processor was registered + refute_empty tracer_provider.instance_variable_get(:@span_processors) + end + + def test_enable_with_global_state + # Set global state + Braintrust::State.global = get_test_state(api_key: "global-key") + + tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + + # Should not raise and use global state + Braintrust::Trace.enable(tracer_provider) + + # Verify that a span processor was registered + refute_empty tracer_provider.instance_variable_get(:@span_processors) + end + + def test_enable_adds_console_exporter_when_env_var_set + state = get_test_state + tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + + # Set env var + ENV["BRAINTRUST_ENABLE_TRACE_CONSOLE_LOG"] = "true" + + begin + Braintrust::Trace.enable(tracer_provider, state: state) + + # Should have 2 processors: OTLP + Console + processors = tracer_provider.instance_variable_get(:@span_processors) + assert_equal 2, processors.length + ensure + # Clean up env var + ENV.delete("BRAINTRUST_ENABLE_TRACE_CONSOLE_LOG") + end + end + + def test_enable_creates_spans_with_braintrust_attributes + # Set up OpenTelemetry with memory exporter (includes Braintrust processor) + rig = setup_otel_test_rig + + # Create a span using the tracer helper + rig.tracer.in_span("test-operation") do |span| + span.set_attribute("custom.attribute", "custom-value") + end + + # Drain exactly one span (asserts count and returns the span) + span = rig.drain_one + + assert_equal "test-operation", span.name + assert_equal "custom-value", span.attributes["custom.attribute"] + + # Verify Braintrust attributes were added automatically + assert_equal "project_name:test-project", span.attributes["braintrust.parent"] + assert_equal "test-org", span.attributes["braintrust.org"] + assert_equal "https://app.example.com", span.attributes["braintrust.app_url"] + end + + def test_permalink_with_project_parent + # Set up OpenTelemetry with memory exporter (includes Braintrust processor) + rig = setup_otel_test_rig + + # Create a span + otel_span = nil + rig.tracer.in_span("test-operation") do |span| + otel_span = span + end + + # Generate permalink + link = Braintrust::Trace.permalink(otel_span) + + # Extract span details + span_data = rig.drain_one + trace_id = span_data.hex_trace_id + span_id = span_data.hex_span_id + + # Verify URL format for project parent + expected = "https://app.example.com/app/test-org/p/test-project/logs?r=#{trace_id}&s=#{span_id}" + assert_equal expected, link + end + + def test_permalink_with_experiment_parent + # Set up OpenTelemetry with memory exporter (includes Braintrust processor) + rig = setup_otel_test_rig(default_parent: "experiment_id:test-project/exp-123") + + # Create a span + otel_span = nil + rig.tracer.in_span("test-operation") do |span| + otel_span = span + end + + # Generate permalink + link = Braintrust::Trace.permalink(otel_span) + + # Extract span details + span_data = rig.drain_one + trace_id = span_data.hex_trace_id + span_id = span_data.hex_span_id + + # Verify URL format for experiment parent + expected = "https://app.example.com/app/test-org/p/test-project/experiments/exp-123?r=#{trace_id}&s=#{span_id}" + assert_equal expected, link + end + + def test_permalink_with_missing_attributes + # Set up OpenTelemetry WITHOUT Braintrust processor (to test missing attributes) + require "opentelemetry/sdk" + + exporter = OpenTelemetry::SDK::Trace::Export::InMemorySpanExporter.new + tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + + # Add only a simple processor (no Braintrust processor) + span_processor = OpenTelemetry::SDK::Trace::Export::SimpleSpanProcessor.new(exporter) + tracer_provider.add_span_processor(span_processor) + + tracer = tracer_provider.tracer("test") + + # Create a span WITHOUT Braintrust attributes + otel_span = nil + tracer.in_span("test-operation") do |span| + otel_span = span + end + + # Should return empty string for missing attributes instead of raising + link = Braintrust::Trace.permalink(otel_span) + assert_equal "", link + end + + def test_permalink_with_nil_span + # Should return empty string for nil span instead of raising + link = Braintrust::Trace.permalink(nil) + assert_equal "", link + end +end diff --git a/test/braintrust_test.rb b/test/braintrust_test.rb new file mode 100644 index 00000000..25a6d898 --- /dev/null +++ b/test/braintrust_test.rb @@ -0,0 +1,53 @@ +# frozen_string_literal: true + +require "test_helper" + +class BraintrustTest < Minitest::Test + def setup + # Save original env var + @original_api_key = ENV["BRAINTRUST_API_KEY"] + end + + def teardown + # Reset global state after each test + Braintrust::State.instance_variable_set(:@global_state, nil) + + # Restore original env var + if @original_api_key + ENV["BRAINTRUST_API_KEY"] = @original_api_key + else + ENV.delete("BRAINTRUST_API_KEY") + end + end + + def test_init_sets_global_state_by_default + ENV["BRAINTRUST_API_KEY"] = "test-key" + + Braintrust.init + + state = Braintrust.current_state + assert_equal "test-key", state.api_key + end + + def test_init_with_set_global_false_returns_state + ENV["BRAINTRUST_API_KEY"] = "test-key" + + # Ensure global state is clean before test + Braintrust::State.instance_variable_set(:@global_state, nil) + + state = Braintrust.init(set_global: false) + + assert_equal "test-key", state.api_key + assert_nil Braintrust.current_state + end + + def test_init_merges_options_with_env + ENV["BRAINTRUST_API_KEY"] = "env-key" + + Braintrust.init(api_key: "explicit-key", default_parent: "project_name:my-project") + + state = Braintrust.current_state + assert_equal "explicit-key", state.api_key + assert_equal "project_name:my-project", state.default_parent + end +end diff --git a/test/test_helper.rb b/test/test_helper.rb index 423707b4..8e34ef4c 100644 --- a/test/test_helper.rb +++ b/test/test_helper.rb @@ -4,10 +4,96 @@ require "braintrust" require "minitest/autorun" -require "simplecov" +# Disabled SimpleCov for now - will re-enable later +# require "simplecov" +# +# SimpleCov.start do +# add_filter "/test/" +# enable_coverage :branch +# minimum_coverage 80 +# end -SimpleCov.start do - add_filter "/test/" - enable_coverage :branch - minimum_coverage 80 +# Test helpers for OpenTelemetry tracing +module TracingTestHelper + # Wrapper for OpenTelemetry test setup + class OtelTestRig + attr_reader :tracer_provider, :exporter, :state + + def initialize(tracer_provider, exporter, state) + @tracer_provider = tracer_provider + @exporter = exporter + @state = state + end + + # Get a tracer from the provider + # @param name [String] tracer name (default: "test") + # @return [OpenTelemetry::Trace::Tracer] + def tracer(name = "test") + @tracer_provider.tracer(name) + end + + # Flush and drain all spans from the exporter + # @return [Array] + def drain + @tracer_provider.force_flush + @exporter.finished_spans + end + + # Flush and drain exactly one span from the exporter + # Asserts that exactly one span was flushed + # @return [OpenTelemetry::SDK::Trace::SpanData] + def drain_one + spans = drain + raise Minitest::Assertion, "Expected exactly 1 span, got #{spans.length}" unless spans.length == 1 + spans.first + end + end + + # Creates a test State with sensible defaults and validates it + # Override any fields by passing options + # @return [Braintrust::State] + def get_test_state(**options) + defaults = { + api_key: "test-key", + api_url: "https://api.example.com", + app_url: "https://app.example.com", + org_name: "test-org", + default_parent: "project_name:test-project" + } + + state = Braintrust::State.new(**defaults.merge(options)) + state.validate + state + end + + # Sets up OpenTelemetry with an in-memory exporter for testing + # Returns an OtelTestRig with tracer_provider, exporter, state, and drain() method + # The exporter can be passed to Braintrust::Trace.enable to replace OTLP exporter + # @param state_options [Hash] Options to pass to get_test_state + # @return [OtelTestRig] + def setup_otel_test_rig(**state_options) + require "opentelemetry/sdk" + + exporter = OpenTelemetry::SDK::Trace::Export::InMemorySpanExporter.new + tracer_provider = OpenTelemetry::SDK::Trace::TracerProvider.new + state = get_test_state(**state_options) + + # Add Braintrust span processor (wraps simple processor with memory exporter) + simple_processor = OpenTelemetry::SDK::Trace::Export::SimpleSpanProcessor.new(exporter) + braintrust_processor = Braintrust::Trace::SpanProcessor.new(simple_processor, state) + tracer_provider.add_span_processor(braintrust_processor) + + OtelTestRig.new(tracer_provider, exporter, state) + end + + # Helper to run eval internally without API calls for testing + # Wraps the private run_internal method + def run_test_eval(**kwargs) + Braintrust::Eval.send(:run_internal, **kwargs) + end +end + +# Include helper in all test cases +class Minitest::Test + include TracingTestHelper end