Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions .DONE.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,3 +256,115 @@
- Task errors: Full stacktrace on task span, error message on eval span
- Scorer errors: Full stacktrace on score span with custom "ScorerError" type
- **Total: 72 test runs, 243 assertions, all passing, linter clean**

### Session 5 Completed (API Client + Datasets) ✅
- **API Client Foundation** (`lib/braintrust/api.rb`)
- Clean API class with memoized resource accessors
- Works with explicit state or global state
- Comprehensive tests (5 tests)
- **Datasets API** (`lib/braintrust/api/datasets.rb`)
- Complete implementation with 7 methods: `list`, `get`, `get_by_id`, `create`, `insert`, `fetch`, `permalink`
- Consolidated HTTP request logic into single `http_request()` function
- Debug logging with timing information (controlled by `BRAINTRUST_DEBUG`)
- BTQL-based record fetching with pagination support
- Permalink generation for Braintrust UI links
- Real integration tests (9 tests, not mocked)
- **Namespace Organization**
- Moved `api/auth.rb` → `api/internal/auth.rb` to avoid conflicts
- Updated references in `state.rb`
- **Test Infrastructure**
- Added `unique_name()` helper for parallel-safe tests
- Tests use `set_global: false` for thread safety
- Tests fail (not skip) when API key missing
- **Example** (`examples/api/dataset.rb`)
- Demonstrates create, insert, fetch, pagination, and permalinks
- Working end-to-end example with real API calls
- **Total: 86 test runs, 273 assertions, all passing, linter clean**

### Session 6 Completed (Dataset Integration + Auto-print Results) ✅
- **Dataset Integration** (Eval.run)
- Added `dataset:` parameter to Eval.run (string or hash)
- Support dataset by name (same project as experiment)
- Support dataset by name + explicit project
- Support dataset by ID
- Support dataset with limit and version options
- Auto-pagination (fetch all records by default)
- Validation: dataset and cases are mutually exclusive
- Comprehensive tests (8 tests covering all dataset features)
- **Auto-print Results**
- Added `quiet:` parameter to Eval.run (defaults to false)
- Updated Result#to_s to match Go SDK format
- Auto-print results via `puts result` unless quiet: true
- Format: Experiment name, ID, Link, Duration, Error count
- Updated all tests to use quiet: true
- Updated examples to rely on auto-printing
- **Example** (`examples/eval/dataset.rb`)
- Demonstrates dataset usage in Eval.run
- Shows all dataset resolution methods
- **Total: 99 test runs, 299 assertions, all passing, linter clean**

### Session 7 Completed (Remote Functions) ✅
- **API::Functions class** (`lib/braintrust/api/functions.rb`)
- `list(project_name:)` - List functions by project
- `create(project_name:, slug:, function_data:, prompt_data:)` - Create remote functions
- `invoke(id:, input:)` - Invoke functions server-side with input, returns output
- `delete(id:)` - Delete functions (for test cleanup)
- Proper separation of `function_data` and `prompt_data` parameters
- Automatic project ID resolution from project name
- Comprehensive integration tests (4 tests)
- **Eval::Functions module** (`lib/braintrust/eval/functions.rb`)
- `Functions.task(project:, slug:, state:)` - Get remote task callable for Eval.run
- `Functions.scorer(project:, slug:, state:)` - Get remote scorer for evaluations
- Full OpenTelemetry tracing with `type: "function"` spans
- Proper error handling and span status reporting
- Function metadata attributes (function.name, function.id, function.slug)
- Integration tests (4 tests covering task, scorer, and Eval.run integration)
- **State#login improvements**
- Made `State#login` idempotent (returns early if already logged in)
- Added automatic `state.login` in `Eval.run` to ensure org_name is populated
- Fixed experiment URL generation (no more double slashes)
- **Remote Scorer Support**
- LLM classifier with `parser.type: "llm_classifier"`
- Choice scores mapping (`choice_scores: {"correct" => 1.0, "incorrect" => 0.0}`)
- Chain-of-thought reasoning with `use_cot: true`
- **Example** (`examples/eval/remote_functions.rb`)
- Demonstrates creating remote task function (food classifier)
- Demonstrates creating remote scorer function with LLM classifier
- Shows usage of both in Eval.run
- Includes proper tracer provider setup and shutdown
- Documents benefits of remote functions
- **Total: 99 test runs, 299 assertions, all passing, linter clean**

### Session 8 Completed (Background Login with Retry) ✅
- **Background Login** (`State#login_in_thread`)
- Non-blocking async login in background thread (internal, not returned)
- Indefinite retry with exponential backoff: 1ms → 2ms → 4ms → ... → 5s max
- Thread-safe implementation with mutex protection
- Returns `self` immediately without blocking
- Gracefully handles network issues during SDK initialization
- **Thread-Safe Login** (`State#login`)
- Wrapped with mutex for concurrent access from multiple threads
- Idempotent (returns early if already logged in)
- Safe to call from multiple threads simultaneously
- **Braintrust.init Default Behavior**
- Now calls `login_in_thread` by default (async, non-blocking)
- Use `blocking_login: true` for synchronous login (needed for tracing examples)
- Updated documentation to reflect new default behavior
- **Test Helper** (`State#wait_for_login`)
- Added helper method for tests to wait for background login completion
- Accepts optional timeout parameter
- **Test Improvements**
- Added 6 comprehensive tests for background login functionality
- Removed flaky timing test (exponential backoff timing assertions)
- Updated all Braintrust.init tests to use `set_global: false` to avoid state pollution
- Added proper setup/teardown to reset tracer provider between tests
- Tests stable across different execution orders
- **Code Quality**
- Fixed StandardRB linter issues (private class methods)
- Moved `setup_tracing` to `class << self` block with proper `private`
- Changed "Created OpenTelemetry tracer provider" from stdout to debug log
- **Example Updates**
- Updated tracing examples to use `blocking_login: true` (trace.rb, openai.rb, internal/openai.rb)
- Fixed tracer_provider references to use `OpenTelemetry.tracer_provider`
- Removed unnecessary comments from init calls
- **Total: 109 test runs, 328 assertions, all passing, linter clean**
137 changes: 84 additions & 53 deletions .TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,24 +15,28 @@

### Medium Priority

- [ ] **Kitchen-Sink Span Export Inconsistency**: Some eval runs show incomplete span export
- Affects: examples/internal/kitchen-sink.rb (8 cases, only 3-4 appear sometimes)
- Issue: BatchSpanProcessor may not flush all spans before shutdown
- Simple evals work fine (3 cases exported successfully)
- May need explicit `tracer_provider.force_flush()` before `shutdown()`
- May be timing-related with concurrent OpenAI API calls
- [x] **Kitchen-Sink Span Export Inconsistency**: ✅ RESOLVED (2025-10-22)
- Issue was timing-related with concurrent OpenAI API calls
- Now working correctly

### Low Priority

- [ ] **Parallelism Not Implemented**: Eval.run accepts parallelism parameter but doesn't use it
- Currently runs cases sequentially
- Need to implement parallel execution with threads or concurrent-ruby

- [ ] **Testing with/without OpenTelemetry**: Test SDK behavior with optional dependencies
- Test with OpenTelemetry installed (current default)
- Test without OpenTelemetry installed (graceful degradation)
- Test with `tracing: false` parameter
- Ensure API client, login, and non-tracing features work independently
- Consider making OpenTelemetry an optional dependency

## Pending Work

### Phase 2: Deferred Items
- [ ] Implement Braintrust.with_state (deferred - not needed yet)
- [ ] Implement State#login_until_success (deferred - background thread with retries)
- [x] Implement State#login_in_thread ✅ COMPLETE (2025-10-23) - background thread with retries

### Phase 3: Trace Utilities (Deferred)
- [ ] Write test: permalink generation
Expand Down Expand Up @@ -61,34 +65,78 @@
- [ ] Timeout configuration
- [ ] Rate limiting handling

### Phase 5: API Client (TDD)

#### lib/braintrust/api.rb
### Phase 5: API Client (TDD) - ✅ DATASETS COMPLETE

#### lib/braintrust/api.rb ✅
- [x] Write test: API with explicit state
- [x] Write test: API with global state
- [x] Write test: API#datasets returns Datasets instance
- [x] Implement API class with memoized resource accessors
- [x] Add unique_name() test helper for parallel-safe tests

#### lib/braintrust/api/datasets.rb ✅
- [x] Write test: Datasets#list with project_name
- [x] Write test: Datasets#get by project + name
- [x] Write test: Datasets#get_by_id
- [x] Write test: Datasets#create (idempotent)
- [x] Write test: Datasets#insert events
- [x] Write test: Datasets#fetch with pagination
- [x] Implement Datasets class with all methods
- [x] Implement list, get, get_by_id, create, insert, fetch, permalink
- [x] Implement consolidated http_request() function
- [x] Add debug logging with timing information
- [x] Create examples/api/dataset.rb

#### Deferred (API Projects/Experiments)
- [ ] Write test: register_project creates/fetches project
- [ ] Write test: register_experiment creates experiment
- [ ] Write test: register_experiment with update flag
- [ ] Write test: create_dataset creates dataset
- [ ] Write test: fetch_dataset fetches dataset
- [ ] Write test: insert_dataset_events inserts events
- [ ] Write test: API with explicit state
- [ ] Write test: API with global state
- [ ] Implement API class
- [ ] Implement register_project
- [ ] Implement register_experiment
- [ ] Implement create_dataset
- [ ] Implement fetch_dataset
- [ ] Implement insert_dataset_events
- [ ] Implement API::Projects
- [ ] Implement API::Experiments
- [ ] Move from Internal::Experiments to public API

### Phase 6: Evals - Remaining Items

#### lib/braintrust/eval.rb
- [ ] Implement parallel execution (parallelism parameter)

#### lib/braintrust/eval/dataset.rb
- [ ] Write test: Dataset enumerable
- [ ] Write test: Dataset from array
- [ ] Write test: Dataset from API
- [ ] Implement Dataset class
#### Auto-print Results ✅ COMPLETE (2025-10-23)
- [x] Add `quiet:` parameter to Eval.run (defaults to false)
- [x] Update Result#to_s to Go SDK format
- [x] Auto-print results via `puts result` unless quiet: true
- [x] Format: Experiment name, ID, Link, Duration, Error count
- [x] Updated all tests to use quiet: true
- [x] Updated examples to rely on auto-printing

#### Dataset Integration ✅ COMPLETE (2025-10-22)
- [x] Add `dataset:` parameter to Eval.run (string or hash)
- [x] Support dataset by name (same project as experiment)
- [x] Support dataset by name + explicit project
- [x] Support dataset by ID
- [x] Support dataset with limit option
- [x] Support dataset with version option
- [x] Auto-pagination (fetch all records by default)
- [x] Validation: dataset and cases are mutually exclusive
- [x] Tests for all dataset features
- [x] Example: examples/eval/dataset.rb

#### Remote Functions ✅ COMPLETE (2025-10-23)
- [x] Write test: API::Functions#list with project_name
- [x] Write test: API::Functions#create with function_data and prompt_data
- [x] Write test: API::Functions#invoke by ID
- [x] Write test: API::Functions#delete
- [x] Implement API::Functions class (lib/braintrust/api/functions.rb)
- [x] Write test: Functions.task returns callable
- [x] Write test: Functions.task invokes remote function
- [x] Write test: Functions.scorer returns Scorer
- [x] Write test: Use remote task in Eval.run
- [x] Implement Eval::Functions module (lib/braintrust/eval/functions.rb)
- [x] Add OpenTelemetry tracing for function invocations (type: "function")
- [x] Make State#login idempotent (returns early if already logged in)
- [x] Add automatic state.login in Eval.run to populate org_name
- [x] Create example: examples/eval/remote_functions.rb
- [x] Add remote scorer with LLM classifier and choice_scores
- [x] Tests for all remote function features (4 API tests, 4 Eval tests)

### Phase 7: Examples

Expand Down Expand Up @@ -118,32 +166,15 @@

## Current Status

**Last Updated**: 2025-10-22 (Session 4)
**Current Phase**: Phase 6 (Evals Framework) - ✅ MOSTLY COMPLETE (Error Handling ✅, Parallelism pending)
**Test Status**: 72 test runs, 243 assertions, all passing, linter clean

## Outstanding Issues Summary

**Session 4 Completed**:
- ✅ Error handling complete (task errors, scorer errors, stacktraces)
- ✅ All tests passing
- ⚠️ Kitchen-sink inconsistency (span export timing issue)

## Next Session Options

1. **Fix SSL Certificate Verification** (High Priority ⚠️)
- Security issue that needs resolution
- Investigate proper cert store configuration

2. **Fix Kitchen-Sink Span Export** (Medium Priority)
- Add explicit force_flush() before shutdown
- Test with larger eval runs

3. **Implement Parallelism** (Low Priority)
- Add parallel case execution to Eval.run
**Last Updated**: 2025-10-23 (Session 8)
**Current Phase**: Phase 2 - Background Login with Retry ✅ COMPLETE
**Test Status**: 109 test runs, 328 assertions, all passing, linter clean

4. **API Client** (Phase 5)
- Datasets API support
## Deferred Items

5. **OpenAI Advanced** (Phase 4.5)
- Streaming support
- API::Projects (move from Internal::Experiments)
- API::Experiments (move from Internal::Experiments)
- Eval.run integration with datasets
- Dataset examples
- Implement Parallelism (Eval.run parallelism parameter)
- OpenAI Advanced Features (streaming, embeddings, etc.)
64 changes: 64 additions & 0 deletions examples/api/dataset.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/usr/bin/env ruby
# frozen_string_literal: true

# Example: Using the Braintrust Datasets API
#
# This example demonstrates:
# - Creating a dataset
# - Inserting records
# - Fetching records with pagination
# - Using the low-level API client

require_relative "../../lib/braintrust"

# Initialize Braintrust
Braintrust.init(blocking_login: true)

# Create API client
api = Braintrust::API.new

# Create a new dataset
puts "Creating dataset..."
response = api.datasets.create(
project_name: "ruby-sdk-examples",
name: "example-dataset-#{Time.now.to_i}",
description: "Example dataset created from Ruby SDK"
)

dataset_id = response["dataset"]["id"]
dataset_name = response["dataset"]["name"]
puts "Created dataset: #{dataset_name} (#{dataset_id})"
puts " Link: #{api.datasets.permalink(id: dataset_id)}"

# Insert some records
puts "\nInserting records..."
events = [
{input: "hello", expected: "HELLO"},
{input: "world", expected: "WORLD"},
{input: "foo", expected: "FOO"},
{input: "bar", expected: "BAR"}
]

api.datasets.insert(id: dataset_id, events: events)
puts "Inserted #{events.length} records"

# Fetch records back
puts "\nFetching records..."
result = api.datasets.fetch(id: dataset_id, limit: 10)

puts "Retrieved #{result[:records].length} records:"
result[:records].each do |record|
puts " - input: #{record["input"]}, expected: #{record["expected"]}"
end

# Fetch by project + name
puts "\nFetching dataset by name..."
metadata = api.datasets.get(project_name: "ruby-sdk-examples", name: dataset_name)
puts "Found dataset: #{metadata["name"]} (#{metadata["id"]})"

# List all datasets in project
puts "\nListing all datasets..."
list_result = api.datasets.list(project_name: "ruby-sdk-examples")
puts "Found #{list_result["objects"].length} datasets in project"

puts "\nDone!"
7 changes: 1 addition & 6 deletions examples/eval.rb
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,7 @@
# 5. Inspect the results
#
# Usage:
# BRAINTRUST_API_KEY=key bundle exec ruby examples/eval.rb

unless ENV["BRAINTRUST_API_KEY"]
puts "Error: BRAINTRUST_API_KEY environment variable is required"
exit 1
end
# bundle exec ruby examples/eval.rb

# Initialize Braintrust with blocking login
Braintrust.init(blocking_login: true)
Expand Down
Loading
Loading