braintrustdata · Matt Perpick (clutchski) · Oct 23, 2025 · Oct 23, 2025 · Oct 23, 2025 · Oct 23, 2025
diff --git a/.DONE.md b/.DONE.md
@@ -256,3 +256,115 @@
 - Task errors: Full stacktrace on task span, error message on eval span
 - Scorer errors: Full stacktrace on score span with custom "ScorerError" type
 - **Total: 72 test runs, 243 assertions, all passing, linter clean**
+
+### Session 5 Completed (API Client + Datasets) ✅
+- **API Client Foundation** (`lib/braintrust/api.rb`)
+  - Clean API class with memoized resource accessors
+  - Works with explicit state or global state
+  - Comprehensive tests (5 tests)
+- **Datasets API** (`lib/braintrust/api/datasets.rb`)
+  - Complete implementation with 7 methods: `list`, `get`, `get_by_id`, `create`, `insert`, `fetch`, `permalink`
+  - Consolidated HTTP request logic into single `http_request()` function
+  - Debug logging with timing information (controlled by `BRAINTRUST_DEBUG`)
+  - BTQL-based record fetching with pagination support
+  - Permalink generation for Braintrust UI links
+  - Real integration tests (9 tests, not mocked)
+- **Namespace Organization**
+  - Moved `api/auth.rb` → `api/internal/auth.rb` to avoid conflicts
+  - Updated references in `state.rb`
+- **Test Infrastructure**
+  - Added `unique_name()` helper for parallel-safe tests
+  - Tests use `set_global: false` for thread safety
+  - Tests fail (not skip) when API key missing
+- **Example** (`examples/api/dataset.rb`)
+  - Demonstrates create, insert, fetch, pagination, and permalinks
+  - Working end-to-end example with real API calls
+- **Total: 86 test runs, 273 assertions, all passing, linter clean**
+
+### Session 6 Completed (Dataset Integration + Auto-print Results) ✅
+- **Dataset Integration** (Eval.run)
+  - Added `dataset:` parameter to Eval.run (string or hash)
+  - Support dataset by name (same project as experiment)
+  - Support dataset by name + explicit project
+  - Support dataset by ID
+  - Support dataset with limit and version options
+  - Auto-pagination (fetch all records by default)
+  - Validation: dataset and cases are mutually exclusive
+  - Comprehensive tests (8 tests covering all dataset features)
+- **Auto-print Results**
+  - Added `quiet:` parameter to Eval.run (defaults to false)
+  - Updated Result#to_s to match Go SDK format
+  - Auto-print results via `puts result` unless quiet: true
+  - Format: Experiment name, ID, Link, Duration, Error count
+  - Updated all tests to use quiet: true
+  - Updated examples to rely on auto-printing
+- **Example** (`examples/eval/dataset.rb`)
+  - Demonstrates dataset usage in Eval.run
+  - Shows all dataset resolution methods
+- **Total: 99 test runs, 299 assertions, all passing, linter clean**
+
+### Session 7 Completed (Remote Functions) ✅
+- **API::Functions class** (`lib/braintrust/api/functions.rb`)
+  - `list(project_name:)` - List functions by project
+  - `create(project_name:, slug:, function_data:, prompt_data:)` - Create remote functions
+  - `invoke(id:, input:)` - Invoke functions server-side with input, returns output
+  - `delete(id:)` - Delete functions (for test cleanup)
+  - Proper separation of `function_data` and `prompt_data` parameters
+  - Automatic project ID resolution from project name
+  - Comprehensive integration tests (4 tests)
+- **Eval::Functions module** (`lib/braintrust/eval/functions.rb`)
+  - `Functions.task(project:, slug:, state:)` - Get remote task callable for Eval.run
+  - `Functions.scorer(project:, slug:, state:)` - Get remote scorer for evaluations
+  - Full OpenTelemetry tracing with `type: "function"` spans
+  - Proper error handling and span status reporting
+  - Function metadata attributes (function.name, function.id, function.slug)
+  - Integration tests (4 tests covering task, scorer, and Eval.run integration)
+- **State#login improvements**
+  - Made `State#login` idempotent (returns early if already logged in)
+  - Added automatic `state.login` in `Eval.run` to ensure org_name is populated
+  - Fixed experiment URL generation (no more double slashes)
+- **Remote Scorer Support**
+  - LLM classifier with `parser.type: "llm_classifier"`
+  - Choice scores mapping (`choice_scores: {"correct" => 1.0, "incorrect" => 0.0}`)
+  - Chain-of-thought reasoning with `use_cot: true`
+- **Example** (`examples/eval/remote_functions.rb`)
+  - Demonstrates creating remote task function (food classifier)
+  - Demonstrates creating remote scorer function with LLM classifier
+  - Shows usage of both in Eval.run
+  - Includes proper tracer provider setup and shutdown
+  - Documents benefits of remote functions
+- **Total: 99 test runs, 299 assertions, all passing, linter clean**
+
+### Session 8 Completed (Background Login with Retry) ✅
+- **Background Login** (`State#login_in_thread`)
+  - Non-blocking async login in background thread (internal, not returned)
+  - Indefinite retry with exponential backoff: 1ms → 2ms → 4ms → ... → 5s max
+  - Thread-safe implementation with mutex protection
+  - Returns `self` immediately without blocking
+  - Gracefully handles network issues during SDK initialization
+- **Thread-Safe Login** (`State#login`)
+  - Wrapped with mutex for concurrent access from multiple threads
+  - Idempotent (returns early if already logged in)
+  - Safe to call from multiple threads simultaneously
+- **Braintrust.init Default Behavior**
+  - Now calls `login_in_thread` by default (async, non-blocking)
+  - Use `blocking_login: true` for synchronous login (needed for tracing examples)
+  - Updated documentation to reflect new default behavior
+- **Test Helper** (`State#wait_for_login`)
+  - Added helper method for tests to wait for background login completion
+  - Accepts optional timeout parameter
+- **Test Improvements**
+  - Added 6 comprehensive tests for background login functionality
+  - Removed flaky timing test (exponential backoff timing assertions)
+  - Updated all Braintrust.init tests to use `set_global: false` to avoid state pollution
+  - Added proper setup/teardown to reset tracer provider between tests
+  - Tests stable across different execution orders
+- **Code Quality**
+  - Fixed StandardRB linter issues (private class methods)
+  - Moved `setup_tracing` to `class << self` block with proper `private`
+  - Changed "Created OpenTelemetry tracer provider" from stdout to debug log
+- **Example Updates**
+  - Updated tracing examples to use `blocking_login: true` (trace.rb, openai.rb, internal/openai.rb)
+  - Fixed tracer_provider references to use `OpenTelemetry.tracer_provider`
+  - Removed unnecessary comments from init calls
+- **Total: 109 test runs, 328 assertions, all passing, linter clean**
diff --git a/.TODO.md b/.TODO.md
@@ -15,24 +15,28 @@
 
 ### Medium Priority
 
-- [ ] **Kitchen-Sink Span Export Inconsistency**: Some eval runs show incomplete span export
-  - Affects: examples/internal/kitchen-sink.rb (8 cases, only 3-4 appear sometimes)
-  - Issue: BatchSpanProcessor may not flush all spans before shutdown
-  - Simple evals work fine (3 cases exported successfully)
-  - May need explicit `tracer_provider.force_flush()` before `shutdown()`
-  - May be timing-related with concurrent OpenAI API calls
+- [x] **Kitchen-Sink Span Export Inconsistency**: ✅ RESOLVED (2025-10-22)
+  - Issue was timing-related with concurrent OpenAI API calls
+  - Now working correctly
 
 ### Low Priority
 
 - [ ] **Parallelism Not Implemented**: Eval.run accepts parallelism parameter but doesn't use it
   - Currently runs cases sequentially
   - Need to implement parallel execution with threads or concurrent-ruby
 
+- [ ] **Testing with/without OpenTelemetry**: Test SDK behavior with optional dependencies
+  - Test with OpenTelemetry installed (current default)
+  - Test without OpenTelemetry installed (graceful degradation)
+  - Test with `tracing: false` parameter
+  - Ensure API client, login, and non-tracing features work independently
+  - Consider making OpenTelemetry an optional dependency
+
 ## Pending Work
 
 ### Phase 2: Deferred Items
 - [ ] Implement Braintrust.with_state (deferred - not needed yet)
-- [ ] Implement State#login_until_success (deferred - background thread with retries)
+- [x] Implement State#login_in_thread ✅ COMPLETE (2025-10-23) - background thread with retries
 
 ### Phase 3: Trace Utilities (Deferred)
 - [ ] Write test: permalink generation
@@ -61,34 +65,78 @@
 - [ ] Timeout configuration
 - [ ] Rate limiting handling
 
-### Phase 5: API Client (TDD)
-
-#### lib/braintrust/api.rb
+### Phase 5: API Client (TDD) - ✅ DATASETS COMPLETE
+
+#### lib/braintrust/api.rb ✅
+- [x] Write test: API with explicit state
+- [x] Write test: API with global state
+- [x] Write test: API#datasets returns Datasets instance
+- [x] Implement API class with memoized resource accessors
+- [x] Add unique_name() test helper for parallel-safe tests
+
+#### lib/braintrust/api/datasets.rb ✅
+- [x] Write test: Datasets#list with project_name
+- [x] Write test: Datasets#get by project + name
+- [x] Write test: Datasets#get_by_id
+- [x] Write test: Datasets#create (idempotent)
+- [x] Write test: Datasets#insert events
+- [x] Write test: Datasets#fetch with pagination
+- [x] Implement Datasets class with all methods
+- [x] Implement list, get, get_by_id, create, insert, fetch, permalink
+- [x] Implement consolidated http_request() function
+- [x] Add debug logging with timing information
+- [x] Create examples/api/dataset.rb
+
+#### Deferred (API Projects/Experiments)
 - [ ] Write test: register_project creates/fetches project
 - [ ] Write test: register_experiment creates experiment
 - [ ] Write test: register_experiment with update flag
-- [ ] Write test: create_dataset creates dataset
-- [ ] Write test: fetch_dataset fetches dataset
-- [ ] Write test: insert_dataset_events inserts events
-- [ ] Write test: API with explicit state
-- [ ] Write test: API with global state
-- [ ] Implement API class
-- [ ] Implement register_project
-- [ ] Implement register_experiment
-- [ ] Implement create_dataset
-- [ ] Implement fetch_dataset
-- [ ] Implement insert_dataset_events
+- [ ] Implement API::Projects
+- [ ] Implement API::Experiments
+- [ ] Move from Internal::Experiments to public API
 
 ### Phase 6: Evals - Remaining Items
 
 #### lib/braintrust/eval.rb
 - [ ] Implement parallel execution (parallelism parameter)
 
-#### lib/braintrust/eval/dataset.rb
-- [ ] Write test: Dataset enumerable
-- [ ] Write test: Dataset from array
-- [ ] Write test: Dataset from API
-- [ ] Implement Dataset class
+#### Auto-print Results ✅ COMPLETE (2025-10-23)
+- [x] Add `quiet:` parameter to Eval.run (defaults to false)
+- [x] Update Result#to_s to Go SDK format
+- [x] Auto-print results via `puts result` unless quiet: true
+- [x] Format: Experiment name, ID, Link, Duration, Error count
+- [x] Updated all tests to use quiet: true
+- [x] Updated examples to rely on auto-printing
+
+#### Dataset Integration ✅ COMPLETE (2025-10-22)
+- [x] Add `dataset:` parameter to Eval.run (string or hash)
+- [x] Support dataset by name (same project as experiment)
+- [x] Support dataset by name + explicit project
+- [x] Support dataset by ID
+- [x] Support dataset with limit option
+- [x] Support dataset with version option
+- [x] Auto-pagination (fetch all records by default)
+- [x] Validation: dataset and cases are mutually exclusive
+- [x] Tests for all dataset features
+- [x] Example: examples/eval/dataset.rb
+
+#### Remote Functions ✅ COMPLETE (2025-10-23)
+- [x] Write test: API::Functions#list with project_name
+- [x] Write test: API::Functions#create with function_data and prompt_data
+- [x] Write test: API::Functions#invoke by ID
+- [x] Write test: API::Functions#delete
+- [x] Implement API::Functions class (lib/braintrust/api/functions.rb)
+- [x] Write test: Functions.task returns callable
+- [x] Write test: Functions.task invokes remote function
+- [x] Write test: Functions.scorer returns Scorer
+- [x] Write test: Use remote task in Eval.run
+- [x] Implement Eval::Functions module (lib/braintrust/eval/functions.rb)
+- [x] Add OpenTelemetry tracing for function invocations (type: "function")
+- [x] Make State#login idempotent (returns early if already logged in)
+- [x] Add automatic state.login in Eval.run to populate org_name
+- [x] Create example: examples/eval/remote_functions.rb
+- [x] Add remote scorer with LLM classifier and choice_scores
+- [x] Tests for all remote function features (4 API tests, 4 Eval tests)
 
 ### Phase 7: Examples
 
@@ -118,32 +166,15 @@
 
 ## Current Status
 
-**Last Updated**: 2025-10-22 (Session 4)
-**Current Phase**: Phase 6 (Evals Framework) - ✅ MOSTLY COMPLETE (Error Handling ✅, Parallelism pending)
-**Test Status**: 72 test runs, 243 assertions, all passing, linter clean
-
-## Outstanding Issues Summary
-
-**Session 4 Completed**:
-- ✅ Error handling complete (task errors, scorer errors, stacktraces)
-- ✅ All tests passing
-- ⚠️ Kitchen-sink inconsistency (span export timing issue)
-
-## Next Session Options
-
-1. **Fix SSL Certificate Verification** (High Priority ⚠️)
-   - Security issue that needs resolution
-   - Investigate proper cert store configuration
-
-2. **Fix Kitchen-Sink Span Export** (Medium Priority)
-   - Add explicit force_flush() before shutdown
-   - Test with larger eval runs
-
-3. **Implement Parallelism** (Low Priority)
-   - Add parallel case execution to Eval.run
+**Last Updated**: 2025-10-23 (Session 8)
+**Current Phase**: Phase 2 - Background Login with Retry ✅ COMPLETE
+**Test Status**: 109 test runs, 328 assertions, all passing, linter clean
 
-4. **API Client** (Phase 5)
-   - Datasets API support
+## Deferred Items
 
-5. **OpenAI Advanced** (Phase 4.5)
-   - Streaming support
+- API::Projects (move from Internal::Experiments)
+- API::Experiments (move from Internal::Experiments)
+- Eval.run integration with datasets
+- Dataset examples
+- Implement Parallelism (Eval.run parallelism parameter)
+- OpenAI Advanced Features (streaming, embeddings, etc.)
diff --git a/examples/api/dataset.rb b/examples/api/dataset.rb
@@ -0,0 +1,64 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+
+# Example: Using the Braintrust Datasets API
+#
+# This example demonstrates:
+# - Creating a dataset
+# - Inserting records
+# - Fetching records with pagination
+# - Using the low-level API client
+
+require_relative "../../lib/braintrust"
+
+# Initialize Braintrust
+Braintrust.init(blocking_login: true)
+
+# Create API client
+api = Braintrust::API.new
+
+# Create a new dataset
+puts "Creating dataset..."
+response = api.datasets.create(
+  project_name: "ruby-sdk-examples",
+  name: "example-dataset-#{Time.now.to_i}",
+  description: "Example dataset created from Ruby SDK"
+)
+
+dataset_id = response["dataset"]["id"]
+dataset_name = response["dataset"]["name"]
+puts "Created dataset: #{dataset_name} (#{dataset_id})"
+puts "  Link: #{api.datasets.permalink(id: dataset_id)}"
+
+# Insert some records
+puts "\nInserting records..."
+events = [
+  {input: "hello", expected: "HELLO"},
+  {input: "world", expected: "WORLD"},
+  {input: "foo", expected: "FOO"},
+  {input: "bar", expected: "BAR"}
+]
+
+api.datasets.insert(id: dataset_id, events: events)
+puts "Inserted #{events.length} records"
+
+# Fetch records back
+puts "\nFetching records..."
+result = api.datasets.fetch(id: dataset_id, limit: 10)
+
+puts "Retrieved #{result[:records].length} records:"
+result[:records].each do |record|
+  puts "  - input: #{record["input"]}, expected: #{record["expected"]}"
+end
+
+# Fetch by project + name
+puts "\nFetching dataset by name..."
+metadata = api.datasets.get(project_name: "ruby-sdk-examples", name: dataset_name)
+puts "Found dataset: #{metadata["name"]} (#{metadata["id"]})"
+
+# List all datasets in project
+puts "\nListing all datasets..."
+list_result = api.datasets.list(project_name: "ruby-sdk-examples")
+puts "Found #{list_result["objects"].length} datasets in project"
+
+puts "\nDone!"
diff --git a/examples/eval.rb b/examples/eval.rb
@@ -15,12 +15,7 @@
 # 5. Inspect the results
 #
 # Usage:
-#   BRAINTRUST_API_KEY=key bundle exec ruby examples/eval.rb
-
-unless ENV["BRAINTRUST_API_KEY"]
-  puts "Error: BRAINTRUST_API_KEY environment variable is required"
-  exit 1
-end
+#   bundle exec ruby examples/eval.rb
 
 # Initialize Braintrust with blocking login
 Braintrust.init(blocking_login: true)