DataDrip is a Rails engine that provides a robust framework for running data backfills and migrations in your Rails application. It offers a web interface to monitor, schedule and manage batch processing jobs with built-in error handling and progress tracking.
- 🔄 Batch Processing: Process large datasets in configurable batches to avoid memory issues
- 📊 Web Interface: Monitor backfill progress with a built-in dashboard
- ⏰ Scheduled Execution: Schedule backfills to run at specific times
- 🛡️ Error Handling: Comprehensive error tracking and recovery
- 🔧 Flexible Processing: Choose between batch-level or element-level processing
- 📈 Progress Tracking: Real-time progress updates and batch monitoring
- 🎯 Scoped Processing: Define custom scopes for targeted data processing
Add DataDrip to your application's Gemfile:
gem 'data_drip'Then execute:
bundle installRun the installation generator:
rails generate data_drip:installThis will:
- Mount the DataDrip engine at
/data_dripin your routes - Create the
app/backfillsdirectory - Generate and run migrations for
data_drip_backfill_runsanddata_drip_backfill_run_batchestables
- Ruby: >= 3.1.0
- Rails: >= 7.1
- Dependencies:
importmap-rails>= 1.2.1stimulus-rails>= 1.0turbo-rails
DataDrip can be configured to work with your application's authentication and authorization system. While the gem works out of the box with standard Rails conventions, you may need to customize these settings to match your application's architecture.
Create a DataDrip initializer file to customize the gem's behavior:
# config/initializers/data_drip.rb
Rails.application.configure do
# The model class that represents users who can run backfills
DataDrip.backfiller_class = 'Access' # default: "::User"
# The attribute on the backfiller model to display the user's name
DataDrip.backfiller_name_attribute = 'first_name' # default: :name
# The controller method that returns the current authenticated user
DataDrip.current_backfiller_method = :current_access # default: :current_user
# The base controller class that DataDrip controllers should inherit from
DataDrip.base_controller_class = 'ApiController' # default: "::ApplicationController"
# The class that handles lifecycle hooks for backfill runs and batches
DataDrip.hooks_handler_class_name = 'HookHandler' # default: nil
# Optional: Sleep time between batches in seconds (default: 0.1)
DataDrip.sleep_time = 0.5
end-
backfiller_class: The ActiveRecord model that represents the user running backfills. This model will be referenced in thedata_drip_backfill_runstable to track who initiated each backfill. -
backfiller_name_attribute: The attribute on your backfiller model that contains the user's display name. This is used in the web interface to show who is running or has run backfills. -
current_backfiller_method: The method name that your base controller uses to get the currently authenticated user. DataDrip controllers will call this method to determine who is creating backfills. -
base_controller_class: The controller class that DataDrip's controllers should inherit from. This is crucial for ensuring that DataDrip controllers have access to your application's authentication, authorization, and other controller concerns. -
hooks_handler_class_name: The name of the class that handles lifecycle hooks for backfill runs and batches. When configured, this class will receive callbacks when backfills change status (e.g.,on_run_completed,on_batch_failed). This is useful for sending notifications, tracking metrics, or integrating with external systems. See the Hooks section for more details.
This configuration is particularly useful when your application uses custom authentication systems, non-standard naming conventions, or when you need DataDrip to integrate with existing API controllers or admin interfaces.
DataDrip provides a powerful hooks system that allows you to respond to lifecycle events during backfill execution. Hooks are triggered when backfill runs or batches change status, enabling you to integrate with external systems, send notifications, track metrics, or perform any custom logic.
The recommended approach is to create a global hook handler class that responds to lifecycle events across all backfills.
First, configure the hook handler in your initializer:
# config/initializers/data_drip.rb
DataDrip.hooks_handler_class_name = 'HookHandler'Then create your hook handler class:
# app/services/hook_handler.rb
class HookHandler
# Run hooks - triggered when a BackfillRun changes status
def self.on_run_pending(run)
# Called when a run is created
end
def self.on_run_enqueued(run)
# Called when a run is enqueued for execution
# Example: Track metrics
Metrics.increment('backfill.enqueued', tags: { backfill: run.backfill_class_name })
end
def self.on_run_running(run)
# Called when a run starts processing
end
def self.on_run_completed(run)
# Called when a run completes successfully
# Example: Send notification
SlackNotifier.notify("Backfill #{run.backfill_class_name} completed!")
end
def self.on_run_failed(run)
# Called when a run fails
# Example: Send error alert
ErrorTracker.notify("Backfill failed: #{run.error_message}")
end
def self.on_run_stopped(run)
# Called when a run is manually stopped
end
# Batch hooks - triggered when a BackfillRunBatch changes status
def self.on_batch_pending(batch)
# Called when a batch is created
end
def self.on_batch_enqueued(batch)
# Called when a batch is enqueued
end
def self.on_batch_running(batch)
# Called when a batch starts processing
end
def self.on_batch_completed(batch)
# Called when a batch completes
# Example: Update progress tracking
ProgressTracker.update(batch.backfill_run_id, batch.id)
end
def self.on_batch_failed(batch)
# Called when a batch fails
end
def self.on_batch_stopped(batch)
# Called when a batch is stopped
end
endNote: You can implement any logic inside these hooks, such as sending Slack messages, tracking metrics in DataDog, updating external systems, or logging to custom services.
DataDrip provides hooks for both backfill runs (entire backfill execution) and batches (individual batch processing):
These hooks receive a BackfillRun object as a parameter:
| Hook | Triggered When |
|---|---|
on_run_pending |
Run is created |
on_run_enqueued |
Run is enqueued for execution |
on_run_running |
Run starts processing batches |
on_run_completed |
All batches complete successfully |
on_run_failed |
Run encounters an error |
on_run_stopped |
Run is manually stopped |
These hooks receive a BackfillRunBatch object as a parameter:
| Hook | Triggered When |
|---|---|
on_batch_pending |
Batch is created |
on_batch_enqueued |
Batch is enqueued for processing |
on_batch_running |
Batch starts processing records |
on_batch_completed |
Batch completes successfully |
on_batch_failed |
Batch encounters an error |
on_batch_stopped |
Batch is stopped |
You can also define hooks directly in your backfill classes for backfill-specific behavior. These hooks take precedence over the global handler:
class SendWelcomeEmails < DataDrip::Backfill
def scope
User.where(welcome_email_sent: false)
end
def process_batch(batch)
batch.each { |user| WelcomeMailer.send_welcome(user).deliver_later }
batch.update_all(welcome_email_sent: true)
end
# Backfill-specific hook
def self.on_run_completed(run)
AdminMailer.backfill_summary(run).deliver_now
end
def self.on_batch_completed(batch)
# Track progress for this specific backfill
Rails.logger.info("Sent #{batch.batch_size} welcome emails")
end
endWhen a status change occurs, DataDrip checks for hooks in the following order:
- Backfill class hooks - If the backfill class defines the hook method, it will be called
- Global handler hooks - If the backfill class doesn't define the hook and a global handler is configured, the handler's method will be called
This allows you to provide default behavior in the global handler while still allowing individual backfills to override specific hooks when needed.
Generate a new backfill:
rails generate data_drip:backfill FixUserEmailsYou can add the --sorbet flag to implement sorbet in your backfills.
This creates a backfill class in app/backfills/fix_user_emails.rb:
class FixUserEmails < DataDrip::Backfill
# Define the scope of records to process
def scope
User.where(email_verified: false)
end
# Option 1: Process entire batches (recommended for simple updates)
def process_batch(batch)
batch.update_all(email_verified: true)
end
# Option 2: Process individual elements (for complex logic)
def process_element(element)
element.update!(email_verified: true)
end
endYou can make your backfills configurable by adding attributes that users can set when creating a backfill run. This allows for dynamic filtering and let's you customize your backfill runs:
class AddRoleToEmployee < DataDrip::Backfill
# Define configurable attributes
attribute :age, :integer
attribute :name, :string
attribute :department, :string
attribute :active_only, :boolean, default: true
def scope
scope = Employee.where(role: nil)
# Now you can use the attributes to filter the scope dynamically
scope = scope.where(age: age) if age.present?
scope = scope.where(name: name) if name.present?
scope = scope.where(department: department) if department.present?
scope = scope.where(active: true) if active_only
scope
end
def process_batch(batch)
batch.update_all(role: 'intern')
end
endWhen creating a backfill run through the form in the UI, you will see form fields for each attribute, allowing you to customize the backfill's behavior without modifying code. This way, it's easy to create a backfill run for different scenarios with different scopes.
DataDrip supports various attribute types that automatically generate appropriate form fields:
:string- Text input field:integer- Number input (whole numbers):decimal/:float- Number input with decimal support:boolean- Checkbox:date- Date picker:time- Time picker:datetime- Date and time picker
Every backfill must inherit from DataDrip::Backfill and implement:
scope- Returns an ActiveRecord relation of records to process- Either
process_batchORprocess_element(not both):process_batch(batch)- Processes entire batches at once (more efficient)process_element(element)- Processes individual records (more flexible)
class AddDefaultRoleToUsers < DataDrip::Backfill
def scope
User.where(role: nil)
end
def process_batch(batch)
batch.update_all(role: 'member')
end
endNavigate to /data_drip/backfill_runs in your application to access the DataDrip dashboard where you can:
- View all available backfills
- Create new backfill runs with configurable options
- Monitor progress and status
- View error messages and logs
- Stop running backfills
- Schedule backfills for future execution
When creating a new backfill run, the interface dynamically generates form fields based on the attributes defined in your backfill class, making it easy to customize each run without code changes.
Bug reports and pull requests are welcome on GitHub at https://github.com/factorialco/data_drip.
The gem is available as open source under the terms of the MIT License.