Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 31 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,8 +123,8 @@ All options can be set via environment variables or CLI flags. CLI flags take pr
|---------------------|----------|---------|-------------|
| `MIGRATION_RESOURCES` | `--resources`, `-r` | `all` | Comma-separated list of resources to migrate. Options: `all`, `ai_secrets`, `roles`, `groups`, `datasets`, `project_tags`, `span_iframes`, `functions`, `prompts`, `project_scores`, `experiments`, `logs`, `views` |
| `MIGRATION_PROJECTS` | `--projects`, `-p` | *(all projects)* | Comma-separated list of project names to migrate |
| `MIGRATION_CREATED_AFTER` | `--created-after` | *(none)* | Only migrate data created on or after this date (**inclusive**: `>=`). Format: `YYYY-MM-DD` or ISO-8601 |
| `MIGRATION_CREATED_BEFORE` | `--created-before` | *(none)* | Only migrate data created before this date (**exclusive**: `<`). Format: `YYYY-MM-DD` or ISO-8601 |
| `MIGRATION_CREATED_AFTER` | `--created-after` | *(none)* | Only applies to resources that support created-time filtering. Currently this affects project logs event streaming and experiment listing. Migrates items with `created >=` this value (**inclusive**). Format: `YYYY-MM-DD` or ISO-8601 |
| `MIGRATION_CREATED_BEFORE` | `--created-before` | *(none)* | Only applies to resources that support created-time filtering. Currently this affects project logs event streaming and experiment listing. Migrates items with `created <` this value (**exclusive**). Format: `YYYY-MM-DD` or ISO-8601 |

#### Logging

Expand All @@ -146,7 +146,7 @@ All options can be set via environment variables or CLI flags. CLI flags take pr
| `MIGRATION_BATCH_SIZE` | — | `100` | Number of resources to process per batch |
| `MIGRATION_RETRY_ATTEMPTS` | — | `3` | Number of retry attempts for failed operations (0 = no retries) |
| `MIGRATION_RETRY_DELAY` | — | `1.0` | Initial retry delay in seconds (exponential backoff) |
| `MIGRATION_MAX_CONCURRENT` | — | `10` | Maximum concurrent **projects** (bounded project-level parallelism) |
| `MIGRATION_MAX_CONCURRENT` | — | `1` | Maximum concurrent **projects** (bounded project-level parallelism) |
| `MIGRATION_CHECKPOINT_INTERVAL` | — | `50` | Write checkpoint every N successful operations |

#### Parallelization Tuning
Expand All @@ -160,6 +160,7 @@ All options can be set via environment variables or CLI flags. CLI flags take pr
#### Streaming Migration (Logs, Experiments, Datasets)

These settings control BTQL-based streaming for high-volume resources.
`MIGRATION_CREATED_AFTER` and `MIGRATION_CREATED_BEFORE` are not universal streaming filters: they currently affect project logs event migration and experiment selection, but not dataset event migration.

| Environment Variable | CLI Flag | Default | Description |
|---------------------|----------|---------|-------------|
Expand Down Expand Up @@ -403,45 +404,48 @@ On resume: skips 1-30 (done), resumes experiment 31 from saved `_pagination_key`

## Parallelization

The migration tool uses **three levels of parallelization** that work together to reduce migration time. All levels are enabled by default and are controlled by the three env vars described in [Parallelization Tuning](#parallelization-tuning).
The migration tool currently uses **two active levels of concurrency** plus pipelined event streaming. The env vars in [Parallelization Tuning](#parallelization-tuning) still matter, but the within-project resource-type DAG concurrency described below has not been implemented yet.

### How It Works

```
┌─────────────────────────────────────────────────────────────┐
│ Level 1: Inter-Project (MIGRATION_MAX_CONCURRENT=10)
│ Projects migrate concurrently (up to 10 at a time)
│ Level 1: Inter-Project (MIGRATION_MAX_CONCURRENT=1)
│ Projects migrate concurrently (up to 1 at a time by default)
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Level 2: Inter-Resource-Type (DAG scheduler) │ │
│ │ Within each project, independent resource types │ │
│ │ (e.g. Datasets + Tags + Iframes) run concurrently. │ │
│ │ Dependent types wait for prerequisites to finish. │ │
│ │ Level 2: Intra-Resource-Type │ │
│ │ Within a single resource type, some migrators can │ │
│ │ process multiple items concurrently │ │
│ │ (MIGRATION_MAX_CONCURRENT_RESOURCES). │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ Level 3: Intra-Resource-Type │ │ │
│ │ │ Within each type, individual items migrate │ │ │
│ │ │ concurrently (MIGRATION_MAX_CONCURRENT_RESOURCES) │ │ │
│ │ │ │ │ │
│ │ │ For streaming resources (logs/datasets/exps): │ │ │
│ │ │ - Multiple event streams run in parallel │ │ │
│ │ │ - Each stream prefetches the next page while │ │ │
│ │ │ inserting the current (STREAMING_PIPELINE) │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ │ Resource types themselves still run sequentially │ │
│ │ within each project. │ │
│ │ │ │
│ │ For streaming resources (logs/datasets/exps): │ │
│ │ - Each stream can prefetch the next page while │ │
│ │ inserting the current one │ │
│ │ - Dataset/experiment streams are grouped for fetch │ │
│ │ efficiency, not scheduled as independent parallel │ │
│ │ DAG tasks within a project │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```

### Level Details

**Inter-Project Concurrency** (`MIGRATION_MAX_CONCURRENT`, default 10)
Multiple projects migrate at the same time. Each project runs its own DAG-scheduled resource pipeline independently. This was the only parallelism the tool had before the current changes.

**Inter-Resource-Type Concurrency** (automatic, DAG-based)
Within each project, resource types are organized in a dependency graph. Types whose dependencies are satisfied start immediately instead of waiting in a fixed sequence. For example, Datasets, Project Tags, and Span Iframes all start at the same time because they have no mutual dependencies.
**Inter-Project Concurrency** (`MIGRATION_MAX_CONCURRENT`, default 1)
Multiple projects migrate at the same time. Each project runs independently, bounded by this semaphore.

**Intra-Resource-Type Concurrency** (`MIGRATION_MAX_CONCURRENT_RESOURCES`, default 5)
Within a single resource type, multiple individual items migrate concurrently. For example, if a project has 50 functions, up to 5 are created at once instead of one-by-one. For streaming resources (datasets, experiments), this also controls how many event streams run in parallel -- e.g. events for 5 datasets copy simultaneously.
Within a single resource type, migrators that use the base batch executor can process multiple individual items concurrently. For example, if a project has 50 functions, up to 5 may be created at once instead of one-by-one.

Resource types themselves are still processed sequentially within a project in the current implementation.

For streaming resources:
- `logs` is a single stream per project.
- `datasets` and `experiments` group multiple ids into one BTQL fetch stream for efficiency.
- `MIGRATION_MAX_CONCURRENT_RESOURCES` does not currently create multiple independent resource-type DAG tasks within a single project.

**Pipelined Event Streaming** (`MIGRATION_STREAMING_PIPELINE`, default true)
For each individual event stream (logs, dataset records, experiment events), the next BTQL page is prefetched from the source while the current page's batches are being inserted into the destination. This overlaps source reads with destination writes, reducing idle time for large migrations.
Expand All @@ -458,7 +462,7 @@ State mutations (ID mappings, checkpoint files) are protected by `asyncio.Lock`
|----------|---------------------|
| **Small migration** (<5 projects, <100 resources) | Defaults work well. No tuning needed. |
| **Many small projects** (50+ projects, small data) | Increase `MIGRATION_MAX_CONCURRENT=20` for more project-level parallelism. |
| **Few projects with many resources** (e.g. 500 experiments in one project) | Increase `MIGRATION_MAX_CONCURRENT_RESOURCES=10` for more intra-type parallelism. |
| **Few projects with many resources** (e.g. 500 experiments in one project) | `MIGRATION_MAX_CONCURRENT_RESOURCES` helps only for migrators that support per-item fanout. Streaming resources within one project still run mostly as a single grouped stream. |
| **Large event streams** (TB-scale logs) | Defaults are good. Pipeline is on by default. Consider increasing `MIGRATION_MAX_CONCURRENT_REQUESTS=40` if the API can handle it. |
| **Rate-limited API** (frequent 429s) | *Decrease* `MIGRATION_MAX_CONCURRENT_RESOURCES=2` and `MIGRATION_MAX_CONCURRENT_REQUESTS=10`. The tool handles 429s with backoff, but fewer concurrent requests reduces throttling. |
| **Debugging or sequential run** | Set `MIGRATION_MAX_CONCURRENT_RESOURCES=1` and `MIGRATION_STREAMING_PIPELINE=false` for deterministic, sequential execution. |
Expand Down
Loading
Loading