Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
4eaea13
docs(issues/3): add SPEC, PLAN, TASKS, RESEARCH for serverless-migrat…
serithemage Apr 18, 2026
2a1efd4
research(serverless-migration): add WA Serverless Lens design princip…
serithemage Apr 18, 2026
58be6b3
research(serverless-migration): add SnapStart/Aurora/DynamoDB/S3-Expr…
serithemage Apr 18, 2026
384c159
research(serverless-migration): extract 15 autoresearch insights with…
serithemage Apr 18, 2026
53ff40c
research(serverless-migration): extract openclaw architecture princip…
serithemage Apr 18, 2026
bbdc25e
research(serverless-migration): fix Stage A review issues — O6, §8/§1…
serithemage Apr 18, 2026
970d238
scaffold(serverless-migration-advisor): create skill + 13 reference s…
serithemage Apr 18, 2026
f4757ec
scaffold(serverless-migration-advisor): register in marketplace (#3)
serithemage Apr 18, 2026
faa9e47
docs(serverless-migration-advisor): tradeoffs-compute from AWS Docs +…
serithemage Apr 18, 2026
59232f4
docs(serverless-migration-advisor): tradeoffs-spot with placement sco…
serithemage Apr 18, 2026
43c6779
docs(serverless-migration-advisor): tradeoffs-data-layer for Aurora/D…
serithemage Apr 18, 2026
e07f588
docs(serverless-migration-advisor): tradeoffs-event-driven EventBridg…
serithemage Apr 18, 2026
6709134
docs(serverless-migration-advisor): serverless-lens 7 principles + 5 …
serithemage Apr 18, 2026
803a0a9
docs(serverless-migration-advisor): patterns-tier1-batch (3 patterns …
serithemage Apr 18, 2026
c10f798
docs(serverless-migration-advisor): patterns-tier2-api (4 patterns + …
serithemage Apr 18, 2026
f1976a6
docs(serverless-migration-advisor): patterns-tier3-monolith (warning …
serithemage Apr 18, 2026
1b029e2
docs(serverless-migration-advisor): patterns-tier3-data (warning + 3 …
serithemage Apr 18, 2026
b5d55d9
docs(serverless-migration-advisor): case-study-autoresearch (Tier 1 v…
serithemage Apr 18, 2026
6cae1b4
docs(serverless-migration-advisor): case-study-openclaw (Tier 2 valid…
serithemage Apr 18, 2026
266d6f5
docs(serverless-migration-advisor): source-insights consolidated (#1-…
serithemage Apr 18, 2026
7f537de
docs(serverless-migration-advisor): interview-bank with Phase 1/3 JSO…
serithemage Apr 18, 2026
7f1eee9
feat(serverless-migration-advisor): write 5-Phase SKILL.md body (#3)
serithemage Apr 18, 2026
b0ac407
docs(readme): list serverless-migration-advisor in Workflow skills (#3)
serithemage Apr 18, 2026
88240cf
docs(changelog): note serverless-migration-advisor skill (#3)
serithemage Apr 18, 2026
5bc89ae
docs(issues/3): check off PLAN implementation deliverables through St…
serithemage Apr 18, 2026
23fd467
docs(issues/3): check off Stage H verification results on PLAN (#3)
serithemage Apr 18, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,8 @@
"skills": [
"./skills/git-workflow/SKILL.md",
"./skills/tidd/SKILL.md",
"./skills/intent-engineering/SKILL.md"
"./skills/intent-engineering/SKILL.md",
"./skills/serverless-migration-advisor/SKILL.md"
]
},
{
Expand Down
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- **serverless-migration-advisor** (workflow): AWS always-on 아키텍처를 서버리스+Spot 패턴으로 이행할 때 사용하는 업스트림 어드바이저. 5-Phase 인터뷰 → 타겟 아키텍처 매핑 → 단계별 체크리스트 리포트 생성. AWS Docs 인용 의무 + serverless-autoresearch($3.94/48실험) 및 serverless-openclaw($1/월) 검증 사례 인용. 구현 how-to는 `sagemaker-spot-training` 등으로 위임. (#3)

## [0.2.0-beta] - 2026-04-07

### Added
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ Git 워크플로우 및 협업 관련
| [git-workflow](plugins/workflow/skills/git-workflow) | Git 브랜치 전략 및 커밋 컨벤션 가이드 |
| [tidd](plugins/workflow/skills/tidd) | TiDD(Ticket Driven Development) - No Ticket, No Commit 원칙 강제 훅 |
| [intent-engineering](plugins/workflow/skills/intent-engineering) | Intent Document(INTENT.md) 생성 및 관리 - Why/What/Not/Learnings 기반 프로젝트 의도 문서화 |
| [serverless-migration-advisor](plugins/workflow/skills/serverless-migration-advisor) | AWS always-on 아키텍처를 서버리스+Spot 패턴으로 이행하는 업스트림 어드바이저. 트레이드오프 평가·리스크 플래깅·단계별 이행 계획 생성 후 구현 스킬(sagemaker-spot-training 등)로 위임. |

### Documentation

Expand Down
224 changes: 224 additions & 0 deletions issues/3-serverless-migration/HANDOFF.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Issue #3: Serverless Migration Skill — Handoff Document

> **To the agent picking this up:** this is a handoff from the
> `serverless-autoresearch` project. Read this end-to-end before writing any
> code. The source repository (see §3) contains the lessons that should power
> this skill — extract, don't reinvent. Write `SPEC.md` and `PLAN.md` in this
> directory once you've decided scope and architecture.

## 1. Goal

Build a Claude Code skill (or skill family) that helps users **migrate an
existing, always-on architecture to a serverless + Spot pattern on AWS**, with
particular emphasis on the lessons proven out in the source project.

The skill should let a user describe their current workload (batch training,
ETL job, long-running service, ML inference, etc.) and receive:

1. A feasibility assessment (is this workload a good serverless+Spot candidate?)
2. A concrete migration plan (target AWS services, cost/time estimate, risks)
3. Drop-in IaC / code patterns where applicable
4. Pitfalls to avoid, drawn from real experience

## 2. Why this skill

The source project (`serverless-autoresearch`) demonstrated that a workload
most people assume needs a dedicated H100 for 8 hours can actually run on
Spot for **229 seconds at $0.16**, matching the reference result. Same
technique transfers to many batch-style workloads. The lessons are concrete
and hard-won; putting them behind a skill lets other teams apply them without
re-learning the same failures.

## 3. Source project — what to mine

Repository: `/Users/dohyunjung/Workspace/roboco-io/research/serverless-autoresearch/`
(GitHub: `roboco-io/serverless-autoresearch`, main branch)

### 3.1 Primary sources (read these first)

| Path | What it contains | Why it matters |
|------|-----------------|----------------|
| `docs/insights.md` | 15 numbered lessons from 48 real Spot experiments | **Core knowledge base**. Every insight is battle-tested. Several directly contradict intuition (e.g. #3/#12 vs #13). |
| `docs/comparison-report.md` | Sequential-dedicated vs parallel-Spot architecture analysis | Migration narrative: before/after |
| `docs/spot-capacity-guide.md` | How to find Spot capacity by region, quota flow | First-mile problem for every Spot migration |
| `docs/gpu-cost-analysis.md` | P5/P6 pricing, per-experiment cost math | Concrete cost templates |
| `experiments/003-h100-comparison/results-summary.md` | Full multi-phase experiment report | End-to-end worked example |
| `experiments/001-baseline-l40s/`, `002-optimization-l40s/` | Earlier experiment reports | Shows the incremental path — useful for the skill's suggested migration stages |
| `README.md` | Project summary, 48-experiment program, $3.94 total | Headline numbers for the skill's "what's possible" pitch |

### 3.2 Code patterns (reusable snippets)

| Path | Pattern demonstrated |
|------|---------------------|
| `src/pipeline/batch_launcher.py` | Submitting N parallel SageMaker Spot jobs with `use_spot_instances=True`, `max_wait`, retry-safe structure |
| `src/pipeline/result_collector.py` | Polling multiple async SageMaker jobs, extracting metrics from CloudWatch logs |
| `src/pipeline/orchestrator.py` | Generation-loop coordination across parallel Spot jobs |
| `src/sagemaker/entry_point.py`, `train_wrapper.py` | Container entry point that works under SageMaker's S3 input-channel convention |
| `infrastructure/setup_iam.sh` | Minimal IAM role for SageMaker training (execution + S3 access) |
| `infrastructure/requirements-train.txt` | Pinned deps compatible with SageMaker PyTorch DLC 2.8.0 |
| `config.yaml.example` | Config template: profile, region, instance type, Spot flags, time budget |
| `Makefile` | `dry-run`, `run-single`, `run`, `cost` commands — the UX template the skill should replicate |

### 3.3 Diagrams (for reference / inspiration, don't copy verbatim)

- `docs/architecture.svg` — system architecture
- `docs/comparison-diagrams.svg` — sequential vs parallel visual

## 4. Core knowledge to surface

Distilled from the source project's `docs/insights.md`, the skill must teach
or apply at minimum these principles:

### 4.1 Spot-specific operational knowledge
- **Region/AZ matters more than instance size** — Spot placement scores
vary 1–9 across regions for the same SKU. Always check
`aws ec2 get-spot-placement-scores` before committing to a region.
- **Larger instances can be cheaper on Spot** — less demand. Don't assume
smaller = cheaper.
- **Quota is the first-mile problem** — g-family auto-approves in minutes,
p-family (H100, B200) needs days of lead time.
- **Spot interrupt risk is workload-length-dependent** — under ~30 min,
Spot is almost free. Beyond that, interrupt-handling matters.
- **Billable ≠ wall clock** — SageMaker Spot only bills training time,
not Spot wait, not startup. This is the real cost model.

### 4.2 Cost & time patterns
- **HUGI (Hurry Up and Get Idle)** — only pay when compute is active;
let the platform release the instance the instant the job ends. Always-on
servers lose to HUGI on any bursty workload.
- **Parallelism cuts wall clock, not cost** — N-way parallel Spot is the
same billable time as sequential, at 1/N the wall clock.
- **Startup overhead matters at short durations** — SageMaker startup is
~3 min. For 5-min jobs, that's 60% overhead. Either amortize (batch more
work per job) or choose a platform with lower cold-start (Lambda, Fargate
Spot, Batch).

### 4.3 Migration-specific pitfalls (from experience)
- **Hyperparameters and batch size interact across hardware** (insight
#13). When migrating a workload, don't assume "same config" — LR-like
settings that depended on the old hardware's memory limits may need to
be re-examined after the move.
- **Cheap-GPU-evolved configs often transfer to expensive GPUs** (insight
#14). For iteration-heavy workloads, run exploration on cheap Spot, then
promote the winner. Don't start on the expensive tier.
- **Transient CUDA / Spot errors happen** at ~5% rate on H100 Spot. Any
migration plan needs automatic retry baked in; one failure out of 20 is
normal, not a signal to abandon Spot.
- **Kernel/accelerator support varies by GPU** (FA3 sm_90 only, etc.). A
serverless migration across GPU types needs a fallback path.

### 4.4 The concrete achievement to cite

The source project reproduced Karpathy's upstream autoresearch result
(val_bpb ~0.998) to **val_bpb = 0.9951 on a single 229-second H100 Spot
run costing $0.16**, versus the upstream's estimated **$7–24 over 8 hours**.
Total program: 48 experiments across L40S + H100 for **$3.94**. These are
the numbers the skill should be able to quote as the "what's actually
possible" anchor.

## 5. Suggested scope (the agent decides)

The ask — "help migrate any architecture to serverless" — is broad. A
plausible scoping, for the agent to accept, refine, or reject:

**Tier 1 (must-have):** batch/training/ETL workloads → Spot + S3 + managed
service (SageMaker Training / AWS Batch / Fargate Spot). This is where the
source project's experience is deepest.

**Tier 2 (extend if time permits):** always-on API → Lambda / API Gateway
or ECS Fargate. Source project doesn't cover this directly but HUGI
principles apply.

**Tier 3 (probably out of scope):** full legacy monolith decomposition,
data layer migration (RDS → DynamoDB), event-driven re-architecture.
These are bigger than a single skill.

Reasonable options for skill structure:

- **Option A — single skill** `serverless-migration-advisor`. One SKILL.md
with workload-type branches internally. Simpler to maintain, harder to
load selectively.
- **Option B — orchestrator + sub-skills**, like the `aws-well-architected`
family (see `issues/2-aws-well-architected-review/SPEC.md`). Main skill
routes to `serverless-migration-batch`, `serverless-migration-api`, etc.
Matches the existing plugin repo pattern.
- **Option C — complement an existing skill**. The source project's
CLAUDE.md mentions a `sagemaker-spot-training` skill at
`github.com/roboco-io/claude-skills`. If that skill already covers the
training-workload case, this new skill should focus on migration framing
(before-state analysis, service selection, cost/risk call-out) rather
than reimplement SageMaker-specific how-to.

**Recommendation:** investigate Option C first. If the existing skill is
hands-on-how-to, this new skill should be the upstream migration advisor
that eventually *delegates* to it. Don't duplicate.

Category placement in the plugin repo: either `development/` or
`workflow/`. `workflow/` fits "migration process" framing better.

## 6. Non-goals

- **Not a cost calculator** — skill should cite cost *patterns and ranges*
from real experiments, not compute live AWS prices.
- **Not a Terraform generator** — emit *snippets and structural guidance*,
not a full IaC suite.
- **Not a multi-cloud story** — AWS-only, matching source project scope.
- **Not silent about limits** — must surface the source project's actual
failure modes (CUDA errors, quota delays, FA3 compatibility, etc.),
not only the wins.

## 7. Success criteria

A user arrives with "I have a nightly batch job running on an EC2 H100 that
costs $N/month and I want it cheaper." The skill should, within one
interaction:

1. Classify the workload (batch-training / batch-ETL / inference / long-running)
2. Identify the target serverless pattern (SageMaker Spot / Batch / Fargate Spot / Lambda)
3. Estimate cost savings range (citing source-project precedents where relevant)
4. Flag top 3 risks specific to that workload type (Spot interrupts, cold start, quota, etc.)
5. Produce a staged migration plan: validation-on-cheap-Spot → production-on-target
6. Cite the source project's insights (numbered) where the advice originates — traceable, not black-box.

A secondary marker: if the user has an IaC file (Terraform, CDK, CloudFormation),
the skill can read it and produce a *diff-style* migration sketch rather than
starting from a blank page.

## 8. Implementation constraints

Follow the plugin repo's conventions (`plugins/tools/plugins/CLAUDE.md`):

- YAML frontmatter + Markdown in every SKILL.md
- Progressive disclosure via `references/` subdirectory
- Update `.claude-plugin/marketplace.json` `plugins` array when adding the plugin
- Keep SKILL.md under ~500 lines; offload detail to `references/`
- Add the plugin under `plugins/{category}/.claude-plugin/plugin.json`

## 9. Deliverables (for the agent)

In this directory (`issues/3-serverless-migration/`):

1. **`SPEC.md`** — your refined scope, architecture, skill structure,
interview/output contracts. Follow the style of `issues/2-aws-well-architected-review/SPEC.md`.
2. **`PLAN.md`** — implementation steps, ordered, with rough effort estimates.
3. (optional) **`RESEARCH.md`** — your investigation of the existing
`sagemaker-spot-training` skill (referenced in §5 Option C) and how
this new skill should complement or supersede it.

Then proceed with implementation in `plugins/{chosen-category}/skills/...`.

## 10. Open questions for the agent to resolve

- Does `github.com/roboco-io/claude-skills` have a live `sagemaker-spot-training`
skill? If yes, what's its exact scope? (Drives §5 Option A vs C.)
- Should the skill interview the user (like `aws-well-architected`) or work
from a single declarative input (paste your Terraform)? Probably both.
- How to represent "staged migration" — as a checklist skill output? A
generated issue tree? A Terraform-diff format?
- Where do the numbered source-project insights live in the skill? Inline
citations in output, or an appended `references/source-insights.md`?

---

*Handoff prepared 2026-04-18 from serverless-autoresearch commit `5435b37`.*
*Source project contact: `/Users/dohyunjung/Workspace/roboco-io/research/serverless-autoresearch/`.*
Loading
Loading