Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions STRATEGY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Strategy: Historical Data & Differentiation for Tinygrid

## Executive Summary
To compete with Grid Status's paid offering ("pre-built historical database"), Tinygrid should adopt a **"Client-Side Data Warehousing"** approach. Instead of managing expensive database servers (Postgres/Snowflake), we will pre-process ERCOT historical data into highly optimized **Parquet** files stored in **Cloudflare R2** (zero egress fees). The `tinygrid` Python client will be updated to transparently fetch and query these files using **DuckDB**, offering performance comparable to paid APIs at a fraction of the cost.

## 1. Architecture: The "Serverless" Historical Database

### Storage Layer: S3-Compatible Object Storage (Parquet)
We will store historical data in an S3-compatible bucket (Cloudflare R2 recommended for free egress).

**Schema Structure:**
```text
s3://tinygrid-ercot/
├── real_time_spp/
│ └── year=2023/month=12/day=01/data.parquet
├── real_time_lmp_bus/
│ └── year=2023/month=12/day=01/part-001.parquet
├── dam_spp/
│ └── year=2023/month=12/day=01/data.parquet
└── ...
```

**Why Parquet?**
- **Compression**: 10x smaller than CSV.
- **Speed**: Columnar storage means fast reads for specific columns (e.g., just "LZ_HOUSTON" prices).
- **Queryability**: Directly queryable by DuckDB, Polars, and AWS Athena/BigQuery.

### Access Layer: The `tinygrid` Client
The Python client becomes the query engine. We leverage **DuckDB** (pip installable, fast in-process SQL OLAP db) inside the library.

**Workflow:**
1. User calls `ercot.get_spp(start="2022-01-01", end="2022-01-05")`.
2. `tinygrid` detects this is "historical" (> 90 days or not in live API).
3. Instead of hitting ERCOT's slow Archive API (zips), it generates S3 URLs:
- `s3://tinygrid-ercot/real_time_spp/year=2022/month=01/day=01/data.parquet`
- ...
4. `tinygrid` uses DuckDB (or Pandas with pyarrow) to download and filter these files in parallel.
5. Data is returned as a Pandas DataFrame, matching the live API format.

**Cost to Us**: $0 compute (client pays), ~$5/month storage.
**Cost to User**: Free (public bucket) or Freemium (presigned URLs).

## 2. Implementation Plan

### Phase 1: Ingestion & Backfill (The "Factory")
We need a robust pipeline to scrape ERCOT and build the Parquet lake.

**Stack**: Python script running on GitHub Actions (scheduled daily) or a small VPS.
**Tasks**:
1. **Backfill**: Script to iterate 2011-2023, call `ERCOTArchive` (existing logic), download zips, convert to Parquet, upload to R2.
2. **Daily Cron**: Script to run daily at ~2 AM CST.
- Fetch "Yesterday's" data from ERCOT Public API (while it's still there).
- **Capture Ephemeral Data**: This is key. Capture data that ERCOT deletes (e.g., real-time fuel mix, outages) and save to Parquet. This builds our unique moat.

### Phase 2: Client Upgrade
Update `tinygrid` library to prefer the S3 Parquet lake over ERCOT's Archive API.

- Add dependency: `duckdb` or `pyarrow`.
- Implement `ParquetArchive` class implementing the same interface as `ERCOTArchive`.
- Add configuration: `ERCOT(use_cloud_archive=True)`.

### Phase 3: Differentiation Features

1. **In-Process SQL**: Allow users to run SQL on the data locally.
```python
ercot.query("SELECT avg(price) FROM 's3://.../spp.parquet' WHERE zone='LZ_HOUSTON'")
```
2. **Ephemeral Data Replay**: "See the grid as it was". Replay real-time conditions (SCED bindings, outages) that are lost in official archives.
3. **BigQuery/Snowflake Integration**: Since the data is in Parquet/S3, we can easily offer "External Tables" for enterprise users to mount our bucket into their warehouse.

## 3. Comparison to Grid Status

| Feature | Grid Status (Paid) | Tinygrid (Proposed) |
| :--- | :--- | :--- |
| **Historical Data** | Hosted Database | S3 Parquet Lake |
| **Access Method** | REST API / Snowflake | Python Client (DuckDB) / S3 Direct |
| **Cost Model** | Contact Sales | Free / Low Fixed Cost |
| **Performance** | API Latency | Network Bandwidth + Local CPU (Fast) |
| **Ephemeral Data** | Yes | Yes (if we start capturing now) |
| **Bus-Level LMPs** | Since Dec 2023 | Since Dec 2023 (same constraint) |

## 4. Immediate Next Steps
1. **Proof of Concept**: Write `scripts/archive_to_parquet.py` to demonstrate downloading one day of archive, converting to Parquet, and querying with DuckDB.
2. **Storage Setup**: Set up Cloudflare R2 bucket.
3. **Client Integration**: Prototype `ParquetArchive` in `tinygrid`.
65 changes: 65 additions & 0 deletions scripts/archive_to_parquet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
import pandas as pd
import duckdb
import os
import shutil
from pathlib import Path

# Setup paths
DATA_DIR = Path("data_lake")
if DATA_DIR.exists():
shutil.rmtree(DATA_DIR)
DATA_DIR.mkdir(parents=True)

print("=== 1. Simulating ERCOT Archive Download ===")
# Simulate downloading data for 3 days
dates = pd.date_range("2023-01-01", "2023-01-03", freq="D")
dfs = []

for date in dates:
print(f"Processing {date.date()}...")
# Create dummy SPP data
df = pd.DataFrame({
"DeliveryDate": [date.date()] * 24,
"DeliveryHour": range(1, 25),
"SettlementPoint": ["LZ_HOUSTON"] * 24,
"SettlementPointPrice": [50.0 + i for i in range(24)],
"DSTFlag": ["N"] * 24
})

# 2. Write to Parquet (Partitioned)
# Structure: data_lake/spp/year=YYYY/month=MM/day=DD/data.parquet
output_path = DATA_DIR / "spp" / f"year={date.year}" / f"month={date.month:02d}" / f"day={date.day:02d}"
output_path.mkdir(parents=True, exist_ok=True)

# Write parquet using pyarrow engine
file_path = output_path / "data.parquet"
df.to_parquet(file_path, engine="pyarrow", index=False)
print(f" Saved to {file_path}")

print("\n=== 2. Querying with DuckDB (Client-Side) ===")
print("Simulating tinygrid client fetching data for Jan 2-3...")

# DuckDB can query hive-partitioned parquet files directly using glob patterns
# In a real S3 scenario, we'd use 's3://bucket/spp/**/data.parquet'
# For local fs, we use the path.

query = """
SELECT
DeliveryDate,
avg(SettlementPointPrice) as avg_price
FROM read_parquet('data_lake/spp/*/*/*/data.parquet', hive_partitioning=1)
WHERE DeliveryDate >= '2023-01-02'
GROUP BY DeliveryDate
ORDER BY DeliveryDate
"""

con = duckdb.connect()
result = con.execute(query).df()
print("\nQuery Result:")
print(result)

print("\n=== 3. Benefits ===")
print(f"Total Lake Size: {sum(f.stat().st_size for f in DATA_DIR.rglob('*.parquet')) / 1024:.2f} KB (Tiny!)")
print("1. Zero Database Cost (Just S3 storage)")
print("2. Fast Queries (Columnar scan)")
print("3. Unified Interface (Client sees a DataFrame)")
118 changes: 118 additions & 0 deletions tinygrid/ercot/parquet_archive.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
"""Parquet-based historical archive access using DuckDB."""

from __future__ import annotations

import logging
from typing import TYPE_CHECKING, Any

import pandas as pd
from attrs import define, field

from ..constants.ercot import ERCOT_TIMEZONE

if TYPE_CHECKING:
from .client import ERCOTBase

logger = logging.getLogger(__name__)

# Default S3 bucket for Tinygrid's public archive
DEFAULT_ARCHIVE_BUCKET = "s3://tinygrid-ercot"


@define
class ParquetArchive:
"""Access historical data via S3 Parquet lake using DuckDB.

This acts as a drop-in replacement for ERCOTArchive but queries
pre-processed Parquet files in S3 instead of downloading zip files
from ERCOT's API.

Requires:
duckdb: for efficient S3 querying
fsspec: for S3 filesystem access (via duckdb or pandas)
"""

client: ERCOTBase | None = field(default=None)
bucket_url: str = field(default=DEFAULT_ARCHIVE_BUCKET)
use_duckdb: bool = field(default=True)

def fetch_historical(
self,
endpoint: str,
start: pd.Timestamp,
end: pd.Timestamp,
add_post_datetime: bool = False,
) -> pd.DataFrame:
"""Fetch historical data from Parquet archive.

Args:
endpoint: API endpoint (e.g., "/np6-905-cd/spp_node_zone_hub")
start: Start timestamp
end: End timestamp
add_post_datetime: If True, add postDatetime column (mocked or from parquet)

Returns:
DataFrame with all historical data
"""
# Map endpoint to parquet path structure
# Example: /np6-905-cd/spp_node_zone_hub -> spp/
# This mapping needs to be robust. For now, we use a simple heuristic or a map.
dataset_path = self._map_endpoint_to_path(endpoint)
if not dataset_path:
logger.warning(f"No Parquet mapping found for {endpoint}")
return pd.DataFrame()

s3_path = f"{self.bucket_url}/{dataset_path}/*/*/*/*.parquet"

# Format dates for query
start_str = start.strftime("%Y-%m-%d")
end_str = end.strftime("%Y-%m-%d")

try:
if self.use_duckdb:
return self._query_duckdb(s3_path, start_str, end_str)
else:
return self._query_pandas(s3_path, start, end)
except Exception as e:
logger.warning(f"Failed to fetch from Parquet archive: {e}")
return pd.DataFrame()

def _query_duckdb(self, s3_path: str, start_date: str, end_date: str) -> pd.DataFrame:
"""Query S3 using DuckDB."""
try:
import duckdb
except ImportError:
logger.warning("duckdb not installed, falling back to empty result")
return pd.DataFrame()

# DuckDB query with hive partitioning
# We assume partition keys are year, month, day
# DeliveryDate is inside the file
query = f"""
SELECT *
FROM read_parquet('{s3_path}', hive_partitioning=1)
WHERE DeliveryDate >= '{start_date}'
AND DeliveryDate <= '{end_date}'
"""

# In a real implementation, we would configure S3 credentials here if needed
# duckdb.sql("INSTALL httpfs; LOAD httpfs;")

return duckdb.sql(query).df()

def _query_pandas(self, s3_path: str, start: pd.Timestamp, end: pd.Timestamp) -> pd.DataFrame:
"""Fallback using Pandas (reads all files, slow!)."""
# This is inefficient without precise file listing, included only as fallback
# Real implementation should list keys in S3 matching the date range first
logger.warning("Pandas fallback for S3 Parquet is not fully implemented (inefficient)")
return pd.DataFrame()

def _map_endpoint_to_path(self, endpoint: str) -> str | None:
"""Map ERCOT endpoint to S3 prefix."""
# Simple mapping for POC
mapping = {
"/np6-905-cd/spp_node_zone_hub": "real_time_spp",
"/np4-190-cd/dam_stlmnt_pnt_prices": "dam_spp",
"/np6-788-cd/lmp_node_zone_hub": "real_time_lmp_node",
}
return mapping.get(endpoint)
Loading