diff --git a/STRATEGY.md b/STRATEGY.md new file mode 100644 index 0000000..1065c11 --- /dev/null +++ b/STRATEGY.md @@ -0,0 +1,85 @@ +# Strategy: Historical Data & Differentiation for Tinygrid + +## Executive Summary +To compete with Grid Status's paid offering ("pre-built historical database"), Tinygrid should adopt a **"Client-Side Data Warehousing"** approach. Instead of managing expensive database servers (Postgres/Snowflake), we will pre-process ERCOT historical data into highly optimized **Parquet** files stored in **Cloudflare R2** (zero egress fees). The `tinygrid` Python client will be updated to transparently fetch and query these files using **DuckDB**, offering performance comparable to paid APIs at a fraction of the cost. + +## 1. Architecture: The "Serverless" Historical Database + +### Storage Layer: S3-Compatible Object Storage (Parquet) +We will store historical data in an S3-compatible bucket (Cloudflare R2 recommended for free egress). + +**Schema Structure:** +```text +s3://tinygrid-ercot/ + ├── real_time_spp/ + │ └── year=2023/month=12/day=01/data.parquet + ├── real_time_lmp_bus/ + │ └── year=2023/month=12/day=01/part-001.parquet + ├── dam_spp/ + │ └── year=2023/month=12/day=01/data.parquet + └── ... +``` + +**Why Parquet?** +- **Compression**: 10x smaller than CSV. +- **Speed**: Columnar storage means fast reads for specific columns (e.g., just "LZ_HOUSTON" prices). +- **Queryability**: Directly queryable by DuckDB, Polars, and AWS Athena/BigQuery. + +### Access Layer: The `tinygrid` Client +The Python client becomes the query engine. We leverage **DuckDB** (pip installable, fast in-process SQL OLAP db) inside the library. + +**Workflow:** +1. User calls `ercot.get_spp(start="2022-01-01", end="2022-01-05")`. +2. `tinygrid` detects this is "historical" (> 90 days or not in live API). +3. Instead of hitting ERCOT's slow Archive API (zips), it generates S3 URLs: + - `s3://tinygrid-ercot/real_time_spp/year=2022/month=01/day=01/data.parquet` + - ... +4. `tinygrid` uses DuckDB (or Pandas with pyarrow) to download and filter these files in parallel. +5. Data is returned as a Pandas DataFrame, matching the live API format. + +**Cost to Us**: $0 compute (client pays), ~$5/month storage. +**Cost to User**: Free (public bucket) or Freemium (presigned URLs). + +## 2. Implementation Plan + +### Phase 1: Ingestion & Backfill (The "Factory") +We need a robust pipeline to scrape ERCOT and build the Parquet lake. + +**Stack**: Python script running on GitHub Actions (scheduled daily) or a small VPS. +**Tasks**: +1. **Backfill**: Script to iterate 2011-2023, call `ERCOTArchive` (existing logic), download zips, convert to Parquet, upload to R2. +2. **Daily Cron**: Script to run daily at ~2 AM CST. + - Fetch "Yesterday's" data from ERCOT Public API (while it's still there). + - **Capture Ephemeral Data**: This is key. Capture data that ERCOT deletes (e.g., real-time fuel mix, outages) and save to Parquet. This builds our unique moat. + +### Phase 2: Client Upgrade +Update `tinygrid` library to prefer the S3 Parquet lake over ERCOT's Archive API. + +- Add dependency: `duckdb` or `pyarrow`. +- Implement `ParquetArchive` class implementing the same interface as `ERCOTArchive`. +- Add configuration: `ERCOT(use_cloud_archive=True)`. + +### Phase 3: Differentiation Features + +1. **In-Process SQL**: Allow users to run SQL on the data locally. + ```python + ercot.query("SELECT avg(price) FROM 's3://.../spp.parquet' WHERE zone='LZ_HOUSTON'") + ``` +2. **Ephemeral Data Replay**: "See the grid as it was". Replay real-time conditions (SCED bindings, outages) that are lost in official archives. +3. **BigQuery/Snowflake Integration**: Since the data is in Parquet/S3, we can easily offer "External Tables" for enterprise users to mount our bucket into their warehouse. + +## 3. Comparison to Grid Status + +| Feature | Grid Status (Paid) | Tinygrid (Proposed) | +| :--- | :--- | :--- | +| **Historical Data** | Hosted Database | S3 Parquet Lake | +| **Access Method** | REST API / Snowflake | Python Client (DuckDB) / S3 Direct | +| **Cost Model** | Contact Sales | Free / Low Fixed Cost | +| **Performance** | API Latency | Network Bandwidth + Local CPU (Fast) | +| **Ephemeral Data** | Yes | Yes (if we start capturing now) | +| **Bus-Level LMPs** | Since Dec 2023 | Since Dec 2023 (same constraint) | + +## 4. Immediate Next Steps +1. **Proof of Concept**: Write `scripts/archive_to_parquet.py` to demonstrate downloading one day of archive, converting to Parquet, and querying with DuckDB. +2. **Storage Setup**: Set up Cloudflare R2 bucket. +3. **Client Integration**: Prototype `ParquetArchive` in `tinygrid`. diff --git a/scripts/archive_to_parquet.py b/scripts/archive_to_parquet.py new file mode 100644 index 0000000..09d3ccb --- /dev/null +++ b/scripts/archive_to_parquet.py @@ -0,0 +1,65 @@ +import pandas as pd +import duckdb +import os +import shutil +from pathlib import Path + +# Setup paths +DATA_DIR = Path("data_lake") +if DATA_DIR.exists(): + shutil.rmtree(DATA_DIR) +DATA_DIR.mkdir(parents=True) + +print("=== 1. Simulating ERCOT Archive Download ===") +# Simulate downloading data for 3 days +dates = pd.date_range("2023-01-01", "2023-01-03", freq="D") +dfs = [] + +for date in dates: + print(f"Processing {date.date()}...") + # Create dummy SPP data + df = pd.DataFrame({ + "DeliveryDate": [date.date()] * 24, + "DeliveryHour": range(1, 25), + "SettlementPoint": ["LZ_HOUSTON"] * 24, + "SettlementPointPrice": [50.0 + i for i in range(24)], + "DSTFlag": ["N"] * 24 + }) + + # 2. Write to Parquet (Partitioned) + # Structure: data_lake/spp/year=YYYY/month=MM/day=DD/data.parquet + output_path = DATA_DIR / "spp" / f"year={date.year}" / f"month={date.month:02d}" / f"day={date.day:02d}" + output_path.mkdir(parents=True, exist_ok=True) + + # Write parquet using pyarrow engine + file_path = output_path / "data.parquet" + df.to_parquet(file_path, engine="pyarrow", index=False) + print(f" Saved to {file_path}") + +print("\n=== 2. Querying with DuckDB (Client-Side) ===") +print("Simulating tinygrid client fetching data for Jan 2-3...") + +# DuckDB can query hive-partitioned parquet files directly using glob patterns +# In a real S3 scenario, we'd use 's3://bucket/spp/**/data.parquet' +# For local fs, we use the path. + +query = """ +SELECT + DeliveryDate, + avg(SettlementPointPrice) as avg_price +FROM read_parquet('data_lake/spp/*/*/*/data.parquet', hive_partitioning=1) +WHERE DeliveryDate >= '2023-01-02' +GROUP BY DeliveryDate +ORDER BY DeliveryDate +""" + +con = duckdb.connect() +result = con.execute(query).df() +print("\nQuery Result:") +print(result) + +print("\n=== 3. Benefits ===") +print(f"Total Lake Size: {sum(f.stat().st_size for f in DATA_DIR.rglob('*.parquet')) / 1024:.2f} KB (Tiny!)") +print("1. Zero Database Cost (Just S3 storage)") +print("2. Fast Queries (Columnar scan)") +print("3. Unified Interface (Client sees a DataFrame)") diff --git a/tinygrid/ercot/parquet_archive.py b/tinygrid/ercot/parquet_archive.py new file mode 100644 index 0000000..5145200 --- /dev/null +++ b/tinygrid/ercot/parquet_archive.py @@ -0,0 +1,118 @@ +"""Parquet-based historical archive access using DuckDB.""" + +from __future__ import annotations + +import logging +from typing import TYPE_CHECKING, Any + +import pandas as pd +from attrs import define, field + +from ..constants.ercot import ERCOT_TIMEZONE + +if TYPE_CHECKING: + from .client import ERCOTBase + +logger = logging.getLogger(__name__) + +# Default S3 bucket for Tinygrid's public archive +DEFAULT_ARCHIVE_BUCKET = "s3://tinygrid-ercot" + + +@define +class ParquetArchive: + """Access historical data via S3 Parquet lake using DuckDB. + + This acts as a drop-in replacement for ERCOTArchive but queries + pre-processed Parquet files in S3 instead of downloading zip files + from ERCOT's API. + + Requires: + duckdb: for efficient S3 querying + fsspec: for S3 filesystem access (via duckdb or pandas) + """ + + client: ERCOTBase | None = field(default=None) + bucket_url: str = field(default=DEFAULT_ARCHIVE_BUCKET) + use_duckdb: bool = field(default=True) + + def fetch_historical( + self, + endpoint: str, + start: pd.Timestamp, + end: pd.Timestamp, + add_post_datetime: bool = False, + ) -> pd.DataFrame: + """Fetch historical data from Parquet archive. + + Args: + endpoint: API endpoint (e.g., "/np6-905-cd/spp_node_zone_hub") + start: Start timestamp + end: End timestamp + add_post_datetime: If True, add postDatetime column (mocked or from parquet) + + Returns: + DataFrame with all historical data + """ + # Map endpoint to parquet path structure + # Example: /np6-905-cd/spp_node_zone_hub -> spp/ + # This mapping needs to be robust. For now, we use a simple heuristic or a map. + dataset_path = self._map_endpoint_to_path(endpoint) + if not dataset_path: + logger.warning(f"No Parquet mapping found for {endpoint}") + return pd.DataFrame() + + s3_path = f"{self.bucket_url}/{dataset_path}/*/*/*/*.parquet" + + # Format dates for query + start_str = start.strftime("%Y-%m-%d") + end_str = end.strftime("%Y-%m-%d") + + try: + if self.use_duckdb: + return self._query_duckdb(s3_path, start_str, end_str) + else: + return self._query_pandas(s3_path, start, end) + except Exception as e: + logger.warning(f"Failed to fetch from Parquet archive: {e}") + return pd.DataFrame() + + def _query_duckdb(self, s3_path: str, start_date: str, end_date: str) -> pd.DataFrame: + """Query S3 using DuckDB.""" + try: + import duckdb + except ImportError: + logger.warning("duckdb not installed, falling back to empty result") + return pd.DataFrame() + + # DuckDB query with hive partitioning + # We assume partition keys are year, month, day + # DeliveryDate is inside the file + query = f""" + SELECT * + FROM read_parquet('{s3_path}', hive_partitioning=1) + WHERE DeliveryDate >= '{start_date}' + AND DeliveryDate <= '{end_date}' + """ + + # In a real implementation, we would configure S3 credentials here if needed + # duckdb.sql("INSTALL httpfs; LOAD httpfs;") + + return duckdb.sql(query).df() + + def _query_pandas(self, s3_path: str, start: pd.Timestamp, end: pd.Timestamp) -> pd.DataFrame: + """Fallback using Pandas (reads all files, slow!).""" + # This is inefficient without precise file listing, included only as fallback + # Real implementation should list keys in S3 matching the date range first + logger.warning("Pandas fallback for S3 Parquet is not fully implemented (inefficient)") + return pd.DataFrame() + + def _map_endpoint_to_path(self, endpoint: str) -> str | None: + """Map ERCOT endpoint to S3 prefix.""" + # Simple mapping for POC + mapping = { + "/np6-905-cd/spp_node_zone_hub": "real_time_spp", + "/np4-190-cd/dam_stlmnt_pnt_prices": "dam_spp", + "/np6-788-cd/lmp_node_zone_hub": "real_time_lmp_node", + } + return mapping.get(endpoint)