kvkenyon · kvkenyon · Dec 29, 2025
diff --git a/STRATEGY.md b/STRATEGY.md
@@ -0,0 +1,85 @@
+# Strategy: Historical Data & Differentiation for Tinygrid
+
+## Executive Summary
+To compete with Grid Status's paid offering ("pre-built historical database"), Tinygrid should adopt a **"Client-Side Data Warehousing"** approach. Instead of managing expensive database servers (Postgres/Snowflake), we will pre-process ERCOT historical data into highly optimized **Parquet** files stored in **Cloudflare R2** (zero egress fees). The `tinygrid` Python client will be updated to transparently fetch and query these files using **DuckDB**, offering performance comparable to paid APIs at a fraction of the cost.
+
+## 1. Architecture: The "Serverless" Historical Database
+
+### Storage Layer: S3-Compatible Object Storage (Parquet)
+We will store historical data in an S3-compatible bucket (Cloudflare R2 recommended for free egress).
+
+**Schema Structure:**
+```text
+s3://tinygrid-ercot/
+  ├── real_time_spp/
+  │   └── year=2023/month=12/day=01/data.parquet
+  ├── real_time_lmp_bus/
+  │   └── year=2023/month=12/day=01/part-001.parquet
+  ├── dam_spp/
+  │   └── year=2023/month=12/day=01/data.parquet
+  └── ...
+```
+
+**Why Parquet?**
+-   **Compression**: 10x smaller than CSV.
+-   **Speed**: Columnar storage means fast reads for specific columns (e.g., just "LZ_HOUSTON" prices).
+-   **Queryability**: Directly queryable by DuckDB, Polars, and AWS Athena/BigQuery.
+
+### Access Layer: The `tinygrid` Client
+The Python client becomes the query engine. We leverage **DuckDB** (pip installable, fast in-process SQL OLAP db) inside the library.
+
+**Workflow:**
+1.  User calls `ercot.get_spp(start="2022-01-01", end="2022-01-05")`.
+2.  `tinygrid` detects this is "historical" (> 90 days or not in live API).
+3.  Instead of hitting ERCOT's slow Archive API (zips), it generates S3 URLs:
+    -   `s3://tinygrid-ercot/real_time_spp/year=2022/month=01/day=01/data.parquet`
+    -   ...
+4.  `tinygrid` uses DuckDB (or Pandas with pyarrow) to download and filter these files in parallel.
+5.  Data is returned as a Pandas DataFrame, matching the live API format.
+
+**Cost to Us**: $0 compute (client pays), ~$5/month storage.
+**Cost to User**: Free (public bucket) or Freemium (presigned URLs).
+
+## 2. Implementation Plan
+
+### Phase 1: Ingestion & Backfill (The "Factory")
+We need a robust pipeline to scrape ERCOT and build the Parquet lake.
+
+**Stack**: Python script running on GitHub Actions (scheduled daily) or a small VPS.
+**Tasks**:
+1.  **Backfill**: Script to iterate 2011-2023, call `ERCOTArchive` (existing logic), download zips, convert to Parquet, upload to R2.
+2.  **Daily Cron**: Script to run daily at ~2 AM CST.
+    -   Fetch "Yesterday's" data from ERCOT Public API (while it's still there).
+    -   **Capture Ephemeral Data**: This is key. Capture data that ERCOT deletes (e.g., real-time fuel mix, outages) and save to Parquet. This builds our unique moat.
+
+### Phase 2: Client Upgrade
+Update `tinygrid` library to prefer the S3 Parquet lake over ERCOT's Archive API.
+
+-   Add dependency: `duckdb` or `pyarrow`.
+-   Implement `ParquetArchive` class implementing the same interface as `ERCOTArchive`.
+-   Add configuration: `ERCOT(use_cloud_archive=True)`.
+
+### Phase 3: Differentiation Features
+
+1.  **In-Process SQL**: Allow users to run SQL on the data locally.
+    ```python
+    ercot.query("SELECT avg(price) FROM 's3://.../spp.parquet' WHERE zone='LZ_HOUSTON'")
+    ```
+2.  **Ephemeral Data Replay**: "See the grid as it was". Replay real-time conditions (SCED bindings, outages) that are lost in official archives.
+3.  **BigQuery/Snowflake Integration**: Since the data is in Parquet/S3, we can easily offer "External Tables" for enterprise users to mount our bucket into their warehouse.
+
+## 3. Comparison to Grid Status
+
+| Feature | Grid Status (Paid) | Tinygrid (Proposed) |
+| :--- | :--- | :--- |
+| **Historical Data** | Hosted Database | S3 Parquet Lake |
+| **Access Method** | REST API / Snowflake | Python Client (DuckDB) / S3 Direct |
+| **Cost Model** | Contact Sales | Free / Low Fixed Cost |
+| **Performance** | API Latency | Network Bandwidth + Local CPU (Fast) |
+| **Ephemeral Data** | Yes | Yes (if we start capturing now) |
+| **Bus-Level LMPs** | Since Dec 2023 | Since Dec 2023 (same constraint) |
+
+## 4. Immediate Next Steps
+1.  **Proof of Concept**: Write `scripts/archive_to_parquet.py` to demonstrate downloading one day of archive, converting to Parquet, and querying with DuckDB.
+2.  **Storage Setup**: Set up Cloudflare R2 bucket.
+3.  **Client Integration**: Prototype `ParquetArchive` in `tinygrid`.
diff --git a/scripts/archive_to_parquet.py b/scripts/archive_to_parquet.py
@@ -0,0 +1,65 @@
+import pandas as pd
+import duckdb
+import os
+import shutil
+from pathlib import Path
+
+# Setup paths
+DATA_DIR = Path("data_lake")
+if DATA_DIR.exists():
+    shutil.rmtree(DATA_DIR)
+DATA_DIR.mkdir(parents=True)
+
+print("=== 1. Simulating ERCOT Archive Download ===")
+# Simulate downloading data for 3 days
+dates = pd.date_range("2023-01-01", "2023-01-03", freq="D")
+dfs = []
+
+for date in dates:
+    print(f"Processing {date.date()}...")
+    # Create dummy SPP data
+    df = pd.DataFrame({
+        "DeliveryDate": [date.date()] * 24,
+        "DeliveryHour": range(1, 25),
+        "SettlementPoint": ["LZ_HOUSTON"] * 24,
+        "SettlementPointPrice": [50.0 + i for i in range(24)],
+        "DSTFlag": ["N"] * 24
+    })
+
+    # 2. Write to Parquet (Partitioned)
+    # Structure: data_lake/spp/year=YYYY/month=MM/day=DD/data.parquet
+    output_path = DATA_DIR / "spp" / f"year={date.year}" / f"month={date.month:02d}" / f"day={date.day:02d}"
+    output_path.mkdir(parents=True, exist_ok=True)
+
+    # Write parquet using pyarrow engine
+    file_path = output_path / "data.parquet"
+    df.to_parquet(file_path, engine="pyarrow", index=False)
+    print(f"  Saved to {file_path}")
+
+print("\n=== 2. Querying with DuckDB (Client-Side) ===")
+print("Simulating tinygrid client fetching data for Jan 2-3...")
+
+# DuckDB can query hive-partitioned parquet files directly using glob patterns
+# In a real S3 scenario, we'd use 's3://bucket/spp/**/data.parquet'
+# For local fs, we use the path.
+
+query = """
+SELECT 
+    DeliveryDate, 
+    avg(SettlementPointPrice) as avg_price
+FROM read_parquet('data_lake/spp/*/*/*/data.parquet', hive_partitioning=1)
+WHERE DeliveryDate >= '2023-01-02'
+GROUP BY DeliveryDate
+ORDER BY DeliveryDate
+"""
+
+con = duckdb.connect()
+result = con.execute(query).df()
+print("\nQuery Result:")
+print(result)
+
+print("\n=== 3. Benefits ===")
+print(f"Total Lake Size: {sum(f.stat().st_size for f in DATA_DIR.rglob('*.parquet')) / 1024:.2f} KB (Tiny!)")
+print("1. Zero Database Cost (Just S3 storage)")
+print("2. Fast Queries (Columnar scan)")
+print("3. Unified Interface (Client sees a DataFrame)")
diff --git a/tinygrid/ercot/parquet_archive.py b/tinygrid/ercot/parquet_archive.py
@@ -0,0 +1,118 @@
+"""Parquet-based historical archive access using DuckDB."""
+
+from __future__ import annotations
+
+import logging
+from typing import TYPE_CHECKING, Any
+
+import pandas as pd
+from attrs import define, field
+
+from ..constants.ercot import ERCOT_TIMEZONE
+
+if TYPE_CHECKING:
+    from .client import ERCOTBase
+
+logger = logging.getLogger(__name__)
+
+# Default S3 bucket for Tinygrid's public archive
+DEFAULT_ARCHIVE_BUCKET = "s3://tinygrid-ercot"
+
+
+@define
+class ParquetArchive:
+    """Access historical data via S3 Parquet lake using DuckDB.
+
+    This acts as a drop-in replacement for ERCOTArchive but queries
+    pre-processed Parquet files in S3 instead of downloading zip files
+    from ERCOT's API.
+
+    Requires:
+        duckdb: for efficient S3 querying
+        fsspec: for S3 filesystem access (via duckdb or pandas)
+    """
+
+    client: ERCOTBase | None = field(default=None)
+    bucket_url: str = field(default=DEFAULT_ARCHIVE_BUCKET)
+    use_duckdb: bool = field(default=True)
+
+    def fetch_historical(
+        self,
+        endpoint: str,
+        start: pd.Timestamp,
+        end: pd.Timestamp,
+        add_post_datetime: bool = False,
+    ) -> pd.DataFrame:
+        """Fetch historical data from Parquet archive.
+
+        Args:
+            endpoint: API endpoint (e.g., "/np6-905-cd/spp_node_zone_hub")
+            start: Start timestamp
+            end: End timestamp
+            add_post_datetime: If True, add postDatetime column (mocked or from parquet)
+
+        Returns:
+            DataFrame with all historical data
+        """
+        # Map endpoint to parquet path structure
+        # Example: /np6-905-cd/spp_node_zone_hub -> spp/
+        # This mapping needs to be robust. For now, we use a simple heuristic or a map.
+        dataset_path = self._map_endpoint_to_path(endpoint)
+        if not dataset_path:
+            logger.warning(f"No Parquet mapping found for {endpoint}")
+            return pd.DataFrame()
+
+        s3_path = f"{self.bucket_url}/{dataset_path}/*/*/*/*.parquet"
+
+        # Format dates for query
+        start_str = start.strftime("%Y-%m-%d")
+        end_str = end.strftime("%Y-%m-%d")
+
+        try:
+            if self.use_duckdb:
+                return self._query_duckdb(s3_path, start_str, end_str)
+            else:
+                return self._query_pandas(s3_path, start, end)
+        except Exception as e:
+            logger.warning(f"Failed to fetch from Parquet archive: {e}")
+            return pd.DataFrame()
+
+    def _query_duckdb(self, s3_path: str, start_date: str, end_date: str) -> pd.DataFrame:
+        """Query S3 using DuckDB."""
+        try:
+            import duckdb
+        except ImportError:
+            logger.warning("duckdb not installed, falling back to empty result")
+            return pd.DataFrame()
+
+        # DuckDB query with hive partitioning
+        # We assume partition keys are year, month, day
+        # DeliveryDate is inside the file
+        query = f"""
+        SELECT *
+        FROM read_parquet('{s3_path}', hive_partitioning=1)
+        WHERE DeliveryDate >= '{start_date}'
+          AND DeliveryDate <= '{end_date}'
+        """
+
+        # In a real implementation, we would configure S3 credentials here if needed
+        # duckdb.sql("INSTALL httpfs; LOAD httpfs;")
+
+        return duckdb.sql(query).df()
+
+    def _query_pandas(self, s3_path: str, start: pd.Timestamp, end: pd.Timestamp) -> pd.DataFrame:
+        """Fallback using Pandas (reads all files, slow!)."""
+        # This is inefficient without precise file listing, included only as fallback
+        # Real implementation should list keys in S3 matching the date range first
+        logger.warning("Pandas fallback for S3 Parquet is not fully implemented (inefficient)")
+        return pd.DataFrame()
+
+    def _map_endpoint_to_path(self, endpoint: str) -> str | None:
+        """Map ERCOT endpoint to S3 prefix."""
+        # Simple mapping for POC
+        mapping = {
+            "/np6-905-cd/spp_node_zone_hub": "real_time_spp",
+            "/np4-190-cd/dam_stlmnt_pnt_prices": "dam_spp",
+            "/np6-788-cd/lmp_node_zone_hub": "real_time_lmp_node",
+        }
+        return mapping.get(endpoint)