blue-marble · anamileva · Jun 11, 2026 · Jun 8, 2026 · Jun 8, 2026 · Jun 9, 2026
diff --git a/data_toolkit/load_raw_data.py b/data_toolkit/load_raw_data.py
@@ -15,9 +15,46 @@
 """
 Load data into the GridPath raw data database. See the documentation of each
 GridPath Data Toolkit module for data prerequisites. Use the
-files_to_import.csv file to tell GridPath which CSV files should be loaded
+``files_to_import.csv`` file to tell GridPath which CSV files should be loaded
 into which database table.
 
+==================
+What this step does
+==================
+
+This module is a generic bulk loader for raw CSV data into the GridPath
+database. It reads a  file named ``files_to_import.csv`` located in the
+directory given by ``--csv_location``. Each row of that file describes one
+CSV file: an import flag (whether the file should be loaded), the CSV
+filename (relative to ``--csv_location``), and the database table the file
+should be loaded into.
+
+The loader iterates over the CSV file rows and, for each row whose import flag
+is True, reads the corresponding CSV from ``--csv_location`` and appends its
+contents to the named database table (existing rows are preserved; data is
+inserted with ``if_exists="append"``). Rows whose import flag is False are
+skipped.
+
+This generic loader is used throughout the Data Toolkit workflow to populate
+``raw_data`` tables (e.g., VER profiles and their unit mapping, hydro operating
+characteristics) that later Data Toolkit steps depend on.
+
+=====
+Usage
+=====
+
+>>> python -m data_toolkit.load_raw_data --database PATH/TO/DATABASE --csv_location PATH/TO/CSV/DIRECTORY
+
+=========
+Settings
+=========
+    * database
+    * csv_location
+
+The ``--csv_location`` directory must contain a ``files_to_import.csv``
+manifest with columns for the import flag, the CSV filename, and the
+destination database table, in that order.
+
 """
 
 import sys

diff --git a/data_toolkit/project/availability/outages/create_availability_iteration_input_csvs.py b/data_toolkit/project/availability/outages/create_availability_iteration_input_csvs.py
@@ -25,6 +25,86 @@
 
 Run unit outage simulation and create availability iteration inputs.
 
+===================
+What this step does
+===================
+
+This module runs a Monte Carlo unit-outage simulation and writes the resulting
+exogenous availability (derate) input CSVs. Using the per-unit availability
+parameters loaded from ``--outage_params_input_csv`` (into the
+``raw_data_unit_availability_params`` table) -- forced-outage rates
+(``unit_for``), mean time to repair (``unit_mttr``), the number of units
+(``n_units``), the unit weight, and the per-unit outage model
+(``unit_fo_model``) -- it simulates ``--n_iterations`` independent outage
+timelines for each project, drawing random forced and (under the sequential
+model) repair/maintenance transitions.
+
+The outage model is selected per unit via the ``unit_fo_model`` column and may
+be one of:
+
+    * ``Derate`` -- a static derate ``1 - unit_for`` applied in every timepoint.
+    * ``MC_independent`` -- each timepoint's outage state is drawn independently
+      from a uniform distribution against the forced-outage rate.
+    * ``MC_sequential`` -- a sequential (exponential) failure/repair process
+      driven by the forced-outage rate and ``unit_mttr`` (the implied mean time
+      to failure is ``mttr * (1 / for - 1)``), preserving outage persistence
+      across timepoints.
+    * ``historical_year`` -- instead of simulating, a random historical year is
+      sampled for the unit from ``--historical_availability_csv`` and that
+      year's hourly derate series is used directly. (This is can be used for
+      units whose availability is taken from a historical record rather than
+      simulated; the choice is driven by the unit's ``unit_fo_model`` value,
+      not by project type.)
+
+For each project the per-unit availability adjustments are combined using each
+unit's ``unit_weight`` to form a weighted project-level derate. Hybrid-storage
+projects (``hybrid_stor`` set) additionally get a separately simulated derate
+for the storage component. By default only rows whose derate differs from 1 are
+written (as default availability in GridPath is 1); pass ``--print_ones`` to
+retain all rows.
+
+Output is written to ``--output_directory`` as one CSV per project, named
+``<project>-<project_availability_scenario_id>-<project_availability_scenario_name>.csv``.
+``--n_parallel_projects`` parallelizes the simulation across projects and
+``--overwrite`` replaces existing files (otherwise existing files are appended
+to). ``--sort`` re-sorts each output file at the end. These outage iterations
+are intended to align with the weather/hydro iterations to form complete Monte
+Carlo draws.
+
+========================
+Reproducibility (seeding)
+========================
+
+By default seeding is OFF: ``--user_provided_seeding`` is not set, so the
+outage simulation is fully random and non-reproducible from run to run. When
+seeding is off, *all* of the seeding flags below are ignored -- the seed
+arguments are replaced with ``None`` before the simulation runs, and NumPy's
+global RNG is never explicitly seeded.
+
+To get reproducible outages, set ``--user_provided_seeding`` together with a
+``--starting_project_iteration_seed <int>`` (defaults to ``0``). With seeding
+on:
+
+    * **Per-project, non-overlapping seed ranges.** Each project is assigned a
+      starting seed of ``starting_project_iteration_seed + project_idx *
+      n_iterations``. Within a project the per-iteration seed starts at that
+      value and is incremented by 1 for each of the ``n_iterations``
+      iterations, so the seed ranges of distinct projects do not overlap.
+    * **Per-unit seeds within an iteration.** For a given project iteration, the
+      per-iteration seed is used to seed NumPy's RNG, which then draws one
+      integer seed per unit via ``np.random.randint(1,
+      max_integer_for_unit_outage_seeding, size=n_units_in_project)``. Each
+      unit's outage timeline is then simulated from its own seed.
+      ``--max_integer_for_unit_outage_seeding`` defaults to ``1000000``.
+    * **Hybrid-storage offset.** For hybrid-storage projects, the storage
+      component is simulated with a seed offset from the generator component's
+      unit seed by ``--hybrid_storage_seed_increment`` (defaults to ``1000``).
+
+Every project / unit / iteration still draws its own independent random outage
+timeline, but the whole simulation reproduces exactly when re-run with the same
+seed settings. Again, these flags are ignored unless ``--user_provided_seeding``
+is set. Caution advised when seeding.
+
 =====
 Usage
 =====
@@ -35,19 +115,34 @@
 Input prerequisites
 ===================
 
-This module assumes the following raw input database tables have been populated:
+This module assumes the following raw input database table has been populated:
     * raw_data_unit_availability_params
-    * raw_data_var_project_units
+
+This table can be populated ahead of time, or loaded at run time by passing
+``--outage_params_input_csv``. Units that use the ``historical_year`` outage
+model additionally read their derate series from the CSV passed via
+``--historical_availability_csv``.
 
 =========
 Settings
 =========
     * database
-    * output_directory
+    * outage_params_input_csv
+    * historical_availability_csv
+    * stage_id
+    * n_iterations
+    * study_year
     * project_availability_scenario_id
     * project_availability_scenario_name
+    * output_directory
     * overwrite
+    * sort
+    * print_ones
     * n_parallel_projects
+    * user_provided_seeding
+    * starting_project_iteration_seed
+    * max_integer_for_unit_outage_seeding
+    * hybrid_storage_seed_increment
 
 """
 
@@ -119,6 +214,7 @@ def parse_arguments(args):
         "-max_unit_seed_int",
         "--max_integer_for_unit_outage_seeding",
         default=1000000,
+        type=int,
         help="The max integer for assigning seeds to each unit outage "
         "simulation for a given project. The --user_provided_seeding flag must "
         "be set to True for this to take effect. Proceed with caution.",
@@ -211,9 +307,37 @@ def get_weighted_availability_adjustment(
         # For each project iteration, we assign a seed to each unit outage
         # simulation based on the project_iteration_seed and a
         # max_integer_for_unit_seeding number set by the user
+        # Draw a distinct generator-component seed per unit (without
+        # replacement) so no two units share a seed and thus an identical
+        # outage timeline. To also keep each hybrid unit's storage-component
+        # seed (generator seed + hyb_stor_seed_unit_increment, assigned below)
+        # from coinciding with another unit's generator or storage seed, draw
+        # the generator seeds from values spaced (increment + 1) apart: no two
+        # then differ by exactly the increment, so {seeds} and
+        # {seeds + increment} are disjoint.
+        n_units_in_project = len(project_df.index)
+        seed_population = np.arange(
+            1,
+            max_integer_for_unit_outage_seeding,
+            hyb_stor_seed_unit_increment + 1,
+        )
+        # Guard: np.random.choice(replace=False) cannot draw more values than
+        # are available. Fail with an actionable message rather than an opaque
+        # ValueError.
+        if n_units_in_project > len(seed_population):
+            raise ValueError(
+                f"Cannot assign {n_units_in_project} distinct, "
+                f"non-colliding outage seeds to project "
+                f"'{project_df['project'].iloc[0]}': only "
+                f"{len(seed_population)} are available given "
+                f"max_integer_for_unit_outage_seeding="
+                f"{max_integer_for_unit_outage_seeding} and "
+                f"hybrid_storage_seed_increment={hyb_stor_seed_unit_increment}. "
+                f"Increase --max_integer_for_unit_outage_seeding."
+            )
         np.random.seed(project_iteration_seed)
-        unit_seeds = np.random.randint(
-            1, max_integer_for_unit_outage_seeding, size=len(project_df.index)
+        unit_seeds = np.random.choice(
+            seed_population, size=n_units_in_project, replace=False
         )
     else:
         unit_seeds = [None for n in project_df.index]
@@ -507,10 +631,13 @@ def simulate_all_project_iterations(pool_datum):
     # Loop through all iterations for this project
     project_iteration_seed = starting_project_iteration_seed
     for iteration_n in range(1, n_iterations + 1):
+        # ORDER BY unit so each unit maps to the same drawn seed (unit_seeds is
+        # indexed positionally) across runs; required for reproducible draws
         project_df = pd.read_sql(
             f"""
                 SELECT * FROM raw_data_unit_availability_params
                 WHERE project = '{project}'
+                ORDER BY unit
                 ;""",
             conn,
         )
@@ -592,9 +719,12 @@ def main(args=None):
     if not os.path.exists(parsed_args.output_directory):
         os.makedirs(parsed_args.output_directory)
 
-    # Get projects
+    # Get projects. ORDER BY so that project_idx (and therefore each project's
+    # seed base, starting_project_iteration_seed + project_idx * n_iterations)
+    # is stable across runs -- otherwise seeded results are not reproducible.
     projects = [i[0] for i in conn.execute("""
-        SELECT DISTINCT project FROM raw_data_unit_availability_params;
+        SELECT DISTINCT project FROM raw_data_unit_availability_params
+        ORDER BY project;
         """).fetchall()]
 
     all_files = []

diff --git a/data_toolkit/project/opchar/hydro/create_hydro_iteration_input_csvs.py b/data_toolkit/project/opchar/hydro/create_hydro_iteration_input_csvs.py
@@ -42,6 +42,84 @@
     * hydro_operational_chars_scenario_name
     * overwrite
     * n_parallel_projects
+
+==================
+What this step does
+==================
+
+This module builds GridPath hydro operational-characteristics input CSVs from
+the year/month hydro data loaded earlier
+(``raw_data_project_hydro_opchars_by_year_month``, ``raw_data_hydro_years``,
+and the user-defined balancing-type horizons in
+``user_defined_balancing_type_horizons``). For each hydro iteration it derives
+the per-horizon hydro operating parameters -- the average, minimum, and maximum
+power fractions -- and writes them to ``--output_directory`` under the given
+``hydro_operational_chars_scenario_id`` and
+``hydro_operational_chars_scenario_name``. ``--n_parallel_projects N`` runs up
+to ``N`` projects at once, and ``--overwrite`` replaces existing CSVs.
+
+===========
+Methodology
+===========
+
+The distinct projects to process are read from
+``raw_data_project_hydro_opchars_by_year_month``, and one CSV is written per
+project, named ``<project>-<scenario_id>-<scenario_name>.csv`` in
+``--output_directory``. Projects are processed in a multiprocessing pool sized
+by ``--n_parallel_projects`` (defaults to ``1``).
+
+----------------------------------
+Hydro iterations and balancing-type horizons
+----------------------------------
+
+The set of hydro years is read from ``raw_data_hydro_years`` and each year is
+treated as one hydro iteration (written into the ``hydro_iteration`` column).
+The set of ``(balancing_type, horizon)`` pairs is read from
+``user_defined_balancing_type_horizons``; if ``--hydro_balancing_type`` is
+supplied, the pairs are filtered to that single balancing type (e.g. ``day``,
+``week``, ``month``), otherwise all balancing types are included. For every
+combination of hydro year and balancing-type horizon, one output row is
+produced.
+
+----------------------------------
+Deriving per-horizon power fractions
+----------------------------------
+
+The ``average_power_fraction``, ``min_power_fraction``, and
+``max_power_fraction`` for each horizon are computed by month-weighting the raw
+monthly opchar values. For a given balancing-type horizon, the module reads its
+``hour_ending_of_year_start`` and ``hour_ending_of_year_end`` from
+``user_defined_balancing_type_horizons`` and walks each hour of the year in that
+range, mapping the hour to a calendar month (via a ``pandas.Timestamp`` anchored
+at January 1 of the hydro year) and counting the number of hours that fall in
+each month. These hour counts become the per-month weights for the horizon.
+
+For each month touched by the horizon, the module looks up the project's
+``average_power_fraction``, ``min_power_fraction``, and ``max_power_fraction``
+for that hydro year and month in
+``raw_data_project_hydro_opchars_by_year_month``, multiplies each by the month's
+hour-count weight, sums across months, and divides by the total number of hours
+in the horizon. The result is an hours-weighted average of the monthly
+fractions for each of the three parameters, written as a single row keyed by
+``balancing_type_project`` and ``horizon`` (with ``weather_iteration`` set to
+``0``, i.e. no weather iteration). Note we take the weighted averages of the
+mins and maxes, not the mins of the mins or the maxes of the maxes.
+
+----------------------------------
+Writing and overwriting output
+----------------------------------
+
+Rows are appended to the project's CSV as they are generated, with the header
+written only when the file does not yet exist. When ``--overwrite`` is set, any
+existing CSV for the project is deleted before processing begins so it is
+rebuilt from scratch; without ``--overwrite``, new rows are appended to any
+existing file.
+
+If the corresponding ``--*_input_csv`` paths are provided, the raw-data tables
+(``raw_data_project_hydro_opchars_by_year_month``, ``raw_data_hydro_years``,
+``user_defined_balancing_type_horizons``) are loaded from those CSVs before the
+inputs are built; otherwise the data is assumed to already be present in the
+database.
 """
 
 from argparse import ArgumentParser

diff --git a/data_toolkit/project/opchar/var_profiles/create_monte_carlo_var_gen_input_csvs.py b/data_toolkit/project/opchar/var_profiles/create_monte_carlo_var_gen_input_csvs.py
@@ -20,6 +20,40 @@
 this module,you will need to create weather draws with the
 ``create_monte_carlo_draws`` module (see :ref:`monte-carlo-draws-section-ref`).
 
+================
+What this step does
+================
+
+This is the variable energy resource (VER) counterpart to the load-CSV step. It
+reads the synthetic per-iteration variable generation profiles -- assembled from
+``raw_data_var_profiles`` (the raw hourly unit-level ``cap_factor`` data) and
+``raw_data_var_project_units`` (the project-to-unit mapping and per-unit
+weights), resampled according to the weather draws stored in
+``aux_weather_iterations`` -- and writes them out as GridPath variable-generator
+profile input CSVs in ``--output_directory``, tagged with the given
+``--variable_generator_profile_scenario_id`` and
+``--variable_generator_profile_scenario_name``. These CSVs are the files the
+GridPath model consumes for variable generation.
+
+===========
+Methodology
+===========
+
+For each project, the per-unit ``cap_factor`` values from ``raw_data_var_profiles``
+are multiplied by their ``unit_weight`` and summed to produce a single
+project-level ``cap_factor`` time series. The weather draws in
+``aux_weather_iterations`` (selected by ``--weather_bins_id`` and
+``--weather_draws_id``) determine, for each Monte Carlo ``weather_iteration`` and
+``draw_number``, which historical day's data to pull, and the draw number is used
+to compute the ``timepoint`` ID. One output CSV is written per project, named
+``{project}-{scenario_id}-{scenario_name}.csv``, with an accompanying iterations
+CSV written to an ``iterations`` subdirectory of ``--output_directory``.
+
+``--n_parallel_projects N`` processes up to ``N`` projects concurrently (via a
+multiprocessing pool over the project pool) to speed things up. ``--overwrite``
+deletes any existing CSVs with the matching project/scenario filename before
+writing; without it, output is appended to existing files.
+
 =====
 Usage
 =====