Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 38 additions & 1 deletion data_toolkit/load_raw_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,46 @@
"""
Load data into the GridPath raw data database. See the documentation of each
GridPath Data Toolkit module for data prerequisites. Use the
files_to_import.csv file to tell GridPath which CSV files should be loaded
``files_to_import.csv`` file to tell GridPath which CSV files should be loaded
into which database table.

==================
What this step does
==================

This module is a generic bulk loader for raw CSV data into the GridPath
database. It reads a file named ``files_to_import.csv`` located in the
directory given by ``--csv_location``. Each row of that file describes one
CSV file: an import flag (whether the file should be loaded), the CSV
filename (relative to ``--csv_location``), and the database table the file
should be loaded into.

The loader iterates over the CSV file rows and, for each row whose import flag
is True, reads the corresponding CSV from ``--csv_location`` and appends its
contents to the named database table (existing rows are preserved; data is
inserted with ``if_exists="append"``). Rows whose import flag is False are
skipped.

This generic loader is used throughout the Data Toolkit workflow to populate
``raw_data`` tables (e.g., VER profiles and their unit mapping, hydro operating
characteristics) that later Data Toolkit steps depend on.

=====
Usage
=====

>>> python -m data_toolkit.load_raw_data --database PATH/TO/DATABASE --csv_location PATH/TO/CSV/DIRECTORY

=========
Settings
=========
* database
* csv_location

The ``--csv_location`` directory must contain a ``files_to_import.csv``
manifest with columns for the import flag, the CSV filename, and the
destination database table, in that order.

"""

import sys
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,86 @@

Run unit outage simulation and create availability iteration inputs.

===================
What this step does
===================

This module runs a Monte Carlo unit-outage simulation and writes the resulting
exogenous availability (derate) input CSVs. Using the per-unit availability
parameters loaded from ``--outage_params_input_csv`` (into the
``raw_data_unit_availability_params`` table) -- forced-outage rates
(``unit_for``), mean time to repair (``unit_mttr``), the number of units
(``n_units``), the unit weight, and the per-unit outage model
(``unit_fo_model``) -- it simulates ``--n_iterations`` independent outage
timelines for each project, drawing random forced and (under the sequential
model) repair/maintenance transitions.

The outage model is selected per unit via the ``unit_fo_model`` column and may
be one of:

* ``Derate`` -- a static derate ``1 - unit_for`` applied in every timepoint.
* ``MC_independent`` -- each timepoint's outage state is drawn independently
from a uniform distribution against the forced-outage rate.
* ``MC_sequential`` -- a sequential (exponential) failure/repair process
driven by the forced-outage rate and ``unit_mttr`` (the implied mean time
to failure is ``mttr * (1 / for - 1)``), preserving outage persistence
across timepoints.
* ``historical_year`` -- instead of simulating, a random historical year is
sampled for the unit from ``--historical_availability_csv`` and that
year's hourly derate series is used directly. (This is can be used for
units whose availability is taken from a historical record rather than
simulated; the choice is driven by the unit's ``unit_fo_model`` value,
not by project type.)

For each project the per-unit availability adjustments are combined using each
unit's ``unit_weight`` to form a weighted project-level derate. Hybrid-storage
projects (``hybrid_stor`` set) additionally get a separately simulated derate
for the storage component. By default only rows whose derate differs from 1 are
written (as default availability in GridPath is 1); pass ``--print_ones`` to
retain all rows.

Output is written to ``--output_directory`` as one CSV per project, named
``<project>-<project_availability_scenario_id>-<project_availability_scenario_name>.csv``.
``--n_parallel_projects`` parallelizes the simulation across projects and
``--overwrite`` replaces existing files (otherwise existing files are appended
to). ``--sort`` re-sorts each output file at the end. These outage iterations
are intended to align with the weather/hydro iterations to form complete Monte
Carlo draws.

========================
Reproducibility (seeding)
========================

By default seeding is OFF: ``--user_provided_seeding`` is not set, so the
outage simulation is fully random and non-reproducible from run to run. When
seeding is off, *all* of the seeding flags below are ignored -- the seed
arguments are replaced with ``None`` before the simulation runs, and NumPy's
global RNG is never explicitly seeded.

To get reproducible outages, set ``--user_provided_seeding`` together with a
``--starting_project_iteration_seed <int>`` (defaults to ``0``). With seeding
on:

* **Per-project, non-overlapping seed ranges.** Each project is assigned a
starting seed of ``starting_project_iteration_seed + project_idx *
n_iterations``. Within a project the per-iteration seed starts at that
value and is incremented by 1 for each of the ``n_iterations``
iterations, so the seed ranges of distinct projects do not overlap.
* **Per-unit seeds within an iteration.** For a given project iteration, the
per-iteration seed is used to seed NumPy's RNG, which then draws one
integer seed per unit via ``np.random.randint(1,
max_integer_for_unit_outage_seeding, size=n_units_in_project)``. Each
unit's outage timeline is then simulated from its own seed.
``--max_integer_for_unit_outage_seeding`` defaults to ``1000000``.
* **Hybrid-storage offset.** For hybrid-storage projects, the storage
component is simulated with a seed offset from the generator component's
unit seed by ``--hybrid_storage_seed_increment`` (defaults to ``1000``).

Every project / unit / iteration still draws its own independent random outage
timeline, but the whole simulation reproduces exactly when re-run with the same
seed settings. Again, these flags are ignored unless ``--user_provided_seeding``
is set. Caution advised when seeding.

=====
Usage
=====
Expand All @@ -35,19 +115,34 @@
Input prerequisites
===================

This module assumes the following raw input database tables have been populated:
This module assumes the following raw input database table has been populated:
* raw_data_unit_availability_params
* raw_data_var_project_units

This table can be populated ahead of time, or loaded at run time by passing
``--outage_params_input_csv``. Units that use the ``historical_year`` outage
model additionally read their derate series from the CSV passed via
``--historical_availability_csv``.

=========
Settings
=========
* database
* output_directory
* outage_params_input_csv
* historical_availability_csv
* stage_id
* n_iterations
* study_year
* project_availability_scenario_id
* project_availability_scenario_name
* output_directory
* overwrite
* sort
* print_ones
* n_parallel_projects
* user_provided_seeding
* starting_project_iteration_seed
* max_integer_for_unit_outage_seeding
* hybrid_storage_seed_increment

"""

Expand Down Expand Up @@ -119,6 +214,7 @@ def parse_arguments(args):
"-max_unit_seed_int",
"--max_integer_for_unit_outage_seeding",
default=1000000,
type=int,
help="The max integer for assigning seeds to each unit outage "
"simulation for a given project. The --user_provided_seeding flag must "
"be set to True for this to take effect. Proceed with caution.",
Expand Down Expand Up @@ -211,9 +307,37 @@ def get_weighted_availability_adjustment(
# For each project iteration, we assign a seed to each unit outage
# simulation based on the project_iteration_seed and a
# max_integer_for_unit_seeding number set by the user
# Draw a distinct generator-component seed per unit (without
# replacement) so no two units share a seed and thus an identical
# outage timeline. To also keep each hybrid unit's storage-component
# seed (generator seed + hyb_stor_seed_unit_increment, assigned below)
# from coinciding with another unit's generator or storage seed, draw
# the generator seeds from values spaced (increment + 1) apart: no two
# then differ by exactly the increment, so {seeds} and
# {seeds + increment} are disjoint.
n_units_in_project = len(project_df.index)
seed_population = np.arange(
1,
max_integer_for_unit_outage_seeding,
hyb_stor_seed_unit_increment + 1,
)
# Guard: np.random.choice(replace=False) cannot draw more values than
# are available. Fail with an actionable message rather than an opaque
# ValueError.
if n_units_in_project > len(seed_population):
raise ValueError(
f"Cannot assign {n_units_in_project} distinct, "
f"non-colliding outage seeds to project "
f"'{project_df['project'].iloc[0]}': only "
f"{len(seed_population)} are available given "
f"max_integer_for_unit_outage_seeding="
f"{max_integer_for_unit_outage_seeding} and "
f"hybrid_storage_seed_increment={hyb_stor_seed_unit_increment}. "
f"Increase --max_integer_for_unit_outage_seeding."
)
np.random.seed(project_iteration_seed)
unit_seeds = np.random.randint(
1, max_integer_for_unit_outage_seeding, size=len(project_df.index)
unit_seeds = np.random.choice(
seed_population, size=n_units_in_project, replace=False
)
else:
unit_seeds = [None for n in project_df.index]
Expand Down Expand Up @@ -507,10 +631,13 @@ def simulate_all_project_iterations(pool_datum):
# Loop through all iterations for this project
project_iteration_seed = starting_project_iteration_seed
for iteration_n in range(1, n_iterations + 1):
# ORDER BY unit so each unit maps to the same drawn seed (unit_seeds is
# indexed positionally) across runs; required for reproducible draws
project_df = pd.read_sql(
f"""
SELECT * FROM raw_data_unit_availability_params
WHERE project = '{project}'
ORDER BY unit
;""",
conn,
)
Expand Down Expand Up @@ -592,9 +719,12 @@ def main(args=None):
if not os.path.exists(parsed_args.output_directory):
os.makedirs(parsed_args.output_directory)

# Get projects
# Get projects. ORDER BY so that project_idx (and therefore each project's
# seed base, starting_project_iteration_seed + project_idx * n_iterations)
# is stable across runs -- otherwise seeded results are not reproducible.
projects = [i[0] for i in conn.execute("""
SELECT DISTINCT project FROM raw_data_unit_availability_params;
SELECT DISTINCT project FROM raw_data_unit_availability_params
ORDER BY project;
""").fetchall()]

all_files = []
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,84 @@
* hydro_operational_chars_scenario_name
* overwrite
* n_parallel_projects

==================
What this step does
==================

This module builds GridPath hydro operational-characteristics input CSVs from
the year/month hydro data loaded earlier
(``raw_data_project_hydro_opchars_by_year_month``, ``raw_data_hydro_years``,
and the user-defined balancing-type horizons in
``user_defined_balancing_type_horizons``). For each hydro iteration it derives
the per-horizon hydro operating parameters -- the average, minimum, and maximum
power fractions -- and writes them to ``--output_directory`` under the given
``hydro_operational_chars_scenario_id`` and
``hydro_operational_chars_scenario_name``. ``--n_parallel_projects N`` runs up
to ``N`` projects at once, and ``--overwrite`` replaces existing CSVs.

===========
Methodology
===========

The distinct projects to process are read from
``raw_data_project_hydro_opchars_by_year_month``, and one CSV is written per
project, named ``<project>-<scenario_id>-<scenario_name>.csv`` in
``--output_directory``. Projects are processed in a multiprocessing pool sized
by ``--n_parallel_projects`` (defaults to ``1``).

----------------------------------
Hydro iterations and balancing-type horizons
----------------------------------

The set of hydro years is read from ``raw_data_hydro_years`` and each year is
treated as one hydro iteration (written into the ``hydro_iteration`` column).
The set of ``(balancing_type, horizon)`` pairs is read from
``user_defined_balancing_type_horizons``; if ``--hydro_balancing_type`` is
supplied, the pairs are filtered to that single balancing type (e.g. ``day``,
``week``, ``month``), otherwise all balancing types are included. For every
combination of hydro year and balancing-type horizon, one output row is
produced.

----------------------------------
Deriving per-horizon power fractions
----------------------------------

The ``average_power_fraction``, ``min_power_fraction``, and
``max_power_fraction`` for each horizon are computed by month-weighting the raw
monthly opchar values. For a given balancing-type horizon, the module reads its
``hour_ending_of_year_start`` and ``hour_ending_of_year_end`` from
``user_defined_balancing_type_horizons`` and walks each hour of the year in that
range, mapping the hour to a calendar month (via a ``pandas.Timestamp`` anchored
at January 1 of the hydro year) and counting the number of hours that fall in
each month. These hour counts become the per-month weights for the horizon.

For each month touched by the horizon, the module looks up the project's
``average_power_fraction``, ``min_power_fraction``, and ``max_power_fraction``
for that hydro year and month in
``raw_data_project_hydro_opchars_by_year_month``, multiplies each by the month's
hour-count weight, sums across months, and divides by the total number of hours
in the horizon. The result is an hours-weighted average of the monthly
fractions for each of the three parameters, written as a single row keyed by
``balancing_type_project`` and ``horizon`` (with ``weather_iteration`` set to
``0``, i.e. no weather iteration). Note we take the weighted averages of the
mins and maxes, not the mins of the mins or the maxes of the maxes.

----------------------------------
Writing and overwriting output
----------------------------------

Rows are appended to the project's CSV as they are generated, with the header
written only when the file does not yet exist. When ``--overwrite`` is set, any
existing CSV for the project is deleted before processing begins so it is
rebuilt from scratch; without ``--overwrite``, new rows are appended to any
existing file.

If the corresponding ``--*_input_csv`` paths are provided, the raw-data tables
(``raw_data_project_hydro_opchars_by_year_month``, ``raw_data_hydro_years``,
``user_defined_balancing_type_horizons``) are loaded from those CSVs before the
inputs are built; otherwise the data is assumed to already be present in the
database.
"""

from argparse import ArgumentParser
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,40 @@
this module,you will need to create weather draws with the
``create_monte_carlo_draws`` module (see :ref:`monte-carlo-draws-section-ref`).

================
What this step does
================

This is the variable energy resource (VER) counterpart to the load-CSV step. It
reads the synthetic per-iteration variable generation profiles -- assembled from
``raw_data_var_profiles`` (the raw hourly unit-level ``cap_factor`` data) and
``raw_data_var_project_units`` (the project-to-unit mapping and per-unit
weights), resampled according to the weather draws stored in
``aux_weather_iterations`` -- and writes them out as GridPath variable-generator
profile input CSVs in ``--output_directory``, tagged with the given
``--variable_generator_profile_scenario_id`` and
``--variable_generator_profile_scenario_name``. These CSVs are the files the
GridPath model consumes for variable generation.

===========
Methodology
===========

For each project, the per-unit ``cap_factor`` values from ``raw_data_var_profiles``
are multiplied by their ``unit_weight`` and summed to produce a single
project-level ``cap_factor`` time series. The weather draws in
``aux_weather_iterations`` (selected by ``--weather_bins_id`` and
``--weather_draws_id``) determine, for each Monte Carlo ``weather_iteration`` and
``draw_number``, which historical day's data to pull, and the draw number is used
to compute the ``timepoint`` ID. One output CSV is written per project, named
``{project}-{scenario_id}-{scenario_name}.csv``, with an accompanying iterations
CSV written to an ``iterations`` subdirectory of ``--output_directory``.

``--n_parallel_projects N`` processes up to ``N`` projects concurrently (via a
multiprocessing pool over the project pool) to speed things up. ``--overwrite``
deletes any existing CSVs with the matching project/scenario filename before
writing; without it, output is appended to existing files.

=====
Usage
=====
Expand Down
Loading
Loading