Exploratory Data Analysis¶
The purpose of this notebook is to explore potential relationships and build an intuition about the distribution of curtailment events. Naively, we are looking to find dimensions on which we might cluster or categorize different curtailment events, and to understand any seasonality or time dependence in the data.
[1]:
import pandas as pd
import numpy as np
import altair as alt
from src.conf import settings
[2]:
# Enable Altair to Serve data from disk since we expect to have at least 8760 rows or more
alt.data_transformers.enable('json')
# Uncomment if using JupyterLab (see: https://github.com/altair-viz/altair/issues/1867)
# alt.data_transformers.enable("data_server")
[2]:
DataTransformerRegistry.enable('json')
[3]:
df = pd.concat(
[
pd.read_parquet(fp_) for fp_ in settings.DATA_DIR.glob("processed/caiso/*.parquet")
]
).tz_convert(tz="US/Pacific")
# Single Year for debugging
# year = 2019
# df = pd.read_parquet(settings.DATA_DIR / f"processed/caiso/{year}.parquet").tz_convert(tz="US/Pacific")
[4]:
column_map = zip(
df.columns.tolist(),
df.columns.str.replace("\W+", "_").str.lower().tolist()
)
df = df.rename(columns=dict(column_map))
Data Dictionary
load (MW): Total demand across the CAISO for a given interval.
solar (MW): Average interval Solar production
wind (MW): Average interval Wind production
net_load (MW): Load - solar - wind
renewables (MW): Average interval production from solar, wind, biomass, biogas, geothermal and small hydropower
nuclear (MW): Average nuclear production
large_hydro (MW): Average large hydro production
imports (MW): Imports coming into the ISO; note that exports are NOT deducted from imports.
generation (MW): Total generation across all generator types
thermal (MW): non-nuclear and non-geothermal thermal production
load_less_generation_imports_ (MW): data validation column to ensure supply ~ demand
wind_curtailment (MW): Curtailed wind in a given interval
solar_curtailment (MW): Curtailed solar in a given interval
[5]:
df.head()
[5]:
load | solar | wind | net_load | renewables | nuclear | large_hydro | imports | generation | thermal | load_less_generation_imports_ | wind_curtailment | solar_curtailment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
timestamp | |||||||||||||
2018-01-01 00:00:00-08:00 | 21552.671558 | 0.0 | 243.496014 | 21309.175544 | 2032.994475 | 2259.991374 | 2098.405092 | 7586.553984 | 13964.462727 | 7573.071786 | 1.654847 | NaN | NaN |
2018-01-01 00:05:00-08:00 | 21486.779943 | 0.0 | 238.499132 | 21248.280811 | 2026.481236 | 2259.952873 | 2143.557374 | 7543.249094 | 13944.799274 | 7514.807791 | -1.268424 | NaN | NaN |
2018-01-01 00:10:00-08:00 | 21391.303108 | 0.0 | 227.229805 | 21164.073304 | 2014.528291 | 2259.809808 | 2130.224136 | 7666.711084 | 13727.433629 | 7322.871394 | -2.841604 | NaN | NaN |
2018-01-01 00:15:00-08:00 | 21301.872170 | 0.0 | 228.026545 | 21073.845624 | 2016.257246 | 2259.802584 | 2093.382906 | 7701.705829 | 13598.816375 | 7229.373638 | 1.349966 | NaN | NaN |
2018-01-01 00:20:00-08:00 | 21242.270230 | 0.0 | 232.213448 | 21010.056782 | 2024.277588 | 2260.008342 | 2160.635742 | 7677.751619 | 13560.236628 | 7115.314955 | 4.281984 | NaN | NaN |
Since our interval timeseries are not continuous and complete (i.e. significant irregular gaps exist across years), we must convert this to MWh to get additivity. All columns are in MW at 5 minute intervals.
MWh are calculated by taking
\(MWh = MW 5[min]\frac{1[hour]}{60[min]}\)
[6]:
df_mwh = df * (5/60.)
df_mwh.head()
[6]:
load | solar | wind | net_load | renewables | nuclear | large_hydro | imports | generation | thermal | load_less_generation_imports_ | wind_curtailment | solar_curtailment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
timestamp | |||||||||||||
2018-01-01 00:00:00-08:00 | 1796.055963 | 0.0 | 20.291334 | 1775.764629 | 169.416206 | 188.332615 | 174.867091 | 632.212832 | 1163.705227 | 631.089315 | 0.137904 | NaN | NaN |
2018-01-01 00:05:00-08:00 | 1790.564995 | 0.0 | 19.874928 | 1770.690068 | 168.873436 | 188.329406 | 178.629781 | 628.604091 | 1162.066606 | 626.233983 | -0.105702 | NaN | NaN |
2018-01-01 00:10:00-08:00 | 1782.608592 | 0.0 | 18.935817 | 1763.672775 | 167.877358 | 188.317484 | 177.518678 | 638.892590 | 1143.952802 | 610.239283 | -0.236800 | NaN | NaN |
2018-01-01 00:15:00-08:00 | 1775.156014 | 0.0 | 19.002212 | 1756.153802 | 168.021437 | 188.316882 | 174.448576 | 641.808819 | 1133.234698 | 602.447803 | 0.112497 | NaN | NaN |
2018-01-01 00:20:00-08:00 | 1770.189186 | 0.0 | 19.351121 | 1750.838065 | 168.689799 | 188.334029 | 180.052978 | 639.812635 | 1130.019719 | 592.942913 | 0.356832 | NaN | NaN |
We roll up our data to an hourly basis to remove effects from sub-hourly markets and highly localized or temporal weather events.
[7]:
# Roll up to hourly 8760 by year
hourly_mwh = df_mwh.groupby(
by=pd.Grouper(freq="H")
)[["solar_curtailment", "solar", "net_load", "load", "generation", "renewables", "wind_curtailment"]].sum()
# Calculate some other potentially interesting metrics
# How much of our load in a given hour was fulfilled by solar generation?
hourly_mwh["pct_solar"] = hourly_mwh["solar"]/hourly_mwh["generation"]
# How much curtailment is occurring for each unit of raw solar generation?
hourly_mwh["curtailment_intensity"] = (hourly_mwh["solar_curtailment"].fillna(0)/hourly_mwh["solar"]).fillna(0)
# Calculate Pct of Total Solar Potential
hourly_mwh["pct_solar_potential"] = (hourly_mwh["solar_curtailment"]/(hourly_mwh["solar"] + hourly_mwh["solar_curtailment"])).fillna(0)
# Calculate Pct of Total Renewable Potential
hourly_mwh["pct_renewable_potential"] = (hourly_mwh["solar_curtailment"]/(
hourly_mwh["renewables"] + hourly_mwh["solar_curtailment"] + hourly_mwh["wind_curtailment"])
).fillna(0)
For convenience, we define a subset of data with some non-zero amount of curtailment.
[8]:
curtailment_days = hourly_mwh.query("solar_curtailment > 0").reset_index()
curtailment_days = pd.concat(
[_dat.sort_values("solar_curtailment", ascending=False).reset_index(drop=True).reset_index() for (grp, _dat) in curtailment_days.groupby(by=curtailment_days["timestamp"].dt.year)],
ignore_index=True
)
Due to correlation between nearby hours, we are also interested in a daily totals.
[9]:
daily_curtailment = curtailment_days.groupby(
by=pd.Grouper(key="timestamp", freq="D")
)[["solar", "solar_curtailment", "net_load", "load", "generation", "renewables", "wind_curtailment"]].sum()
daily_curtailment["curtailment_intensity"] = daily_curtailment.eval("solar_curtailment/solar")
daily_curtailment.reset_index(inplace=True)
Exploration: How much annual Curtailment is there?¶
We are primarily interested in finding explanatory variables for curtailment. We are interested specifically in time dependence at the hour-scale. The goal of this study is to identify exogenous data sets which could explain a given curtailment day.
Chart Definitions
[10]:
# Interactive version to Drill Down Over Fixed Time periods
brush = alt.selection(type='interval', encodings=['x'])
base = alt.Chart(
hourly_mwh.reset_index()
).mark_rect().encode(
alt.Y("hours(timestamp):T", title="Hour of Day", sort="descending"),
alt.X("yearmonthdate(timestamp):T", title="Date", sort="ascending"),
alt.Color(
alt.repeat("row"),
type="quantitative",
sort="descending",
scale=alt.Scale(scheme="redgrey"),
legend=alt.Legend(orient="top")
)
).properties(
width=1500,
height=250
)
background = base.encode(
color=alt.value('#ddd')
).add_selection(brush)
highlight = base.transform_filter(brush)
brush_chart = alt.layer(
background,
highlight
).repeat(
row=["solar_curtailment", "solar", "curtailment_intensity"]
).resolve_scale(color="independent")
[11]:
# A better version of the chart above with units and things but not interactive
basechart = alt.Chart(
hourly_mwh.reset_index()
).mark_rect().encode(
alt.Y("hours(timestamp):T", title="Hour of Day", sort="descending"),
alt.X("yearmonthdate(timestamp):T", title="Date", sort="ascending"),
).properties(
width=1500,
height=250
)
fixed_chart = alt.vconcat(
basechart.encode(
alt.Color(
"solar_curtailment:Q",
scale=alt.Scale(type="pow", exponent=1/4, scheme="redgrey"),
sort="descending",
title="Solar Curtailment (MWh)",
legend=alt.Legend(orient="top"),
)
),
basechart.encode(
alt.Color(
"solar:Q",
scale=alt.Scale(type="pow", exponent=2, scheme="redgrey"),
sort="descending",
title="Solar Output (MW)",
legend=alt.Legend(orient="top"),
)
),
basechart.encode(
alt.Color(
"curtailment_intensity:Q",
scale=alt.Scale(scheme="redgrey", type="pow", exponent=1/4, domain=[0,1]),
sort="descending",
title="MWh Cur. per MWh Load",
legend=alt.Legend(orient="top"),
)
)
).resolve_scale(color="independent")
[12]:
# Let's inspect our curtailed hours a bit more closely
load_chart = alt.Chart(curtailment_days).mark_line().encode(
alt.X("index", title="rank"),
alt.Y("solar_curtailment", title="Solar Curtailment (MW)"),
).facet("year(timestamp):N")
[13]:
curtail_v_solar_production = alt.Chart(curtailment_days).mark_point().encode(
alt.X("solar", title="Solar Production"),
alt.Y("solar_curtailment", title="Solar Curtailment (MW)"),
alt.Color("month(timestamp):N", scale=alt.Scale(scheme="tableau20")),
alt.Facet("year(timestamp):O", columns=3)
)
[14]:
base = alt.Chart().mark_point().encode(
alt.X("monthdate(timestamp):T", title="Date (hour)"),
alt.Y("curtailment_intensity", title="Curtailment Intensity (MW/MW)"),
alt.Color("load:Q", scale=alt.Scale(scheme="redyellowblue"), sort="descending", title="Load (MW)"),
# alt.Facet("year(timestamp):O", columns=3)
)
curtailment_intensity = alt.layer(base, data=curtailment_days.query("curtailment_intensity < 1")).facet("year(timestamp):O")
daily_curtailment_intensity = alt.layer(base, data=daily_curtailment).facet("year(timestamp):O")
[15]:
daily_curtailment_vs_load = alt.Chart(daily_curtailment).mark_point().encode(
alt.X("load", title="Load (MWh)"),
alt.Y("solar_curtailment", title="Daily Solar Curtailment (MWh)"),
alt.Color("month(timestamp):N", scale=alt.Scale(scheme="tableau20")),
alt.Facet("year(timestamp):O"),
)
Curtailment Characteristics¶
[16]:
alt.Chart(daily_curtailment).mark_bar().encode(
alt.X("yearmonth(timestamp):T", title="Month"),
alt.Y("mean(solar_curtailment)", title="Mean Curtailment (MWh)")
).interactive()
[16]:
[17]:
fixed_chart
[17]:
Shown above are
Hourly Solar curtailment in \(x^{.25}\) scale to exaggerate visual differences in curtailment scale.
Hourly Solar Output (MW) in \(x^{2}\) scale.
Hourly Curtailment “Intensity” w.r.t. Load (MW) in \(x^{.25}\) scale.
By visual inspection, two large curtailment “zones” or “clusters” appear to exist – one centered around mid-March, and the other around mid-October. Another prominent feature is the fast adoption of solar over time.
[18]:
daily_curtailment_vs_load
[18]:
By inspection, we can see a very rough pattern of high curtailment events during periods of “medium” load. Intuitively, we might imagine these correspond to days with very high solar production, and very low demand (e.g. sunny in the desert, and temperate in populated centers.)
[19]:
load_chart
[19]:
Hourly curtailment by year in MW - Over time, the available curtailed capacity has increased drastically (by a factor of 3 from 2016 to 2019).
[20]:
curtail_v_solar_production
[20]:
One metric of interest may be the amount of solar curtailment per unit of solar production. Each mark represents a single hour in a year. By visual inspection, a vague envelope is visible on the right hand side. The sum of curtailment and production might represent the total solar capacity in a given hour. By inspection, we can also observe a larger portion of curtailment is occurring year on year.
[21]:
curtailment_intensity
[21]:
Hourly curtailment per unit of solar production. A value of 1 represents an hour where 50% of solar capacity is curtailed.
[22]:
daily_curtailment_intensity
[22]:
Total daily curtailment per unit of solar production. A value of 1 represents a day where 50% of solar capacity is curtailed.