Exploratory Data Analysis

The purpose of this notebook is to explore potential relationships and build an intuition about the distribution of curtailment events. Naively, we are looking to find dimensions on which we might cluster or categorize different curtailment events, and to understand any seasonality or time dependence in the data.

[1]:
import pandas as pd
import numpy as np
import altair as alt

from src.conf import settings
[2]:
# Enable Altair to Serve data from disk since we expect to have at least 8760 rows or more
alt.data_transformers.enable('json')

# Uncomment if using JupyterLab (see: https://github.com/altair-viz/altair/issues/1867)
# alt.data_transformers.enable("data_server")
[2]:
DataTransformerRegistry.enable('json')
[3]:
df = pd.concat(
    [
        pd.read_parquet(fp_) for fp_ in settings.DATA_DIR.glob("processed/caiso/*.parquet")
    ]
).tz_convert(tz="US/Pacific")

# Single Year for debugging

# year = 2019
# df = pd.read_parquet(settings.DATA_DIR / f"processed/caiso/{year}.parquet").tz_convert(tz="US/Pacific")
[4]:
column_map = zip(
    df.columns.tolist(),
    df.columns.str.replace("\W+", "_").str.lower().tolist()
)

df = df.rename(columns=dict(column_map))

Data Dictionary

[5]:
df.head()
[5]:
load solar wind net_load renewables nuclear large_hydro imports generation thermal load_less_generation_imports_ wind_curtailment solar_curtailment
timestamp
2018-01-01 00:00:00-08:00 21552.671558 0.0 243.496014 21309.175544 2032.994475 2259.991374 2098.405092 7586.553984 13964.462727 7573.071786 1.654847 NaN NaN
2018-01-01 00:05:00-08:00 21486.779943 0.0 238.499132 21248.280811 2026.481236 2259.952873 2143.557374 7543.249094 13944.799274 7514.807791 -1.268424 NaN NaN
2018-01-01 00:10:00-08:00 21391.303108 0.0 227.229805 21164.073304 2014.528291 2259.809808 2130.224136 7666.711084 13727.433629 7322.871394 -2.841604 NaN NaN
2018-01-01 00:15:00-08:00 21301.872170 0.0 228.026545 21073.845624 2016.257246 2259.802584 2093.382906 7701.705829 13598.816375 7229.373638 1.349966 NaN NaN
2018-01-01 00:20:00-08:00 21242.270230 0.0 232.213448 21010.056782 2024.277588 2260.008342 2160.635742 7677.751619 13560.236628 7115.314955 4.281984 NaN NaN

Since our interval timeseries are not continuous and complete (i.e. significant irregular gaps exist across years), we must convert this to MWh to get additivity. All columns are in MW at 5 minute intervals.

MWh are calculated by taking

\(MWh = MW 5[min]\frac{1[hour]}{60[min]}\)

[6]:
df_mwh = df * (5/60.)
df_mwh.head()
[6]:
load solar wind net_load renewables nuclear large_hydro imports generation thermal load_less_generation_imports_ wind_curtailment solar_curtailment
timestamp
2018-01-01 00:00:00-08:00 1796.055963 0.0 20.291334 1775.764629 169.416206 188.332615 174.867091 632.212832 1163.705227 631.089315 0.137904 NaN NaN
2018-01-01 00:05:00-08:00 1790.564995 0.0 19.874928 1770.690068 168.873436 188.329406 178.629781 628.604091 1162.066606 626.233983 -0.105702 NaN NaN
2018-01-01 00:10:00-08:00 1782.608592 0.0 18.935817 1763.672775 167.877358 188.317484 177.518678 638.892590 1143.952802 610.239283 -0.236800 NaN NaN
2018-01-01 00:15:00-08:00 1775.156014 0.0 19.002212 1756.153802 168.021437 188.316882 174.448576 641.808819 1133.234698 602.447803 0.112497 NaN NaN
2018-01-01 00:20:00-08:00 1770.189186 0.0 19.351121 1750.838065 168.689799 188.334029 180.052978 639.812635 1130.019719 592.942913 0.356832 NaN NaN

We roll up our data to an hourly basis to remove effects from sub-hourly markets and highly localized or temporal weather events.

[7]:
# Roll up to hourly 8760 by year
hourly_mwh = df_mwh.groupby(
    by=pd.Grouper(freq="H")
)[["solar_curtailment", "solar", "net_load", "load", "generation", "renewables", "wind_curtailment"]].sum()

# Calculate some other potentially interesting metrics

# How much of our load in a given hour was fulfilled by solar generation?
hourly_mwh["pct_solar"] = hourly_mwh["solar"]/hourly_mwh["generation"]

# How much curtailment is occurring for each unit of raw solar generation?
hourly_mwh["curtailment_intensity"] = (hourly_mwh["solar_curtailment"].fillna(0)/hourly_mwh["solar"]).fillna(0)

# Calculate Pct of Total Solar Potential
hourly_mwh["pct_solar_potential"] = (hourly_mwh["solar_curtailment"]/(hourly_mwh["solar"] + hourly_mwh["solar_curtailment"])).fillna(0)

# Calculate Pct of Total Renewable Potential
hourly_mwh["pct_renewable_potential"] = (hourly_mwh["solar_curtailment"]/(
    hourly_mwh["renewables"] + hourly_mwh["solar_curtailment"] + hourly_mwh["wind_curtailment"])
).fillna(0)

For convenience, we define a subset of data with some non-zero amount of curtailment.

[8]:
curtailment_days = hourly_mwh.query("solar_curtailment > 0").reset_index()
curtailment_days = pd.concat(
    [_dat.sort_values("solar_curtailment", ascending=False).reset_index(drop=True).reset_index() for (grp, _dat) in curtailment_days.groupby(by=curtailment_days["timestamp"].dt.year)],
    ignore_index=True
)

Due to correlation between nearby hours, we are also interested in a daily totals.

[9]:
daily_curtailment = curtailment_days.groupby(
    by=pd.Grouper(key="timestamp", freq="D")
)[["solar", "solar_curtailment", "net_load", "load", "generation", "renewables", "wind_curtailment"]].sum()
daily_curtailment["curtailment_intensity"] = daily_curtailment.eval("solar_curtailment/solar")
daily_curtailment.reset_index(inplace=True)

Exploration: How much annual Curtailment is there?

We are primarily interested in finding explanatory variables for curtailment. We are interested specifically in time dependence at the hour-scale. The goal of this study is to identify exogenous data sets which could explain a given curtailment day.

Chart Definitions

[10]:
# Interactive version to Drill Down Over Fixed Time periods
brush = alt.selection(type='interval', encodings=['x'])

base = alt.Chart(
    hourly_mwh.reset_index()
).mark_rect().encode(
    alt.Y("hours(timestamp):T", title="Hour of Day", sort="descending"),
    alt.X("yearmonthdate(timestamp):T", title="Date", sort="ascending"),
    alt.Color(
        alt.repeat("row"),
        type="quantitative",
        sort="descending",
        scale=alt.Scale(scheme="redgrey"),
        legend=alt.Legend(orient="top")
    )
).properties(
    width=1500,
    height=250
)

background = base.encode(
    color=alt.value('#ddd')
).add_selection(brush)

highlight = base.transform_filter(brush)

brush_chart = alt.layer(
    background,
    highlight
).repeat(
    row=["solar_curtailment", "solar", "curtailment_intensity"]
).resolve_scale(color="independent")
[11]:
# A better version of the chart above with units and things but not interactive

basechart = alt.Chart(
    hourly_mwh.reset_index()
).mark_rect().encode(
    alt.Y("hours(timestamp):T", title="Hour of Day", sort="descending"),
    alt.X("yearmonthdate(timestamp):T", title="Date", sort="ascending"),
).properties(
    width=1500,
    height=250
)

fixed_chart = alt.vconcat(
    basechart.encode(
        alt.Color(
            "solar_curtailment:Q",
              scale=alt.Scale(type="pow", exponent=1/4, scheme="redgrey"),
              sort="descending",
              title="Solar Curtailment (MWh)",
              legend=alt.Legend(orient="top"),
         )
    ),
    basechart.encode(
        alt.Color(
            "solar:Q",
              scale=alt.Scale(type="pow", exponent=2, scheme="redgrey"),
              sort="descending",
              title="Solar Output (MW)",
              legend=alt.Legend(orient="top"),
        )
    ),
    basechart.encode(
        alt.Color(
            "curtailment_intensity:Q",
            scale=alt.Scale(scheme="redgrey", type="pow", exponent=1/4, domain=[0,1]),
            sort="descending",
            title="MWh Cur. per MWh Load",
            legend=alt.Legend(orient="top"),
        )
    )
).resolve_scale(color="independent")
[12]:
# Let's inspect our curtailed hours a bit more closely
load_chart = alt.Chart(curtailment_days).mark_line().encode(
    alt.X("index", title="rank"),
    alt.Y("solar_curtailment", title="Solar Curtailment (MW)"),
).facet("year(timestamp):N")
[13]:
curtail_v_solar_production = alt.Chart(curtailment_days).mark_point().encode(
    alt.X("solar", title="Solar Production"),
    alt.Y("solar_curtailment", title="Solar Curtailment (MW)"),
    alt.Color("month(timestamp):N", scale=alt.Scale(scheme="tableau20")),
    alt.Facet("year(timestamp):O", columns=3)
)
[14]:
base = alt.Chart().mark_point().encode(
    alt.X("monthdate(timestamp):T", title="Date (hour)"),
    alt.Y("curtailment_intensity", title="Curtailment Intensity (MW/MW)"),
    alt.Color("load:Q", scale=alt.Scale(scheme="redyellowblue"), sort="descending", title="Load (MW)"),
    # alt.Facet("year(timestamp):O", columns=3)
)

curtailment_intensity = alt.layer(base, data=curtailment_days.query("curtailment_intensity < 1")).facet("year(timestamp):O")
daily_curtailment_intensity = alt.layer(base, data=daily_curtailment).facet("year(timestamp):O")
[15]:
daily_curtailment_vs_load = alt.Chart(daily_curtailment).mark_point().encode(
    alt.X("load", title="Load (MWh)"),
    alt.Y("solar_curtailment", title="Daily Solar Curtailment (MWh)"),
    alt.Color("month(timestamp):N", scale=alt.Scale(scheme="tableau20")),
    alt.Facet("year(timestamp):O"),
)

Curtailment Characteristics

[16]:
alt.Chart(daily_curtailment).mark_bar().encode(
    alt.X("yearmonth(timestamp):T", title="Month"),
    alt.Y("mean(solar_curtailment)", title="Mean Curtailment (MWh)")
).interactive()
[16]:
[17]:
fixed_chart
[17]:

Shown above are

By visual inspection, two large curtailment “zones” or “clusters” appear to exist – one centered around mid-March, and the other around mid-October. Another prominent feature is the fast adoption of solar over time.

[18]:
daily_curtailment_vs_load
[18]:

By inspection, we can see a very rough pattern of high curtailment events during periods of “medium” load. Intuitively, we might imagine these correspond to days with very high solar production, and very low demand (e.g. sunny in the desert, and temperate in populated centers.)

[19]:
load_chart
[19]:

Hourly curtailment by year in MW - Over time, the available curtailed capacity has increased drastically (by a factor of 3 from 2016 to 2019).

[20]:
curtail_v_solar_production
[20]:

One metric of interest may be the amount of solar curtailment per unit of solar production. Each mark represents a single hour in a year. By visual inspection, a vague envelope is visible on the right hand side. The sum of curtailment and production might represent the total solar capacity in a given hour. By inspection, we can also observe a larger portion of curtailment is occurring year on year.

[21]:
curtailment_intensity
[21]:

Hourly curtailment per unit of solar production. A value of 1 represents an hour where 50% of solar capacity is curtailed.

[22]:
daily_curtailment_intensity
[22]:

Total daily curtailment per unit of solar production. A value of 1 represents a day where 50% of solar capacity is curtailed.