Example: Predicting Daily Curtailment Events¶

Process Overview:

Label curtailment events (i.e. define a decision boundary)
Partition historic data into training and test sets
1. Spring Season (Feb-May)
Fit a statistical model to the training data
Predict against the test data
Evaluate the performance of the model against known labels in the test data.

[1]:

import pandas as pd
from dask import dataframe as dd
import statsmodels.formula.api as smf
import statsmodels.api as sm

from src.conf import settings

Early Spring¶

In our exploratory data analysis, we found evidence of seasonality being a strong factor in influencing curtailment intensity, being particularly noticeable in early spring. Here we exam only Feb - May curtailment to derive a model accordingly.

[2]:

# Read curtailment data, aggregate to dailies.
df = pd.concat(
    [
        pd.read_parquet(
            settings.DATA_DIR / f"processed/caiso/{y}.parquet"
        ) for y in range(2017, 2020)
    ]
)

df.columns = df.columns.str.lower().str.replace(" ", "_")
df.index = df.index.tz_convert("US/Pacific")
df = df.groupby(pd.Grouper(freq="D")).sum()
# Hourly resampling is a bad idea because hour-to-hour effects are co-correlated
# df = df[columns].groupby(pd.Grouper(freq="H")).sum()
df.reset_index(inplace=True)

/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/fastparquet/encoding.py:222: NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 'numba.experimental.jitclass' to better reflect the experimental nature of the functionality. Please update your imports to accommodate this change and see http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location for the time frame.
  Numpy8 = numba.jitclass(spec8)(NumpyIO)
/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/fastparquet/encoding.py:224: NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 'numba.experimental.jitclass' to better reflect the experimental nature of the functionality. Please update your imports to accommodate this change and see http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location for the time frame.
  Numpy32 = numba.jitclass(spec32)(NumpyIO)

[3]:

df.dtypes

[3]:

timestamp                         datetime64[ns, US/Pacific]
load                                                 float64
solar                                                float64
wind                                                 float64
net_load                                             float64
renewables                                           float64
nuclear                                              float64
large_hydro                                          float64
imports                                              float64
generation                                           float64
thermal                                              float64
load_less_(generation+imports)                       float64
wind_curtailment                                     float64
solar_curtailment                                    float64
dtype: object

[4]:

df.head()

[4]:

	timestamp	load	solar	wind	net_load	renewables	nuclear	large_hydro	imports	generation	thermal	load_less_(generation+imports)	wind_curtailment	solar_curtailment
0	2017-01-01 00:00:00-08:00	6.451973e+06	502302.914737	656544.662054	5.293125e+06	1.665067e+06	652721.169749	653481.452160	2.200843e+06	4.251096e+06	1.279826e+06	33.747653	10107.108141	26760.716117
1	2017-01-02 00:00:00-08:00	6.838524e+06	203004.927518	443945.262752	6.191574e+06	1.156723e+06	653163.556261	705478.766030	2.372526e+06	4.465951e+06	1.950586e+06	47.689558	401.789387	43.705500
2	2017-01-03 00:00:00-08:00	7.391597e+06	256791.945920	279703.466170	6.855101e+06	1.053297e+06	654631.092257	793118.552495	1.885086e+06	5.506580e+06	3.005534e+06	-69.892993	0.000000	54.841500
3	2017-01-04 00:00:00-08:00	7.304337e+06	351737.160810	339561.662186	6.613038e+06	1.213044e+06	653906.204819	779211.177282	1.877239e+06	5.427051e+06	2.780890e+06	46.565469	0.000000	20.247000
4	2017-01-05 00:00:00-08:00	7.246891e+06	343183.228333	629678.639548	6.274029e+06	1.493144e+06	653460.137940	865964.297512	1.239591e+06	6.007303e+06	2.994735e+06	-3.179387	804.842781	191.525216

[5]:

# Subset curtailments data to relevant time range
df = df[df["timestamp"].dt.month.isin(range(2,6))]

[6]:

# Analysis Period of Dataset
df["timestamp"].describe()

[6]:

count                           360
unique                          360
top       2019-05-24 00:00:00-07:00
freq                              1
first     2017-02-01 00:00:00-08:00
last      2019-05-31 00:00:00-07:00
Name: timestamp, dtype: object

[7]:

# Label Data - based on our EDA, we might start by "guessing" a threshold of importance of .05
# Later methods will be less biased, and allow for more variance.
# TODO: Try to find natural clusterings through an unsupervised process to label the dataset, and try to predict those labels.
df["curtailment_event"] = pd.Categorical(df["solar_curtailment"]/df["solar"] > .1)

df["is_weekday"] = pd.Categorical(df["timestamp"].dt.weekday.isin([5, 6]))

[8]:

# Merge Weather Data
# Use Day-Ahead forecasts
forecasts = [
    *(settings.DATA_DIR / f"interim/gfs/ca/gfs_3_201[7-9][01][2-5]*_0000_{i*3:03}.parquet" for i in range(5, 10))
]
dayahead_weather = dd.read_parquet(forecasts).compute()

[9]:

dayahead_weather["timestamp"] = dayahead_weather["valid_time"].dt.tz_localize("UTC").dt.tz_convert("US/Pacific")

[10]:

dayahead_weather.head()

[10]:

	latitude	longitude	t	gust	tp	dswrf	uswrf	SUNSD	al	sp	...	MTFCC	CSAFP	CBSAFP	METDIVFP	FUNCSTAT	ALAND	AWATER	INTPTLAT	INTPTLON	timestamp
index
17517	42.0	237.0	269.899994	2.9	0.0	0.0	0.0	0.0	0.0	86861.039062	...	G4020	None	None	None	A	16261974847	179108278	+41.5879861	-122.5332868	2017-02-01 07:00:00-08:00
17518	42.0	238.0	266.000000	1.9	0.0	0.0	0.0	0.0	0.0	86630.640625	...	G4020	None	None	None	A	16261974847	179108278	+41.5879861	-122.5332868	2017-02-01 07:00:00-08:00
17876	41.0	236.0	278.600006	2.9	0.0	0.0	0.0	0.0	0.0	97531.437500	...	G4020	None	21700	None	A	9240992572	1254297982	+40.7066731	-123.9258181	2017-02-01 07:00:00-08:00
17877	41.0	237.0	270.100006	2.7	0.0	0.0	0.0	0.0	0.0	81915.437500	...	G4020	None	None	None	A	8234265201	73407949	+40.6477241	-123.1144043	2017-02-01 07:00:00-08:00
18237	40.0	237.0	272.000000	2.8	0.0	0.0	0.0	0.0	0.0	83081.843750	...	G4020	None	None	None	A	8234265201	73407949	+40.6477241	-123.1144043	2017-02-01 07:00:00-08:00

5 rows × 37 columns

[11]:

dayahead_weather.iloc[0]

[11]:

latitude                              42
longitude                            237
t                                  269.9
gust                                 2.9
tp                                     0
dswrf                                  0
uswrf                                  0
SUNSD                                  0
al                                     0
sp                                 86861
csnow                                  0
cicep                                  0
cfrzr                                  0
crain                                  0
sde                                 0.34
surface                                0
time                 2017-02-01 00:00:00
valid_time           2017-02-01 15:00:00
index_right                           54
STATEFP                               06
COUNTYFP                             093
COUNTYNS                        00277311
GEOID                              06093
NAME                            Siskiyou
NAMELSAD                 Siskiyou County
LSAD                                  06
CLASSFP                               H1
MTFCC                              G4020
CSAFP                               None
CBSAFP                              None
METDIVFP                            None
FUNCSTAT                               A
ALAND                        16261974847
AWATER                         179108278
INTPTLAT                     +41.5879861
INTPTLON                    -122.5332868
timestamp      2017-02-01 07:00:00-08:00
Name: 17517, dtype: object

[12]:

dayahead_weather.dtypes

[12]:

latitude                          float64
longitude                         float64
t                                 float32
gust                              float32
tp                                float32
dswrf                             float32
uswrf                             float32
SUNSD                             float32
al                                float32
sp                                float32
csnow                             float32
cicep                             float32
cfrzr                             float32
crain                             float32
sde                               float32
surface                             int64
time                       datetime64[ns]
valid_time                 datetime64[ns]
index_right                         int64
STATEFP                            object
COUNTYFP                           object
COUNTYNS                           object
GEOID                              object
NAME                               object
NAMELSAD                           object
LSAD                               object
CLASSFP                            object
MTFCC                              object
CSAFP                              object
CBSAFP                             object
METDIVFP                           object
FUNCSTAT                           object
ALAND                               int64
AWATER                              int64
INTPTLAT                           object
INTPTLON                           object
timestamp      datetime64[ns, US/Pacific]
dtype: object

[13]:

# Take an average over all datapoints (no weighting)
# FIXME: 100 We need to be more thoughtful about how to integrate this data.
# It should be weighted somehow, or perhaps certain locations are expressed as their own IV.
# Look into Uitlity CZs
dayahead_hourly = dayahead_weather.groupby(pd.Grouper(key="timestamp", freq="H"))[["t", "dswrf", "uswrf", "gust", "SUNSD"]].mean().interpolate().reset_index()

[14]:

dayahead_daily = dayahead_weather.groupby(pd.Grouper(key="timestamp", freq="D")).agg({"t": "mean", "dswrf": "mean", "uswrf": "mean", "SUNSD": "sum"})

[15]:

data = df.merge(dayahead_daily, on="timestamp", how="inner")
data

[15]:

	timestamp	load	solar	wind	net_load	renewables	nuclear	large_hydro	imports	generation	thermal	load_less_(generation+imports)	wind_curtailment	solar_curtailment	curtailment_event	is_weekday	t	dswrf	uswrf	SUNSD
0	2017-02-01 00:00:00-08:00	7.404797e+06	6.365488e+05	8.839707e+04	6.679851e+06	1.241231e+06	652218.031425	1.058698e+06	2.043459e+06	5.361323e+06	2.409176e+06	14.109963	64.670000	8480.663464	False	False	282.630219	204.382217	46.266666	2130985.0
1	2017-02-02 00:00:00-08:00	7.367050e+06	5.002730e+05	2.195530e+05	6.647224e+06	1.235706e+06	651180.678103	1.077320e+06	1.941564e+06	5.425467e+06	2.461260e+06	19.334575	2691.564962	288.956667	False	False	282.440430	155.364441	36.537777	1721099.0
2	2017-02-03 00:00:00-08:00	7.272859e+06	5.318559e+05	2.889443e+05	6.452059e+06	1.332687e+06	650538.682373	1.103827e+06	1.913683e+06	5.359143e+06	2.272091e+06	32.670400	488.852142	869.203799	False	False	283.654236	162.995560	35.911110	1762346.0
3	2017-02-04 00:00:00-08:00	6.611103e+06	6.363877e+05	4.979092e+05	5.476806e+06	1.657910e+06	649935.253088	1.102862e+06	1.667016e+06	4.944096e+06	1.533388e+06	-9.417728	8260.820991	69858.576657	True	True	283.166656	197.639999	42.088890	1947685.0
4	2017-02-05 00:00:00-08:00	6.661248e+06	4.002112e+05	2.128107e+05	6.048226e+06	1.123362e+06	649679.457890	1.093060e+06	2.046784e+06	4.614450e+06	1.748349e+06	12.990172	0.000000	25.348000	False	True	283.199097	174.813339	35.973331	2019408.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
355	2019-05-27 00:00:00-07:00	5.672302e+06	7.720765e+05	1.113156e+06	3.787069e+06	2.453673e+06	656141.389061	9.970034e+05	1.169736e+06	4.502609e+06	3.957918e+05	-43.080195	18796.425323	444354.402287	True	False	292.328644	457.533325	73.480003	3091204.0
356	2019-05-28 00:00:00-07:00	6.436942e+06	1.194693e+06	9.471656e+05	4.295084e+06	2.701523e+06	656950.514763	1.010069e+06	1.417568e+06	5.019443e+06	6.509010e+05	-69.383111	924.968950	63554.570744	False	False	296.436951	507.760010	74.080002	3122509.0
357	2019-05-29 00:00:00-07:00	6.762152e+06	1.239861e+06	5.052720e+05	5.017019e+06	2.309672e+06	657307.503346	1.089729e+06	1.752092e+06	5.010103e+06	9.533939e+05	-42.688343	10658.002569	72766.329577	False	False	299.156006	511.271118	73.702225	3112164.0
358	2019-05-30 00:00:00-07:00	6.865228e+06	1.295425e+06	5.069673e+05	5.062835e+06	2.359094e+06	657488.225968	1.192374e+06	1.771105e+06	5.094174e+06	8.852177e+05	-50.745769	13743.059075	7661.455448	False	False	299.053741	506.617767	73.831108	3118246.0
359	2019-05-31 00:00:00-07:00	6.929919e+06	1.244073e+06	4.013874e+05	5.284459e+06	2.219221e+06	657313.148060	1.212114e+06	1.889459e+06	5.040392e+06	9.517434e+05	67.953527	909.440812	48430.745252	False	False	298.764374	538.822205	76.894447	2290201.0

360 rows × 20 columns

Single Model Run¶

[16]:

test_data = data.sample(int(len(data)*.3//1))
training_data = data[~data.index.isin(test_data.index)]
model = "C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + t"
result = smf.glm(
    model,
    training_data,
    family=sm.families.Binomial()
).fit()
result.summary()

[16]:

Generalized Linear Model Regression Results
Dep. Variable:	['C(curtailment_event)[False]', 'C(curtailment_event)[True]']	No. Observations:	248
Model:	GLM	Df Residuals:	242
Model Family:	Binomial	Df Model:	5
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-72.320
Date:	Thu, 30 Apr 2020	Deviance:	144.64
Time:	12:32:09	Pearson chi2:	191.
No. Iterations:	6
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-29.0311	17.196	-1.688	0.091	-62.735	4.673
C(timestamp.dt.month)[T.3]	0.0594	0.736	0.081	0.936	-1.384	1.502
C(timestamp.dt.month)[T.4]	-1.3746	0.860	-1.599	0.110	-3.059	0.310
C(timestamp.dt.month)[T.5]	-1.6381	1.027	-1.595	0.111	-3.652	0.375
C(is_weekday)[T.True]	-2.4666	0.472	-5.223	0.000	-3.392	-1.541
t	0.1142	0.061	1.869	0.062	-0.006	0.234

[17]:

predictions = result.predict(test_data.drop(columns=["curtailment_event"]))
predictions.name = "probability"
predictions = test_data.merge(predictions, left_index=True, right_index=True)

cutoff = .7
true_positives = predictions.query("probability > @cutoff")["curtailment_event"].value_counts().loc[True]
false_negatives = predictions.query("probability > @cutoff")["curtailment_event"].value_counts().loc[False]
true_negatives = predictions.query("probability <= @cutoff")["curtailment_event"].value_counts().loc[False]
false_positives = predictions.query("probability <= @cutoff")["curtailment_event"].value_counts().loc[True]

accuracy = (true_positives+true_negatives)/len(predictions)
precision = true_positives / (true_positives + false_positives)
print(f"Accuracy: {accuracy}; Precision: {precision}")

Accuracy: 0.19444444444444445; Precision: 1.0

[18]:

predictions["curtailment_event"].astype(bool).sum()/len(predictions["curtailment_event"])

[18]:

0.06481481481481481

[19]:

1 - predictions["probability"].mean()

[19]:

0.10371318777291871

Multi-Model Run¶

TODO:

Try a k-fold approach instead. Divide the training set into K folds (without resampling), use k-1 folds as the training data and withold 1 fold. Run the model k times witholding the next fold. Calculate your accuracy metrics based on the witheld data as the “test” data within each folded run.

Other Notes:

Folding is difficult to do when predicting low-frequency events. Each fold may contain a consecutive set of rows that contain no curtailment events, effectively limiting our predictions to only consider or attempt to predict 1 type of event.

[30]:

def simulate(model, data, cutoff=.8):
    test_data = data.sample(int(len(data)*.2//1))
    training_data = data[~data.index.isin(test_data.index)]
    result = smf.glm(
        model,
        training_data,
        family=sm.families.Binomial()
    ).fit()

    predictions = result.predict(test_data.drop(columns=["curtailment_event"]))
    predictions.name = "probability"
    predictions = 1 - predictions
    predictions = test_data.merge(predictions, left_index=True, right_index=True)

    positive_predictions = predictions.query("probability > @cutoff")["curtailment_event"].value_counts()
    negative_predictions = predictions.query("probability <= @cutoff")["curtailment_event"].value_counts()

    true_positives = positive_predictions.loc[True]
    false_positives = positive_predictions.loc[False]

    true_negatives = negative_predictions.loc[False]
    false_negatives = negative_predictions[True]

    accuracy = (true_positives+true_negatives)/len(predictions)
    precision = true_positives / (true_positives + false_positives)
    return {"results": result, "accuracy": accuracy, "precision": precision}

[31]:

# Run this 100 times:
model1 = []
for i in range(100):
    model1.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + load + t + dswrf", data, cutoff=.7))
results1 = pd.DataFrame(model1)
print("Accuracy : {}".format(results1["accuracy"].mean()), "Precision : {}".format(results1["precision"].mean()))

/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars

Accuracy : 0.8873611111111112 Precision : 0.8489247311827957

[32]:

model1[1]["results"].summary()

[32]:

Generalized Linear Model Regression Results
Dep. Variable:	['C(curtailment_event)[False]', 'C(curtailment_event)[True]']	No. Observations:	283
Model:	GLM	Df Residuals:	275
Model Family:	Binomial	Df Model:	7
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-68.257
Date:	Thu, 30 Apr 2020	Deviance:	136.51
Time:	13:25:21	Pearson chi2:	212.
No. Iterations:	7
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-57.6466	22.202	-2.596	0.009	-101.162	-14.132
C(timestamp.dt.month)[T.3]	1.1928	0.808	1.477	0.140	-0.390	2.776
C(timestamp.dt.month)[T.4]	1.0888	0.993	1.097	0.273	-0.857	3.035
C(timestamp.dt.month)[T.5]	0.9277	1.139	0.814	0.415	-1.305	3.160
C(is_weekday)[T.True]	1.2901	0.866	1.490	0.136	-0.407	2.987
load	4.799e-06	1.14e-06	4.213	0.000	2.57e-06	7.03e-06
t	0.1028	0.074	1.384	0.166	-0.043	0.248
dswrf	-0.0069	0.003	-2.295	0.022	-0.013	-0.001

[33]:

# Run this 100 times:
model2 = []
for i in range(100):
    model2.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + load", data, cutoff=.7))
results2 = pd.DataFrame(model2)
print("Accuracy : {}".format(results2["accuracy"].mean()), "Precision : {}".format(results2["precision"].mean()))

/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars

Accuracy : 0.9038888888888887 Precision : 0.8506289308176099

[36]:

# Run this 100 times:
model3 = []
for i in range(100):
    model3.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + t * dswrf", data, cutoff=.8))
results3 = pd.DataFrame(model3)
print("Accuracy : {}".format(results3["accuracy"].mean()), "Precision : {}".format(results3["precision"].mean()))

/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars

Accuracy : 0.8783333333333332 Precision : nan

[25]:

# Run this 100 times:
model4 = []
for i in range(100):
    model4.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + t + dswrf", data, cutoff=.7))
results4 = pd.DataFrame(model4)
print("Accuracy : {}".format(results4["accuracy"].mean()), "Precision : {}".format(results4["precision"].mean()))

/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars

Accuracy : 0.8748611111111111 Precision : 0.15

[26]:

# Run this 100 times:
model5 = []
for i in range(100):
    model5.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + SUNSD", data, cutoff=.7))
results5 = pd.DataFrame(model5)
print("Accuracy : {}".format(results5["accuracy"].mean()), "Precision : {}".format(results5["precision"].mean()))

/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars

Accuracy : 0.9001388888888888 Precision : nan

[27]:

# Run this 100 times:
model_ = []
for i in range(100):
    model_.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + t", data, cutoff=.7))
results_ = pd.DataFrame(model_)
print("Accuracy : {}".format(results_["accuracy"].mean()), "Precision : {}".format(results_["precision"].mean()))

/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars

Accuracy : 0.8776388888888887 Precision : 0.0

Visualize Results¶

TODO:

Histogram of probabilities on days where there is or isn’t a “substantial” curtailment event

[28]:

model_[0]

[28]:

{'results': <statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x7efcb07d8910>,
 'accuracy': 0.9166666666666666,
 'precision': nan}

[29]:

import altair as alt

alt.Chart()

---------------------------------------------------------------------------
SchemaValidationError                     Traceback (most recent call last)
~/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/altair/vegalite/v4/api.py in to_dict(self, *args, **kwargs)
    380         if dct is None:
    381             kwargs["validate"] = "deep"
--> 382             dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
    383
    384         # TODO: following entries are added after validation. Should they be validated?

~/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/altair/utils/schemapi.py in to_dict(self, validate, ignore, context)
    337                 self.validate(result)
    338             except jsonschema.ValidationError as err:
--> 339                 raise SchemaValidationError(self, err)
    340         return result
    341

SchemaValidationError: Invalid specification

        altair.vegalite.v4.api.Chart, validating 'required'

        'data' is a required property

[29]:

alt.Chart(...)

[ ]:

[ ]: