Example: Predicting Daily Curtailment Events

Process Overview:

  1. Label curtailment events (i.e. define a decision boundary)

  2. Partition historic data into training and test sets

    1. Spring Season (Feb-May)

  3. Fit a statistical model to the training data

  4. Predict against the test data

  5. Evaluate the performance of the model against known labels in the test data.

[1]:
import pandas as pd
from dask import dataframe as dd
import statsmodels.formula.api as smf
import statsmodels.api as sm

from src.conf import settings

Early Spring

In our exploratory data analysis, we found evidence of seasonality being a strong factor in influencing curtailment intensity, being particularly noticeable in early spring. Here we exam only Feb - May curtailment to derive a model accordingly.

[2]:
# Read curtailment data, aggregate to dailies.
df = pd.concat(
    [
        pd.read_parquet(
            settings.DATA_DIR / f"processed/caiso/{y}.parquet"
        ) for y in range(2017, 2020)
    ]
)

df.columns = df.columns.str.lower().str.replace(" ", "_")
df.index = df.index.tz_convert("US/Pacific")
df = df.groupby(pd.Grouper(freq="D")).sum()
# Hourly resampling is a bad idea because hour-to-hour effects are co-correlated
# df = df[columns].groupby(pd.Grouper(freq="H")).sum()
df.reset_index(inplace=True)
/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/fastparquet/encoding.py:222: NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 'numba.experimental.jitclass' to better reflect the experimental nature of the functionality. Please update your imports to accommodate this change and see http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location for the time frame.
  Numpy8 = numba.jitclass(spec8)(NumpyIO)
/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/fastparquet/encoding.py:224: NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 'numba.experimental.jitclass' to better reflect the experimental nature of the functionality. Please update your imports to accommodate this change and see http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location for the time frame.
  Numpy32 = numba.jitclass(spec32)(NumpyIO)
[3]:
df.dtypes
[3]:
timestamp                         datetime64[ns, US/Pacific]
load                                                 float64
solar                                                float64
wind                                                 float64
net_load                                             float64
renewables                                           float64
nuclear                                              float64
large_hydro                                          float64
imports                                              float64
generation                                           float64
thermal                                              float64
load_less_(generation+imports)                       float64
wind_curtailment                                     float64
solar_curtailment                                    float64
dtype: object
[4]:
df.head()
[4]:
timestamp load solar wind net_load renewables nuclear large_hydro imports generation thermal load_less_(generation+imports) wind_curtailment solar_curtailment
0 2017-01-01 00:00:00-08:00 6.451973e+06 502302.914737 656544.662054 5.293125e+06 1.665067e+06 652721.169749 653481.452160 2.200843e+06 4.251096e+06 1.279826e+06 33.747653 10107.108141 26760.716117
1 2017-01-02 00:00:00-08:00 6.838524e+06 203004.927518 443945.262752 6.191574e+06 1.156723e+06 653163.556261 705478.766030 2.372526e+06 4.465951e+06 1.950586e+06 47.689558 401.789387 43.705500
2 2017-01-03 00:00:00-08:00 7.391597e+06 256791.945920 279703.466170 6.855101e+06 1.053297e+06 654631.092257 793118.552495 1.885086e+06 5.506580e+06 3.005534e+06 -69.892993 0.000000 54.841500
3 2017-01-04 00:00:00-08:00 7.304337e+06 351737.160810 339561.662186 6.613038e+06 1.213044e+06 653906.204819 779211.177282 1.877239e+06 5.427051e+06 2.780890e+06 46.565469 0.000000 20.247000
4 2017-01-05 00:00:00-08:00 7.246891e+06 343183.228333 629678.639548 6.274029e+06 1.493144e+06 653460.137940 865964.297512 1.239591e+06 6.007303e+06 2.994735e+06 -3.179387 804.842781 191.525216
[5]:
# Subset curtailments data to relevant time range
df = df[df["timestamp"].dt.month.isin(range(2,6))]
[6]:
# Analysis Period of Dataset
df["timestamp"].describe()
[6]:
count                           360
unique                          360
top       2019-05-24 00:00:00-07:00
freq                              1
first     2017-02-01 00:00:00-08:00
last      2019-05-31 00:00:00-07:00
Name: timestamp, dtype: object
[7]:
# Label Data - based on our EDA, we might start by "guessing" a threshold of importance of .05
# Later methods will be less biased, and allow for more variance.
# TODO: Try to find natural clusterings through an unsupervised process to label the dataset, and try to predict those labels.
df["curtailment_event"] = pd.Categorical(df["solar_curtailment"]/df["solar"] > .1)

df["is_weekday"] = pd.Categorical(df["timestamp"].dt.weekday.isin([5, 6]))
[8]:
# Merge Weather Data
# Use Day-Ahead forecasts
forecasts = [
    *(settings.DATA_DIR / f"interim/gfs/ca/gfs_3_201[7-9][01][2-5]*_0000_{i*3:03}.parquet" for i in range(5, 10))
]
dayahead_weather = dd.read_parquet(forecasts).compute()
[9]:
dayahead_weather["timestamp"] = dayahead_weather["valid_time"].dt.tz_localize("UTC").dt.tz_convert("US/Pacific")
[10]:
dayahead_weather.head()
[10]:
latitude longitude t gust tp dswrf uswrf SUNSD al sp ... MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT ALAND AWATER INTPTLAT INTPTLON timestamp
index
17517 42.0 237.0 269.899994 2.9 0.0 0.0 0.0 0.0 0.0 86861.039062 ... G4020 None None None A 16261974847 179108278 +41.5879861 -122.5332868 2017-02-01 07:00:00-08:00
17518 42.0 238.0 266.000000 1.9 0.0 0.0 0.0 0.0 0.0 86630.640625 ... G4020 None None None A 16261974847 179108278 +41.5879861 -122.5332868 2017-02-01 07:00:00-08:00
17876 41.0 236.0 278.600006 2.9 0.0 0.0 0.0 0.0 0.0 97531.437500 ... G4020 None 21700 None A 9240992572 1254297982 +40.7066731 -123.9258181 2017-02-01 07:00:00-08:00
17877 41.0 237.0 270.100006 2.7 0.0 0.0 0.0 0.0 0.0 81915.437500 ... G4020 None None None A 8234265201 73407949 +40.6477241 -123.1144043 2017-02-01 07:00:00-08:00
18237 40.0 237.0 272.000000 2.8 0.0 0.0 0.0 0.0 0.0 83081.843750 ... G4020 None None None A 8234265201 73407949 +40.6477241 -123.1144043 2017-02-01 07:00:00-08:00

5 rows × 37 columns

[11]:
dayahead_weather.iloc[0]
[11]:
latitude                              42
longitude                            237
t                                  269.9
gust                                 2.9
tp                                     0
dswrf                                  0
uswrf                                  0
SUNSD                                  0
al                                     0
sp                                 86861
csnow                                  0
cicep                                  0
cfrzr                                  0
crain                                  0
sde                                 0.34
surface                                0
time                 2017-02-01 00:00:00
valid_time           2017-02-01 15:00:00
index_right                           54
STATEFP                               06
COUNTYFP                             093
COUNTYNS                        00277311
GEOID                              06093
NAME                            Siskiyou
NAMELSAD                 Siskiyou County
LSAD                                  06
CLASSFP                               H1
MTFCC                              G4020
CSAFP                               None
CBSAFP                              None
METDIVFP                            None
FUNCSTAT                               A
ALAND                        16261974847
AWATER                         179108278
INTPTLAT                     +41.5879861
INTPTLON                    -122.5332868
timestamp      2017-02-01 07:00:00-08:00
Name: 17517, dtype: object
[12]:
dayahead_weather.dtypes
[12]:
latitude                          float64
longitude                         float64
t                                 float32
gust                              float32
tp                                float32
dswrf                             float32
uswrf                             float32
SUNSD                             float32
al                                float32
sp                                float32
csnow                             float32
cicep                             float32
cfrzr                             float32
crain                             float32
sde                               float32
surface                             int64
time                       datetime64[ns]
valid_time                 datetime64[ns]
index_right                         int64
STATEFP                            object
COUNTYFP                           object
COUNTYNS                           object
GEOID                              object
NAME                               object
NAMELSAD                           object
LSAD                               object
CLASSFP                            object
MTFCC                              object
CSAFP                              object
CBSAFP                             object
METDIVFP                           object
FUNCSTAT                           object
ALAND                               int64
AWATER                              int64
INTPTLAT                           object
INTPTLON                           object
timestamp      datetime64[ns, US/Pacific]
dtype: object
[13]:
# Take an average over all datapoints (no weighting)
# FIXME: 100 We need to be more thoughtful about how to integrate this data.
# It should be weighted somehow, or perhaps certain locations are expressed as their own IV.
# Look into Uitlity CZs
dayahead_hourly = dayahead_weather.groupby(pd.Grouper(key="timestamp", freq="H"))[["t", "dswrf", "uswrf", "gust", "SUNSD"]].mean().interpolate().reset_index()
[14]:
dayahead_daily = dayahead_weather.groupby(pd.Grouper(key="timestamp", freq="D")).agg({"t": "mean", "dswrf": "mean", "uswrf": "mean", "SUNSD": "sum"})
[15]:
data = df.merge(dayahead_daily, on="timestamp", how="inner")
data
[15]:
timestamp load solar wind net_load renewables nuclear large_hydro imports generation thermal load_less_(generation+imports) wind_curtailment solar_curtailment curtailment_event is_weekday t dswrf uswrf SUNSD
0 2017-02-01 00:00:00-08:00 7.404797e+06 6.365488e+05 8.839707e+04 6.679851e+06 1.241231e+06 652218.031425 1.058698e+06 2.043459e+06 5.361323e+06 2.409176e+06 14.109963 64.670000 8480.663464 False False 282.630219 204.382217 46.266666 2130985.0
1 2017-02-02 00:00:00-08:00 7.367050e+06 5.002730e+05 2.195530e+05 6.647224e+06 1.235706e+06 651180.678103 1.077320e+06 1.941564e+06 5.425467e+06 2.461260e+06 19.334575 2691.564962 288.956667 False False 282.440430 155.364441 36.537777 1721099.0
2 2017-02-03 00:00:00-08:00 7.272859e+06 5.318559e+05 2.889443e+05 6.452059e+06 1.332687e+06 650538.682373 1.103827e+06 1.913683e+06 5.359143e+06 2.272091e+06 32.670400 488.852142 869.203799 False False 283.654236 162.995560 35.911110 1762346.0
3 2017-02-04 00:00:00-08:00 6.611103e+06 6.363877e+05 4.979092e+05 5.476806e+06 1.657910e+06 649935.253088 1.102862e+06 1.667016e+06 4.944096e+06 1.533388e+06 -9.417728 8260.820991 69858.576657 True True 283.166656 197.639999 42.088890 1947685.0
4 2017-02-05 00:00:00-08:00 6.661248e+06 4.002112e+05 2.128107e+05 6.048226e+06 1.123362e+06 649679.457890 1.093060e+06 2.046784e+06 4.614450e+06 1.748349e+06 12.990172 0.000000 25.348000 False True 283.199097 174.813339 35.973331 2019408.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
355 2019-05-27 00:00:00-07:00 5.672302e+06 7.720765e+05 1.113156e+06 3.787069e+06 2.453673e+06 656141.389061 9.970034e+05 1.169736e+06 4.502609e+06 3.957918e+05 -43.080195 18796.425323 444354.402287 True False 292.328644 457.533325 73.480003 3091204.0
356 2019-05-28 00:00:00-07:00 6.436942e+06 1.194693e+06 9.471656e+05 4.295084e+06 2.701523e+06 656950.514763 1.010069e+06 1.417568e+06 5.019443e+06 6.509010e+05 -69.383111 924.968950 63554.570744 False False 296.436951 507.760010 74.080002 3122509.0
357 2019-05-29 00:00:00-07:00 6.762152e+06 1.239861e+06 5.052720e+05 5.017019e+06 2.309672e+06 657307.503346 1.089729e+06 1.752092e+06 5.010103e+06 9.533939e+05 -42.688343 10658.002569 72766.329577 False False 299.156006 511.271118 73.702225 3112164.0
358 2019-05-30 00:00:00-07:00 6.865228e+06 1.295425e+06 5.069673e+05 5.062835e+06 2.359094e+06 657488.225968 1.192374e+06 1.771105e+06 5.094174e+06 8.852177e+05 -50.745769 13743.059075 7661.455448 False False 299.053741 506.617767 73.831108 3118246.0
359 2019-05-31 00:00:00-07:00 6.929919e+06 1.244073e+06 4.013874e+05 5.284459e+06 2.219221e+06 657313.148060 1.212114e+06 1.889459e+06 5.040392e+06 9.517434e+05 67.953527 909.440812 48430.745252 False False 298.764374 538.822205 76.894447 2290201.0

360 rows × 20 columns

Single Model Run

[16]:
test_data = data.sample(int(len(data)*.3//1))
training_data = data[~data.index.isin(test_data.index)]
model = "C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + t"
result = smf.glm(
    model,
    training_data,
    family=sm.families.Binomial()
).fit()
result.summary()
[16]:
Generalized Linear Model Regression Results
Dep. Variable: ['C(curtailment_event)[False]', 'C(curtailment_event)[True]'] No. Observations: 248
Model: GLM Df Residuals: 242
Model Family: Binomial Df Model: 5
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -72.320
Date: Thu, 30 Apr 2020 Deviance: 144.64
Time: 12:32:09 Pearson chi2: 191.
No. Iterations: 6
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -29.0311 17.196 -1.688 0.091 -62.735 4.673
C(timestamp.dt.month)[T.3] 0.0594 0.736 0.081 0.936 -1.384 1.502
C(timestamp.dt.month)[T.4] -1.3746 0.860 -1.599 0.110 -3.059 0.310
C(timestamp.dt.month)[T.5] -1.6381 1.027 -1.595 0.111 -3.652 0.375
C(is_weekday)[T.True] -2.4666 0.472 -5.223 0.000 -3.392 -1.541
t 0.1142 0.061 1.869 0.062 -0.006 0.234
[17]:
predictions = result.predict(test_data.drop(columns=["curtailment_event"]))
predictions.name = "probability"
predictions = test_data.merge(predictions, left_index=True, right_index=True)

cutoff = .7
true_positives = predictions.query("probability > @cutoff")["curtailment_event"].value_counts().loc[True]
false_negatives = predictions.query("probability > @cutoff")["curtailment_event"].value_counts().loc[False]
true_negatives = predictions.query("probability <= @cutoff")["curtailment_event"].value_counts().loc[False]
false_positives = predictions.query("probability <= @cutoff")["curtailment_event"].value_counts().loc[True]

accuracy = (true_positives+true_negatives)/len(predictions)
precision = true_positives / (true_positives + false_positives)
print(f"Accuracy: {accuracy}; Precision: {precision}")
Accuracy: 0.19444444444444445; Precision: 1.0
[18]:
predictions["curtailment_event"].astype(bool).sum()/len(predictions["curtailment_event"])
[18]:
0.06481481481481481
[19]:
1 - predictions["probability"].mean()
[19]:
0.10371318777291871

Multi-Model Run

TODO:

Other Notes:

[30]:
def simulate(model, data, cutoff=.8):
    test_data = data.sample(int(len(data)*.2//1))
    training_data = data[~data.index.isin(test_data.index)]
    result = smf.glm(
        model,
        training_data,
        family=sm.families.Binomial()
    ).fit()

    predictions = result.predict(test_data.drop(columns=["curtailment_event"]))
    predictions.name = "probability"
    predictions = 1 - predictions
    predictions = test_data.merge(predictions, left_index=True, right_index=True)

    positive_predictions = predictions.query("probability > @cutoff")["curtailment_event"].value_counts()
    negative_predictions = predictions.query("probability <= @cutoff")["curtailment_event"].value_counts()

    true_positives = positive_predictions.loc[True]
    false_positives = positive_predictions.loc[False]

    true_negatives = negative_predictions.loc[False]
    false_negatives = negative_predictions[True]

    accuracy = (true_positives+true_negatives)/len(predictions)
    precision = true_positives / (true_positives + false_positives)
    return {"results": result, "accuracy": accuracy, "precision": precision}
[31]:
# Run this 100 times:
model1 = []
for i in range(100):
    model1.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + load + t + dswrf", data, cutoff=.7))
results1 = pd.DataFrame(model1)
print("Accuracy : {}".format(results1["accuracy"].mean()), "Precision : {}".format(results1["precision"].mean()))
/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars
Accuracy : 0.8873611111111112 Precision : 0.8489247311827957
[32]:
model1[1]["results"].summary()
[32]:
Generalized Linear Model Regression Results
Dep. Variable: ['C(curtailment_event)[False]', 'C(curtailment_event)[True]'] No. Observations: 283
Model: GLM Df Residuals: 275
Model Family: Binomial Df Model: 7
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -68.257
Date: Thu, 30 Apr 2020 Deviance: 136.51
Time: 13:25:21 Pearson chi2: 212.
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -57.6466 22.202 -2.596 0.009 -101.162 -14.132
C(timestamp.dt.month)[T.3] 1.1928 0.808 1.477 0.140 -0.390 2.776
C(timestamp.dt.month)[T.4] 1.0888 0.993 1.097 0.273 -0.857 3.035
C(timestamp.dt.month)[T.5] 0.9277 1.139 0.814 0.415 -1.305 3.160
C(is_weekday)[T.True] 1.2901 0.866 1.490 0.136 -0.407 2.987
load 4.799e-06 1.14e-06 4.213 0.000 2.57e-06 7.03e-06
t 0.1028 0.074 1.384 0.166 -0.043 0.248
dswrf -0.0069 0.003 -2.295 0.022 -0.013 -0.001
[33]:
# Run this 100 times:
model2 = []
for i in range(100):
    model2.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + load", data, cutoff=.7))
results2 = pd.DataFrame(model2)
print("Accuracy : {}".format(results2["accuracy"].mean()), "Precision : {}".format(results2["precision"].mean()))
/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars
Accuracy : 0.9038888888888887 Precision : 0.8506289308176099
[36]:
# Run this 100 times:
model3 = []
for i in range(100):
    model3.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + t * dswrf", data, cutoff=.8))
results3 = pd.DataFrame(model3)
print("Accuracy : {}".format(results3["accuracy"].mean()), "Precision : {}".format(results3["precision"].mean()))
/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars
Accuracy : 0.8783333333333332 Precision : nan
[25]:
# Run this 100 times:
model4 = []
for i in range(100):
    model4.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + t + dswrf", data, cutoff=.7))
results4 = pd.DataFrame(model4)
print("Accuracy : {}".format(results4["accuracy"].mean()), "Precision : {}".format(results4["precision"].mean()))
/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars
Accuracy : 0.8748611111111111 Precision : 0.15
[26]:
# Run this 100 times:
model5 = []
for i in range(100):
    model5.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + SUNSD", data, cutoff=.7))
results5 = pd.DataFrame(model5)
print("Accuracy : {}".format(results5["accuracy"].mean()), "Precision : {}".format(results5["precision"].mean()))
/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars
Accuracy : 0.9001388888888888 Precision : nan
[27]:
# Run this 100 times:
model_ = []
for i in range(100):
    model_.append(simulate("C(curtailment_event) ~ C(timestamp.dt.month) + C(is_weekday) + t", data, cutoff=.7))
results_ = pd.DataFrame(model_)
print("Accuracy : {}".format(results_["accuracy"].mean()), "Precision : {}".format(results_["precision"].mean()))
/home/ttu/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in long_scalars
Accuracy : 0.8776388888888887 Precision : 0.0

Visualize Results

TODO:

[28]:
model_[0]
[28]:
{'results': <statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x7efcb07d8910>,
 'accuracy': 0.9166666666666666,
 'precision': nan}
[29]:
import altair as alt

alt.Chart()
---------------------------------------------------------------------------
SchemaValidationError                     Traceback (most recent call last)
~/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/altair/vegalite/v4/api.py in to_dict(self, *args, **kwargs)
    380         if dct is None:
    381             kwargs["validate"] = "deep"
--> 382             dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
    383
    384         # TODO: following entries are added after validation. Should they be validated?

~/.local/share/virtualenvs/CaReCur-b3qbtQ7S/lib/python3.7/site-packages/altair/utils/schemapi.py in to_dict(self, validate, ignore, context)
    337                 self.validate(result)
    338             except jsonschema.ValidationError as err:
--> 339                 raise SchemaValidationError(self, err)
    340         return result
    341

SchemaValidationError: Invalid specification

        altair.vegalite.v4.api.Chart, validating 'required'

        'data' is a required property

[29]:
alt.Chart(...)
[ ]:

[ ]: