Note

This lecture is going to:

  • Demonstrate how to load complex time series data with pandas

  • Introduce interactive plotting with the plotly python library

  • Review MAE, RMSE, and parity plots for evaluating model fits

  • Use lmfit to try various non-linear fits, and motivate feature and model choices

  • Introduce and demonstrate train/val/test splits

Intro to Supervised Learning#

Let’s start with some common tasks in engineering or science that you might come across in your classes, internships, or future jobs:

  • insights into complex datasets

  • predict properties or behavior of a system

  • find patterns in data

  • learn how properties are related to outputs or behaviors

  • identify where we should add additional data or experiments

  • control a process

All of these are important and things you learn about in your undergrad ChE classes. However, we are limited in how well we can solve these problems by our understanding of the underlying science or our ability to solve/model complex systems.

Let’s start with the most common type of machine learning technique in engineering - supervised learning!

Tip

If you have a physical model (based on mol/mass balances, kinetics, engineering principles, etc that describes your data sufficiently well and accomplishes your goals, stick with it!

Example dataset: northern hemisphere monthly temperature#

To demonstrate important concepts in supervised learning, let’s use the northern hemisphere land-surface average temperature dataset prepared by Berkeley Earth non-profit organization. First, let’s grab the dataset using wget from http://berkeleyearth.lbl.gov/auto/Global/Complete_TAVG_complete.txt

Note

wget is a very useful utility for linux/unix to download files off the internet! You can read more about wget by asking for the help documentation with

!wget --help
!wget http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/northern-hemisphere-TAVG-Trend.txt
--2022-10-13 15:13:39--  http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/northern-hemisphere-TAVG-Trend.txt
Resolving berkeleyearth.lbl.gov (berkeleyearth.lbl.gov)... 
128.3.29.26
Connecting to berkeleyearth.lbl.gov (berkeleyearth.lbl.gov)|128.3.29.26|:80... 
connected.
HTTP request sent, awaiting response... 
200 OK
Length: 216468 (211K) [text/plain]
Saving to: ‘northern-hemisphere-TAVG-Trend.txt’


          northern-   0%[                    ]       0  --.-KB/s               
         northern-h  37%[======>             ]  78.26K   383KB/s               
northern-hemisphere 100%[===================>] 211.39K   567KB/s    in 0.4s    

2022-10-13 15:13:40 (567 KB/s) - ‘northern-hemisphere-TAVG-Trend.txt’ saved [216468/216468]

Now, we’re going to use pandas to load this dataset into our notebook and look at the data.

import pandas as pd

# Read in the dataset using pandas, and explicitly include the column names
# since pandas has trouble with the header lines in this file
df = pd.read_csv(
    "northern-hemisphere-TAVG-Trend.txt",
    delim_whitespace=True,
    skiprows=70,
    names=[
        "year",
        "month",
        "monthly_anomaly",
        "monthly_anomaly_unc",
        "annual_anomaly",
        "annual_anomaly_unc",
        "fiveyear_anomaly",
        "fiveyear_anomaly_unc",
        "tenyear_anomaly",
        "tenyear_anomaly_unc",
        "twentyyear_anomaly",
        "twentyyear_anomaly_unc",
    ],
).dropna(subset="monthly_anomaly")

# calculate the total temperature from the anomaly column and the average temperatures listed in the header of the CSV file
df["temperature"] = (
    df["monthly_anomaly"]
    + (df["month"] == 1) * -2.06
    + (df["month"] == 2) * -0.04
    + (df["month"] == 3) * 4.57
    + (df["month"] == 4) * 10.35
    + (df["month"] == 5) * 15.74
    + (df["month"] == 6) * 19.67
    + (df["month"] == 7) * 21.32
    + (df["month"] == 8) * 20.21
    + (df["month"] == 9) * 16.63
    + (df["month"] == 10) * 10.99
    + (df["month"] == 11) * 4.52
    + (df["month"] == 12) * -0.38
)

# display the resulting datafrane
df
year month monthly_anomaly monthly_anomaly_unc annual_anomaly annual_anomaly_unc fiveyear_anomaly fiveyear_anomaly_unc tenyear_anomaly tenyear_anomaly_unc twentyyear_anomaly twentyyear_anomaly_unc temperature
0 1840 1 0.447 1.333 NaN NaN NaN NaN NaN NaN NaN NaN -1.613
1 1840 2 -0.821 1.100 NaN NaN NaN NaN NaN NaN NaN NaN -0.861
2 1840 3 -1.050 1.503 -0.722 0.520 NaN NaN NaN NaN NaN NaN 3.520
3 1840 4 -0.455 0.622 -0.792 0.483 NaN NaN NaN NaN NaN NaN 9.895
4 1840 5 -0.002 0.829 -0.794 0.471 NaN NaN NaN NaN NaN NaN 15.738
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2167 2020 8 1.208 0.147 NaN NaN NaN NaN NaN NaN NaN NaN 21.418
2168 2020 9 1.339 0.161 NaN NaN NaN NaN NaN NaN NaN NaN 17.969
2169 2020 10 1.228 0.147 NaN NaN NaN NaN NaN NaN NaN NaN 12.218
2170 2020 11 1.874 0.123 NaN NaN NaN NaN NaN NaN NaN NaN 6.394
2171 2020 12 1.440 0.244 NaN NaN NaN NaN NaN NaN NaN NaN 1.060

2171 rows × 13 columns

For convenience, we’ll start by turning the year and month into a single time

# convert the years and months into a single time
time = df["year"].values + df["month"].values / 12

First, let’s plot the data as a function of the year to see what’s going on! We’ll use plotly to get an interactive graph!

See also

More tips and graph types for using plotly! https://plotly.com/python/basic-charts/

import plotly.graph_objects as go

# Make a plotly graph
fig = go.Figure()
fig.add_scatter(x=time, y=df["temperature"], name="Actual Data")

# Update the x/y axis labels
fig.update_xaxes(title_text="Time [Years]")
fig.update_yaxes(title_text="Northern Hemisphere Average Temperature [C]")

# Add a horizontal line at y=0
fig.add_hline(y=0.0)

# Set the plot size
fig.update_layout(autosize=False, width=1000, height=600)
fig.show()

Baseline constant model#

The very simplest thing we can try is to fit a constant number to the data. Let’s try using the lmfit tool that we reviewed last class to fit this function.

\[\begin{align*} \hat{y}(X) = c \end{align*}\]

The constant model is a surprisingly helpful baseline - if we can’t do better than this with a model, we haven’t done anything at all!

Note

  • \(\hat{y}\) is the outputs (targets or labels) of the model we want to learn

  • \(X\) is a feature matrix. Each row represents one data point. The columns in \(X\) are called features, as they are numbers that together represent each data point. For this example, X is an Nx1 matrix with a single column - the time we calculated above.

The data that we use to fit the model is called training data, so we’ll refer to these as X_train and y_train.

from lmfit.models import ConstantModel

X_train = time
y_train = df["temperature"].values

# Make a constant model
model = ConstantModel(independent_vars=["x"])
params = model.make_params(c=1)
result = model.fit(y_train, params, x=X_train)

# The constant model just predicts a number, so we do this to make sure the prediction is the same length for all points
y_pred = [result.eval(x=X_train)] * len(y_train)

# Show the result
result

Model

Model(constant)

Fit Statistics

fitting methodleastsq
# function evals4
# data points2171
# variables1
chi-square 150571.160
reduced chi-square 69.3876313
Akaike info crit. 9205.40720
Bayesian info crit. 9211.09014

Variables

name value standard error relative error initial value min max vary
c 10.1153505 0.17877678 (1.77%) 1 -inf inf True

Evaluating fits (MAE, RMSE, parity plots)#

We’ve completed our initial fit, so we now need to evaluate how well we’ve done. Let’s calculate the mean absolute error and the root mean absolute error of the fit. Remember, the MAE is:

\[\begin{align*} \text{Mean Absolute Error (MAE)} &= \frac{1}{N}\sum_i^N |y_i-\hat{y}_i|\\ \text{Root Mean Squared Error (RMSE)} &= \left[\frac{1}{N}\sum_i^N (y_i-\hat{y}_i)^2\right]^{1/2} \end{align*}\]

where

  • \(y_i\) is the actual (training) data (e.g. monthly anomaly)

  • \(\hat{y}_i\) is the prediction from the model

To evaluate these, we’re going to use the very helpful scikit-learn package (which we’ll be using a lot!)

Tip

  • The MAE indicates the average residual error and tends to focus on the majority of the points.

  • The RMSE weights larger residuals more and so it more sensitive to outliers or large residuals. The RMSE is always larger than the MAE.

from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y_train, y_pred)
rmse = mean_squared_error(y_train, y_pred) ** 0.5
print(f"The MAE for the constant fit is {mae:0.2f} C")
print(f"The RMSE for the constant fit is {rmse:0.2f} C")
The MAE for the constant fit is 7.41 C
The RMSE for the constant fit is 8.33 C

We should also plot the fit to see how it looks visually. In addition to plotting the output, we’re also going to use a parity plot.

from plotly.subplots import make_subplots

# Make a plotly graph for the temperature data
fig = make_subplots(rows=1, cols=2)
fig.add_scatter(x=X_train, y=y_train, name="Actual Data", row=1, col=1)

# Add a line for the polynomial fit results
fig.add_scatter(
    x=X_train,
    y=y_pred,
    name=f"Constant Fit, MAE={mae:,.2f}, RMSE={rmse:0.02f}",
    row=1,
    col=1,
)

# Update the x/y axis labels
fig.update_xaxes(title_text="Time [Years]", row=1, col=1)
fig.update_yaxes(title_text="Northern Hemisphere Average Temperature [C]", row=1, col=1)

# Add a horizontal line at y=0
fig.add_hline(y=0.0, row=1, col=1)

# Make the parity plot!
fig.add_scatter(
    x=y_train,
    y=y_pred,
    mode="markers",
    name="Constant Fit",
    row=1,
    col=2,
)

# Update the x/y axis labels
fig.update_xaxes(
    title_text="Actual Temperature [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)
fig.update_yaxes(
    title_text="Predicted Temperature [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)

fig.add_shape(
    type="line",
    x0=df["temperature"].min(),
    y0=df["temperature"].min(),
    x1=df["temperature"].max(),
    y1=df["temperature"].max(),
    row=1,
    col=2,
)
# Set the plot size
fig.update_layout(autosize=False, width=1200, height=600)
fig.show()

Note the shape of the parity plot - the model predicts the same number for every point, so this shows up as a horizontal line. This is not a very good fit!

Tip

Parity plots like the one just above (a horizontal line) often indicate that the input data has no relationship to the output data, so the best the model can do is predict the average. This is a surprisingly common issue in machine learning. One common cause is that the data inputs were shuffled or re-ordered compared to the targets (labels), so there really is no correlation in the data.

Polynomial regression#

Let’s make this one step more complex but fitting a 4th order polynomial instead. If you remember from 06-262, polynomial models are linear in the fitted parameters since they’re of the form \(f(x)=c_0+c_1x+c_2x^2+\cdots\), so we expect this minimization to be well-behaved. We could fit this model using linear regression if we wanted.

from lmfit.models import PolynomialModel

X_train = time
y_train = df["temperature"].values

model = PolynomialModel(degree=4, independent_vars=["x"])
params = model.make_params(c0=1, c1=1, c2=1, c3=1, c4=1)
result = model.fit(y_train, params, x=X_train)
y_pred = result.eval(x=X_train)
result

Model

Model(polynomial)

Fit Statistics

fitting methodleastsq
# function evals32
# data points2171
# variables5
chi-square 149969.299
reduced chi-square 69.2379035
Akaike info crit. 9204.71192
Bayesian info crit. 9233.12664

Variables

name value standard error relative error initial value min max vary
c0 149918.342 487410.820 (325.12%) 1 -inf inf True
c1 -315.214900 1010.72067 (320.64%) 1 -inf inf True
c2 0.24848163 0.78573653 (316.22%) 1 -inf inf True
c3 -8.7033e-05 2.7141e-04 (311.84%) 1 -inf inf True
c4 1.1429e-08 3.5146e-08 (307.52%) 1 -inf inf True

Correlations (unreported correlations are < 0.100)

c3c4-1.0000
c2c3-1.0000
c1c2-1.0000
c0c1-1.0000
c2c40.9999
c1c30.9999
c0c20.9999
c1c4-0.9998
c0c3-0.9998
c0c40.9997

To evaluate the fit, we’ll use the same methods (MAE, RMSE, parity plot) as above.

mae = mean_absolute_error(y_train, y_pred)
rmse = mean_squared_error(y_train, y_pred) ** 0.5
print(f"The MAE for the polynomial fit is {mae:0.2f} C")
print(f"The RMSE for the polynomial fit is {rmse:0.2f} C")
The MAE for the polynomial fit is 7.39 C
The RMSE for the polynomial fit is 8.31 C
# Make a plotly graph for the temperature data
fig = make_subplots(rows=1, cols=2)
fig.add_scatter(x=time, y=df["temperature"], name="Actual Data", row=1, col=1)

# Add a line for the polynomial fit results
fig.add_scatter(
    x=X_train,
    y=y_pred,
    name=f"Polynomial Fit, MAE={mae:,.2f}, RMSE={rmse:0.02f}",
    row=1,
    col=1,
)

# Update the x/y axis labels
fig.update_xaxes(title_text="Time [Years]", row=1, col=1)
fig.update_yaxes(title_text="Northern Hemisphere Average Temperature [C]", row=1, col=1)

# Add a horizontal line at y=0
fig.add_hline(y=0.0, row=1, col=1)

# Make the parity plot!
fig.add_scatter(
    x=y_train,
    y=y_pred,
    mode="markers",
    name="Polynomial Fit",
    row=1,
    col=2,
)

# Update the x/y axis labels
fig.update_xaxes(
    title_text="Actual Temperature Anomaly [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)
fig.update_yaxes(
    title_text="Predicted Temperature [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)

fig.add_shape(
    type="line",
    x0=df["temperature"].min(),
    y0=df["temperature"].min(),
    x1=df["temperature"].max(),
    y1=df["temperature"].max(),
    row=1,
    col=2,
)
# Set the plot size
fig.update_layout(autosize=False, width=1200, height=600)
fig.show()

We can see a bit of a trend with the years now, which is interesting! The parity plot has a bit more structure. The MAE/RMSE has decreased very slightly.

Improving the fit by changing the model, or adding features#

Using a fourth-order polynomial to predict the monthly anomaly got us a fit with an RMSE of about 0.5 C. We probably want to do better than that. We have two main options:

  1. Try a different model (more/less polynomial degrees? completely different type of model? etc.)

  2. Add more inputs besides just the time

There’s not one right or wrong answer here. Both are feasible, and both will take some creativity on our part. Either we have to have some insight into why another model might be better suited to the dataset, or we have to have some insight into why just the absolute time is not the best feature.

Adding month as a feature (Month/Year Polynomial)#

To start, let’s add the month as a feature. Essentially, we’re changing our feature matrix \(X\) from a single column of the time, to two columns:

  • X[:,0] = year

  • X[:,1] = month

To keep things simple, let’s make it a similar polynomial!

import numpy as np
from lmfit import Model

X_train = np.vstack((df["year"].values, df["month"].values)).T
y_train = df["temperature"].values


def polynomial_year_month(year, month, c0, c1, c2, c3, d1, d2, d3, d4):
    return (
        c0
        + year * c1
        + year**2 * c2
        + year**3 * c3
        + d1 * month
        + d2 * month**2
        + d3 * month**3
        + d4 * month**4
    )


model = Model(
    polynomial_year_month,
    independent_vars=["year", "month"],
    param_names=["c0", "c1", "c2", "c3", "d1", "d2", "d3", "d4"],
)
params = model.make_params(c0=1, c1=1, c2=1, c3=1, d1=1, d2=1, d3=1, d4=1)

result = model.fit(y_train, params, year=X_train[:, 0], month=X_train[:, 1])
y_pred = result.eval(year=X_train[:, 0], month=X_train[:, 1])
mae = mean_absolute_error(y_train, y_pred)
rmse = mean_squared_error(y_train, y_pred) ** 0.5
print(f"The MAE for the constant fit is {mae:0.2f} C")
print(f"The RMSE for the constant fit is {rmse:0.2f} C")
The MAE for the constant fit is 0.37 C
The RMSE for the constant fit is 0.49 C
from plotly.subplots import make_subplots

# Make a plotly graph for the temperature data
fig = make_subplots(rows=1, cols=2)
fig.add_scatter(x=time, y=df["temperature"], name="Actual Data", row=1, col=1)

# Add a line for the polynomial fit results
fig.add_scatter(
    x=time,
    y=y_pred,
    name=f"Polynomial Fit (Month/Time), MAE={mae:,.2f}, RMSE={rmse:0.02f}",
    row=1,
    col=1,
)

# Update the x/y axis labels
fig.update_xaxes(title_text="Time [Years]", row=1, col=1)
fig.update_yaxes(title_text="Northern Hemisphere Average Temperature [C]", row=1, col=1)

# Add a horizontal line at y=0
fig.add_hline(y=0.0, row=1, col=1)

# Make the parity plot!
fig.add_scatter(
    x=df["temperature"],
    y=y_pred,
    mode="markers",
    name="Polynomial (year/month) parity",
    row=1,
    col=2,
)

# Update the x/y axis labels
fig.update_xaxes(
    title_text="Actual Temperature Anomaly [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)
fig.update_yaxes(
    title_text="Predicted Temperature [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)

fig.add_shape(
    type="line",
    x0=df["temperature"].min(),
    y0=df["temperature"].min(),
    x1=df["temperature"].max(),
    y1=df["temperature"].max(),
    row=1,
    col=2,
)
# Set the plot size
fig.update_layout(autosize=False, width=1200, height=600)
fig.show()

Now we’re getting somewhere! The MAE/RMSE have fallen to ~0.37/0.49 C respectively. The parity plot also looks much nicer now, with results scattering around the parity line.

Practice 1#

What’s the best polynomial order to use for this fit for the month term?

Change the model (Polynomial+Sin)#

It’s not a huge surprise that there is some periodicity in the model. A reasonable guess for a more reasonable function could include a periodic (sin) component. Let’s try:

\[\begin{align*} f(t) = c_0+c_1t+c_2t^2+c_3t^3+c_4t^4+A\sin (\omega t+\phi) \end{align*}\]

Note that we aren’t explicitly using the month info here.

import numpy as np
from lmfit.models import PolynomialModel, SineModel

X_train = time
y_train = df["temperature"].values

poly_model = PolynomialModel(degree=4, independent_vars=["x"])
sine_model = SineModel(independent_vars=["x"])
model = sine_model + poly_model
params = model.make_params(
    c0=1, c1=1, c2=1, c3=1, c4=1, frequency=2 * np.pi, amplitude=10, shift=0 * np.pi
)
result = model.fit(y_train, params, x=X_train)
y_pred = result.eval(x=X_train)
mae = mean_absolute_error(df["temperature"], y_pred)
rmse = mean_squared_error(df["temperature"], y_pred) ** 0.5
print(f"The MAE for the polynomial+sin fit is {mae:0.2f} C")
print(f"The RMSE for the polynomial+sin fit is {rmse:0.2f} C")
The MAE for the polynomial+sin fit is 0.63 C
The RMSE for the polynomial+sin fit is 0.79 C
from plotly.subplots import make_subplots

# Make a plotly graph for the temperature data
fig = make_subplots(rows=1, cols=2)
fig.add_scatter(x=time, y=df["temperature"], name="Actual Data", row=1, col=1)

# Add a line for the polynomial fit results
fig.add_scatter(
    x=X_train,
    y=y_pred,
    name=f"f(t)+sin(t) Fit (Time), MAE={mae:,.2f}, RMSE={rmse:0.02f}",
    row=1,
    col=1,
)

# Update the x/y axis labels
fig.update_xaxes(title_text="Time [Years]", row=1, col=1)
fig.update_yaxes(title_text="Northern Hemisphere Average Temperature [C]", row=1, col=1)

# Add a horizontal line at y=0
fig.add_hline(y=0.0, row=1, col=1)

# Make the parity plot!
fig.add_scatter(
    x=y_train,
    y=y_pred,
    mode="markers",
    name="Polynomial+Sin Parity",
    row=1,
    col=2,
)

# Update the x/y axis labels
fig.update_xaxes(
    title_text="Actual Temperature Anomaly [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)
fig.update_yaxes(
    title_text="Predicted Temperature [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)

fig.add_shape(
    type="line",
    x0=df["temperature"].min(),
    y0=df["temperature"].min(),
    x1=df["temperature"].max(),
    y1=df["temperature"].max(),
    row=1,
    col=2,
)

# Set the plot size
fig.update_layout(autosize=False, width=1200, height=600)
fig.show()

This fit is also pretty good, but not quite as good as our double polynomial from above.

Warning

This fit is quite sensitive to the initial conditions! Try other sets of parameter guesses and see how the answers change!

Pitfalls with fitting models using all of the available data#

So far, we’ve evaluated modeling techniques by fitting the model using all of the available data. Based on the metrics we calculated, it seems like we should be able to predict the temperature for any month/year within about 0.5 degrees C.

However, there are a few major issues:

  • We have no idea how the model will work for predicting things it hasn’t seen before

  • We have no idea if or the model will extrapolate

  • We can perfectly fit most data by just throwing more polynomial features and fitted parameters at the problem

Model extrapolation#

As a demonstration, let’s check how our last model from above works for predicting temperatures in the future, or before the first available data (1840).

import numpy as np
from lmfit.models import PolynomialModel, SineModel

# Make a plotly graph for the temperature data
fig = go.Figure()
fig.add_scatter(x=time, y=df["temperature"], name="Actual Data")

# Add a line for the polynomial fit results
fig.add_scatter(
    x=np.linspace(1700, 2100, 12 * 400),
    y=result.eval(x=np.linspace(1700, 2100, 12 * 400)),
    name=f"Polynomial Fit (Month/Time), MAE={mae:,.2f}, RMSE={rmse:0.02f}",
)

# Update the x/y axis labels
fig.update_xaxes(title_text="Time [Years]")
fig.update_yaxes(title_text="Northern Hemisphere Average Temperature [C]")

# Add a horizontal line at y=0
fig.add_hline(y=0.0)

# Set the plot size
fig.update_layout(autosize=False, width=1200, height=600)
fig.show()

Notice that although the model fits reasonably well within the range of the dataset, if we ask it to make predictions in the future, or for data points before the start of the data, the predictions are far too high. The temperature in 1700 was definitely not 20 degrees C higher than the 1900s, and global warming predictions is expected to be significantly than the 15C predicted here.

We want models that can predict things that we haven’t seen before, so we need to change our way of evaluating the models.

Train/validation/test splits#

We want our models to be predictive! That is, they should work for data they haven’t seen before, and be useful in practice. To accomplish this, we use train/validation/test data splits to see how our fitted models work in synthetic scenarios.

Note

  • Train: The data that we actually show our models during the fitting process

  • Validation: Data that we predict on with our trained models and use to decide which model to use, or how to adjust various parameters in the fitting procedure

  • Test: Data that we set aside and look at as infrequently as possible. This should represent some data that the model has never seen before and is indicative of the real world task

Random train/val/test split#

The most approach in supervised learning is a random train/validation/test split, in ratios of something like 80%/10%/10%. That is, we fit a bunch of models to the same 80% of the data. We pick the best one by asking which works best on the validation 10%. And then we try that model on the test 10% to see if it works in practice!

Let’s try this with our sample dataset. To do so we’ll use more scikit learn tools.https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

See also

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn.model_selection import train_test_split

# Make one big 2d array that holds all of the data we might want to use
X = np.vstack((time, df["year"], df["month"])).T

# split the data into train/val/test splits
X_train, X_test, y_train, y_test = train_test_split(
    X, df["temperature"], test_size=0.1, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=1 / 9, random_state=42
)

# Check the train/val/test shapes
print(X_train.shape)
print(y_train.shape)

print(X_test.shape)
print(y_test.shape)

print(X_val.shape)
print(y_val.shape)
(1736, 3)
(1736,)
(218, 3)
(218,)
(217, 3)
(217,)

Now that we have our train/validation/test split, let’s see how this works in practice!

First, we fit the model using the training data using our competing models and predict on the validation set to pick the best model.

import numpy as np
from lmfit.models import PolynomialModel, SineModel

poly_model = PolynomialModel(degree=4, independent_vars=["x"])
sine_model = SineModel(independent_vars=["x"])
model = sine_model + poly_model
params = model.make_params(
    c0=1, c1=1, c2=1, c3=1, c4=1, frequency=2 * np.pi, amplitude=10, shift=-2 * np.pi
)
result = model.fit(y_train, params, x=X_train[:, 0])
y_val_pred = result.eval(x=X_val[:, 0])

mae = mean_absolute_error(y_val, y_val_pred)
rmse = mean_squared_error(y_val, y_val_pred) ** 0.5
print(f"The MAE for the polynomial+sin model fit is {mae:0.2f} C")
print(f"The RMSE for the  polynomial+sin model fit is {rmse:0.2f} C")
The MAE for the polynomial+sin model fit is 0.52 C
The RMSE for the  polynomial+sin model fit is 0.66 C
import numpy as np
from lmfit import Model


def polynomial_year_month(year, month, c0, c1, c2, c3, d1, d2, d3, d4):
    return (
        c0
        + year * c1
        + year**2 * c2
        + year**3 * c3
        + d1 * month
        + d2 * month**2
        + d3 * month**3
        + d4 * month**4
    )


gmodel = Model(
    polynomial_year_month,
    independent_vars=["year", "month"],
    param_names=["c0", "c1", "c2", "c3", "d1", "d2", "d3", "d4"],
)
params = gmodel.make_params(c0=1, c1=1, c2=1, c3=1, d1=1, d2=1, d3=1, d4=1)

monthyear_result = gmodel.fit(y_train, params, year=X_train[:, 1], month=X_train[:, 2])
y_val_pred = monthyear_result.eval(year=X_val[:, 1], month=X_val[:, 2])

mae = mean_absolute_error(y_val, y_val_pred)
rmse = mean_squared_error(y_val, y_val_pred) ** 0.5
print(f"The MAE for the month/year polynomial model fit is {mae:0.2f} C")
print(f"The RMSE for the  month/year polynomial model fit is {rmse:0.2f} C")
The MAE for the month/year polynomial model fit is 0.39 C
The RMSE for the  month/year polynomial model fit is 0.52 C
# Make a plotly graph for the temperature data
fig = make_subplots(rows=1, cols=2)
fig.add_scatter(
    x=X_train[:, 0], y=y_train, name="Train Data", row=1, col=1, mode="markers"
)
fig.add_scatter(
    x=X_val[:, 0], y=y_val, name="Validation Data", row=1, col=1, mode="markers"
)

# Add a line for the polynomial fit results
fig.add_scatter(
    x=X_val[:, 0],
    y=result.eval(x=X_val[:, 0]),
    name=f"Polynomial+Sin Fit",
    mode="markers",
    row=1,
    col=1,
)

fig.add_scatter(
    x=X_val[:, 0],
    y=monthyear_result.eval(year=X_val[:, 1], month=X_val[:, 2]),
    name=f"Month/Year Polynomial Fit",
    mode="markers",
    row=1,
    col=1,
)

# Update the x/y axis labels
fig.update_xaxes(title_text="Time [Years]", row=1, col=1)
fig.update_yaxes(title_text="Northern Hemisphere Average Temperature [C]", row=1, col=1)

# Add a horizontal line at y=0
fig.add_hline(y=0.0, row=1, col=1)

# Make the parity plot!
fig.add_scatter(
    x=y_val,
    y=result.eval(x=X_val[:, 0]),
    mode="markers",
    name="Polynomial+Sin Fit",
    row=1,
    col=2,
)

# Make the parity plot!
fig.add_scatter(
    x=y_val,
    y=monthyear_result.eval(year=X_val[:, 1], month=X_val[:, 2]),
    mode="markers",
    name="Month/Year Polynomial Fit",
    row=1,
    col=2,
)

# Update the x/y axis labels
fig.update_xaxes(
    title_text="Actual Temperature Anomaly [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)
fig.update_yaxes(
    title_text="Predicted Temperature [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)

fig.add_shape(
    type="line",
    x0=df["temperature"].min(),
    y0=df["temperature"].min(),
    x1=df["temperature"].max(),
    y1=df["temperature"].max(),
    row=1,
    col=2,
)
# Set the plot size
fig.update_layout(autosize=False, width=1200, height=600)
fig.show()

Surprisingly, the polynomial/sin fit works even better than with all of the training data! This probably means that we could have done better with other initial conditions in the optimization above.

Time series train/val/test splits#

We are getting pretty close to a standard machine learning pipeline. However, we need to think a bit about the task we’ve indicated with the random train/val/test split. We chose the validation and test randomly from the entire time series, so the task is basically whether we can fill in missing time points.

A more useful model would be one that could predict the average temperature. Let’s try repeating the train/val/test split, but do it without shuffling. This will represent a task:

  • Models are fit on data from 1840-1985

  • The models are asked to predict the validation data from 1985-2003, and the best model is chosen based on its ability to predict 20 years in to the future

  • The best model is fit on the train and validation data and used to predict the test data from 2003-2022

from sklearn.model_selection import train_test_split

# Make one big 2d array that holds all of the data we might want to use
X = np.vstack((time, df["year"], df["month"])).T

# split the data into train/val/test splits
X_train, X_test, y_train, y_test = train_test_split(
    X, df["temperature"], test_size=0.1, random_state=42, shuffle=False
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=1 / 9, random_state=42, shuffle=False
)

# Check the train/val/test shapes
print(X_train.shape)
print(y_train.shape)

print(X_test.shape)
print(y_test.shape)

print(X_val.shape)
print(y_val.shape)
(1736, 3)
(1736,)
(218, 3)
(218,)
(217, 3)
(217,)
import numpy as np
from lmfit.models import PolynomialModel, SineModel

poly_model = PolynomialModel(degree=4, independent_vars=["x"])
sine_model = SineModel(independent_vars=["x"])
model = sine_model + poly_model
params = model.make_params(
    c0=1, c1=1, c2=1, c3=1, c4=1, frequency=2 * np.pi, amplitude=10, shift=-2 * np.pi
)
result = model.fit(y_train, params, x=X_train[:, 0])
y_val_pred = result.eval(x=X_val[:, 0])

mae = mean_absolute_error(y_val, y_val_pred)
rmse = mean_squared_error(y_val, y_val_pred) ** 0.5
print(f"The MAE for the polynomial+sin model fit is {mae:0.2f} C")
print(f"The RMSE for the  polynomial+sin model fit is {rmse:0.2f} C")
The MAE for the polynomial+sin model fit is 8.40 C
The RMSE for the  polynomial+sin model fit is 9.36 C
import numpy as np
from lmfit import Model


def polynomial_year_month(year, month, c0, c1, c2, c3, d1, d2, d3, d4):
    return (
        c0
        + year * c1
        + year**2 * c2
        + year**3 * c3
        + d1 * month
        + d2 * month**2
        + d3 * month**3
        + d4 * month**4
    )


gmodel = Model(
    polynomial_year_month,
    independent_vars=["year", "month"],
    param_names=["c0", "c1", "c2", "c3", "d1", "d2", "d3", "d4"],
)
params = gmodel.make_params(c0=1, c1=1, c2=1, c3=1, d1=1, d2=1, d3=1, d4=1)

result = gmodel.fit(y_train, params, year=X_train[:, 1], month=X_train[:, 2])
y_val_pred = result.eval(year=X_val[:, 1], month=X_val[:, 2])

mae = mean_absolute_error(y_val, y_val_pred)
rmse = mean_squared_error(y_val, y_val_pred) ** 0.5
print(f"The MAE for the month/year polynomial model fit is {mae:0.2f} C")
print(f"The RMSE for the  month/year polynomial model fit is {rmse:0.2f} C")
The MAE for the month/year polynomial model fit is 0.52 C
The RMSE for the  month/year polynomial model fit is 0.69 C

The MAE for the second model is far better. Clearly the first has trouble extrapolating to the future, or possibly the initial parameter guesses don’t work very well for this new data!

Practice 2#

What’s the best polynomial order to use for this fit for the month term?

Train/val/test continued!#

The last thing to do is train this model on train/val data and predict on the test to see how well it actually does for the most recent 20 years or so!

result = gmodel.fit(
    np.hstack((y_train, y_val)),
    params,
    year=np.vstack((X_train, X_val))[:, 1],
    month=np.vstack((X_train, X_val))[:, 2],
)
y_test_pred = monthyear_result.eval(year=X_test[:, 1], month=X_test[:, 2])

mae = mean_absolute_error(y_test, y_test_pred)
rmse = mean_squared_error(y_test, y_test_pred) ** 0.5
print(f"The MAE for the month/year polynomial model fit is {mae:0.2f} C")
print(f"The RMSE for the  month/year polynomial model fit is {rmse:0.2f} C")
The MAE for the month/year polynomial model fit is 0.34 C
The RMSE for the  month/year polynomial model fit is 0.45 C
# Make a plotly graph for the temperature data
fig = make_subplots(rows=1, cols=2)
fig.add_scatter(
    x=X_train[:, 0], y=y_train, name="Train Data", row=1, col=1, mode="markers"
)
fig.add_scatter(
    x=X_val[:, 0], y=y_val, name="Validation Data", row=1, col=1, mode="markers"
)
fig.add_scatter(
    x=X_test[:, 0], y=y_test, name="Test Data", row=1, col=1, mode="markers"
)

# Add a line for the polynomial fit results
fig.add_scatter(
    x=X_test[:, 0],
    y=result.eval(year=X_test[:, 1],month=X_test[:, 2]),
    name=f"Polynomial Fit (Month/Time), MAE={mae:,.2f}, RMSE={rmse:0.02f}",
    mode="markers",
    row=1,
    col=1,
)


# Update the x/y axis labels
fig.update_xaxes(title_text="Time [Years]", row=1, col=1)
fig.update_yaxes(title_text="Northern Hemisphere Average Temperature [C]", row=1, col=1)

# Add a horizontal line at y=0
fig.add_hline(y=0.0, row=1, col=1)

# Make the parity plot!
fig.add_scatter(
    x=y_test,
    y=result.eval(year=X_test[:, 1],month=X_test[:, 2]),
    mode="markers",
    name=f"Polynomial Fit (Month/Time), MAE={mae:,.2f}, RMSE={rmse:0.02f}",
    row=1,
    col=2,
)

# Update the x/y axis labels
fig.update_xaxes(
    title_text="Actual Temperature Anomaly [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)
fig.update_yaxes(
    title_text="Predicted Temperature [C]",
    range=[df["temperature"].min(), df["temperature"].max()],
    row=1,
    col=2,
)

fig.add_shape(
    type="line",
    x0=df["temperature"].min(),
    y0=df["temperature"].min(),
    x1=df["temperature"].max(),
    y1=df["temperature"].max(),
    row=1,
    col=2,
)
# Set the plot size
fig.update_layout(autosize=False, width=1200, height=600)
fig.show()

We now have a pretty good idea of what level of accuracy we can expect moving forward. The parity plot looks reasonable, and the RMSE is ~0.52 C for the test set. Assuming that the system does not have major forces acting on it (like significant climate change action), we can probably use this model to forecast the next 10 years. Hopefully this isn’t a good assumption!

Note

To recap, we demonstrated some fundamental concepts of supervised learning using a historical temperature record:

  • We loaded and visualized the data using pandas and plotly

  • We tried simple constant and polynomial fits using the time as a single input

  • We used MAE/RMSE and parity plots to investigate the fits

  • We tried adding more features (month) and changing the model to improve the fit

  • We then showed how random and time series train/val/test splits worked, and used them to estimate the error we would have seen if we had used the dataset to predict the last 20 years of temperatures!