HW2 - Global CO2 concentration prediction and practice with scikit-learn [100 pt]#

In class we showed that we can build predictive models for the northern hemisphere monthly average temperature. We achieved a MAE of about 0.5 C.

You’re going to help me:

Fit and predict the co2 emissions from the Mauna Loa observatory in Hawaii
analyze the temperature anomaly (temperature variation compared to the monthly average) instead of the absolute temperature
use time series splits to analyze your model
plot the temperature and your best fit

Data download#

Go to the Mauna Loa data website and find the link to the “Mauna Loa CO2 weekly mean and historical comparisons” text or csv file. Download it with wget.

!wget https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_weekly_mlo.txt

--2022-10-13 15:06:00--  https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_weekly_mlo.txt
Resolving gml.noaa.gov (gml.noaa.gov)... 

140.172.200.41, 2610:20:8800:6101::29
Connecting to gml.noaa.gov (gml.noaa.gov)|140.172.200.41|:443... connected.

HTTP request sent, awaiting response...

200 OK
Length: 194372 (190K) [text/plain]
Saving to: ‘co2_weekly_mlo.txt’


co2_weekly_mlo.txt    0%[                    ]       0  --.-KB/s               

co2_weekly_mlo.txt  100%[===================>] 189.82K  --.-KB/s    in 0.09s   

2022-10-13 15:06:00 (2.17 MB/s) - ‘co2_weekly_mlo.txt’ saved [194372/194372]

Load and visualize the data#

First, load the data with pandas. You will probably have to change the column names and the number of rows that are skipped compared to the example from class. Depending on whether you download the csv or the txt file you may also have to change delim_whitespace.

import pandas as pd

# Read in the dataset using pandas, and explicitly include the column names
# since pandas has trouble with the header lines in this file
df = pd.read_csv(
    "co2_weekly_mlo.txt",
    delim_whitespace=True,
    skiprows=70,
    names=[
        "year",
        "month",
        "day",
        "decimal",
        "ppm",
        "days",
        "1year",
        "10year",
        "1800",
    ],
)

# display the resulting datafrane
df

	year	month	day	decimal	ppm	days	1year	10year	1800
0	1974	10	13	1974.7808	327.23	5	-999.99	-999.99	50.53
1	1974	10	20	1974.8000	327.40	5	-999.99	-999.99	50.51
2	1974	10	27	1974.8192	327.64	7	-999.99	-999.99	50.50
3	1974	11	3	1974.8384	327.80	7	-999.99	-999.99	50.39
4	1974	11	10	1974.8575	328.50	6	-999.99	-999.99	50.79
...	...	...	...	...	...	...	...	...	...
2499	2022	9	4	2022.6740	416.33	7	413.41	391.85	139.62
2500	2022	9	11	2022.6932	416.05	6	413.07	391.10	139.60
2501	2022	9	18	2022.7123	415.57	7	413.32	391.02	139.27
2502	2022	9	25	2022.7315	415.57	7	413.32	391.07	139.33
2503	2022	10	2	2022.7507	415.26	7	413.52	391.06	138.96

2504 rows × 9 columns

Now, plot the CO2 concentration in ppm vs the time (the decimal column in the sheet).

import plotly.graph_objects as go

# Make a plotly graph
fig = go.Figure()
fig.add_scatter(x=df["decimal"], y=df["ppm"], name="Actual Data")

# Update the x/y axis labels
fig.update_xaxes(title_text="Time [Years]")
fig.update_yaxes(title_text="Mauna Loa CO2 Conc Measurement [ppm]")

# Add a horizontal line at y=0
fig.add_hline(y=0.0)

fig.show()

Filter the data#

Note that some of the data is reported as -999.99. Those are days when there was no data at Mauna Loa for various reasons. Filter the dataframe to only include rows with positive ppm measurements. Repeat the plot from above to confirm the data looks good now!

Train/val-test split#

To start, split your data into train/val (90%) and test (10%). Use sklearn.model_selection.train_test_split like we did in class. Make sure you don’t shuffle when you do it, so that the test data is the most recent 10% of the data!

Tip

sklearn.model_selection.train_test_split can split pandas dataframes directly!

from sklearn.model_selection import train_test_split

df_trainval, df_test = train_test_split(df, test_size=0.1, shuffle=False)

Your first scikit-learn model#

Scikit-learn can handle all of the things we talked about in class!

Take data and featurizing it
Implement and fit a machine learning model
Create various train/test splits
Calculate the average MAE/RMSE of the splits

To implement this, let’s use a linear model with polynomial features, like we had in class!

Tip

Helpful resources!

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html#sklearn.model_selection.TimeSeriesSplit
https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selection-plot-underfitting-overfitting-py

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit, cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

# We expect a dataframe df_trainval generated above that contains the train and validation data
df_train, df_val = train_test_split(df_trainval, test_size=0.1, shuffle=False)

# Make a pipeline where we first generate quadratic polynomial features from the time data
# then fit using linear regression
model = make_pipeline(PolynomialFeatures(2), LinearRegression())

# Evaluate the model by generate 5 different train/val splits using TimseSeriesSplit,
# fitting the model from above on each train, and evaluating the MAE for the validation in each
cross_validate(
    model,
    df_train["decimal"].values.reshape(-1, 1),
    df_train["ppm"].values,
    cv=TimeSeriesSplit(),
    scoring=["neg_mean_absolute_error", "neg_root_mean_squared_error"],
)

{'fit_time': array([0.03108954, 0.00190473, 0.00165319, 0.00157523, 0.00358939]),
 'score_time': array([0.00141287, 0.00082779, 0.00082111, 0.00082707, 0.00084805]),
 'test_neg_mean_absolute_error': array([-3.80430561, -2.0159317 , -2.13827431, -2.64710001, -2.31575849]),
 'test_neg_root_mean_squared_error': array([-4.57438318, -2.3744048 , -2.51921687, -3.19661145, -2.71270215])}

Underfitting/overfitting#

Vary the degree of the polynomial features above, and find the degree that minimized the mean absolute error on the final train/val split. Plot the MAE as a function of the polynomial degree.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit, cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

# We expect a dataframe df_trainval generated above that contains the train and validation data
df_train, df_val = train_test_split(df_trainval, test_size=0.1, shuffle=False)

# Make a pipeline where we first generate quadratic polynomial features from the time data
# then fit using linear regression
model = make_pipeline(PolynomialFeatures(2), LinearRegression())

# Evaluate the model by generate 5 different train/val splits using TimseSeriesSplit,
# fitting the model from above on each train, and evaluating the MAE for the validation in each
cross_validate(
    model,
    df_train["decimal"].values.reshape(-1, 1),
    df_train["ppm"].values,
    cv=TimeSeriesSplit(),
    scoring=["neg_mean_absolute_error", "neg_root_mean_squared_error"],
)

{'fit_time': array([0.00223446, 0.00254369, 0.00171828, 0.00164795, 0.00163293]),
 'score_time': array([0.00099373, 0.00084138, 0.00081015, 0.00078964, 0.00081563]),
 'test_neg_mean_absolute_error': array([-3.80430561, -2.0159317 , -2.13827431, -2.64710001, -2.31575849]),
 'test_neg_root_mean_squared_error': array([-4.57438318, -2.3744048 , -2.51921687, -3.19661145, -2.71270215])}

import numpy as np

degrees = list(range(1, 10))
mae_save = []
for degree in degrees:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())

    # Evaluate the model by generate 5 different train/val splits using TimseSeriesSplit,
    # fitting the model from above on each train, and evaluating the MAE for the validation in each
    results = cross_validate(
        model,
        df_train["decimal"].values.reshape(-1, 1),
        df_train["ppm"].values,
        cv=TimeSeriesSplit(),
        scoring=["neg_mean_absolute_error", "neg_root_mean_squared_error"],
    )
    mae_save.append(results["test_neg_mean_absolute_error"][-1])

mae_save = np.array(mae_save)

# Make a plotly graph
fig = go.Figure()
fig.add_scatter(x=degrees, y=-mae_save, name="Polynomial Fit Val MAE")

# Update the x/y axis labels
fig.update_xaxes(title_text="Polynomial Degree")
fig.update_yaxes(title_text="Polynomial Fit Validation MAE [ppm]")

fig.show()

Visualize your best model#

Now that you’ve optimized the degree of the polynomial to be most predictive for the train/val splits you identified, let’s see how it does on the test data you set aside earlier!

Make a plot with plotly that shows:

The train/val data and test concentration data
The model you identified above, fitted on all of the train/val data and predicting the test date. Include predictions for the next 5 years.

model = make_pipeline(PolynomialFeatures(3), LinearRegression())
model.fit(df_trainval["decimal"].values.reshape((-1,1)),df_trainval["ppm"])

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=3)),
                ('linearregression', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Make a plotly graph
fig = go.Figure()

# Make a plotly graph
fig = go.Figure()

# Plot train/val, validation, and test data 
fig.add_scatter(x=df_trainval["decimal"], y=df_trainval["ppm"], name="Train/Val Data")
fig.add_scatter(x=df_val["decimal"], y=df_val["ppm"], name="Val Data")
fig.add_scatter(x=df_test["decimal"], y=df_test["ppm"], name="Test Data")

# Plot fit for train/val data and val fit
fig.add_scatter(x=df_trainval["decimal"], y=model.predict(df_trainval["decimal"].values.reshape((-1,1))), name="Train/Val Fit")
fig.add_scatter(x=df_val["decimal"], y=model.predict(df_val["decimal"].values.reshape((-1,1))), name="Val Fit")

# Plot best polynomial fit and future predictions
fig.add_scatter(x=df_test["decimal"], y=model.predict(df_test["decimal"].values.reshape((-1,1))), name="Best Polynomial Fit Prediction")
future = np.linspace(2022,2030).reshape((-1,1))
fig.add_scatter(x=future[:,0], y=model.predict(future), name="Future Prediction")

# Update the x/y axis labels
fig.update_xaxes(range=(1970,2030),title_text="Time [Years]")
fig.update_yaxes(title_text="Mauna Loa CO2 Conc Measurement [ppm]")

fig.show()

Sources of bias and limitations#

Discuss potential sources of bias or difficulties with this data? Is it possible to predict CO2 concentrations in the future without knowing if the world will take drastic action on CO2 emissions? Why or why not?

Some potential answers for sources of bias or difficulties with the data:

There is the possibility for outlier bias - the data is given as monthly averages which does not contain all information about the CO2 levels for that month. It is possible that a few days out of the month had very high or low levels of CO2 which can impact the overall average for the month and sway what can be concluded from the results.

How is the data collected - automated vs non-automated systems? There may be the possibility of human error in the collection of the data if the collection is not automated.

This data is location specific -cannot confidently conclude about CO2 level trends in other locations around the world based solely on this data.

Potential answer for predicting CO2 concentrations in the future:

It is possible to predicting the CO2 concentrations in the future since it can show the expected trend without taking any intervention. However, the prediction will only be able to predict based on the provided data and will not be able to capture the results if action is taken to reduce CO2 emissions.

Bonus [10 pt]#

We used quite simple features, and most likely your best fit has none of the annual cyclic variation present in the original dataset.

Try incorporating strategies from one of these:

Periodic spline features (https://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html#periodic-splines)
https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html#sphx-glr-auto-examples-applications-plot-cyclical-feature-engineering-py

To see if you can improve on the fit above. What’s the lowest MAE you can obtain using the time series splits?

from sklearn.preprocessing import SplineTransformer
from sklearn.linear_model import Ridge


# Use linear spline features
model2 = make_pipeline(SplineTransformer(degree=3, n_knots=2, extrapolation = "linear"), Ridge(alpha=1e-3)) 
model2.fit(df_trainval["decimal"].values.reshape((-1,1)),df_trainval["ppm"])

results = cross_validate(
        model2,
        df_train["decimal"].values.reshape(-1, 1),
        df_train["ppm"].values,
        cv=TimeSeriesSplit(),
        scoring=["neg_mean_absolute_error", "neg_root_mean_squared_error"],
    )
print(f'MAE using linear spline features = {results["test_neg_mean_absolute_error"][-1]}')

# Make a plotly graph to show fit
fig2 = go.Figure()

# Plot train/val, validation, and test data 
fig2.add_scatter(x=df_trainval["decimal"], y=df_trainval["ppm"], name="Train/Val Data")
fig2.add_scatter(x=df_val["decimal"], y=df_val["ppm"], name="Val Data")
fig2.add_scatter(x=df_test["decimal"], y=df_test["ppm"], name="Test Data")

# Plot fit for train/val data 
fig2.add_scatter(x=df_trainval["decimal"], y=model2.predict(df_trainval["decimal"].values.reshape((-1,1))), name="Train/Val Fit")
fig2.add_scatter(x=df_val["decimal"], y=model2.predict(df_val["decimal"].values.reshape((-1,1))), name="Val Fit")

# Plot linear spline 
fig2.add_scatter(x=df_test["decimal"], y=model2.predict(df_test["decimal"].values.reshape((-1,1))), name="Linear Spline Prediction")

MAE using linear spline features = -2.0205841508504445

Good MAE, but still does not capture the annual cyclic pattern, so let’s try periodic splines.

# Trying again with periodic spline features 

# Use periodic spline features
model3 = make_pipeline(SplineTransformer(degree=3, n_knots=100, extrapolation = "periodic"), Ridge(alpha=1e-3)) 
model3.fit(df_trainval["decimal"].values.reshape((-1,1)),df_trainval["ppm"])

results = cross_validate(
        model3,
        df_train["decimal"].values.reshape(-1, 1),
        df_train["ppm"].values,
        cv=TimeSeriesSplit(),
        scoring=["neg_mean_absolute_error", "neg_root_mean_squared_error"],
    )
print(f'MAE using periodic spline features = {results["test_neg_mean_absolute_error"][-1]}')

# Make a plotly graph to show fit
fig3 = go.Figure()

# Plot train/val, validation, and test data 
fig3.add_scatter(x=df_trainval["decimal"], y=df_trainval["ppm"], name="Train/Val Data")
fig3.add_scatter(x=df_val["decimal"], y=df_val["ppm"], name="Val Data")
fig3.add_scatter(x=df_test["decimal"], y=df_test["ppm"], name="Test Data")

# Plot fit for train/val data 
fig3.add_scatter(x=df_trainval["decimal"], y=model3.predict(df_trainval["decimal"].values.reshape((-1,1))), name="Train/Val Fit")
fig3.add_scatter(x=df_val["decimal"], y=model3.predict(df_val["decimal"].values.reshape((-1,1))), name="Val Fit")

# Plot linear spline 
fig3.add_scatter(x=df_test["decimal"], y=model3.predict(df_test["decimal"].values.reshape((-1,1))), name="Periodic Spline Prediction")

MAE using periodic spline features = -54.98389813688461

Fit of data more accurately captures the annual cyclic pattern, but the prediction on the test data is not good.

Created in Deepnote

Numerical Methods and ML for ChE (F22-06-325)

HW2 - Global CO2 concentration prediction and practice with scikit-learn [100 pt]

Contents

HW2 - Global CO2 concentration prediction and practice with scikit-learn [100 pt]#

Data download#

Load and visualize the data#

Filter the data#

Train/val-test split#

Your first scikit-learn model#

Underfitting/overfitting#

Visualize your best model#

Sources of bias and limitations#

Bonus [10 pt]#