```{margin} Adaptation!
Some of this homework was inspired by or uses content from Prof. AJ Medford (GTech)'s lectures for ChBE4745: https://github.com/medford-group/data_analytics_ChE

The dataset came from Dow Chemicals and released publicly as part of Prof. Medford's class. 
```

# HW4 (due Monday 10/3 noon) [100 pts]

## Dow chemical process [50 pt]

![DOW process](dow_process.png)

The dataset contains a number of operating conditions for each of the units in the process, as well as the concentration of impurities in the output stream. Let's take a look:



In [None]:
import pandas as pd
import numpy as np

df = pd.read_excel('impurity_dataset-training.xlsx')

df.head()

To use this data, we need to remove some problematic rows. We also want to select all the columns with "x" as features, and select the impurity as "y". This will leave 10297 data points with 40 features, with the goal being to predict the impurity in the output stream! You should use this code to generate the features (X) and target (y). 

In [None]:
def is_real_and_finite(x):
    if not np.isreal(x):
        return False
    elif not np.isfinite(x):
        return False
    else:
        return True

all_data = df[df.columns[1:]].values #drop the first column (date)
numeric_map = df[df.columns[1:]].applymap(is_real_and_finite)
real_rows = numeric_map.all(axis=1).copy().values #True if all values in a row are real numbers
X = np.array(all_data[real_rows,:-5], dtype='float') #drop the last 5 cols that are not inputs
y = np.array(all_data[real_rows,-3], dtype='float')

### Train/validation/test split

Split the dataset into an 80/10/10 train/val/test split. 

### Polynomial features with LASSO

Using polynomials up to second order, fit a LASSO model. Print the validation MAE and make a parity plot for your model compared to the experiments!

### Decision Tree

Fit a decision tree to the dataset using just the features. Print the validation MAE and make a parity plot for your model compared to the experiments!

### Decision Tree w/ Polynomial features

Repeat the above using second order polynomial features. Print the validation MAE and make a parity plot for your model. Is this better or worse than just the original decision tree?

### Decision Tree Analysis

For your best decision tree above, analyze the first few splits in the tree (either by making a tree plot like we did in class, or with the text analysis. You can also use the feature_importances_ attribute on the fitted model.

Which columns/features are most important for predicting the output impurity?

`````{seealso}
The examples for the classification problem here (export_text, plot_tree) also for regression models: https://scikit-learn.org/stable/modules/tree.html
````

### Pick your best model from above and evaluate your final impurity error on the test set.

### Bonus [10pt]

Try some other models from sklearn to see if you can do better than the decision tree. You can also try using feature selection (like in [Prof. Medford's lecture](https://github.com/medford-group/data_analytics_ChE/blob/master/2-regression/Topic4-High-dimensional_Regression.ipynb)). Remember, when you try different models you can only use the train/validation sets. 

If you find a model with a better validation score than your best model above, report the test set accuracy. 

## Materials dataset practice [50 pt]

We saw in class how we could use matminer to generate composition features. In class we built some simple models to predict the band gap of a materials. 

I want your help classifying whether a particular material/composition is a metal based on experimental data!

### Load the dataset and train/val/test split

Load the "matbench_expt_is_metal" and generate 60/20/20 train/val/test splits.

What fraction of the data is a metal?

### Generate composition features

Use the magpie features like we did in class.

### Fit a logistic regression model to the features

Report the accuracy on the validation set. How does this compare to the fraction of metals in the training dataset?

### Fit a decision tree classifier


A decision tree classifier is very similar to the decision tree regressor we used in class. Fit one to your data, and report the accuracy on the validation set. Try playing with the decision tree classifier parameters (like max_depth, etc) to see if you can get better results. 

`````{seealso}
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
`````


### Model explainability

Look at the feature_importances_ attribute of your fitted decision tree classifier. Which features matter the most in your model? Does this make sense based on what you remember from general chemistry?

### Test error

Pick your best model from above and evaluate the test accuracy. 

### Materials prediction

Go to the Material Project website, then the Materials Explorer app. Filter by "Is Metal" under the electronic structure toggle in the filters list. 

Pick three compositions that according to simulations were metals, and 3 that were predicted to be not metals. How does your best classifier from above work on these?

A few things to keep in mind:
* Simulations are not perfect at predicting experimental properties
* The experimental dataset you used is probably not indicative of all possible structures in the Materials Project