\[\require{mhchem}\]

HW3 (due 9/26)#

Challenger Rocket O-ring Failures [40 pt]#

Challenger

The space shuttle Challenger flew 9 missions from 1983-1986. On its final launch, Challenger exploded a bit more than 1 minute after launch. Among the astronauts on board was Judith Resnik, a CMU alum (ECE), for whom the Resnik award at CMU for exceptional senior women is named after, and Ronald McNair, the second African American to fly to space and for whom the national McNair scholars program is named after.

The Rogers commission was tasked with identifying the factors that led to the disaster, and notably Nobel-prize-winning physicist Richard Feynman was on the commission. The cause of the failure was eventually determined to be a failed O-ring that led to an explosion in the fuel tank, and the temperature on the day of the launch one of the primary considerations.

This was a particularly dark period for NASA and engineers as the Challenger incident exposed major issues around reliability and engineering statistics. The disaster was not just a technical failure in the O-ring, but a failure in the engineering organization at NASA.

See also

  • Feynman’s special report is in the appendix to the Rogers commission report (Appendix F) https://www.govinfo.gov/content/pkg/GPO-CRPT-99hrpt1016/pdf/GPO-CRPT-99hrpt1016.pdf

  • Mr Feynman Goes to Washington, available here: https://calteches.library.caltech.edu/3570/1/Feynman.pdf

  • https://www.youtube.com/watch?v=ZOzoLdfWyKw (Feynman testimony starts at 1:53)

  • https://www.cmu.edu/education-office/academic-resources/resnik-award.html


Download the o-ring dataset from UC Irvine’s website#

Find the data file on the website (you want the O-ring erosion only one), download the data file using wget, and load it into pandas. https://archive.ics.uci.edu/ml/datasets/Challenger+USA+Space+Shuttle+O-Ring

Binary variable and plot#

Make a binary variable that represents whether the number of O-ring failures is greater than or equal to 1 (e.g. 0 if no failure, 1 if failure). Plot the binary variable as a function of the temperature.

Train/test split#

Make a simple train 80/20 train/test split using the available data. Don’t use shuffle, so that this is effectively a time series split (e.g. first 80% of the launches is train, then last 20% is test).

Logistic regression#

Fit a simple LogistRegression model to the data and evaluate the accuracy on the test set. Is it statistically significant with such a small dataset?

Plot whether a failure is predicted at all temperatures from 20F to 80F.

Probability of failure#

One of the helpful things about logistic regression is that you can make predictions for the probability of outcomes. Use the predict_proba method on your classifier to predict the probability of a failure at 32F.

See also

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Problem analysis#

Skim the “Mr. Feynman Goes to Washington” article or the wikipedia article and answer the following questions with a few sentences each

  • What was the primary chemical engineering / material cause for O-ring failures at low temperatures?

  • What factors led to such large disparities in risk assessment for the launch (they ranged from near certainty, to 1-in-100, to 1-in-100,000 odds of a catastrophic failure).

Perovskite classification and discovery [60 pt]#

NREL PV chart https://www.nrel.gov/pv/cell-efficiency.html/

Perovskites are a special type of inorganic crystal that are now one of the most efficient and popular research materials for photovoltaics, as shown in the chart above!

Perovskite refers to a specific material (\(\ce{CaTiO3}\)). There are many substitutions that you can make for Ca and Ti of the form \(\ce{ABX3}\) that may be stable in the same crystal structure; some will be stable in the perovskite structure, and some will not be.

  • A is a replacement for Ca, usually a transition metal

  • B is a replacement for Ti, usually a transition metal

  • X is a replacement for Cl, often O, S, Se, F, Cl, Br, I

One factor (but not the only one) is that the radii of the two elements need to be similar to the original Ca/Ti ratio. For example, one proposed factor (commonly referred to as the Goldschmidt tolerance factor \(t\)) is:

\[\begin{align*} t=\frac{R_A+R_X}{\sqrt{2}(R_B+R_X)} \end{align*}\]

where \(R_A\) is the radius of element A, \(R_B\) is the radius of element \(B\), and \(R_X\) is the radius of element X (the units don’t matter as long as you are consistent with all three!). If \(t\) is between 0.75 and 1.0, the combination is often stable in the perovskite structure.

One way to identify whether a particular combination of elements A, B and X is stable is to do a quantum chemistry simulation like the guest speakers last week discussed. This often works but is not perfectly accurate

See also

  • https://en.wikipedia.org/wiki/Perovskite

  • https://en.wikipedia.org/wiki/Perovskite_(structure)

I want your help to identify some polynomial features that correlate with whether a particular combination will form a perovskite!

Download the dataset and load with pandas#

You can find an experimental dataset for whether the perovskite structure is stable for 576 A/B/X compounds here: https://github.com/CJBartel/perovskite-stability

Download the file Table1.csv using wget and use pandas to load as a dataframe.

Note

The url for the “raw” file in github is one that you can use with wget.

Train/val/test split#

Split the data into a random 90/10 train/validation (90) and test (10) split using sklearn. The label we want to predict (y) is “exp_label”. Allowed features for X are the columns [A,B,X,nA,nB,nX,rA (Ang),rB (Ang),rX (Ang)]

Visualize the Goldschmidt ratio#

Plot a histogram for the t-values in the data set. Highlight the desired range of 0.75-1.00. Color code the histogram based on the “exp_label” (Hint: This should look similar to Fig. 2A in the paper that the data was generated for).

Define a simple ‘Goldschmidt classifier’, which uses the t-ratio to predict ‘exp_label’. What is its accuracy on the full data set? Is this significantly better than a naive classifier?

Logistic regression#

Using the columns above, fit a logistic regression model using the available features. Report the accuracy and compare it to the Goldschmidt ratio.

Add features#

First, add some features to the input matrix X: 1/R_A, 1/R_B Then add polynomial features (varying the polynomial features degrees from 1 to 3).

Like we did in class in the features lecture, vary the polynomial powers from 1-3 and study the effect on the train and validation accuracy. Which one is the most predictive for your train/validation split?

Regularization#

Try varying the type of regularization (“penalty”), and the strength of the regularization (“C”) in LogisticRegression. For the highest validation set accuracy you can find, fit the model to the entire train+validation dataset and predict on the test set to report your final model accuracy.

Bonus#

This problem is based on a paper that used a special method (SISSO) to generate and screen billions of possible algebraic formulae to see which ones correlated the best with perovskite structure. If you’re interested the code for that is here: https://github.com/rouyang2017/SISSO

Read the paper to find the tolerance factor \(\tau\), and use it as a single logistic regression feature. How accurate is your model compared to \(\tau\) on your test set? How do your results compare to what the paper claims for accuracy?