Making a config
Contents
Making a config#
Catlas is highly configurable. There are some examples in configs/automated_screens
and configs/tests
. Here we will go through each section of the config and detail the options.
Memory cache location#
This is a required entry and it should be a string of the path to the folder in which results should be cached.
ex.
memory_cache_location: '/home/jovyan/test_cache'
Output options#
Run name#
This will be used to make a folder in outputs
which will house all outputs from your run. Name it something helpful to help future you track down old results.
ex.
run_name: "OH_on_binary_intermetallics"
Pickle final outputs#
This boolean specifies whether or not to pickle a final dataframe of your results into the outputs folder. If you are running something so large it can not fit in local memory, you may want to set this to False
.
ex.
pickle_final_output: True
Pickle intermediate outputs#
This is useful for edge cases. If set to True
, it will pickle each work partition processed by a worker. This means that many (O(100)-O(1000)) pickle files will populate a subfolder in your outputs folder as they finish up.
ex.
pickle_intermediate_outputs: False
Make parity plots#
If True
, this will perform the same functionality as bin/get_parities.py
at the beginning of the inference run. If data is available for the model you selected, parity plots will be generated and saved in the outputs folder.
ex.
make_parity_plots: True
Verbose#
If True
this will print the summary dataframe to the terminal at the conclusion of a run. This (as well as pickle final outputs) pushes all results to local memory. If you are running something so large it can not fit in local memory, you may want to set this to False
.
Get all structures#
If True
this will include all adslab and slab objects. Sometimes it is advantageous to set this to False
for large runs because it adslabs are memory intensive.
ex.
output_all_structures: True
Input Options#
This is where you specify the file paths to the bulk and adsorbate input files. This is required.
ex.
input_options:
bulk_file: 'catlas/bulk_structures/ocp_bulks_w_properties.db'
adsorbate_file: 'catlas/adsorbate_structures/adsorbates.pkl'
Bulk Filters#
This is where you specify how you would like to downselect the material design space. There are several options which are listed here and detailed below:
filter_by_bulk_ids
filter_by_acceptable_elements
filter_ignore_bulk_ids
filter_by_num_elements
filter_by_required_elements
filter_by_object_size
filter_by_elements_active_host
filter_by_element_groups
filter_by_bulk_e_above_hull
filter_by_bulk_band_gap
filter_by_pourbaix_stability
filter_fraction
1. Bulk ids to include:#
a list of bulk ids to include
filter_by_bulk_ids: ['mp-126','mp-30', 'mp-81', 'mp-13', 'mp-79']
2. Acceptable elements:#
a list of element symbols to include. If you say “Au” and “Ag”, and materials containing only Au and Ag will be included.
filter_by_acceptable_elements: ["Au", "Ag"]
3. Bulk ids to ignore:#
a list of bulk ids to exclude
filter_ignore_bulk_ids: ['mp-126','mp-30', 'mp-81', 'mp-13', 'mp-79']
4. Number of elements:#
a list of the numbers which are acceptable i.e. [2, 3]
gives you binary and ternary materials
filter_by_num_elements: [2, 3]
5. Required elements:#
a list of elements which must be in each material. If you say “Cu”, then only copper alloys will be considered.
filter_by_required_elements: ["Cu"]
6. Number of atoms:#
this is useful to avoid very large structures which are costly to compute. It filters out all unit cell structures with more atoms than the number specified.
filter_by_object_size: 60
7. Active-host paradigm:#
allows you to find materials containing elements you are interested in where at least one element is coming from one list of materials and the other element coming from another list. The example here would give Zinc alloys with Pd, Ag, and/ or Cu.
filter_by_elements_active_host:
active: ["Pd", "Ag", "Cu"]
host: ["Zn"]
8. Filter by element groups:#
A list of periodic table groups which are of interest. The groups should be specified as a list of any of the following: ["transition metal", "post-transition metal", "metalloid", "rare earth metal", "alkali", "alkaline", "alkali earth", "chalcogen", "halogen"]
filter_by_element_groups: ["transition metal"]
9. Energy above hull:#
a float of the maximum energy above hull you would like to consider in eV/atom
filter_by_bulk_e_above_hull: 0.05
10. Band gap:#
This requires one or both of the minimum and maximum band gap (in eV) you would like to be considered to be specified. If both are, then materials in the range will be filtered for. If only the maximum is, then anything with a band gap less than the value given will be filtered for. If only the minimum is, then anything with a band gap greater than the value given will be filtered for.
filter_by_bulk_band_gap:
min_gap: 0.1
max_gap:0.3
11. Pourbaix stability:#
This is only supported for Materials Project materials as of now. It selects the Pourbaix stable materials under the conditions you specify. Conditions may be specified as a list of specific values, or a range with an interval. You may also specify a maximum decoposition energy in eV/atom. Step size is the increment that will be used for the interval. If the material is stable at any of the conditions considered, it will not be filtered out. A path to an lmdb file containing Pourbaix info for your materials is required. If the file path doesnt exist, the Pourbaix info will be queried from MP as a part of the run. This does take a bit of time. Be careful of having to many workers querying.
filter_by_pourbaix_stability:
max_decomposition_energy: 0.05
lmdb_path: "catlas/pourbaix_diagrams/20220222_query.lmdb"
V_lower: -0.5
V_upper: 1
V_step: 0.05
pH_lower: 0
pH_upper: 2
OR
filter_by_pourbaix_stability:
max_decomposition_energy: 0.5
lmdb_path: "catlas/pourbaix_diagrams/20220222_query.lmdb"
conditions_list:
- pH: 1
V: -1.2
- pH: 14
V: 1.2
12. Randomly sample:#
specify the fraction of materials you would like to randomly select. The example here randomly selects 5% of materials.
filter_fraction: 0.05
Adsorbate filters#
There is only one option for adsorbate filters which is to filter by their SMILES string. Provide a list of the SMILES strings for adsorbates you would like to consider. An exhaustive list of those in OC20 is included here in our example.
ex.
adsorbate_filters:
filter_by_smiles: ['*C', '*C*C', '*CCH', '*CCH2', '*CCH2OH', '*CCH3',
'*CCHO', '*CCHOH', '*CCO', '*CH', '*CH*CH', '*CH*COH', '*CH2', '*CH2*O',
'*CH2CH2OH', '*CH2CH3', '*CH2OH', '*CH3', '*CH4', '*CHCH2', '*CHCH2OH',
'*CHCHO', '*CHCHOH', '*CHCO', '*CHO', '*CHO*CHO', '*CHOCH2OH', '*CHOCHOH',
'*CHOH', '*CHOHCH2', '*CHOHCH2OH', '*CHOHCH3', '*CHOHCHOH', '*CN',
'*COCH2O', '*COCH2OH', '*COCH3', '*COCHO', '*COH', '*COHCH2OH', '*COHCH3',
'*COHCHO', '*COHCHOH', '*COHCOH', '*H', '*N', '*N*NH', '*N*NO', '*N2',
'*NH', '*NH2', '*NH2N(CH3)2', '*NH3', '*NHNH', '*NO', '*NO2', '*NO2NO2',
'*NO3', '*NONH', '*O', '*OCH2CH3','*OCH2CHOH', '*OCH3', '*OCHCH3', '*OH',
'*OH2', '*OHCH2CH3', '*OHCH3', '*OHNH2', '*OHNNCH3', '*ONH', '*ONN(CH3)2',
'*ONNH2', '*ONOH', 'CH2*CO', '*CO', '*CH2*CH2', '*COHCH2, '*NHN2',
'*NNCH3', '*OCHCH2', '*ONNO2']
Slab Filters#
Downselect slabs before performing adsorbate placement and inference.
Object size: filters out any slabs with more atoms than the number specified here. This is useful to avoid need many calculations for very large surfaces.
filter_by_object_size: 100
Maximum miller index: will enumerate all slabs up to the provided index. If not provided, this defaults to 2.
filter_by_max_miller_index: 1
Broken bond model / surface density: There are two ways of using the broken bond model to select which slabs to run inference: a. Selecting the best surfaces per bulk: here you select by two citeria: i. k: which is the number of surfaces per bulk you would like to consider (i.e. if
top_k = 10
then the 10 surfaces with the lowest surface energy proxy values will be selected) ii. top proportion: a proportion of surfaces to choose (i.e. iftop_propostion = 0.5
and 50 surfaces are enumerated, then 25 will be filtered out.``` filter_by_broken_bonds: top_k: 10 OR filter_by_surface_density: top_proportion: 0.15 ```
b. Choosing the best shift(s) from a for a given miller index: you can specify a difference tolerance so surfaces with similar broken bond densities will still be considered (i.e. if
surface_now - best_surface <= difference_threshold * best_surface
the surface will also be considered) otherwise this defaults to 0.1.``` filter_best_shift_by_broken_bonds: difference_threshold: 0.2 OR filter_best_shift_by_surface_density: difference_threshold: 0.2 ```
Optionally, you may specify a neighbor factor which is use to determine nearest neighbors by the models. This defaults to 1.1.
neighbor_factor: 1.1
Adslab Prediction Steps#
Runs may be set up so that inference is performed sequentially. The idea here is that you may want to use a cheap, less accurate model to downselect first, and then perform more expensive, accurate inference. If you would not like to do this, simply use one step instead.
ex.
adslab_prediction_steps:
- step_type: inference
gpu: true
batch_size: 8
label: 'dE_gemnet_is2re_finetuned'
checkpoint_path: 'ocp_checkpoints/private_checkpoints/gemnet-is2re-finetuned-11-01.pt'
OR
adslab_prediction_steps:
- step_type: inference
gpu: true
batch_size: 8
label: 'dE_gemnet_is2re_finetuned'
checkpoint_path: 'ocp_checkpoints/private_checkpoints/gemnet-is2re-finetuned-11-01.pt'
- step_type: filter_by_adsorption_energy_target
adsorbate_smiles: '*CO'
target_value: -0.6
range_value: 0.2
filter_column: 'min_dE_gemnet_is2re_finetuned'
- step_type: inference
gpu: true
batch_size: 8
label: 'dE_gemnet_oc_large_s2ef_all_md'
checkpoint_path: 'ocp_checkpoints/public_checkpoints/gemnet_oc_large_s2ef_all_md.pt'
OR
adslab_prediction_steps:
- step_type: inference
gpu: true
batch_size: 8
label: 'dE_gemnet_is2re_finetuned'
checkpoint_path: 'ocp_checkpoints/private_checkpoints/gemnet-is2re-finetuned-11-01.pt'
- step_type: filter_by_adsorption_energy_target
adsorbate_smiles: '*CO'
min_value: -0.8
max_value: -0.4
filter_column: 'min_dE_gemnet_is2re_finetuned'
- step_type: inference
gpu: true
batch_size: 8
label: 'dE_gemnet_oc_large_s2ef_all_md'
checkpoint_path: 'ocp_checkpoints/public_checkpoints/gemnet_oc_large_s2ef_all_md.pt'
number_steps: 98
step_type
: this should beinference
,filter_by_adsorption_energy
, orfilter_by_adsorption_energy_target
. As their names implyinference
is an inference step,filter_by_adsorption_energy_target
filters surfaces that are within a range near your target value, andfilter_by_adsorption_energy
filters surfaces by whether a previous inference step predicts within a range of values you specify.gpu
: (inference step only) A boolean for if an inference step should use gpusbatch_size
: (inference step only) The number of adslab configs which will be considered in an inference batch. If you have very large objects you may have to decrease this so they fit in memory.label
: (inference step only) What the inference step will be named in the output dataframe. Should be the name of the model or something useful to youcheckpoint_path
: (inference step only) file path to the desired pretrained model checkpoint filenumber_steps
: (relaxation inference step only) The number of relaxation steps to take. If unspecified, this defaults to 200.adsorbate_smiles
: (filter step only) SMILES string of the adsorbate to filter onfilter_column
: (filter step only) the df column to which the filter should be applied. For most use cases, this should be “min_” + the previous inference step’s labelmin_value
: (filter_by_adsorption_energy
only) The minimum value in the desired range. If unspecified, this defaults to -infinity.max_value
: (filter_by_adsorption_energy
only) The maximum value in the desired range. If unspecified, this defaults to +infinity.target_value
: (filter_by_adsorption_energy_target
only) The target value.range_value
: (filter_by_adsorption_energy_target
only) The range to consider (i.e.target_value - range_value
->target_value + range_value will be used
) If a range value is not specified, then 0.5 is used by default.