Converting ASE objects to PyG Data objects#

This notebook provides an overview of converting ASE Atoms objects to PyTorch Geometric Data objects. To better understand the raw data contained within OC20, check out the following tutorial first: https://github.com/Open-Catalyst-Project/ocp/blob/master/docs/source/tutorials/data_visualization.ipynb

from ocpmodels.preprocessing import AtomsToGraphs
import ase.io
from ase.build import bulk
from ase.build import fcc100, add_adsorbate, molecule
from ase.constraints import FixAtoms
from ase.calculators.emt import EMT
from ase.optimize import BFGS

Generate toy dataset: Relaxation of CO on Cu#

adslab = fcc100("Cu", size=(2, 2, 3))
ads = molecule("CO")
add_adsorbate(adslab, ads, 3, offset=(1, 1))
cons = FixAtoms(indices=[atom.index for atom in adslab if (atom.tag == 3)])
adslab.set_constraint(cons)
adslab.center(vacuum=13.0, axis=2)
adslab.set_pbc(True)
adslab.set_calculator(EMT())
dyn = BFGS(adslab, trajectory="CuCO_adslab.traj", logfile=None)
dyn.run(fmax=0, steps=1000)
False
raw_data = ase.io.read("CuCO_adslab.traj", ":")
print(len(raw_data))
1001

Convert Atoms object to Data object#

The AtomsToGraphs class takes in several arguments to control how Data objects created:

  • max_neigh (int): Maximum number of neighbors a given atom is allowed to have, discarding the furthest

  • radius (float): Cutoff radius to compute nearest neighbors around

  • r_energy (bool): Write energy to Data object

  • r_forces (bool): Write forces to Data object

  • r_distances (bool): Write distances between neighbors to Data object

  • r_edges (bool): Write neigbhor edge indices to Data object

  • r_fixed (bool): Write indices of fixed atoms to Data object

a2g = AtomsToGraphs(
    max_neigh=50,
    radius=6,
    r_energy=True,
    r_forces=True,
    r_distances=False,
    r_edges=True,
    r_fixed=True,
)
data_objects = a2g.convert_all(raw_data, disable_tqdm=True)
/home/runner/work/ml_catalysis_tutorials/ml_catalysis_tutorials/ocp/ocpmodels/preprocessing/atoms_to_graphs.py:139: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  /opt/conda/conda-bld/pytorch_1639180507909/work/torch/csrc/utils/tensor_new.cpp:201.)
  cell = torch.Tensor(atoms.get_cell()).view(1, 3, 3)
data = data_objects[0]
data
Data(pos=[14, 3], cell=[1, 3, 3], atomic_numbers=[14], natoms=14, tags=[14], edge_index=[2, 636], cell_offsets=[636, 3], y=3.989314410668539, force=[14, 3], fixed=[14])
data.atomic_numbers
tensor([29., 29., 29., 29., 29., 29., 29., 29., 29., 29., 29., 29.,  8.,  6.])
data.cell
tensor([[[ 5.1053,  0.0000,  0.0000],
         [ 0.0000,  5.1053,  0.0000],
         [ 0.0000,  0.0000, 32.6100]]])
data.edge_index #neighbor idx, source idx
tensor([[ 1,  2,  2,  ...,  4,  6,  3],
        [ 0,  0,  0,  ..., 13, 13, 13]])
from torch_geometric.utils import degree
# Degree corresponds to the number of neighbors a given node has. Note there is no more than max_neigh neighbors for
# any given node.

degree(data.edge_index[1]) 
tensor([45., 45., 45., 46., 49., 49., 49., 49., 50., 49., 49., 50., 26., 35.])
data.fixed
tensor([1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
data.force
tensor([[ 9.9356e-16,  4.5465e-15,  1.1354e-01],
        [ 2.6749e-15,  3.7696e-15,  1.1344e-01],
        [ 8.4481e-16,  2.7062e-16,  1.1344e-01],
        [-6.6623e-18,  6.6196e-17,  1.1294e-01],
        [-8.5221e-03, -8.5221e-03, -1.1496e-02],
        [ 8.5221e-03, -8.5221e-03, -1.1496e-02],
        [-8.5221e-03,  8.5221e-03, -1.1496e-02],
        [ 8.5221e-03,  8.5221e-03, -1.1496e-02],
        [ 1.9082e-17,  9.6277e-16, -1.0431e-01],
        [-2.0583e-15, -4.3021e-16, -6.6610e-02],
        [-5.5511e-17, -2.3592e-15, -6.6610e-02],
        [-2.9409e-17, -4.3038e-15, -3.3250e-01],
        [ 3.3204e-19,  6.7763e-21, -3.4247e-01],
        [-4.5103e-17, -5.2042e-17,  5.0512e-01]])
data.pos
tensor([[ 0.0000,  0.0000, 13.0000],
        [ 2.5527,  0.0000, 13.0000],
        [ 0.0000,  2.5527, 13.0000],
        [ 2.5527,  2.5527, 13.0000],
        [ 1.2763,  1.2763, 14.8050],
        [ 3.8290,  1.2763, 14.8050],
        [ 1.2763,  3.8290, 14.8050],
        [ 3.8290,  3.8290, 14.8050],
        [ 0.0000,  0.0000, 16.6100],
        [ 2.5527,  0.0000, 16.6100],
        [ 0.0000,  2.5527, 16.6100],
        [ 2.5527,  2.5527, 16.6100],
        [ 2.5527,  2.5527, 19.6100],
        [ 2.5527,  2.5527, 18.4597]])
data.y
3.989314410668539

Adding additional info to your Data objects#

In addition to the above information, the OCP repo requires several other pieces of information for your data to work with the provided trainers:

  • sid (int): A unique identifier for a particular system. Does not affect your model performance, used for prediction saving

  • fid (int) (S2EF only): If training for the S2EF task, your data must also contain a unique frame identifier for atoms objects coming from the same system.

  • tags (tensor): Tag information - 0 for adsorbate, 1 for surface, 2 for subsurface. Optional, can be used for training.

Other information may be added her as well if you choose to incorporate other information in your models/frameworks

data_objects = []
for idx, system in enumerate(raw_data):
    data = a2g.convert(system)
    data.fid = idx
    data.sid = 0 # All data points come from the same system, arbitrarly define this as 0
    data_objects.append(data)
data = data_objects[100]
data
Data(pos=[14, 3], cell=[1, 3, 3], atomic_numbers=[14], natoms=14, tags=[14], edge_index=[2, 635], cell_offsets=[635, 3], y=3.968355893395739, force=[14, 3], fixed=[14], fid=100, sid=0)
data.sid
0
data.fid
100

Resources:

  • https://github.com/Open-Catalyst-Project/ocp/blob/6604e7130ea41fabff93c229af2486433093e3b4/ocpmodels/preprocessing/atoms_to_graphs.py

  • https://github.com/Open-Catalyst-Project/ocp/blob/master/scripts/preprocess_ef.py