Converting ASE objects to PyG Data objects
Contents
Converting ASE objects to PyG Data objects#
This notebook provides an overview of converting ASE Atoms objects to PyTorch Geometric Data objects. To better understand the raw data contained within OC20, check out the following tutorial first: https://github.com/Open-Catalyst-Project/ocp/blob/master/docs/source/tutorials/data_visualization.ipynb
from ocpmodels.preprocessing import AtomsToGraphs
import ase.io
from ase.build import bulk
from ase.build import fcc100, add_adsorbate, molecule
from ase.constraints import FixAtoms
from ase.calculators.emt import EMT
from ase.optimize import BFGS
Generate toy dataset: Relaxation of CO on Cu#
adslab = fcc100("Cu", size=(2, 2, 3))
ads = molecule("CO")
add_adsorbate(adslab, ads, 3, offset=(1, 1))
cons = FixAtoms(indices=[atom.index for atom in adslab if (atom.tag == 3)])
adslab.set_constraint(cons)
adslab.center(vacuum=13.0, axis=2)
adslab.set_pbc(True)
adslab.set_calculator(EMT())
dyn = BFGS(adslab, trajectory="CuCO_adslab.traj", logfile=None)
dyn.run(fmax=0, steps=1000)
False
raw_data = ase.io.read("CuCO_adslab.traj", ":")
print(len(raw_data))
1001
Convert Atoms object to Data object#
The AtomsToGraphs class takes in several arguments to control how Data objects created:
max_neigh (int): Maximum number of neighbors a given atom is allowed to have, discarding the furthest
radius (float): Cutoff radius to compute nearest neighbors around
r_energy (bool): Write energy to Data object
r_forces (bool): Write forces to Data object
r_distances (bool): Write distances between neighbors to Data object
r_edges (bool): Write neigbhor edge indices to Data object
r_fixed (bool): Write indices of fixed atoms to Data object
a2g = AtomsToGraphs(
max_neigh=50,
radius=6,
r_energy=True,
r_forces=True,
r_distances=False,
r_edges=True,
r_fixed=True,
)
data_objects = a2g.convert_all(raw_data, disable_tqdm=True)
/home/runner/work/ml_catalysis_tutorials/ml_catalysis_tutorials/ocp/ocpmodels/preprocessing/atoms_to_graphs.py:139: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /opt/conda/conda-bld/pytorch_1639180507909/work/torch/csrc/utils/tensor_new.cpp:201.)
cell = torch.Tensor(atoms.get_cell()).view(1, 3, 3)
data = data_objects[0]
data
Data(pos=[14, 3], cell=[1, 3, 3], atomic_numbers=[14], natoms=14, tags=[14], edge_index=[2, 636], cell_offsets=[636, 3], y=3.989314410668539, force=[14, 3], fixed=[14])
data.atomic_numbers
tensor([29., 29., 29., 29., 29., 29., 29., 29., 29., 29., 29., 29., 8., 6.])
data.cell
tensor([[[ 5.1053, 0.0000, 0.0000],
[ 0.0000, 5.1053, 0.0000],
[ 0.0000, 0.0000, 32.6100]]])
data.edge_index #neighbor idx, source idx
tensor([[ 1, 2, 2, ..., 4, 6, 3],
[ 0, 0, 0, ..., 13, 13, 13]])
from torch_geometric.utils import degree
# Degree corresponds to the number of neighbors a given node has. Note there is no more than max_neigh neighbors for
# any given node.
degree(data.edge_index[1])
tensor([45., 45., 45., 46., 49., 49., 49., 49., 50., 49., 49., 50., 26., 35.])
data.fixed
tensor([1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
data.force
tensor([[ 9.9356e-16, 4.5465e-15, 1.1354e-01],
[ 2.6749e-15, 3.7696e-15, 1.1344e-01],
[ 8.4481e-16, 2.7062e-16, 1.1344e-01],
[-6.6623e-18, 6.6196e-17, 1.1294e-01],
[-8.5221e-03, -8.5221e-03, -1.1496e-02],
[ 8.5221e-03, -8.5221e-03, -1.1496e-02],
[-8.5221e-03, 8.5221e-03, -1.1496e-02],
[ 8.5221e-03, 8.5221e-03, -1.1496e-02],
[ 1.9082e-17, 9.6277e-16, -1.0431e-01],
[-2.0583e-15, -4.3021e-16, -6.6610e-02],
[-5.5511e-17, -2.3592e-15, -6.6610e-02],
[-2.9409e-17, -4.3038e-15, -3.3250e-01],
[ 3.3204e-19, 6.7763e-21, -3.4247e-01],
[-4.5103e-17, -5.2042e-17, 5.0512e-01]])
data.pos
tensor([[ 0.0000, 0.0000, 13.0000],
[ 2.5527, 0.0000, 13.0000],
[ 0.0000, 2.5527, 13.0000],
[ 2.5527, 2.5527, 13.0000],
[ 1.2763, 1.2763, 14.8050],
[ 3.8290, 1.2763, 14.8050],
[ 1.2763, 3.8290, 14.8050],
[ 3.8290, 3.8290, 14.8050],
[ 0.0000, 0.0000, 16.6100],
[ 2.5527, 0.0000, 16.6100],
[ 0.0000, 2.5527, 16.6100],
[ 2.5527, 2.5527, 16.6100],
[ 2.5527, 2.5527, 19.6100],
[ 2.5527, 2.5527, 18.4597]])
data.y
3.989314410668539
Adding additional info to your Data objects#
In addition to the above information, the OCP repo requires several other pieces of information for your data to work with the provided trainers:
sid (int): A unique identifier for a particular system. Does not affect your model performance, used for prediction saving
fid (int) (S2EF only): If training for the S2EF task, your data must also contain a unique frame identifier for atoms objects coming from the same system.
tags (tensor): Tag information - 0 for adsorbate, 1 for surface, 2 for subsurface. Optional, can be used for training.
Other information may be added her as well if you choose to incorporate other information in your models/frameworks
data_objects = []
for idx, system in enumerate(raw_data):
data = a2g.convert(system)
data.fid = idx
data.sid = 0 # All data points come from the same system, arbitrarly define this as 0
data_objects.append(data)
data = data_objects[100]
data
Data(pos=[14, 3], cell=[1, 3, 3], atomic_numbers=[14], natoms=14, tags=[14], edge_index=[2, 635], cell_offsets=[635, 3], y=3.968355893395739, force=[14, 3], fixed=[14], fid=100, sid=0)
data.sid
0
data.fid
100
Resources:
https://github.com/Open-Catalyst-Project/ocp/blob/6604e7130ea41fabff93c229af2486433093e3b4/ocpmodels/preprocessing/atoms_to_graphs.py
https://github.com/Open-Catalyst-Project/ocp/blob/master/scripts/preprocess_ef.py