Interacting with the OC20 datasets#

The OC20 datasets are stored in LMDBs. Here we show how to interact with the datasets directly in order to better understand the data. We use two seperate classes to read in the approriate datasets:

S2EF - We use the TrajectoryLmdbDataset object to read in a directory of LMDB files containing the dataset.

IS2RE/IS2RS - We use the SinglePointLmdbDataset class to read in a single LMDB file containing the dataset.

First, let’s download the tutorial dataset [~1min].

%%bash
mkdir data
cd data
wget -q http://dl.fbaipublicfiles.com/opencatalystproject/data/tutorial_data.tar.gz -O tutorial_data.tar.gz
tar -xzvf tutorial_data.tar.gz
rm tutorial_data.tar.gz

./

./is2re/

./is2re/train_100/

./is2re/train_100/data.lmdb

./is2re/train_100/data.lmdb-lock

./is2re/val_20/

./is2re/val_20/data.lmdb

./is2re/val_20/data.lmdb-lock

./s2ef/

./s2ef/train_100/

./s2ef/train_100/data.lmdb

./s2ef/train_100/data.lmdb-lock

./s2ef/val_20/

./s2ef/val_20/data.lmdb

./s2ef/val_20/data.lmdb-lock

Now, let’s interact with the the lmdb files via the TrajectoryLmdbDataset interface!

from ocpmodels.datasets import SinglePointLmdbDataset, TrajectoryLmdbDataset

# TrajectoryLmdbDataset is our custom Dataset method to read the lmdbs as Data objects. Note that we need to give the path to the folder containing lmdbs for S2EF
dataset = TrajectoryLmdbDataset({"src": "data/s2ef/train_100/"})

print("Size of the dataset created:", len(dataset))
print(dataset[0])

Size of the dataset created: 100
Data(edge_index=[2, 2964], y=6.282500615000004, pos=[86, 3], cell=[1, 3, 3], atomic_numbers=[86], natoms=86, cell_offsets=[2964, 3], force=[86, 3], fixed=[86], tags=[86], sid=[1], fid=[1], total_frames=74, id='0_0')

/home/runner/micromamba-root/envs/buildenv/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3433: UserWarning: TrajectoryLmdbDataset is deprecated and will be removed in the future.Please use 'LmdbDataset' instead.
  exec(code_obj, self.user_global_ns, self.user_ns)

data = dataset[0]
data

Data(edge_index=[2, 2964], y=6.282500615000004, pos=[86, 3], cell=[1, 3, 3], atomic_numbers=[86], natoms=86, cell_offsets=[2964, 3], force=[86, 3], fixed=[86], tags=[86], sid=[1], fid=[1], total_frames=74, id='0_0')

import torch

energies = torch.tensor([data.y for data in dataset])
energies

tensor([ 6.2825e+00,  4.1290e+00,  3.1451e+00,  3.0260e+00,  1.7921e+00,
         1.6451e+00,  1.2257e+00,  1.2161e+00,  1.0712e+00,  7.4727e-01,
         5.9575e-01,  5.7016e-01,  4.2819e-01,  3.1616e-01,  2.5283e-01,
         2.2425e-01,  2.2346e-01,  2.0530e-01,  1.6090e-01,  1.1807e-01,
         1.1691e-01,  9.1254e-02,  7.4997e-02,  6.3274e-02,  5.2782e-02,
         4.8892e-02,  3.9609e-02,  3.1746e-02,  2.7179e-02,  2.7007e-02,
         2.3709e-02,  1.8005e-02,  1.7676e-02,  1.4129e-02,  1.3162e-02,
         1.1374e-02,  7.4124e-03,  7.7525e-03,  6.1224e-03,  5.2787e-03,
         2.8587e-03,  1.1835e-04, -1.1200e-03, -1.3011e-03, -2.6812e-03,
        -5.9202e-03, -6.1644e-03, -6.9261e-03, -9.1364e-03, -9.2114e-03,
        -1.0665e-02, -1.3760e-02, -1.3588e-02, -1.4895e-02, -1.6190e-02,
        -1.8660e-02, -1.4980e-02, -1.8880e-02, -2.0218e-02, -2.0559e-02,
        -2.1013e-02, -2.2129e-02, -2.2748e-02, -2.3322e-02, -2.3382e-02,
        -2.3865e-02, -2.3973e-02, -2.4196e-02, -2.4755e-02, -2.4951e-02,
        -2.5078e-02, -2.5148e-02, -2.5257e-02, -2.5550e-02,  5.9721e+00,
         9.5081e+00,  2.6373e+00,  4.0946e+00,  1.4385e+00,  1.2700e+00,
         1.0081e+00,  5.3797e-01,  5.1462e-01,  2.8812e-01,  1.2429e-01,
        -1.1352e-02, -2.2293e-01, -3.9102e-01, -4.3574e-01, -5.3142e-01,
        -5.4777e-01, -6.3948e-01, -7.3816e-01, -8.2163e-01, -8.2526e-01,
        -8.8313e-01, -8.8615e-01, -9.3446e-01, -9.5100e-01, -9.5168e-01])

plt.hist(energies, bins=50)
plt.yscale("log")
plt.xlabel("Energies")
plt.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [5], line 1
----> 1 plt.hist(energies, bins=50)
      2 plt.yscale("log")
      3 plt.xlabel("Energies")

NameError: name 'plt' is not defined

ML for Catalysis Tutorials

Interacting with the OC20 datasets

Interacting with the OC20 datasets#