# Interacting with the OC20 datasets

The OC20 datasets are stored in LMDBs. Here we show how to interact with the datasets directly in order to better understand the data. We use two seperate classes to read in the approriate datasets:

*S2EF* - We use the [TrajectoryLmdbDataset](https://github.com/Open-Catalyst-Project/ocp/blob/master/ocpmodels/datasets/trajectory_lmdb.py) object to read in a **directory** of LMDB files containing the dataset.

*IS2RE/IS2RS* - We use the [SinglePointLmdbDataset](https://github.com/Open-Catalyst-Project/ocp/blob/master/ocpmodels/datasets/single_point_lmdb.py) class to read in a **single LMDB file** containing the dataset.



First, let's download the tutorial dataset [~1min]. 

In [None]:
%%bash
mkdir data
cd data
wget -q http://dl.fbaipublicfiles.com/opencatalystproject/data/tutorial_data.tar.gz -O tutorial_data.tar.gz
tar -xzvf tutorial_data.tar.gz
rm tutorial_data.tar.gz

Now, let's interact with the the lmdb files via the TrajectoryLmdbDataset interface!

In [None]:
from ocpmodels.datasets import SinglePointLmdbDataset, TrajectoryLmdbDataset

# TrajectoryLmdbDataset is our custom Dataset method to read the lmdbs as Data objects. Note that we need to give the path to the folder containing lmdbs for S2EF
dataset = TrajectoryLmdbDataset({"src": "data/s2ef/train_100/"})

print("Size of the dataset created:", len(dataset))
print(dataset[0])

In [None]:
data = dataset[0]
data

In [None]:
import torch

energies = torch.tensor([data.y for data in dataset])
energies

In [None]:
plt.hist(energies, bins=50)
plt.yscale("log")
plt.xlabel("Energies")
plt.show()