Dataset Overview#

The Open Catalyst 2020 Dataset (OC20) will be used throughout this tutorial. More details can be found here and the corresponding paper. Data is stored in PyTorch Geometric Data objects and stored in LMDB files. For each task we include several sized training splits. Validation/Test splits are broken into several subsplits: In Domain (ID), Out of Domain Adsorbate (OOD-Ads), Out of Domain Catalyast (OOD-Cat) and Out of Domain Adsorbate and Catalyst (OOD-Both). Split sizes are summarized below:

Train

  • S2EF - 200k, 2M, 20M, 134M(All)

  • IS2RE/IS2RS - 10k, 100k, 460k(All)

Val/Test

  • S2EF - ~1M across all subsplits

  • IS2RE/IS2RS - ~25k across all splits

Tutorial Use#

For the sake of this tutorial we provide much smaller splits (100 train, 20 val for all tasks) to allow users to easily store, train, and predict across the various tasks. Please refer here for details on how to download the full datasets for general use.

OC20 image