dig.threedgraph.dataset

Dataset interfaces under dig.threedgraph.dataset.

MD17

A Pytorch Geometric data interface for MD17 dataset which is from "Machine learning of accurate energy-conserving molecular force fields" paper.

QM93D

A Pytorch Geometric data interface for QM9 dataset which is from "Quantum chemistry structures and properties of 134 kilo molecules" paper.

class MD17(root='dataset/', name='benzene_old', transform=None, pre_transform=None, pre_filter=None)[source]

A Pytorch Geometric data interface for MD17 dataset which is from “Machine learning of accurate energy-conserving molecular force fields” paper. MD17 is a collection of eight molecular dynamics simulations for small organic molecules.

Parameters
  • root (string) – The dataset folder will be located at root/name.

  • name (string) – The name of dataset. Available dataset names are as follows: aspirin, benzene_old, ethanol, malonaldehyde, naphthalene, salicylic, toluene, uracil. (default: benzene_old)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

>>> dataset = MD17(name='aspirin')
>>> split_idx = dataset.get_idx_split(len(dataset.data.y), train_size=1000, valid_size=10000, seed=42)
>>> train_dataset, valid_dataset, test_dataset = dataset[split_idx['train']], dataset[split_idx['valid']], dataset[split_idx['test']]
>>> train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
>>> data = next(iter(train_loader))
>>> data
Batch(batch=[672], force=[672, 3], pos=[672, 3], ptr=[33], y=[32], z=[672])

Where the attributes of the output data indicates:

  • z: The atom type.

  • pos: The 3D position for atoms.

  • y: The property (energy) for the graph (molecule).

  • force: The 3D force for atoms.

  • batch: The assignment vector which maps each node to its respective graph identifier and can help reconstructe single graphs

download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

property processed_file_names

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

property raw_file_names

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

class QM93D(root='dataset/', transform=None, pre_transform=None, pre_filter=None)[source]

A Pytorch Geometric data interface for QM9 dataset which is from “Quantum chemistry structures and properties of 134 kilo molecules” paper. It connsists of about 130,000 equilibrium molecules with 12 regression targets: mu, alpha, homo, lumo, gap, r2, zpve, U0, U, H, G, Cv. Each molecule includes complete spatial information for the single low energy conformation of the atoms in the molecule.

Note

We used the processed data in DimeNet, wihch includes spatial information and type for each atom. You can also use QM9 in Pytorch Geometric.

Parameters
  • root (string) – the dataset folder will be located at root/qm9.

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

>>> dataset = QM93D()
>>> target = 'mu'
>>> dataset.data.y = dataset.data[target]
>>> split_idx = dataset.get_idx_split(len(dataset.data.y), train_size=110000, valid_size=10000, seed=42)
>>> train_dataset, valid_dataset, test_dataset = dataset[split_idx['train']], dataset[split_idx['valid']], dataset[split_idx['test']]
>>> train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
>>> data = next(iter(train_loader))
>>> data
Batch(Cv=[32], G=[32], H=[32], U=[32], U0=[32], alpha=[32], batch=[579], gap=[32], homo=[32], lumo=[32], mu=[32], pos=[579, 3], ptr=[33], r2=[32], y=[32], z=[579], zpve=[32])

Where the attributes of the output data indicates:

  • z: The atom type.

  • pos: The 3D position for atoms.

  • y: The target property for the graph (molecule).

  • batch: The assignment vector which maps each node to its respective graph identifier and can help reconstructe single graphs

download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

property processed_file_names

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

property raw_file_names

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.