dig.ggraph.dataset

Dataset interfaces under dig.ggraph.dataset.

MOSES

A Pytorch Geometric data interface for MOSES dataset which is from the paper “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models” and contains 4,591,276 molecules refined from the ZINC database.

PygDataset

A Pytorch Geometric data interface for datasets used in molecule generation.

QM9

A Pytorch Geometric data interface for QM9 dataset which is from “MoleculeNet: A Benchmark for Molecular Machine Learning” paper and connsists of about 130,000 molecules with 2 property optimization targets: penalized_logp and qed.

ZINC250k

A Pytorch Geometric data interface for ZINC250k dataset which comes from the ZINC database and the “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules” paper and contains about 250,000 molecular graphs with up to 38 heavy atoms.

ZINC800

A Pytorch Geometric data interface for ZINC800 dataset which contains 800 selected molecules with lowest penalized logP scores.

class MOSES(root='./', prop_name=None, conf_dict=None, transform=None, pre_transform=None, pre_filter=None, processed_filename='data.pt', use_aug=False, one_shot=False)[source]

A Pytorch Geometric data interface for MOSES dataset which is from the paper “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models” and contains 4,591,276 molecules refined from the ZINC database.

Parameters
  • root (string, optional) – Root directory where the dataset should be saved.

  • prop_name (string, optional) – The molecular property desired and used as the optimization target. (default: None)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • use_aug (bool, optional) – If True, data augmentation will be used. (default: False)

  • one_shot (bool, optional) – If True, the returned data will use one-shot format with an extra dimension of virtual node and edge feature. (default: False)

class PygDataset(root, name, prop_name='penalized_logp', conf_dict=None, transform=None, pre_transform=None, pre_filter=None, processed_filename='data.pt', use_aug=False, one_shot=False)[source]

A Pytorch Geometric data interface for datasets used in molecule generation.

Note

Some datasets may not come with any node labels, like moses. Since they don’t have any properties in the original data file. The process of the dataset can only save the current input property and will load the same property label when the processed dataset is used. You can change the augment processed_filename to re-process the dataset with intended property.

Parameters
  • root (string, optional) – Root directory where the dataset should be saved. (default: /)

  • name (string, optional) – The name of the dataset. Available dataset names are as follows: zinc250k, zinc_800_graphaf, zinc_800_jt, zinc250k_property, qm9_property, qm9, moses. (default: qm9)

  • prop_name (string, optional) – The molecular property desired and used as the optimization target. (default: penalized_logp)

  • conf_dict (dictionary, optional) – dictionary that stores all the configuration for the corresponding dataset. Default is None, but when something is passed, it uses its information. Useful for debugging and customizing for external contributers. (default: False)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • use_aug (bool, optional) – If True, data augmentation will be used. (default: False)

  • one_shot (bool, optional) – If True, the returned data will use one-shot format with an extra dimension of virtual node and edge feature. (default: False)

download()[source]

Downloads the dataset to the self.raw_dir folder.

get(idx)[source]

Gets the data object at index :idx:.

Parameters

idx – The index of the data that you want to reach.

Return type

A data object corresponding to the input index idx .

get_split_idx()[source]

Gets the train-valid set split indices of the dataset.

Return type

A dictionary for training-validation split with key train_idx and valid_idx.

process()[source]

Processes the dataset from raw data file to the self.processed_dir folder.

If one-hot format is required, the processed data type will include an extra dimension of virtual node and edge feature.

property processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

property raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

class QM9(root='./', prop_name='penalized_logp', conf_dict=None, transform=None, pre_transform=None, pre_filter=None, processed_filename='data.pt', use_aug=False, one_shot=False)[source]

A Pytorch Geometric data interface for QM9 dataset which is from “MoleculeNet: A Benchmark for Molecular Machine Learning” paper and connsists of about 130,000 molecules with 2 property optimization targets: penalized_logp and qed.

Parameters
  • root (string, optional) – Root directory where the dataset should be saved.

  • prop_name (string, optional) – The molecular property desired and used as the optimization target. (default: penalized_logp)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • use_aug (bool, optional) – If True, data augmentation will be used. (default: False)

  • one_shot (bool, optional) – If True, the returned data will use one-shot format with an extra dimension of virtual node and edge feature. (default: False)

class ZINC250k(root='./', prop_name='penalized_logp', conf_dict=None, transform=None, pre_transform=None, pre_filter=None, processed_filename='data.pt', use_aug=False, one_shot=False)[source]

A Pytorch Geometric data interface for ZINC250k dataset which comes from the ZINC database and the “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules” paper and contains about 250,000 molecular graphs with up to 38 heavy atoms.

Parameters
  • root (string, optional) – Root directory where the dataset should be saved.

  • prop_name (string, optional) – The molecular property desired and used as the optimization target. (default: penalized_logp)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • use_aug (bool, optional) – If True, data augmentation will be used. (default: False)

  • one_shot (bool, optional) – If True, the returned data will use one-shot format with an extra dimension of virtual node and edge feature. (default: False)

The dataset can be merged into a batch data format with torch_geometric.data.DataLoader and torch_geometric.data.DenseDataLoader. While DenseDataLoader work with dense adjacency matrices and put batch information into an additional attribute batch, DataLoader concatenate all graph attributes into one large graph. You can iterate over the data loader and see what it yields.

Examples

>>> dataset = ZINC250k(root='./dataset', prop_name='penalized_logp')
>>> loader = DataLoader(dataset, batch_size=32, shuffle=True)
>>> denseloader = DenseDataLoader(dataset, batch_size=32, shuffle=True)
>>> data = next(iter(loader))
>>> data
Batch(adj=[128, 38, 38], batch=[1216], bfs_perm_origin=[1216], num_atom=[32], ptr=[33], smile=[32], x=[1216, 9], y=[32])
>>> data = next(iter(denseloader))
>>> data
Batch(adj=[32, 4, 38, 38], bfs_perm_origin=[32, 38], num_atom=[32, 1], smile=[32], x=[32, 38, 9], y=[32, 1])

Where the attributes of the output data indicates:

  • x: The node features.

  • y: The property labels for the graph.

  • adj: The edge features in the form of adjacent matrices.

  • batch: The assignment vector which maps each node to its respective graph identifier and can help reconstructe single graphs

  • bfs_perm_origin: The bfs-searching order for single graph

  • num_atom: Number of atoms for each graph.

  • smile: Original SMILE sequences for the graphs.

The dataset object is provided with training-validation split indices get_split_idx(), a list for all atom types atom_list, and the maximum number of nodes (atoms) among all molecules num_max_node.

Examples

>>> dataset.num_max_node
38
>>> dataset.atom_list
[6, 7, 8, 9, 15, 16, 17, 35, 53]
class ZINC800(root='./', method='jt', prop_name='penalized_logp', conf_dict=None, transform=None, pre_transform=None, pre_filter=None, processed_filename='data.pt', use_aug=False, one_shot=False)[source]

A Pytorch Geometric data interface for ZINC800 dataset which contains 800 selected molecules with lowest penalized logP scores. While method jt selects from the test set and graphaf selects from the train set.

Parameters
  • root (string, optional) – Root directory where the dataset should be saved.

  • method (string, optional) – Method name for ZINC800 dataset, can be either jt or graphaf. (default: jt)

  • prop_name (string, optional) – The molecular property desired and used as the optimization target.(default: penalized_logp)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • use_aug (bool, optional) – If True, data augmentation will be used. (default: False)

  • one_shot (bool, optional) – If True, the returned data will use one-shot format with an extra dimension of virtual node and edge feature. (default: False)