dig.ggraph.dataset¶
Dataset interfaces under dig.ggraph.dataset
.
A Pytorch Geometric data interface for |
|
A Pytorch Geometric data interface for datasets used in molecule generation. |
|
A Pytorch Geometric data interface for |
|
A Pytorch Geometric data interface for |
|
A Pytorch Geometric data interface for |
- class MOSES(root='./', prop_name=None, conf_dict=None, transform=None, pre_transform=None, pre_filter=None, processed_filename='data.pt', use_aug=False, one_shot=False)[source]¶
A Pytorch Geometric data interface for
MOSES
dataset which is from the paper “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models” and contains 4,591,276 molecules refined from the ZINC database.- Parameters
root (string, optional) – Root directory where the dataset should be saved.
prop_name (string, optional) – The molecular property desired and used as the optimization target. (default:
None
)transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)pre_filter (callable, optional) – A function that takes in an
torch_geometric.data.Data
object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default:None
)use_aug (bool, optional) – If
True
, data augmentation will be used. (default:False
)one_shot (bool, optional) – If
True
, the returned data will use one-shot format with an extra dimension of virtual node and edge feature. (default:False
)
- class PygDataset(root, name, prop_name='penalized_logp', conf_dict=None, transform=None, pre_transform=None, pre_filter=None, processed_filename='data.pt', use_aug=False, one_shot=False)[source]¶
A Pytorch Geometric data interface for datasets used in molecule generation.
Note
Some datasets may not come with any node labels, like
moses
. Since they don’t have any properties in the original data file. The process of the dataset can only save the current input property and will load the same property label when the processed dataset is used. You can change the augmentprocessed_filename
to re-process the dataset with intended property.- Parameters
root (string, optional) – Root directory where the dataset should be saved. (default:
/
)name (string, optional) – The name of the dataset. Available dataset names are as follows:
zinc250k
,zinc_800_graphaf
,zinc_800_jt
,zinc250k_property
,qm9_property
,qm9
,moses
. (default:qm9
)prop_name (string, optional) – The molecular property desired and used as the optimization target. (default:
penalized_logp
)conf_dict (dictionary, optional) – dictionary that stores all the configuration for the corresponding dataset. Default is None, but when something is passed, it uses its information. Useful for debugging and customizing for external contributers. (default:
False
)transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)pre_filter (callable, optional) – A function that takes in an
torch_geometric.data.Data
object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default:None
)use_aug (bool, optional) – If
True
, data augmentation will be used. (default:False
)one_shot (bool, optional) – If
True
, the returned data will use one-shot format with an extra dimension of virtual node and edge feature. (default:False
)
- get(idx)[source]¶
Gets the data object at index :idx:.
- Parameters
idx – The index of the data that you want to reach.
- Return type
A data object corresponding to the input index
idx
.
- get_split_idx()[source]¶
Gets the train-valid set split indices of the dataset.
- Return type
A dictionary for training-validation split with key
train_idx
andvalid_idx
.
- process()[source]¶
Processes the dataset from raw data file to the
self.processed_dir
folder.If one-hot format is required, the processed data type will include an extra dimension of virtual node and edge feature.
- property processed_file_names¶
The name of the files in the
self.processed_dir
folder that must be present in order to skip processing.
- property raw_file_names¶
The name of the files in the
self.raw_dir
folder that must be present in order to skip downloading.
- class QM9(root='./', prop_name='penalized_logp', conf_dict=None, transform=None, pre_transform=None, pre_filter=None, processed_filename='data.pt', use_aug=False, one_shot=False)[source]¶
A Pytorch Geometric data interface for
QM9
dataset which is from “MoleculeNet: A Benchmark for Molecular Machine Learning” paper and connsists of about 130,000 molecules with 2 property optimization targets:penalized_logp
andqed
.- Parameters
root (string, optional) – Root directory where the dataset should be saved.
prop_name (string, optional) – The molecular property desired and used as the optimization target. (default:
penalized_logp
)transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)pre_filter (callable, optional) – A function that takes in an
torch_geometric.data.Data
object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default:None
)use_aug (bool, optional) – If
True
, data augmentation will be used. (default:False
)one_shot (bool, optional) – If
True
, the returned data will use one-shot format with an extra dimension of virtual node and edge feature. (default:False
)
- class ZINC250k(root='./', prop_name='penalized_logp', conf_dict=None, transform=None, pre_transform=None, pre_filter=None, processed_filename='data.pt', use_aug=False, one_shot=False)[source]¶
A Pytorch Geometric data interface for
ZINC250k
dataset which comes from the ZINC database and the “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules” paper and contains about 250,000 molecular graphs with up to 38 heavy atoms.- Parameters
root (string, optional) – Root directory where the dataset should be saved.
prop_name (string, optional) – The molecular property desired and used as the optimization target. (default:
penalized_logp
)transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)pre_filter (callable, optional) – A function that takes in an
torch_geometric.data.Data
object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default:None
)use_aug (bool, optional) – If
True
, data augmentation will be used. (default:False
)one_shot (bool, optional) – If
True
, the returned data will use one-shot format with an extra dimension of virtual node and edge feature. (default:False
)
The dataset can be merged into a batch data format with
torch_geometric.data.DataLoader
andtorch_geometric.data.DenseDataLoader
. WhileDenseDataLoader
work with dense adjacency matrices and put batch information into an additional attributebatch
,DataLoader
concatenate all graph attributes into one large graph. You can iterate over the data loader and see what it yields.Examples
>>> dataset = ZINC250k(root='./dataset', prop_name='penalized_logp') >>> loader = DataLoader(dataset, batch_size=32, shuffle=True) >>> denseloader = DenseDataLoader(dataset, batch_size=32, shuffle=True) >>> data = next(iter(loader)) >>> data Batch(adj=[128, 38, 38], batch=[1216], bfs_perm_origin=[1216], num_atom=[32], ptr=[33], smile=[32], x=[1216, 9], y=[32]) >>> data = next(iter(denseloader)) >>> data Batch(adj=[32, 4, 38, 38], bfs_perm_origin=[32, 38], num_atom=[32, 1], smile=[32], x=[32, 38, 9], y=[32, 1])
Where the attributes of the output data indicates:
x
: The node features.y
: The property labels for the graph.adj
: The edge features in the form of adjacent matrices.batch
: The assignment vector which maps each node to its respective graph identifier and can help reconstructe single graphsbfs_perm_origin
: The bfs-searching order for single graphnum_atom
: Number of atoms for each graph.smile
: Original SMILE sequences for the graphs.
The dataset object is provided with training-validation split indices
get_split_idx()
, a list for all atom typesatom_list
, and the maximum number of nodes (atoms) among all moleculesnum_max_node
.Examples
>>> dataset.num_max_node 38 >>> dataset.atom_list [6, 7, 8, 9, 15, 16, 17, 35, 53]
- class ZINC800(root='./', method='jt', prop_name='penalized_logp', conf_dict=None, transform=None, pre_transform=None, pre_filter=None, processed_filename='data.pt', use_aug=False, one_shot=False)[source]¶
A Pytorch Geometric data interface for
ZINC800
dataset which contains 800 selected molecules with lowest penalized logP scores. While methodjt
selects from the test set andgraphaf
selects from the train set.- Parameters
root (string, optional) – Root directory where the dataset should be saved.
method (string, optional) – Method name for
ZINC800
dataset, can be eitherjt
orgraphaf
. (default:jt
)prop_name (string, optional) – The molecular property desired and used as the optimization target.(default:
penalized_logp
)transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)pre_filter (callable, optional) – A function that takes in an
torch_geometric.data.Data
object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default:None
)use_aug (bool, optional) – If
True
, data augmentation will be used. (default:False
)one_shot (bool, optional) – If
True
, the returned data will use one-shot format with an extra dimension of virtual node and edge feature. (default:False
)