dig.ggraph.method¶

Method classes under dig.ggraph.method.

`Generator`	The method base class for graph generation.
`GraphAF`	The method class for GraphAF algorithm proposed in the paper GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation.
`GraphDF`	The method class for GraphDF algorithm proposed in the paper GraphDF: A Discrete Flow Model for Molecular Graph Generation.
`GraphEBM`	The method class for GraphEBM algorithm proposed in the paper GraphEBM: Molecular Graph Generation with Energy-Based Models.
`JTVAE`	The method class for the JTVAE algorithm proposed in the paper Junction Tree Variational Autoencoder for Molecular Graph Generation.

class Generator[source]¶

The method base class for graph generation. To write a new graph generation method, create a new class inheriting from this class and implement the functions.

run_const_prop_opt(*args, **kwargs)[source]¶: Running molecule optimization for constrained optimization task.

run_prop_opt(*args, **kwargs)[source]¶: Running graph generation for property optimization task.

run_rand_gen(*args, **kwargs)[source]¶: Running graph generation for random generation task.

train_const_prop_opt(loader, *args, **kwargs)[source]¶

Running training for constrained optimization task.

Parameters: loader – The data loader for loading training samples.

train_prop_opt(*args, **kwargs)[source]¶: Running training for property optimization task.

train_rand_gen(loader, *args, **kwargs)[source]¶

Running training for random generation task.

Parameters: loader – The data loader for loading training samples.

class GraphAF[source]¶

The method class for GraphAF algorithm proposed in the paper GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation. This class provides interfaces for running random generation, property optimization, and constrained optimization with GraphAF. Please refer to the benchmark codes for usage examples.

run_cons_optim(dataset, model_conf_dict, checkpoint_path, repeat_time=200, min_optim_time=50, num_max_node=25, temperature=0.7, atomic_num_list=[6, 7, 8, 9])[source]¶

Running molecule optimization for constrained optimization task.

Parameters

dataset – The dataset class for loading molecules to be optimized. It is supposed to use dig.ggraph.dataset.ZINC800 as the dataset class.
model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
checkpoint_path (str) – The path to the saved model checkpoint file.
repeat_time (int, optional) – The maximum number of optimization times for each molecule before successfully optimizing it under the threshold 0.6. (default: 200)
min_optim_time (int, optional) – The minimum number of optimization times for each molecule. (default: 50)
num_max_node (int, optional) – The maximum number of nodes in the optimized molecular graphs. (default: 25)
temperature (float, optional) – A float numbers, the temperature parameter of prior distribution. (default: 0.75)
atomic_num_list (list, optional) – A list of integers, the list of atomic numbers indicating the node types in the optimized molecular graphs. (default: [6, 7, 8, 9])

Return type

(mols_0, mols_2, mols_4, mols_6), they are lists of optimized molecules (represented by rdkit Chem.Mol objects) under the threshold 0.0, 0.2, 0.4, 0.6, respectively.

run_prop_optim(model_conf_dict, checkpoint_path, n_mols=100, num_min_node=7, num_max_node=25, temperature=0.75, atomic_num_list=[6, 7, 8, 9])[source]¶

Running graph generation for property optimization task.

Parameters

model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
checkpoint_path (str) – The path to the saved model checkpoint file.
n_mols (int, optional) – The number of molecules to generate. (default: 100)
num_min_node (int, optional) – The minimum number of nodes in the generated molecular graphs. (default: 7)
num_max_node (int, optional) – The maximum number of nodes in the generated molecular graphs. (default: 25)
temperature (float, optional) – A float numbers, the temperature parameter of prior distribution. (default: 0.75)
atomic_num_list (list, optional) – A list of integers, the list of atomic numbers indicating the node types in the generated molecular graphs. (default: [6, 7, 8, 9])

Return type

all_mols, a list of generated molecules represented by rdkit Chem.Mol objects.

run_rand_gen(model_conf_dict, checkpoint_path, n_mols=100, num_min_node=7, num_max_node=25, temperature=0.75, atomic_num_list=[6, 7, 8, 9])[source]¶

Running graph generation for random generation task.

Parameters

model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
checkpoint_path (str) – The path to the saved model checkpoint file.
n_mols (int, optional) – The number of molecules to generate. (default: 100)
num_min_node (int, optional) – The minimum number of nodes in the generated molecular graphs. (default: 7)
num_max_node (int, optional) – The maximum number of nodes in the generated molecular graphs. (default: 25)
temperature (float, optional) – A float numbers, the temperature parameter of prior distribution. (default: 0.75)
atomic_num_list (list, optional) – A list of integers, the list of atomic numbers indicating the node types in the generated molecular graphs. (default: [6, 7, 8, 9])

Return type

(all_mols, pure_valids), all_mols is a list of generated molecules represented by rdkit Chem.Mol objects; pure_valids is a list of integers, all are 0 or 1, indicating whether bond resampling happens.

train_cons_optim(loader, lr, wd, max_iters, warm_up, model_conf_dict, pretrain_path, save_interval, save_dir)[source]¶

Running fine-tuning for constrained optimization task.

Parameters

loader – The data loader for loading training samples. It is supposed to use dig.ggraph.dataset.ZINC800 as the dataset class, and apply torch_geometric.data.DenseDataLoader to it to form the data loader.
lr (float) – The learning rate for training.
wd (float) – The weight decay factor for training.
max_iters (int) – The maximum number of training iters.
warm_up (int) – The number of linear warm-up iters.
model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
pretrain_path (str) – The path to the saved pretrained model parameters file.
save_interval (int) – Indicate the frequency to save the model parameters to .pth files, e.g., if save_interval=20, the model parameters will be saved for every 20 training iters.
save_dir (str) – The directory to save the model parameters.

train_prop_optim(lr, wd, max_iters, warm_up, model_conf_dict, pretrain_path, save_interval, save_dir)[source]¶

Running fine-tuning for property optimization task.

Parameters

lr (float) – The learning rate for fine-tuning.
wd (float) – The weight decay factor for training.
max_iters (int) – The maximum number of training iters.
warm_up (int) – The number of linear warm-up iters.
model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
pretrain_path (str) – The path to the saved pretrained model file.
save_interval (int) – Indicate the frequency to save the model parameters to .pth files, e.g., if save_interval=20, the model parameters will be saved for every 20 training iters.
save_dir (str) – The directory to save the model parameters.

train_rand_gen(loader, lr, wd, max_epochs, model_conf_dict, save_interval, save_dir)[source]¶

Running training for random generation task.

Parameters

loader – The data loader for loading training samples. It is supposed to use dig.ggraph.dataset.QM9/ZINC250k as the dataset class, and apply torch_geometric.data.DenseDataLoader to it to form the data loader.
lr (float) – The learning rate for training.
wd (float) – The weight decay factor for training.
max_epochs (int) – The maximum number of training epochs.
model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
save_interval (int) – Indicate the frequency to save the model parameters to .pth files, e.g., if save_interval=2, the model parameters will be saved for every 2 training epochs.
save_dir (str) – The directory to save the model parameters.

class GraphDF[source]¶

The method class for GraphDF algorithm proposed in the paper GraphDF: A Discrete Flow Model for Molecular Graph Generation. This class provides interfaces for running random generation, property optimization, and constrained optimization with GraphDF algorithm. Please refer to the benchmark codes for usage examples.

run_const_prop_opt(dataset, model_conf_dict, checkpoint_path, repeat_time=200, min_optim_time=50, num_max_node=25, temperature=[0.3, 0.3], atomic_num_list=[6, 7, 8, 9])[source]¶

Running molecule optimization for constrained optimization task.

Parameters

dataset – The dataset class for loading molecules to be optimized. It is supposed to use dig.ggraph.dataset.ZINC800 as the dataset class.
model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
checkpoint_path (str) – The path to the saved model checkpoint file.
repeat_time (int, optional) – The maximum number of optimization times for each molecule before successfully optimizing it under the threshold 0.6. (default: 200)
min_optim_time (int, optional) – The minimum number of optimization times for each molecule. (default: 50)
num_max_node (int, optional) – The maximum number of nodes in the optimized molecular graphs. (default: 25)
temperature (list, optional) – A list of two float numbers, the temperature parameter of prior distribution. (default: [0.3, 0.3])
atomic_num_list (list, optional) – A list of integers, the list of atomic numbers indicating the node types in the optimized molecular graphs. (default: [6, 7, 8, 9])

Return type

(mols_0, mols_2, mols_4, mols_6), they are lists of optimized molecules (represented by rdkit Chem.Mol objects) under the threshold 0.0, 0.2, 0.4, 0.6, respectively.

run_prop_opt(model_conf_dict, checkpoint_path, n_mols=100, num_min_node=7, num_max_node=25, temperature=[0.3, 0.3], atomic_num_list=[6, 7, 8, 9])[source]¶

Running graph generation for property optimization task.

Parameters

model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
checkpoint_path (str) – The path to the saved model checkpoint file.
n_mols (int, optional) – The number of molecules to generate. (default: 100)
num_min_node (int, optional) – The minimum number of nodes in the generated molecular graphs. (default: 7)
num_max_node (int, optional) – The maximum number of nodes in the generated molecular graphs. (default: 25)
temperature (list, optional) – A list of two float numbers, the temperature parameter of prior distribution. (default: [0.3, 0.3])
atomic_num_list (list, optional) – A list of integers, the list of atomic numbers indicating the node types in the generated molecular graphs. (default: [6, 7, 8, 9])

Return type

all_mols, a list of generated molecules represented by rdkit Chem.Mol objects.

run_rand_gen(model_conf_dict, checkpoint_path, n_mols=100, num_min_node=7, num_max_node=25, temperature=[0.3, 0.3], atomic_num_list=[6, 7, 8, 9])[source]¶

Running graph generation for random generation task.

Parameters

model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
checkpoint_path (str) – The path to the saved model checkpoint file.
n_mols (int, optional) – The number of molecules to generate. (default: 100)
num_min_node (int, optional) – The minimum number of nodes in the generated molecular graphs. (default: 7)
num_max_node (int, optional) – the maximum number of nodes in the generated molecular graphs. (default: 25)
temperature (list, optional) – a list of two float numbers, the temperature parameter of prior distribution. (default: [0.3, 0.3])
atomic_num_list (list, optional) – a list of integers, the list of atomic numbers indicating the node types in the generated molecular graphs. (default: [6, 7, 8, 9])

Return type

(all_mols, pure_valids), all_mols is a list of generated molecules represented by rdkit Chem.Mol objects; pure_valids is a list of integers, all are 0 or 1, indicating whether bond resampling happens.

train_const_prop_opt(loader, lr, wd, max_iters, warm_up, model_conf_dict, pretrain_path, save_interval, save_dir)[source]¶

Running fine-tuning for constrained optimization task.

Parameters

loader – The data loader for loading training samples. It is supposed to use dig.ggraph.dataset.ZINC800 as the dataset class, and apply torch_geometric.data.DenseDataLoader to it to form the data loader.
lr (float) – The learning rate for training.
wd (float) – The weight decay factor for training.
max_iters (int) – The maximum number of training iters.
warm_up (int) – The number of linear warm-up iters.
model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
pretrain_path (str) – The path to the saved pretrained model parameters file.
save_interval (int) – Indicate the frequency to save the model parameters to .pth files, e.g., if save_interval=20, the model parameters will be saved for every 20 training iters.
save_dir (str) – The directory to save the model parameters.

train_prop_opt(lr, wd, max_iters, warm_up, model_conf_dict, pretrain_path, save_interval, save_dir)[source]¶

Running fine-tuning for property optimization task.

Parameters

lr (float) – The learning rate for fine-tuning.
wd (float) – The weight decay factor for training.
max_iters (int) – The maximum number of training iters.
warm_up (int) – The number of linear warm-up iters.
model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
pretrain_path (str) – The path to the saved pretrained model file.
save_interval (int) – Indicate the frequency to save the model parameters to .pth files, e.g., if save_interval=20, the model parameters will be saved for every 20 training iters.
save_dir (str) – The directory to save the model parameters.

train_rand_gen(loader, lr, wd, max_epochs, model_conf_dict, save_interval, save_dir)[source]¶

Running training for random generation task.

Parameters

loader – The data loader for loading training samples. It is supposed to use dig.ggraph.dataset.QM9/ZINC250k/MOSES as the dataset class, and apply torch_geometric.data.DenseDataLoader to it to form the data loader.
lr (float) – The learning rate for training.
wd (float) – The weight decay factor for training.
max_epochs (int) – The maximum number of training epochs.
model_conf_dict (dict) – The python dict for configuring the model hyperparameters.
save_interval (int) – Indicate the frequency to save the model parameters to .pth files, e.g., if save_interval=2, the model parameters will be saved for every 2 training epochs.
save_dir (str) – The directory to save the model parameters.

class GraphEBM(n_atom, n_atom_type, n_edge_type, hidden, device=None)[source]¶

The method class for GraphEBM algorithm proposed in the paper GraphEBM: Molecular Graph Generation with Energy-Based Models. This class provides interfaces for running random generation, goal-directed generation (including property optimization and constrained optimization), and compositional generation with GraphEBM algorithm. Please refer to the benchmark codes for usage examples.

Parameters

n_atom (int) – Maximum number of atoms.
n_atom_type (int) – Number of possible atom types.
n_edge_type (int) – Number of possible bond types.
hidden (int) – Hidden dimensions.
device (torch.device, optional) – The device where the model is deployed.

run_comp_gen(checkpoint_path_qed, checkpoint_path_plogp, n_samples, c, ld_step, ld_noise, ld_step_size, clamp, atomic_num_list)[source]¶

Running graph generation for compositional generation task.

Parameters

checkpoint_path_qed (str) – The path of the model trained on QED property, i.e., the .pt file.
checkpoint_path_plogp (str) – The path of the model trained on plogp property, i.e., the .pt file.
n_samples (int) – the number of molecules to generate.
c (float) – The scaling hyperparameter for dequantization.
ld_step (int) – The number of iteration steps of Langevin dynamics.
ld_noise (float) – The standard deviation of the added noise in Langevin dynamics.
ld_step_size (int) – The step size of Langevin dynamics.
clamp (bool) – Whether to use gradient clamp in Langevin dynamics.
atomic_num_list (list) – The list used to indicate atom types.

Return type

gen_mols (list): A list of generated molecules represented by rdkit Chem.Mol objects;

run_const_prop_opt(checkpoint_path, initialization_loader, c, ld_step, ld_noise, ld_step_size, clamp, atomic_num_list, train_smiles)[source]¶

Running graph generation for goal-directed generation task: constrained property optimization.

Parameters

checkpoint_path (str) – The path of the trained model, i.e., the .pt file.
initialization_loader – The data loader for loading samples to initialize the Langevin dynamics. It is supposed to use dig.ggraph.dataset.QM9/ZINC250k as the dataset class, and apply torch_geometric.data.DenseDataLoader to it to form the data loader.
c (float) – The scaling hyperparameter for dequantization.
ld_step (int) – The number of iteration steps of Langevin dynamics.
ld_noise (float) – The standard deviation of the added noise in Langevin dynamics.
ld_step_size (int) – The step size of Langevin dynamics.
clamp (bool) – Whether to use gradient clamp in Langevin dynamics.
atomic_num_list (list) – The list used to indicate atom types.
train_smiles (list) – A list of smiles string corresponding to training samples.

Return type

mols_0_list (list), mols_2_list (list), mols_4_list (list), mols_6_list (list), imp_0_list (list), imp_2_list (list), imp_4_list (list), imp_4_list (list): They are lists of optimized molecules (represented by rdkit Chem.Mol objects) and the corresponding improvements under the threshold 0.0, 0.2, 0.4, 0.6, respectively.

run_prop_opt(checkpoint_path, initialization_loader, c, ld_step, ld_noise, ld_step_size, clamp, atomic_num_list, train_smiles)[source]¶

Running graph generation for goal-directed generation task: property optimization.

Parameters

checkpoint_path (str) – The path of the trained model, i.e., the .pt file.
initialization_loader – The data loader for loading samples to initialize the Langevin dynamics. It is supposed to use dig.ggraph.dataset.QM9/ZINC250k as the dataset class, and apply torch_geometric.data.DenseDataLoader to it to form the data loader.
c (float) – The scaling hyperparameter for dequantization.
ld_step (int) – The number of iteration steps of Langevin dynamics.
ld_noise (float) – The standard deviation of the added noise in Langevin dynamics.
ld_step_size (int) – The step size of Langevin dynamics.
clamp (bool) – Whether to use gradient clamp in Langevin dynamics.
atomic_num_list (list) – The list used to indicate atom types.
train_smiles (list) – A list of smiles string corresponding to training samples.

Return type

save_mols_list (list), prop_list (list): save_mols_list is a list of generated molecules with high QED scores represented by rdkit Chem.Mol objects; prop_list is a list of the corresponding QED scores.

run_rand_gen(checkpoint_path, n_samples, c, ld_step, ld_noise, ld_step_size, clamp, atomic_num_list)[source]¶

Running graph generation for random generation task.

Parameters

checkpoint_path (str) – The path of the trained model, i.e., the .pt file.
n_samples (int) – the number of molecules to generate.
c (float) – The scaling hyperparameter for dequantization.
ld_step (int) – The number of iteration steps of Langevin dynamics.
ld_noise (float) – The standard deviation of the added noise in Langevin dynamics.
ld_step_size (int) – The step size of Langevin dynamics.
clamp (bool) – Whether to use gradient clamp in Langevin dynamics.
atomic_num_list (list) – The list used to indicate atom types.

Return type

gen_mols (list): A list of generated molecules represented by rdkit Chem.Mol objects;

train_goal_directed(loader, lr, wd, max_epochs, c, ld_step, ld_noise, ld_step_size, clamp, alpha, save_interval, save_dir)[source]¶

Running training for goal-directed generation task.

Parameters

loader – The data loader for loading training samples. It is supposed to use dig.ggraph.dataset.QM9/ZINC250k as the dataset class, and apply torch_geometric.data.DenseDataLoader to it to form the data loader.
lr (float) – The learning rate for training.
wd (float) – The weight decay factor for training.
max_epochs (int) – The maximum number of training epochs.
c (float) – The scaling hyperparameter for dequantization.
ld_step (int) – The number of iteration steps of Langevin dynamics.
ld_noise (float) – The standard deviation of the added noise in Langevin dynamics.
ld_step_size (int) – The step size of Langevin dynamics.
clamp (bool) – Whether to use gradient clamp in Langevin dynamics.
alpha (float) – The weight coefficient for loss function.
save_interval (int) – The frequency to save the model parameters to .pt files, e.g., if save_interval=2, the model parameters will be saved for every 2 training epochs.
save_dir (str) – the directory to save the model parameters.

train_rand_gen(loader, lr, wd, max_epochs, c, ld_step, ld_noise, ld_step_size, clamp, alpha, save_interval, save_dir)[source]¶

Running training for random generation task.

Parameters

loader – The data loader for loading training samples. It is supposed to use dig.ggraph.dataset.QM9/ZINC250k as the dataset class, and apply torch_geometric.data.DenseDataLoader to it to form the data loader.
lr (float) – The learning rate for training.
wd (float) – The weight decay factor for training.
max_epochs (int) – The maximum number of training epochs.
c (float) – The scaling hyperparameter for dequantization.
ld_step (int) – The number of iteration steps of Langevin dynamics.
ld_noise (float) – The standard deviation of the added noise in Langevin dynamics.
ld_step_size (int) – The step size of Langevin dynamics.
clamp (bool) – Whether to use gradient clamp in Langevin dynamics.
alpha (float) – The weight coefficient for loss function.
save_interval (int) – The frequency to save the model parameters to .pt files, e.g., if save_interval=2, the model parameters will be saved for every 2 training epochs.
save_dir (str) – the directory to save the model parameters.

class JTVAE(list_smiles, build_vocab=True, device=None)[source]¶

The method class for the JTVAE algorithm proposed in the paper Junction Tree Variational Autoencoder for Molecular Graph Generation. This class provides interfaces for running random generation with the JTVAE algorithm. Please refer to the benchmark codes for usage examples.

Parameters

list_smiles (list) – List of smiles in training data.
training (boolean) – If we are training (as opposed to testing).
build_vocab (boolean) – If we need to build the vocabulary (first time training with this dataset).
device (torch.device, optional) – The device where the model is deployed.

build_vocabulary(list_smiles)[source]¶

Building the vocabulary for training.

Parameters: list_smiles (list) – the list of smiles strings in the dataset.
Return type: cset (list): A list of smiles that contains the vocabulary for the training data.

preprocess(list_smiles)[source]¶

Preprocess the molecules.

Parameters: list_smiles (list) – The list of smiles strings in the dataset.
Return type: preprocessed (list): A list of preprocessed MolTree objects.

run_cons_optim(list_smiles, sim_cutoff=0.0)[source]¶

Optimize a set of molecules.

Parameters

list_smiles (list) – List of smiles in training data.
sim_cutoff (float) – Simulation cutoff.

run_rand_gen(num_samples)[source]¶

Sample new molecules from the trained model.

Parameters: num_samples (int) – Number of samples to generate from the trained model.
Return type: samples (list): samples is a list of generated molecules.

train_cons_optim(loader, batch_size, num_epochs, hidden_size, latent_size, depth, beta, lr)[source]¶

Train the Junction Tree Variational Autoencoder for the constrained optimization task.

Parameters

loader (MolTreeFolder) – The MolTreeFolder loader.
batch_size (int) – The batch size.
num_epochs (int) – The number of epochs.
hidden_size (int) – The hidden size.
latent_size (int) – The latent size.
depth (int) – The depth of the network.
lr (float) – The learning rate for training.
beta (float) – The KL regularization weight.

train_rand_gen(loader, load_epoch, lr, anneal_rate, clip_norm, num_epochs, beta, max_beta, step_beta, anneal_iter, kl_anneal_iter, print_iter, save_iter)[source]¶

Train the Junction Tree Variational Autoencoder for the random generation task.

Parameters

loader (MolTreeFolder) – The MolTreeFolder loader.
load_epoch (int) – The epoch to load from state dictionary.
lr (float) – The learning rate for training.
anneal_rate (float) – The learning rate annealing.
clip_norm (float) – Clips gradient norm of an iterable of parameters.
num_epochs (int) – The number of training epochs.
beta (float) – The KL regularization weight.
max_beta (float) – The maximum KL regularization weight.
step_beta (float) – The KL regularization weight step size.
anneal_iter (int) – How often to step in annealing the learning rate.
kl_anneal_iter (int) – How often to step in annealing the KL regularization weight.
print_iter (int) – How often to print the iteration statistics.
save_iter (int) – How often to save the iteration statistics.