examol.score¶

Tools for quickly assigning scores to a molecule

examol.score.base¶

Base classes for scoring functions

class examol.score.base.MultiFidelityScorer[source]¶

Bases: Scorer

Base class for scorers which support multi-fidelity learning

All subclasses support a “lower_fidelities” keyword argument to the score() and retrain() functions that takes any lower-fidelity information available. Subclasses should train a multi-fidelity model if provided lower-fidelity data during training and use the lower-fidelity data to enhance prediction accuracy during scoring.

retrain(model_msg: object, input_data: list, output_data: list, lower_fidelities: ndarray | None = None, **kwargs) → object[source]¶

Retrain the scorer based on new training records

Parameters:

model_msg – Model to be retrained
input_data – Training set inputs, as generated by transform_inputs()
output_data – Training Set outputs, as generated by transform_outputs()

Returns:

Message defining how to update the model

score(model_msg: object, input_data: list, lower_fidelities: ndarray | None = None, **kwargs) → ndarray[source]¶

Assign a score to molecules

Parameters:

model_msg – Model in a transmittable format, may need to be deserialized
input_data – Batch of inputs ready for the model, as generated by transform_inputs()

Returns:

The scores to a set of records

class examol.score.base.Scorer[source]¶

Bases: object

Base class for algorithms which quickly assign a score to a molecule, typically using a machine learning model

Using a Scorer

Scoring a molecule requires transforming the molecule into a form compatible with a machine learning algorithm, then executing inference using the machine learning algorithm. We separate these two steps so that the former can run on local resources and the latter can run on larger remote resource. Running the scorer will then look something like

scorer = Scorer()
recipe = PropertyRecipe()  # Recipe that we are trying to predict
model = ...   # The model that we'll be sending to workers
inputs = model.transform_inputs(records)  # Readies records to run inference
model_msg = model.prepare_message(model)  # Readies model to be sent to a remote worker
scorer.score(model_msg, inputs)  # Can be run remotely

Note how the Scorer class does not hold on to the model as state. The Scorer is just the tool which holds code needed train and run the model.

Training operations are broken into separate operations for similar reasons. We separate the training operation from pre-processing inputs and outputs, and updating a local copy of the model given the results of training.

prepare_message(model: object, training: bool = False) → object[source]¶

Get the model state as a serializable object

Parameters:

model – Model to be sent to score or retrain function
training – Whether to prepare the message for training or inference

Returns:

Get the model state as an object which can be serialized then transmitted to a remote worker

retrain(model_msg: object, input_data: list, output_data: list, **kwargs) → object[source]¶

Retrain the scorer based on new training records

Parameters:

model_msg – Model to be retrained
input_data – Training set inputs, as generated by transform_inputs()
output_data – Training Set outputs, as generated by transform_outputs()

Returns:

Message defining how to update the model

score(model_msg: object, input_data: list, **kwargs) → ndarray[source]¶

Assign a score to molecules

Parameters:

model_msg – Model in a transmittable format, may need to be deserialized
input_data – Batch of inputs ready for the model, as generated by transform_inputs()

Returns:

The scores to a set of records

transform_inputs(record_batch: list[MoleculeRecord]) → list[source]¶

Form inputs for the model based on the data in a molecule record

Parameters:: record_batch – List of records to pre-process
Returns:: List of inputs ready for score() or retrain()

transform_outputs(records: list[MoleculeRecord], recipe: PropertyRecipe) → ndarray[source]¶

Gather the target outputs of the model

Parameters:

records – List of records from which to extract outputs
recipe – Target recipe for the scorer for single-fidelity learning

Returns:

Outputs ready for model training

update(model: object, update_msg: object) → object[source]¶

Update this local copy of a model

Parameters:

model – Model to be updated
update_msg – Update for the model

Returns:

Updated model

examol.score.nfp¶

Train neural network models using NFP

class examol.score.nfp.NFPMessage(model: tensorflow.keras.Model)[source]¶

Bases: object

Package for sending an MPNN model over connections that require pickling

get_model() → tensorflow.keras.Model[source]¶

Get a copy of the model

Returns:: The model specified by this message

class examol.score.nfp.NFPScorer(retrain_from_scratch: bool = True)[source]¶

Bases: MultiFidelityScorer

Train message-passing neural networks based on the NFP library.

NFP uses Keras to define message-passing networks, which is backed by Tensorflow for executing the networks on different hardware.

Multi-fidelity models predict the lowest, most-plentiful level of fidelity directly and correction factors to adjust the low-level predictions for the higher levels (i.e., delta learning). Training does not require all levels of fidelity to be available and will only measure loss against the available data. Inference predicts the low-fidelity value and all correction factors for higher levels, but uses known values in place of them if available.

prepare_message(model: tensorflow.keras.models.Model, training: bool = False) → dict | NFPMessage[source]¶

Get the model state as a serializable object

Parameters:

model – Model to be sent to score or retrain function
training – Whether to prepare the message for training or inference

Returns:

Get the model state as an object which can be serialized then transmitted to a remote worker

retrain(model_msg: dict | NFPMessage, input_data: list, output_data: ndarray, lower_fidelities: None | ndarray = None, num_epochs: int = 4, batch_size: int = 32, validation_split: float = 0.1, learning_rate: float = 0.001, device_type: str = 'gpu', steps_per_exec: int = 1, patience: int = None, timeout: float = None, verbose: bool = False) → tuple[list[ndarray], dict][source]¶

Retrain the scorer based on new training records

Parameters:

model_msg – Model to be retrained
input_data – Training set inputs, as generated by transform_inputs()
output_data – Training Set outputs, as generated by transform_outputs()
lower_fidelities – Lower-fidelity data, if available
num_epochs – Maximum number of epochs to run
batch_size – Number of molecules per training batch
validation_split – Fraction of molecules used for the training/validation split
learning_rate – Learning rate for the Adam optimizer
device_type – Type of device used for training
steps_per_exec – Number of training steps to run per execution on acceleration
patience – Number of epochs without improvement before terminating training. Default is 10% of num_epochs
timeout – Maximum training time in seconds
verbose – Whether to print training information to screen

Returns:

Message defining how to update the model

score(model_msg: NFPMessage, input_data: list[dict | tuple[dict, ndarray]], batch_size: int = 64, lower_fidelities: ndarray | None = None, **kwargs) → ndarray[source]¶

Assign a score to molecules

Parameters:

model_msg – Model in a transmittable format
input_data – Batch of inputs ready for the model (in dictionary format)
batch_size – Number of molecules to evaluate at each time
lower_fidelities – Properties of the molecule at lower levels, if known

Returns:

The scores to a set of records

transform_inputs(record_batch: list[MoleculeRecord]) → list[dict][source]¶

Form inputs for the model based on the data in a molecule record

Parameters:: record_batch – List of records to pre-process
Returns:: List of inputs ready for score() or retrain()

update(model: tensorflow.keras.models.Model, update_msg: tuple[list[ndarray], dict]) → tensorflow.keras.models.Model[source]¶

Update this local copy of a model

Parameters:

model – Model to be updated
update_msg – Update for the model

Returns:

Updated model

class examol.score.nfp.ReduceAtoms(*args: Any, **kwargs: Any)[source]¶

Bases: Layer

Reduce the atoms along a certain direction

Parameters:: reduction_op – Name of the operation used for reduction

call(inputs, mask=None)[source]¶

Parameters:

inputs – Matrix to be reduced
mask – Identifies which rows to sum are placeholders

get_config()[source]¶

examol.score.nfp.convert_string_to_dict(mol_string: str) → dict[source]¶

Convert a molecule to an NFP-compatible dictionary form

Parameters:: mol_string – SMILES or InChI string
Returns:: Dictionary

examol.score.nfp.make_data_loader(mol_dicts: list[dict], values: ndarray | list[object] | None = None, batch_size: int = 32, repeat: bool = False, shuffle_buffer: int | None = None, value_spec: tensorflow.TensorSpec = tensorflow.TensorSpec, drop_last_batch: bool = False) → tensorflow.data.Dataset[source]¶

Make an in-memory data loader for data compatible with NFP-style neural networks

Parameters:

mol_dicts – List of molecules parsed into the moldesign format
values – List of output values, if included in the output
value_spec – Tensorflow specification for the output
batch_size – Number of molecules per batch
repeat – Whether to create an infinitely-repeating iterator
shuffle_buffer – Size of a shuffle buffer. Use None to leave data unshuffled
drop_last_batch – Whether to keep the last batch in the dataset. Set to True if, for example, you need every batch to be the same size

Returns:

Data loader that generates molecules in the desired shapes

examol.score.nfp.make_simple_network(atom_features: int = 64, message_steps: int = 8, output_layers: list[int] = (512, 256, 128), reduce_op: str = 'mean', atomwise: bool = True, outputs: int = 1) → tensorflow.keras.models.Model[source]¶

Construct a basic MPNN model using the settings provided by a user

Models will have embeddings for atoms with atomic numbers up to 63, and 4 types of bonds (single, double, triple, aromatic).

The models use edge, node, and global update for each message passing layer and a separate set of MLPs for each of the outputs. There is also a “scaling” layer which can be used to adjust the mean and standard deviation of the prediction.

Parameters:

atom_features – Number of features used per atom and bond
message_steps – Number of message passing steps
output_layers – Number of neurons in the readout layers
reduce_op – Operation used to reduce from atom-level to molecule-level vectors
atomwise – Whether to reduce atomwise contributions after the output layers, or reduce to a single vector per molecule before the output layers
outputs – Number of output properties. Each will use their own output network

Returns:

A model instantiated with the user-defined options

examol.score¶

examol.score.base¶

examol.score.nfp¶

examol.score.rdkit¶