examol.score¶
Tools for quickly assigning scores to a molecule
examol.score.base¶
Base classes for scoring functions
- class examol.score.base.MultiFidelityScorer[source]¶
Bases:
Scorer
Base class for scorers which support multi-fidelity learning
All subclasses support a “lower_fidelities” keyword argument to the
score()
andretrain()
functions that takes any lower-fidelity information available. Subclasses should train a multi-fidelity model if provided lower-fidelity data during training and use the lower-fidelity data to enhance prediction accuracy during scoring.- retrain(model_msg: object, input_data: list, output_data: list, lower_fidelities: ndarray | None = None, **kwargs) object [source]¶
Retrain the scorer based on new training records
- Parameters:
model_msg – Model to be retrained
input_data – Training set inputs, as generated by
transform_inputs()
output_data – Training Set outputs, as generated by
transform_outputs()
- Returns:
Message defining how to update the model
- score(model_msg: object, input_data: list, lower_fidelities: ndarray | None = None, **kwargs) ndarray [source]¶
Assign a score to molecules
- Parameters:
model_msg – Model in a transmittable format, may need to be deserialized
input_data – Batch of inputs ready for the model, as generated by
transform_inputs()
- Returns:
The scores to a set of records
- class examol.score.base.Scorer[source]¶
Bases:
object
Base class for algorithms which quickly assign a score to a molecule, typically using a machine learning model
Using a Scorer
Scoring a molecule requires transforming the molecule into a form compatible with a machine learning algorithm, then executing inference using the machine learning algorithm. We separate these two steps so that the former can run on local resources and the latter can run on larger remote resource. Running the scorer will then look something like
scorer = Scorer() recipe = PropertyRecipe() # Recipe that we are trying to predict model = ... # The model that we'll be sending to workers inputs = model.transform_inputs(records) # Readies records to run inference model_msg = model.prepare_message(model) # Readies model to be sent to a remote worker scorer.score(model_msg, inputs) # Can be run remotely
Note how the
Scorer
class does not hold on to the model as state. The Scorer is just the tool which holds code needed train and run the model.Training operations are broken into separate operations for similar reasons. We separate the training operation from pre-processing inputs and outputs, and updating a local copy of the model given the results of training.
- prepare_message(model: object, training: bool = False) object [source]¶
Get the model state as a serializable object
- Parameters:
model – Model to be sent to score or retrain function
training – Whether to prepare the message for training or inference
- Returns:
Get the model state as an object which can be serialized then transmitted to a remote worker
- retrain(model_msg: object, input_data: list, output_data: list, **kwargs) object [source]¶
Retrain the scorer based on new training records
- Parameters:
model_msg – Model to be retrained
input_data – Training set inputs, as generated by
transform_inputs()
output_data – Training Set outputs, as generated by
transform_outputs()
- Returns:
Message defining how to update the model
- score(model_msg: object, input_data: list, **kwargs) ndarray [source]¶
Assign a score to molecules
- Parameters:
model_msg – Model in a transmittable format, may need to be deserialized
input_data – Batch of inputs ready for the model, as generated by
transform_inputs()
- Returns:
The scores to a set of records
- transform_inputs(record_batch: list[MoleculeRecord]) list [source]¶
Form inputs for the model based on the data in a molecule record
- transform_outputs(records: list[MoleculeRecord], recipe: PropertyRecipe) ndarray [source]¶
Gather the target outputs of the model
- Parameters:
records – List of records from which to extract outputs
recipe – Target recipe for the scorer for single-fidelity learning
- Returns:
Outputs ready for model training
examol.score.nfp¶
Train neural network models using NFP
- class examol.score.nfp.NFPMessage(model: tensorflow.keras.Model)[source]¶
Bases:
object
Package for sending an MPNN model over connections that require pickling
- class examol.score.nfp.NFPScorer(retrain_from_scratch: bool = True)[source]¶
Bases:
MultiFidelityScorer
Train message-passing neural networks based on the NFP library.
NFP uses Keras to define message-passing networks, which is backed by Tensorflow for executing the networks on different hardware.
Multi-fidelity models predict the lowest, most-plentiful level of fidelity directly and correction factors to adjust the low-level predictions for the higher levels (i.e., delta learning). Training does not require all levels of fidelity to be available and will only measure loss against the available data. Inference predicts the low-fidelity value and all correction factors for higher levels, but uses known values in place of them if available.
- prepare_message(model: tensorflow.keras.models.Model, training: bool = False) dict | NFPMessage [source]¶
Get the model state as a serializable object
- Parameters:
model – Model to be sent to score or retrain function
training – Whether to prepare the message for training or inference
- Returns:
Get the model state as an object which can be serialized then transmitted to a remote worker
- retrain(model_msg: dict | NFPMessage, input_data: list, output_data: ndarray, lower_fidelities: None | ndarray = None, num_epochs: int = 4, batch_size: int = 32, validation_split: float = 0.1, learning_rate: float = 0.001, device_type: str = 'gpu', steps_per_exec: int = 1, patience: int = None, timeout: float = None, verbose: bool = False) tuple[list[ndarray], dict] [source]¶
Retrain the scorer based on new training records
- Parameters:
model_msg – Model to be retrained
input_data – Training set inputs, as generated by
transform_inputs()
output_data – Training Set outputs, as generated by
transform_outputs()
lower_fidelities – Lower-fidelity data, if available
num_epochs – Maximum number of epochs to run
batch_size – Number of molecules per training batch
validation_split – Fraction of molecules used for the training/validation split
learning_rate – Learning rate for the Adam optimizer
device_type – Type of device used for training
steps_per_exec – Number of training steps to run per execution on acceleration
patience – Number of epochs without improvement before terminating training. Default is 10% of
num_epochs
timeout – Maximum training time in seconds
verbose – Whether to print training information to screen
- Returns:
Message defining how to update the model
- score(model_msg: NFPMessage, input_data: list[dict | tuple[dict, ndarray]], batch_size: int = 64, lower_fidelities: ndarray | None = None, **kwargs) ndarray [source]¶
Assign a score to molecules
- Parameters:
model_msg – Model in a transmittable format
input_data – Batch of inputs ready for the model (in dictionary format)
batch_size – Number of molecules to evaluate at each time
lower_fidelities – Properties of the molecule at lower levels, if known
- Returns:
The scores to a set of records
- class examol.score.nfp.ReduceAtoms(*args: Any, **kwargs: Any)[source]¶
Bases:
Layer
Reduce the atoms along a certain direction
- Parameters:
reduction_op – Name of the operation used for reduction
- examol.score.nfp.convert_string_to_dict(mol_string: str) dict [source]¶
Convert a molecule to an NFP-compatible dictionary form
- Parameters:
mol_string – SMILES or InChI string
- Returns:
Dictionary
- examol.score.nfp.make_data_loader(mol_dicts: list[dict], values: ndarray | list[object] | None = None, batch_size: int = 32, repeat: bool = False, shuffle_buffer: int | None = None, value_spec: tensorflow.TensorSpec = tensorflow.TensorSpec, drop_last_batch: bool = False) tensorflow.data.Dataset [source]¶
Make an in-memory data loader for data compatible with NFP-style neural networks
- Parameters:
mol_dicts – List of molecules parsed into the moldesign format
values – List of output values, if included in the output
value_spec – Tensorflow specification for the output
batch_size – Number of molecules per batch
repeat – Whether to create an infinitely-repeating iterator
shuffle_buffer – Size of a shuffle buffer. Use
None
to leave data unshuffleddrop_last_batch – Whether to keep the last batch in the dataset. Set to
True
if, for example, you need every batch to be the same size
- Returns:
Data loader that generates molecules in the desired shapes
- examol.score.nfp.make_simple_network(atom_features: int = 64, message_steps: int = 8, output_layers: list[int] = (512, 256, 128), reduce_op: str = 'mean', atomwise: bool = True, outputs: int = 1) tensorflow.keras.models.Model [source]¶
Construct a basic MPNN model using the settings provided by a user
Models will have embeddings for atoms with atomic numbers up to 63, and 4 types of bonds (single, double, triple, aromatic).
The models use edge, node, and global update for each message passing layer and a separate set of MLPs for each of the outputs. There is also a “scaling” layer which can be used to adjust the mean and standard deviation of the prediction.
- Parameters:
atom_features – Number of features used per atom and bond
message_steps – Number of message passing steps
output_layers – Number of neurons in the readout layers
reduce_op – Operation used to reduce from atom-level to molecule-level vectors
atomwise – Whether to reduce atomwise contributions after the output layers, or reduce to a single vector per molecule before the output layers
outputs – Number of output properties. Each will use their own output network
- Returns:
A model instantiated with the user-defined options