examol.score

Tools for quickly assigning scores to a molecule

examol.score.base

Base classes for scoring functions

class examol.score.base.MultiFidelityScorer[source]

Bases: Scorer

Base class for scorers which support multi-fidelity learning

All subclasses support a “lower_fidelities” keyword argument to the score() and retrain() functions that takes any lower-fidelity information available. Subclasses should train a multi-fidelity model if provided lower-fidelity data during training and use the lower-fidelity data to enhance prediction accuracy during scoring.

retrain(model_msg: object, input_data: list, output_data: list, lower_fidelities: ndarray | None = None, **kwargs) object[source]

Retrain the scorer based on new training records

Parameters:
  • model_msg – Model to be retrained

  • input_data – Training set inputs, as generated by transform_inputs()

  • output_data – Training Set outputs, as generated by transform_outputs()

Returns:

Message defining how to update the model

score(model_msg: object, input_data: list, lower_fidelities: ndarray | None = None, **kwargs) ndarray[source]

Assign a score to molecules

Parameters:
  • model_msg – Model in a transmittable format, may need to be deserialized

  • input_data – Batch of inputs ready for the model, as generated by transform_inputs()

Returns:

The scores to a set of records

class examol.score.base.Scorer[source]

Bases: object

Base class for algorithms which quickly assign a score to a molecule, typically using a machine learning model

Using a Scorer

Scoring a molecule requires transforming the molecule into a form compatible with a machine learning algorithm, then executing inference using the machine learning algorithm. We separate these two steps so that the former can run on local resources and the latter can run on larger remote resource. Running the scorer will then look something like

scorer = Scorer()
recipe = PropertyRecipe()  # Recipe that we are trying to predict
model = ...   # The model that we'll be sending to workers
inputs = model.transform_inputs(records)  # Readies records to run inference
model_msg = model.prepare_message(model)  # Readies model to be sent to a remote worker
scorer.score(model_msg, inputs)  # Can be run remotely

Note how the Scorer class does not hold on to the model as state. The Scorer is just the tool which holds code needed train and run the model.

Training operations are broken into separate operations for similar reasons. We separate the training operation from pre-processing inputs and outputs, and updating a local copy of the model given the results of training.

prepare_message(model: object, training: bool = False) object[source]

Get the model state as a serializable object

Parameters:
  • model – Model to be sent to score or retrain function

  • training – Whether to prepare the message for training or inference

Returns:

Get the model state as an object which can be serialized then transmitted to a remote worker

retrain(model_msg: object, input_data: list, output_data: list, **kwargs) object[source]

Retrain the scorer based on new training records

Parameters:
Returns:

Message defining how to update the model

score(model_msg: object, input_data: list, **kwargs) ndarray[source]

Assign a score to molecules

Parameters:
  • model_msg – Model in a transmittable format, may need to be deserialized

  • input_data – Batch of inputs ready for the model, as generated by transform_inputs()

Returns:

The scores to a set of records

transform_inputs(record_batch: list[MoleculeRecord]) list[source]

Form inputs for the model based on the data in a molecule record

Parameters:

record_batch – List of records to pre-process

Returns:

List of inputs ready for score() or retrain()

transform_outputs(records: list[MoleculeRecord], recipe: PropertyRecipe) ndarray[source]

Gather the target outputs of the model

Parameters:
  • records – List of records from which to extract outputs

  • recipe – Target recipe for the scorer for single-fidelity learning

Returns:

Outputs ready for model training

update(model: object, update_msg: object) object[source]

Update this local copy of a model

Parameters:
  • model – Model to be updated

  • update_msg – Update for the model

Returns:

Updated model

examol.score.nfp

Train neural network models using NFP

class examol.score.nfp.NFPMessage(model: tensorflow.keras.Model)[source]

Bases: object

Package for sending an MPNN model over connections that require pickling

get_model() tensorflow.keras.Model[source]

Get a copy of the model

Returns:

The model specified by this message

class examol.score.nfp.NFPScorer(retrain_from_scratch: bool = True)[source]

Bases: MultiFidelityScorer

Train message-passing neural networks based on the NFP library.

NFP uses Keras to define message-passing networks, which is backed by Tensorflow for executing the networks on different hardware.

Multi-fidelity models predict the lowest, most-plentiful level of fidelity directly and correction factors to adjust the low-level predictions for the higher levels (i.e., delta learning). Training does not require all levels of fidelity to be available and will only measure loss against the available data. Inference predicts the low-fidelity value and all correction factors for higher levels, but uses known values in place of them if available.

prepare_message(model: tensorflow.keras.models.Model, training: bool = False) dict | NFPMessage[source]

Get the model state as a serializable object

Parameters:
  • model – Model to be sent to score or retrain function

  • training – Whether to prepare the message for training or inference

Returns:

Get the model state as an object which can be serialized then transmitted to a remote worker

retrain(model_msg: dict | NFPMessage, input_data: list, output_data: ndarray, lower_fidelities: None | ndarray = None, num_epochs: int = 4, batch_size: int = 32, validation_split: float = 0.1, learning_rate: float = 0.001, device_type: str = 'gpu', steps_per_exec: int = 1, patience: int = None, timeout: float = None, verbose: bool = False) tuple[list[ndarray], dict][source]

Retrain the scorer based on new training records

Parameters:
  • model_msg – Model to be retrained

  • input_data – Training set inputs, as generated by transform_inputs()

  • output_data – Training Set outputs, as generated by transform_outputs()

  • lower_fidelities – Lower-fidelity data, if available

  • num_epochs – Maximum number of epochs to run

  • batch_size – Number of molecules per training batch

  • validation_split – Fraction of molecules used for the training/validation split

  • learning_rate – Learning rate for the Adam optimizer

  • device_type – Type of device used for training

  • steps_per_exec – Number of training steps to run per execution on acceleration

  • patience – Number of epochs without improvement before terminating training. Default is 10% of num_epochs

  • timeout – Maximum training time in seconds

  • verbose – Whether to print training information to screen

Returns:

Message defining how to update the model

score(model_msg: NFPMessage, input_data: list[dict | tuple[dict, ndarray]], batch_size: int = 64, lower_fidelities: ndarray | None = None, **kwargs) ndarray[source]

Assign a score to molecules

Parameters:
  • model_msg – Model in a transmittable format

  • input_data – Batch of inputs ready for the model (in dictionary format)

  • batch_size – Number of molecules to evaluate at each time

  • lower_fidelities – Properties of the molecule at lower levels, if known

Returns:

The scores to a set of records

transform_inputs(record_batch: list[MoleculeRecord]) list[dict][source]

Form inputs for the model based on the data in a molecule record

Parameters:

record_batch – List of records to pre-process

Returns:

List of inputs ready for score() or retrain()

update(model: tensorflow.keras.models.Model, update_msg: tuple[list[ndarray], dict]) tensorflow.keras.models.Model[source]

Update this local copy of a model

Parameters:
  • model – Model to be updated

  • update_msg – Update for the model

Returns:

Updated model

class examol.score.nfp.ReduceAtoms(*args: Any, **kwargs: Any)[source]

Bases: Layer

Reduce the atoms along a certain direction

Parameters:

reduction_op – Name of the operation used for reduction

call(inputs, mask=None)[source]
Parameters:
  • inputs – Matrix to be reduced

  • mask – Identifies which rows to sum are placeholders

get_config()[source]
examol.score.nfp.convert_string_to_dict(mol_string: str) dict[source]

Convert a molecule to an NFP-compatible dictionary form

Parameters:

mol_string – SMILES or InChI string

Returns:

Dictionary

examol.score.nfp.make_data_loader(mol_dicts: list[dict], values: ndarray | list[object] | None = None, batch_size: int = 32, repeat: bool = False, shuffle_buffer: int | None = None, value_spec: tensorflow.TensorSpec = tensorflow.TensorSpec, drop_last_batch: bool = False) tensorflow.data.Dataset[source]

Make an in-memory data loader for data compatible with NFP-style neural networks

Parameters:
  • mol_dicts – List of molecules parsed into the moldesign format

  • values – List of output values, if included in the output

  • value_spec – Tensorflow specification for the output

  • batch_size – Number of molecules per batch

  • repeat – Whether to create an infinitely-repeating iterator

  • shuffle_buffer – Size of a shuffle buffer. Use None to leave data unshuffled

  • drop_last_batch – Whether to keep the last batch in the dataset. Set to True if, for example, you need every batch to be the same size

Returns:

Data loader that generates molecules in the desired shapes

examol.score.nfp.make_simple_network(atom_features: int = 64, message_steps: int = 8, output_layers: list[int] = (512, 256, 128), reduce_op: str = 'mean', atomwise: bool = True, outputs: int = 1) tensorflow.keras.models.Model[source]

Construct a basic MPNN model using the settings provided by a user

Models will have embeddings for atoms with atomic numbers up to 63, and 4 types of bonds (single, double, triple, aromatic).

The models use edge, node, and global update for each message passing layer and a separate set of MLPs for each of the outputs. There is also a “scaling” layer which can be used to adjust the mean and standard deviation of the prediction.

Parameters:
  • atom_features – Number of features used per atom and bond

  • message_steps – Number of message passing steps

  • output_layers – Number of neurons in the readout layers

  • reduce_op – Operation used to reduce from atom-level to molecule-level vectors

  • atomwise – Whether to reduce atomwise contributions after the output layers, or reduce to a single vector per molecule before the output layers

  • outputs – Number of output properties. Each will use their own output network

Returns:

A model instantiated with the user-defined options

examol.score.rdkit