examol.steer

Collection of policies which interleave simulation and machine learning tasks.

Policies are defined as Colmena Thinker classes.

examol.steer.base

Base class that defines core routines used across many steering policies

class examol.steer.base.MoleculeThinker(queues: ColmenaQueues, rec: ResourceCounter, run_dir: Path, recipes: Sequence[PropertyRecipe], solution: SolutionSpecification, search_space: list[Path | str], database: MoleculeStore, pool: ProcessPoolExecutor)[source]

Bases: BaseThinker

Base for a thinker which performs molecular design

Parameters:
  • queues – Queues used to communicate with the task server

  • rec – Counter used to track availability of different resources

  • run_dir – Directory in which to store results

  • recipes – List of recipes to compute

  • solution – Description of how to solve the problem

  • database – List of molecule records

  • search_space – Lists of molecules to be evaluated as a list of “.smi” or “.json” files

  • num_workers – Number of workers to use locally for the thinker

database: MoleculeStore

Access to the data available to the thinker

iterate_over_search_space(only_smiles: bool = False) Iterator[MoleculeRecord | str][source]

Function to produce a stream of molecules from the input files

Parameters:

only_smiles – Whether to return only the SMILES string rather than the full record

Yields:

A MoleculeRecord for each molecule in the search space or just the SMILES String

store_simulation(result: Result)[source]

Store the output of a simulation

submit_simulation()[source]

Submit a simulation task when resources are available

task_iterator() Iterator[tuple[MoleculeRecord, Iterable[PropertyRecipe], SimulationRequest]][source]

Iterate over the next tasks in the task queue

Yields:
  • Molecule being processed

  • Recipes being computed

  • Simulation to execute

task_queue: list[tuple[str, float]]

List of tasks to run. Each entry is a SMILES string and score, and they are arranged descending in priority

task_queue_lock: Condition

Lock used to control access to task_queue

examol.steer.baseline

Baseline methods for steering a molecular design campaign

class examol.steer.baseline.BruteForceThinker(queues: ColmenaQueues, run_dir: Path, recipes: Sequence[PropertyRecipe], solution: SolutionSpecification, search_space: list[Path | str], database: MoleculeStore, pool: ProcessPoolExecutor, num_workers: int = 1, overselection: float = 0)[source]

Bases: MoleculeThinker

Run the selection of molecules selected in the beginning

Parameters:
  • queues – Queues used to communicate with the task server

  • run_dir – Directory in which to store results

  • recipes – List of recipes to compute

  • solution – Description of how to solve the problem

  • database – List of molecule records

  • search_space – Lists of molecules to be evaluated as a list of “.smi” or “.json” files

  • num_workers – Number of simulations to run in parallel

  • overselection – Additional fraction molecules to select above the number requested by the user. Number of molecules will be solution.num_to_run * overselection. Used to ensure target number of molecules are evaluated even if some fail.

startup()[source]

Pre-populate the database, if needed.

examol.steer.single

Single-objective and single-fidelity implementation of active learning. As easy as we get

class examol.steer.single.SingleStepThinker(queues: ColmenaQueues, run_dir: Path, recipes: Sequence[PropertyRecipe], solution: SingleFidelityActiveLearning, search_space: list[Path | str], database: MoleculeStore, pool: ProcessPoolExecutor, num_workers: int = 2, inference_chunk_size: int = 10000)[source]

Bases: MoleculeThinker

A thinker which submits all computations needed to evaluate a molecule whenever it is selected

Parameters:
  • queues – Queues used to communicate with the task server

  • run_dir – Directory in which to store logs, etc.

  • recipes – Recipes used to compute the target properties

  • database – Connection to the store of molecular data

  • solution – Settings related to tools used to solve the problem (e.g., active learning strategy)

  • search_space – Search space of molecules. Provided as a list of paths to “.smi” files

  • num_workers – Number of simulation tasks to run in parallel

  • inference_chunk_size – Number of molecules to run inference on per task

count_training_size(recipe: PropertyRecipe) int[source]

Count the number of entries available for training each recipe

Parameters:

recipe – Recipe being assessed

Returns:

Number of records for which this property is defined

get_additional_training_information(train_set: list[MoleculeRecord], recipe: PropertyRecipe) dict[str, object][source]

Determine any additional information to be provided during training

An example could be to gather low-fidelity data to use to augment the training process

Parameters:
  • train_set – Training set for the model

  • recipe – Recipe being trained

Returns:

Additional options

property inference_store: Store | None

Proxystore used for inference tasks

property num_models: int

Number of models being trained by this class

retrain()[source]

Retrain all models

run_inference()[source]

Store inference results then update the task list

scorer: Scorer

Class used to communicate data and models to distributed workers

search_space_dir: Path

Cache directory for search space

search_space_inputs: list[list[object]]

Inputs (or proxies of inputs) to the machine learning models for each molecule in the search space

search_space_smiles: list[list[str]]

SMILES strings of molecules in the search space

solution: SingleFidelityActiveLearning
startup()[source]

Pre-populate the database, if needed.

submit_inference() tuple[list[list[str]], ndarray, list[ndarray]][source]

Submit all molecules to be evaluated, return placeholders for their outputs

Inference tasks are submitted with a few bits of metadata
  • recipe_id: Index of the recipe being evaluated

  • model_id: Index of the model being evaluated

  • chunk_id: Index of the chunk of molecules

  • chunk_size: Number of molecules in chunks being evaluated

Returns:

  • Smiles strings of the molecules being evaluated

  • Boolean array marking if inference task is done n_chunks x recipes x ensemble_size

  • List of arrays in which to store inference results a total of n_chunks arrays of size recipes x batch_size x models

examol.steer.multifi

Scheduling strategies for multi-fidelity design campaigns

class examol.steer.multifi.PipelineThinker(queues: ColmenaQueues, run_dir: Path, recipes: Sequence[PropertyRecipe], database: MoleculeStore, solution: MultiFidelityActiveLearning, search_space: list[Path | str], pool: ProcessPoolExecutor, num_workers: int = 2, inference_chunk_size: int = 10000)[source]

Bases: SingleStepThinker

Thinker which runs each level of fidelity in incremental steps

See MultiFidelityActiveLearning for a description of the adjustable parameters.

already_in_db: set[str]

InChI keys of molecules from the search space which are already in the database

get_additional_training_information(train_set: list[MoleculeRecord], recipe: PropertyRecipe) dict[str, object][source]

Determine any additional information to be provided during training

An example could be to gather low-fidelity data to use to augment the training process

Parameters:
  • train_set – Training set for the model

  • recipe – Recipe being trained

Returns:

Additional options

get_level(smiles: str) int[source]

Get the current step number of a molecule

Parameters:

smiles – SMILES string of molecule in question

Returns:

Step level (0 means no data)

get_relevant_database_records() set[str][source]

Get only the entries from the database which are in the search space

Returns:

InChI keys from the database which are in the search space

property num_levels[source]
solution: MultiFidelityActiveLearning
property steps: Sequence[Sequence[PropertyRecipe]][source]
submit_inference() tuple[list[list[str]], ndarray, list[ndarray]][source]

Submit all molecules to be evaluated, return placeholders for their outputs

Inference tasks are submitted with a few bits of metadata
  • recipe_id: Index of the recipe being evaluated

  • model_id: Index of the model being evaluated

  • chunk_id: Index of the chunk of molecules

  • chunk_size: Number of molecules in chunks being evaluated

Returns:

  • Smiles strings of the molecules being evaluated

  • Boolean array marking if inference task is done n_chunks x recipes x ensemble_size

  • List of arrays in which to store inference results a total of n_chunks arrays of size recipes x batch_size x models