examol.steer¶
Collection of policies which interleave simulation and machine learning tasks.
Policies are defined as Colmena Thinker classes.
examol.steer.base¶
Base class that defines core routines used across many steering policies
- class examol.steer.base.MoleculeThinker(queues: ColmenaQueues, rec: ResourceCounter, run_dir: Path, recipes: Sequence[PropertyRecipe], solution: SolutionSpecification, search_space: list[Path | str], database: MoleculeStore, pool: ProcessPoolExecutor)[source]¶
Bases:
BaseThinker
Base for a thinker which performs molecular design
- Parameters:
queues – Queues used to communicate with the task server
rec – Counter used to track availability of different resources
run_dir – Directory in which to store results
recipes – List of recipes to compute
solution – Description of how to solve the problem
database – List of molecule records
search_space – Lists of molecules to be evaluated as a list of “.smi” or “.json” files
num_workers – Number of workers to use locally for the thinker
- database: MoleculeStore¶
Access to the data available to the thinker
- iterate_over_search_space(only_smiles: bool = False) Iterator[MoleculeRecord | str] [source]¶
Function to produce a stream of molecules from the input files
- Parameters:
only_smiles – Whether to return only the SMILES string rather than the full record
- Yields:
A
MoleculeRecord
for each molecule in the search space or just the SMILES String
- task_iterator() Iterator[tuple[MoleculeRecord, Iterable[PropertyRecipe], SimulationRequest]] [source]¶
Iterate over the next tasks in the task queue
- Yields:
Molecule being processed
Recipes being computed
Simulation to execute
- task_queue: list[tuple[str, float]]¶
List of tasks to run. Each entry is a SMILES string and score, and they are arranged descending in priority
- task_queue_lock: Condition¶
Lock used to control access to
task_queue
examol.steer.baseline¶
Baseline methods for steering a molecular design campaign
- class examol.steer.baseline.BruteForceThinker(queues: ColmenaQueues, run_dir: Path, recipes: Sequence[PropertyRecipe], solution: SolutionSpecification, search_space: list[Path | str], database: MoleculeStore, pool: ProcessPoolExecutor, num_workers: int = 1, overselection: float = 0)[source]¶
Bases:
MoleculeThinker
Run the selection of molecules selected in the beginning
- Parameters:
queues – Queues used to communicate with the task server
run_dir – Directory in which to store results
recipes – List of recipes to compute
solution – Description of how to solve the problem
database – List of molecule records
search_space – Lists of molecules to be evaluated as a list of “.smi” or “.json” files
num_workers – Number of simulations to run in parallel
overselection – Additional fraction molecules to select above the number requested by the user. Number of molecules will be
solution.num_to_run * overselection
. Used to ensure target number of molecules are evaluated even if some fail.
examol.steer.single¶
Single-objective and single-fidelity implementation of active learning. As easy as we get
- class examol.steer.single.SingleStepThinker(queues: ColmenaQueues, run_dir: Path, recipes: Sequence[PropertyRecipe], solution: SingleFidelityActiveLearning, search_space: list[Path | str], database: MoleculeStore, pool: ProcessPoolExecutor, num_workers: int = 2, inference_chunk_size: int = 10000)[source]¶
Bases:
MoleculeThinker
A thinker which submits all computations needed to evaluate a molecule whenever it is selected
- Parameters:
queues – Queues used to communicate with the task server
run_dir – Directory in which to store logs, etc.
recipes – Recipes used to compute the target properties
database – Connection to the store of molecular data
solution – Settings related to tools used to solve the problem (e.g., active learning strategy)
search_space – Search space of molecules. Provided as a list of paths to “.smi” files
num_workers – Number of simulation tasks to run in parallel
inference_chunk_size – Number of molecules to run inference on per task
- count_training_size(recipe: PropertyRecipe) int [source]¶
Count the number of entries available for training each recipe
- Parameters:
recipe – Recipe being assessed
- Returns:
Number of records for which this property is defined
- get_additional_training_information(train_set: list[MoleculeRecord], recipe: PropertyRecipe) dict[str, object] [source]¶
Determine any additional information to be provided during training
An example could be to gather low-fidelity data to use to augment the training process
- Parameters:
train_set – Training set for the model
recipe – Recipe being trained
- Returns:
Additional options
- search_space_dir: Path¶
Cache directory for search space
- search_space_inputs: list[list[object]]¶
Inputs (or proxies of inputs) to the machine learning models for each molecule in the search space
- solution: SingleFidelityActiveLearning¶
- submit_inference() tuple[list[list[str]], ndarray, list[ndarray]] [source]¶
Submit all molecules to be evaluated, return placeholders for their outputs
- Inference tasks are submitted with a few bits of metadata
recipe_id: Index of the recipe being evaluated
model_id: Index of the model being evaluated
chunk_id: Index of the chunk of molecules
chunk_size: Number of molecules in chunks being evaluated
- Returns:
Smiles strings of the molecules being evaluated
Boolean array marking if inference task is done
n_chunks x recipes x ensemble_size
List of arrays in which to store inference results a total of
n_chunks
arrays of sizerecipes x batch_size x models
examol.steer.multifi¶
Scheduling strategies for multi-fidelity design campaigns
- class examol.steer.multifi.PipelineThinker(queues: ColmenaQueues, run_dir: Path, recipes: Sequence[PropertyRecipe], database: MoleculeStore, solution: MultiFidelityActiveLearning, search_space: list[Path | str], pool: ProcessPoolExecutor, num_workers: int = 2, inference_chunk_size: int = 10000)[source]¶
Bases:
SingleStepThinker
Thinker which runs each level of fidelity in incremental steps
See
MultiFidelityActiveLearning
for a description of the adjustable parameters.- already_in_db: set[str]¶
InChI keys of molecules from the search space which are already in the database
- get_additional_training_information(train_set: list[MoleculeRecord], recipe: PropertyRecipe) dict[str, object] [source]¶
Determine any additional information to be provided during training
An example could be to gather low-fidelity data to use to augment the training process
- Parameters:
train_set – Training set for the model
recipe – Recipe being trained
- Returns:
Additional options
- get_level(smiles: str) int [source]¶
Get the current step number of a molecule
- Parameters:
smiles – SMILES string of molecule in question
- Returns:
Step level (0 means no data)
- get_relevant_database_records() set[str] [source]¶
Get only the entries from the database which are in the search space
- Returns:
InChI keys from the database which are in the search space
- solution: MultiFidelityActiveLearning¶
- property steps: Sequence[Sequence[PropertyRecipe]][source]¶
- submit_inference() tuple[list[list[str]], ndarray, list[ndarray]] [source]¶
Submit all molecules to be evaluated, return placeholders for their outputs
- Inference tasks are submitted with a few bits of metadata
recipe_id: Index of the recipe being evaluated
model_id: Index of the model being evaluated
chunk_id: Index of the chunk of molecules
chunk_size: Number of molecules in chunks being evaluated
- Returns:
Smiles strings of the molecules being evaluated
Boolean array marking if inference task is done
n_chunks x recipes x ensemble_size
List of arrays in which to store inference results a total of
n_chunks
arrays of sizerecipes x batch_size x models