flexs.baselines.explorers.ppo¶

PPO explorer.

class flexs.baselines.explorers.ppo.PPO(model, rounds, sequences_batch_size, model_queries_per_batch, starting_sequence, alphabet, log_file=None)[source]¶

Bases: flexs.explorer.Explorer

Explorer which uses PPO.

The algorithm is:

for N experiment rounds: collect samples with policy train policy on samples

A simpler baseline than DyNAPPOMutative with similar performance.

add_last_seq_in_trajectory(experience, new_seqs)[source]¶

Add the last sequence in an episode’s trajectory.

Given a trajectory object, checks if the object is the last in the trajectory. Since the environment ends the episode when the score is non-increasing, it adds the associated maximum-valued sequence to the batch.

If the episode is ending, it changes the “current sequence” of the environment to the next one in last_batch, so that when the environment resets, mutants are generated from that new sequence.

propose_sequences(measured_sequences_data)[source]¶

Propose top sequences_batch_size sequences for evaluation.

Return type: Tuple[ndarray, ndarray]