flexs.baselines.explorers.ppo¶
PPO explorer.
-
class
flexs.baselines.explorers.ppo.
PPO
(model, rounds, sequences_batch_size, model_queries_per_batch, starting_sequence, alphabet, log_file=None)[source]¶ Bases:
flexs.explorer.Explorer
Explorer which uses PPO.
- The algorithm is:
- for N experiment rounds
collect samples with policy train policy on samples
A simpler baseline than DyNAPPOMutative with similar performance.
-
add_last_seq_in_trajectory
(experience, new_seqs)[source]¶ Add the last sequence in an episode’s trajectory.
Given a trajectory object, checks if the object is the last in the trajectory. Since the environment ends the episode when the score is non-increasing, it adds the associated maximum-valued sequence to the batch.
If the episode is ending, it changes the “current sequence” of the environment to the next one in last_batch, so that when the environment resets, mutants are generated from that new sequence.