A Minimal Example of PPO with Continuous Control

This case study is based on the example UtilsRL/examples/train_ppo_mujoco.py, and it will brief you on the basic usage of UtilsRL.

1. Set up logger and arguments

We placed all of the training configurations in UtilsRL/examples/configs/ppo_mujoco.py. So simply parse the python configuration file with parse_args() is enough.

args = parse_args("./examples/configs/ppo_mujoco.py")

Logger should be created according args.log_path and the name of this trial, so we create a TensorboardLogger object with

logger = TensorboardLogger(args.log_path, args.name)

After this, setup the argument dict and register logger, device and seed by

setup(args, logger, args.device)

Simple and pretty. Everything about experiment management is done with 3 lines of code.

2. Add environment info to args

For convenience we hope to add environment information (such as action shape, obs shape and etc) to the argument. Here args is a NameSpace object, so both dict-like assignment and dot-like assignment are accepted.

task = "Hopper-v3"
env = gym.make(task)
args["obs_space"] = env.observation_space
args["action_space"] = env.action_space
args["obs_shape"] = env.observation_space.shape[0]
args["action_shape"] = env.action_space.shape[0]

3. Define the structures of networks

MLP backend is enough for mujoco environments, so we instantiate a SquashedGaussianActor as the actor and a SingleCritic as the critic.

actor_backend = MLP(args.obs_shape, 0, args.actor_hidden_dims, activation=nn.Tanh, device=args.device)
critic1_backend = MLP(args.obs_shape, 0, args.critic_hidden_dims, activation=nn.Tanh, device=args.device)
# critic2_backend = MLP(args.obs_shape, 0, args.critic_hidden_dims, activation=nn.Tanh, device=args.device)

actor = SquashedGaussianActor(
    actor_backend, args.actor_hidden_dims[-1], args.action_shape,
    device=args.device, reparameterize=False, conditioned_logstd=False, hidden_dims=args.actor_output_hidden_dims
).to(args.device)
critic1 = SingleCritic(
    critic1_backend, args.critic_hidden_dims[-1], 1,
    device=args.device, hidden_dims=args.critic_output_hidden_dims
).to(args.device)

4. Define the actor udpate logic, action selection logic and training loops

This is more about PPO Algorithms itself, so we refer the readers to check the code in source file, and we only paste the training loop here in the doc. Note that observation data is transformed by a RunningNormalizer before training. At the end of each epoch, the collected data are used to update the normalizer.

for i_epoch in Monitor("PPO Training").listen(range(args.max_epoch)):
    buffer.clear()
    obs, done = env.reset(), False
    sample_ph = buffer.get_placeholder(args.sample_per_epoch)
    traj_length = traj_return = traj_start = 0
    for env_step in range(args.sample_per_epoch):
        obs_norm = obs_normalizer.transform(torch.from_numpy(obs).float().to(args.device)).cpu().numpy()
        action, logprob = get_action(obs_norm)
        next_obs, reward, done, _ = env.step(action)
        # traj_return += reward
        traj_length += 1
        tot_env_step += 1

        value = get_value(obs_norm)

        sample_ph["obs"][env_step] = obs
        sample_ph["action"][env_step] = action
        sample_ph["logprob"][env_step] = logprob
        sample_ph["next_obs"][env_step] = next_obs
        sample_ph["reward"][env_step] = reward
        sample_ph["done"][env_step] = done
        sample_ph["value"][env_step] = value

        epoch_ended = env_step == args.sample_per_epoch - 1
        timeout = traj_length == args.max_traj_length

        if done or timeout or epoch_ended:
            if timeout or epoch_ended:
                last_v = get_value(next_obs)
            else:
                last_v = 0
            gae, ret = compute_gae(sample_ph["reward"][traj_start:env_step+1], sample_ph["value"][traj_start:env_step+1], last_v=last_v)
            sample_ph["return"][traj_start:env_step+1] = ret.reshape(-1, 1)
            sample_ph["advantage"][traj_start:env_step+1] = gae.reshape(-1, 1)
            # for field in buffer.field_names:
                # sample_ph[field] = sample_ph[field][:traj_length]
            # buffer.add_samples(sample_ph)

            next_obs, done = env.reset(), False
            traj_length = 0
            traj_start = env_step + 1

        obs = next_obs

    if i_epoch < args.warmup_epoch:
        obs_torch = torch.from_numpy(sample_ph["obs"]).float().to(args.device)
        obs_normalizer.update(obs_torch)
        continue
    # sample_ph["obs"] = obs_normalizer.transform(obs_torch).cpu().numpy()
    buffer.add_samples(sample_ph)
    data_batch = buffer.random_batch(0)
    data_batch["obs"] = obs_normalizer.transform(torch.from_numpy(data_batch["obs"]).float().to(args.device)).cpu().numpy()
    train_loss = update(data_batch)

    if i_epoch % args.eval_interval == 0:
        traj_lengths = []
        traj_returns = []
        for traj_id in range(args.eval_num_traj):
            traj_return = traj_length = 0
            state, done = env.reset(), False
            for step in range(args.max_traj_length):
                state_norm = obs_normalizer.transform(torch.from_numpy(state).float().to(args.device)).cpu().numpy()
                action, _= get_action(state_norm, deterministic=True)
                state, reward, done, _ = env.step(action)
                traj_return += reward
                traj_length += 1
                if done:
                    break
            traj_lengths.append(traj_length)
            traj_returns.append(traj_return)
        train_loss.update({
            "eval/traj_return": np.mean(traj_returns),
            "eval/traj_length": np.mean(traj_lengths)
        })

    obs_torch = torch.from_numpy(sample_ph["obs"]).float().to(args.device)
    obs_normalizer.update(obs_torch)

5. Record the results

The actor’s update function will return with a dict recording several metrics of the training process, and we can just log the statistics with one line of code:

logger.log_scalars("", train_loss, step=i_epoch)

Here i_epoch is the count of training epochs, "" means we identifies the statistics with keys of train_loss. After training is done, you can check the curves by typing

tensorboard --logdir </path/to/log> --bind_all

in the terminal.