Primal Wasserstein Imitation Learning

In this study, the authors propose a algorithm called Primal Wasserstein Imitation Learning (PWIL) as a method of Imitation Learning (IL). This approach is conceptually straightforward, linking to the primal form of the Wasserstein distance between expert and agent state-action distributions. Unlike recent adversarial IL algorithms that learn a reward function through interactions with the environment, PWIL utilizes an offline-derived reward function, requiring minimal fine-tuning.

The experiments conducted showcase the effectiveness of PWIL in efficiently restoring expert behavior on various continuous control tasks in the MuJoCo domain, measured in terms of both agent interactions and expert interactions with the environment. Additionally, the emphasis is placed on the fact that the behavior of the trained agent aligns with the Wasserstein distance rather than relying on the commonly used proxy of performance. These characteristics highlight PWIL as a compelling and efficient imitation learning method with robust performance.

Wasserstein Distance

Suppose there is a metric space $(M, d)$, where $M$ is a set, and $d$ is a metric on $M$. If we have two distributions $µ$ and $ν$ on $M$ with finite moments, the $p$-th order Wasserstein distance (Villani, 2008) is the smallest possible distance between them.

\[\begin{align*} W^{p}_{p}(µ, ν) = \inf_{θ∈Θ}(µ,ν) \int_{M×M} d(x, y)^pdθ(x, y) \end{align*}\]

It’s calculated using a coupling $θ$ that minimizes the integral of $d(x, y)^p$ over all pairs of points in $M×M$. The set of all possible couplings between $µ$ and $ν$ is denoted as $Θ(µ, ν)$. In the following, we only consider distributionss with finite support, meaning distributions that have a limited set of possible values. A coupling between two distributions with supports of size $T$ and $D$ is represented as a doubly stochastic matrix of size $T × D$. We use $Θ$ to refer to the set of all doubly stochastic matrices of size $T × D$.

Objective

PWIL minimizes the Wasserstein distance between the state-action distribution of the learned policy $\hat{ρ}_π$ and that of the expert $\hat{ρ}_e$ in the trajectory setting:

\[\begin{align*} \inf_{π∈Π}{\cal{W}^{p}_{p}( \hat{\rho}_π,\hat{\rho}_e)}=\inf_{π∈Π}\inf_{θ∈Θ}∑_{i=1}^{T}∑_{j=1}^{D}{d((s^π_i,a^π_i),(s^e_j,a^e_j))^pθ[i,j]} \end{align*}\]

where $Θ(i, j)$ is the set of all coupling between $i$ and $j$, $d$ is a distance function.

Reward Function

PWIL is based on an off-policy algorithm. The reward function is defined as a Standardized Euclidean Distance which is the L2 distance on the concatenation of the observation and the action, weighted along each dimension by the inverse standard deviation of the expert demonstrations. It should be a decreasing function of the cost.

\[\begin{align*} r_i = \alpha \exp (-\frac{\beta T}{\sqrt{|\cal{S}| + |\cal{A}|}} c_i) \end{align*}\]

where $c_i$ is the minimum cost of moving from $(s^π_i,a^π_i)$ to $(s^e_j,a^e_j)$. The author used $α=5$ and $β=5$ for all environments. The scaling factor \(\frac{T}{\sqrt{|\cal{S}| + |\cal{A}|}}\) acts as a normalizer on the dimensionality of the state and action spaces and on the time horizon of the task.

Method

#1. Episodic version

In RL, we calculate $θ^∗_π$ as the optimal value to connect with the policy π:

\[\begin{align*} θ^∗_π = \underset{θ∈Θ}{\operatorname{arg \: min}}\sum^T_{i=1}\sum^D_{j=1}d((s^π_i, a^π_i),(s^e_j, a^e_j))θ[i, j] \end{align*}\]

given which we minimize the following cost using an RL method.

\[\begin{align*} \inf_{π∈Π}{\cal{W}^{p}_{p}(\hat{\rho}_π,\hat{\rho}_e)} &= \inf_{π∈Π}\sum^{T}_{i=1}{c^*_{i, π}} \\ \mathit{where} \:\:\: c^*_{i, π} &= \sum^{D}_{j=1}{d((s^π_i, a^π_i),(s^e_j, a^e_j))θ^*_{π}[i, j]} \end{align*}\]

Note that $c^∗_{i,π}$ relies on the best coupling, which is only calculated at the end of an episode. This could be an issue for agents learning online or dealing with tasks that have long time horizons.

#2. Greedy version

A way to avoid the bad scenario is to define a greedy coupling $θ^g_π∈Θ$ recursively for $1≤i≤T$ as:

\[\begin{align*} θ^g_π[i, :] &= \underset{θ[i, :]∈Θ_i}{\operatorname{arg \: min}} \sum^D_{j=1}d((s^π_i, a^π_i),(s^e_j, a^e_j))θ[i, j] \\ \mathit{where} \:\:\: Θ_i &= \left\{ θ[i, :] ∈ \mathbb{R}^D_+ \vert \underbrace{\sum^D_{j=1}θ[i, j] = \frac{1}{T}}_{\mathbf{constraint \: (a)}}, \underbrace{ \forall k \in [1 : D], \sum^{i-1}_{i'=1}{θ_g[i', k] + θ[i, k] ≤ \frac{1}{D} } }_{\mathbf{constraint \: (b)}} \right\} \end{align*}\]

In the terminology of earth mover’s distance, the constraint (a) means that all the dirt at the ith timestep needs to be moved, and the constraint (b) constrains the target capacity. The greedy coupling is suboptimal due to the constraint on the target capacity as Figure 1 demonstrates above. Although it becomes optimal when the target capacity constraint is removed, it sometimes leads to performance drop as shown in the experimental results.

Now the cost function is defined as:

\[\begin{align*} c^g_{i, π} = \sum^{D}_{j=1}{d((s^π_i, a^π_i),(s^e_j, a^e_j))θ^g_{π}[i, j]} \end{align*}\]

Algorithm

Note the runtime of a single reward step computation is \(O((|\mathcal{S}|+|\mathcal{A}|)D + \frac{D^2}{T})\) where $O((|\mathcal{S}|+|\mathcal{A}|)D)$ is used for computing all $d((s,a),(s^e,a^2))$ at once and \(O(\frac{D^2}{T})\) is for the while loop.

Experiment

Figure 2 compares PWIL with DAC and BC. Note that PWIL is able to learn from a single demonstration for challenging tasks such as Humanoid.

Figure 10 illustrates the outcomes of the ablation study:

  • PWIL-state omits the use of the expert’s actions, causing a substantial performance decline in Walker2d while leaving the others almost unaffected.

  • PWIL-nofill skips prefilling the replay buffer with expert transitions, resulting in a performance decrease in most games, particularly for Walker.

  • PWIL-L2 employs the simple L2 distance without normalization, leading to a notable drop in all environments except for Hopper and Walker2d.

  • PWIL-support removes the constraint on the target capacity, eliminating the while loop in the algorithm. This results in a significant drop in Walker2d and a slight decrease in Hopper.

Reference